robots-txt

kostja94/marketing-skills · updated Apr 8, 2026

$npx skills add https://github.com/kostja94/marketing-skills --skill robots-txt
0 commentsdiscussion
summary

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

skill.md

SEO Technical: robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

Scope (Technical SEO)

  • Robots.txt: Configure Disallow/Allow, Sitemap, Clean-param; audit for accidental blocks
  • Crawler access: Path-level crawl control; AI crawler allow/block strategy
  • Differentiation: robots.txt = crawl control (who accesses what paths); noindex = index control (what gets indexed). See indexing for page-level exclusions.

Initial Assessment

Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site URL and indexing goals.

Identify:

  1. Site URL: Base domain (e.g., https://example.com)
  2. Indexing scope: Full site, partial, or specific paths to exclude
  3. AI crawler strategy: Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

Point Note
Purpose Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet)
Advisory Rules are advisory; malicious crawlers may ignore
Public robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing

Crawl vs Index vs Link Equity (Quick Reference)

Tool Controls Prevents indexing?
robots.txt Crawl (path-level) No—blocked URLs may still appear in SERP
noindex (meta / X-Robots-Tag) Index (page-level) Yes. See indexing
nofollow Link equity only No—does not control indexing

When to Use robots.txt vs noindex

Use Tool Example
Path-level (whole directory) robots.txt Disallow: /admin/, Disallow: /api/, Disallow: /staging/
Page-level (specific pages) noindex meta / X-Robots-Tag Login, signup, thank-you, 404, legal. See indexing for full list
Critical Do NOT block in robots.txt Pages that use noindex—crawlers must access the page to read the directive

Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.

Location and Format

Item Requirement
Path Site root: https://example.com/robots.txt
Encoding UTF-8 plain text
Standard RFC 9309 (Robots Exclusion Protocol)

Core Directives

Directive Purpose Example
User-agent: Target crawler User-agent: Googlebot, User-agent: *
Disallow: Block path prefix Disallow: /admin/
Allow: Allow path (can override Disallow) Allow: /public/
Sitemap: Declare sitemap absolute URL Sitemap: https://example.com/sitemap.xml
Clean-param: Strip query params (Yandex) See below

Critical: Do Not Block

Do not block Reason
CSS, JS, images Google needs them to render pages; blocking breaks indexing
/_next/ (Next.js) Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing
Pages that use noindex Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that

Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.

AI Crawler Strategy

robots.txt is effective for all measured AI crawlers (Vercel/MERJ study, 2024). Set rules per user-agent; check each vendor's docs for current tokens.

User-agent Purpose Typical
OAI-SearchBot ChatGPT search Allow
GPTBot OpenAI training Disallow
Claude-SearchBot Claude search Allow
ClaudeBot Anthropic training Disallow
PerplexityBot Perplexity search Allow
Google-Extended Gemini training Disallow
CCBot Common Crawl (LLM training) Disallow
Bytespider ByteDance Disallow
Meta-ExternalAgent Meta Disallow
AppleBot Apple (Siri, Spotlight); renders JS Allow for indexing

Allow vs Disallow: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training. See site-crawlability for AI crawler optimization (SSR, URL management).

Clean-param (Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

  • Current state (if auditing)
  • Recommended robots.txt (full file)
  • Compliance checklist
  • References: Google robots.txt

Related Skills

  • indexing: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis
  • page-metadata: Meta robots (noindex, nofollow) implementation
  • xml-sitemap: Sitemap URL to reference in robots.txt
  • site-crawlability: Broader crawl and structure guidance; AI crawler optimization
  • rendering-strategies: SSR, SSG, CSR; content in initial HTML for crawlers

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.740 reviews
  • Tariq Shah· Dec 28, 2024

    I recommend robots-txt for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Pratham Ware· Dec 12, 2024

    Useful defaults in robots-txt — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Chen Yang· Dec 12, 2024

    Solid pick for teams standardizing on skills: robots-txt is focused, and the summary matches what you get after install.

  • Zara Sanchez· Nov 19, 2024

    Keeps context tight: robots-txt is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Aditi Wang· Nov 15, 2024

    robots-txt reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Yash Thakker· Nov 3, 2024

    robots-txt is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Li Farah· Nov 3, 2024

    We added robots-txt from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Dhruvi Jain· Oct 22, 2024

    Keeps context tight: robots-txt is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Daniel Sethi· Oct 22, 2024

    robots-txt fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Zara Patel· Oct 10, 2024

    robots-txt is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

showing 1-10 of 40

1 / 4