← All articles·§ technical·Pillar

The AI bot robots.txt complete guide for 2026

Complete reference for AI bot robots.txt configuration. Every user agent that matters (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, etc.), the gold-standard explicit-allow pattern, Cloudflare and WAF traps, same-day verification procedure.

Data for AI Search Editorial Team··14 min read

The robots.txt file controls which crawlers — including AI bots — can read your site. As of mid-2026, at least thirteen AI bot user agents matter for citation: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, anthropic-ai, Claude-Web, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, cohere-ai, ImagesiftBot, and FacebookBot. Each crawler reads different parts of your site for different purposes — training data ingestion, real-time retrieval, image indexing — and a blocked crawler vetoes that platform's citation channel entirely. The single most common audit finding we surface across verticals is brands inadvertently blocking GPTBot at the infrastructure level, most often via Cloudflare's AI Crawl Control toggle which defaults to block. A brand can have flawless content, complete directory presence, and active brand mention engineering and still score zero on ChatGPT for months without knowing why. This guide is the reference document: every AI bot user agent that matters, the gold-standard explicit-allow robots.txt pattern, the Cloudflare and WAF traps to avoid, and the same-day verification procedure that closes the highest-leverage AEO gap.

What is the AI bot robots.txt?

robots.txt is a text file at the root of your domain (https://yourdomain.com/robots.txt) that tells crawlers which paths they can access. The format is simple: one or more rule groups, each specifying a User-agent directive followed by Allow and/or Disallow directives.

The traditional use case was managing search engine crawlers — telling Googlebot to skip admin pages or rate-limit Bingbot. AI bots arrived between 2023 and 2025 and adopted the same convention. OpenAI ships GPTBot. Anthropic ships ClaudeBot. Perplexity ships PerplexityBot. Google added Google-Extended for AI training (distinct from Googlebot for search). Each respects standard robots.txt directives when configured correctly.

The complication: many brands have robots.txt files configured for traditional search engines without explicit consideration of AI bots. The default User-agent: * block applies to every crawler including AI bots, which works fine. But infrastructure layers above robots.txt — Cloudflare's AI Crawl Control, WAF rules, Vercel firewalls — can block AI bots independently of what robots.txt says. We've audited brands with perfectly permissive robots.txt files who scored zero on ChatGPT because Cloudflare's panel had toggled GPTBot to block six months earlier.

The robots.txt file is necessary but not sufficient for AI bot accessibility. Both layers — robots.txt AND infrastructure — must be configured correctly.

Which AI bots should you allow?

The complete current list of AI bots that materially affect AEO citation in 2026:

User-AgentOwnerPurposeAffects
GPTBotOpenAITraining data ingestionChatGPT citation
ChatGPT-UserOpenAIUser-action triggered retrievalChatGPT real-time
OAI-SearchBotOpenAIChatGPT Search retrievalChatGPT Search
ClaudeBotAnthropicTraining data ingestionClaude citation
anthropic-aiAnthropicAlternative crawler IDClaude citation
Claude-WebAnthropicUser-action triggered retrievalClaude real-time
PerplexityBotPerplexityRetrieval + trainingPerplexity citation
Google-ExtendedGoogleGemini training dataGemini + AI Overviews
Applebot-ExtendedAppleApple Intelligence trainingApple Intelligence
BytespiderByteDanceTikTok / Doubao trainingDoubao + TikTok AI
cohere-aiCohereCohere trainingCohere model citation
ImagesiftBotImagesiftImage AI trainingImage generation models
FacebookBotMetaMeta AI trainingMeta AI citation

For most brands optimizing for the major AI assistants (ChatGPT, Perplexity, Claude, Gemini), allowing the first eight is essential. Blocking any of them vetoes citation on the corresponding platform per the 10-Point AI Citation Audit Check 1.

The remaining five (Applebot-Extended, Bytespider, cohere-ai, ImagesiftBot, FacebookBot) matter more selectively. Allowing Bytespider is useful for international brands targeting Chinese-speaking markets via Doubao. Applebot-Extended for brands optimizing for Apple Intelligence (still emerging as of mid-2026). The others depend on specific business contexts.

What's wrong with robots.txt by default?

Three problems with default robots.txt configurations:

Reliance on User-agent: *. A blanket allow with User-agent: * / Allow: / does work — most AI bots respect it. But it's silent. There's no explicit signal that AI bots are welcome, and infrastructure layers above can independently block bots regardless of what robots.txt says. Brands that explicitly enumerate each AI bot in robots.txt send a clearer signal to both the bots and to internal/external auditors.

Implicit blocks via overly aggressive Disallow rules. A Disallow: /api/ or Disallow: /admin/ is fine. A Disallow: / under any named user-agent is a hard block on that bot. We've audited brands with Disallow: / on user-agents they didn't recognize — turns out those were AI bots, blocked silently for months.

Missing AI-specific bot declarations entirely. Older robots.txt files written before 2024 may not mention any AI bots. They work by default because of the wildcard rule, but they leave the brand unaware of the AI bot crawler layer. When something breaks at the infrastructure level, there's no audit trail in robots.txt to investigate.

The Cloudflare GPTBot trap

Cloudflare's AI Crawl Control panel (introduced 2024, evolved through 2025-2026) provides a UI for managing AI bot access at the CDN layer — before requests reach the origin server. The panel includes toggles for GPTBot, PerplexityBot, ClaudeBot, Google-Extended, Bytespider, and others.

The trap: the GPTBot toggle defaulted to "Block" on rollout for many Cloudflare accounts. Brands who hadn't actively configured AI bot access discovered months later that ChatGPT couldn't read their site. The issue was invisible from the brand's perspective — robots.txt looked fine, server logs showed no errors, content was being published — but Cloudflare was returning 403s to GPTBot requests at the edge.

The fix is same-day:

  1. Sign in to Cloudflare → select the domain
  2. Navigate to Security → Bots → AI Crawl Control
  3. Verify GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, anthropic-ai, Claude-Web, PerplexityBot, Google-Extended are all toggled to Allow. Some accounts may show different default states depending on when the account was provisioned.
  4. Save changes
  5. Test using curl with the GPTBot user agent (see verification procedure below)

The Cloudflare panel takes precedence over robots.txt. A site with permissive robots.txt and a blocking Cloudflare toggle is effectively blocked. Per the Two-Track Law, this is the single most common audit finding for Track-1 underperformance — content investment is wasted when the crawlers can't read the site.

The gold-standard explicit-allow pattern

The pattern we recommend for robots.txt, applied across every audit we've ever produced:

User-Agent: *
Allow: /

User-Agent: GPTBot
Allow: /

User-Agent: ChatGPT-User
Allow: /

User-Agent: OAI-SearchBot
Allow: /

User-Agent: ClaudeBot
Allow: /

User-Agent: anthropic-ai
Allow: /

User-Agent: Claude-Web
Allow: /

User-Agent: PerplexityBot
Allow: /

User-Agent: Google-Extended
Allow: /

User-Agent: Applebot-Extended
Allow: /

User-Agent: CCBot
Allow: /

User-Agent: Bytespider
Allow: /

User-Agent: Meta-ExternalAgent
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Why explicit allow rather than reliance on User-agent: *:

  • Audit trail. Anyone reading the file (internal team, external consultant, future auditor) sees an explicit signal that AI bots are welcome.
  • Defense against overly-restrictive blanket rules. If a security review adds Disallow: / under User-agent: *, the explicit AI bot allows remain in effect.
  • Industry convention. Brands publishing the explicit allow pattern (Stripe, Mintlify, Linear, Stack Overflow, and others) have established it as the standard for AI-friendly sites in 2026.

We score the explicit-allow pattern at 10/10 on Check 1 of the 10-Point AI Citation Audit. The case study we cite most frequently is Tony's Painting CA (San Diego painting contractor) whose robots.txt explicitly allows every major AI bot by name — they scored 10/10 on crawler accessibility despite being a small local-services business with limited technical resources. The pattern is replicable.

How to test crawler accessibility

Verification procedure for any domain:

# Test GPTBot
curl -A "GPTBot" -I https://yourdomain.com/

# Expected: HTTP/2 200 with response headers

# Test ChatGPT-User
curl -A "ChatGPT-User" -I https://yourdomain.com/

# Test PerplexityBot
curl -A "PerplexityBot" -I https://yourdomain.com/

# Test ClaudeBot
curl -A "ClaudeBot" -I https://yourdomain.com/

# Test Google-Extended
curl -A "Google-Extended" -I https://yourdomain.com/

Any response other than 200 (especially 403, 429, or 503) indicates a block somewhere in the request pipeline. Investigate by:

  1. Checking robots.txt for explicit Disallow rules under that user-agent.
  2. Checking Cloudflare AI Crawl Control panel for that bot's toggle state.
  3. Checking WAF rules for blanket AI-bot blocks.
  4. Checking Vercel firewall configuration (if deployed on Vercel).
  5. Checking any reverse proxy or load balancer rules between Cloudflare and origin.

Test from at least three representative pages — homepage, a content pillar, a service area page or product page. Some misconfigurations apply globally; others apply only to specific paths.

WAF and firewall considerations

Web Application Firewalls and infrastructure firewalls operate independently of robots.txt and Cloudflare's AI Crawl Control panel. Common patterns that inadvertently block AI bots:

Generic "bot challenge" rules. Many WAF configurations challenge or block any user-agent containing "bot" — which catches every AI bot. Explicitly exempt named AI bot user agents from challenge rules.

Rate-limiting on legitimate AI crawler request volume. AI bots can request many pages in short windows during training crawls. Aggressive rate-limiting returns 429s and effectively blocks the crawl. Configure rate limits with AI bot user agents specifically exempted or with higher thresholds.

Geographic blocking. Some WAF configurations block requests from countries where major AI provider infrastructure is hosted. This can silently block ClaudeBot (Anthropic infrastructure), GPTBot (OpenAI infrastructure), or Google-Extended (Google infrastructure).

TLS / cipher restrictions. AI bots use specific TLS configurations. Older WAF rules that block non-modern TLS may block AI bots while allowing browser traffic.

Audit the full request pipeline — origin, Vercel firewall, Cloudflare, WAF, CDN — quarterly. Each layer can independently block AI bots without the others noticing.

How often do AI bots crawl?

Crawl frequency varies by bot and by site authority:

  • GPTBot — crawls high-authority sites frequently (daily or near-daily); medium-authority sites weekly; low-authority sites monthly or less.
  • PerplexityBot — crawls aggressively during real-time retrieval; less for training. Pattern depends heavily on query frequency for the brand's domain.
  • ClaudeBot — training-data crawls happen in batches tied to Anthropic's model update cycles. Less consistent than GPTBot.
  • Google-Extended — follows Google's broader crawling patterns, modulated by Google's AI training schedule.
  • Bytespider — aggressive crawl frequency for sites in its target verticals.

The implication: changes to robots.txt or infrastructure permissions take effect on the timescale of the next crawl cycle, not instantly. A same-day Cloudflare toggle change may take 1-4 weeks to fully appear in ChatGPT citation behavior as GPTBot re-crawls.

Frequently asked questions

Should I block any AI bots?

For most brands, no. Blocking AI bots forfeits citation on those platforms with no compensating benefit. The arguments for blocking (training data privacy, intellectual property concerns) apply primarily to brands publishing sensitive proprietary content or to brands philosophically opposed to AI training. For commercial brands optimizing for AEO/GEO, the right default is allow.

Does blocking Google-Extended affect Google Search rankings?

No. Google-Extended is for Gemini training; Googlebot is for Search. Blocking Google-Extended does not affect Google Search ranking but does affect Gemini citation and AI Overviews.

What about CCBot?

CCBot is Common Crawl's bot. Common Crawl produces a publicly accessible web archive used by many AI training datasets. Allowing CCBot makes your content available for general AI training across many providers. Most brands should allow it; a minority block it as part of broader AI training opt-out.

Should robots.txt list disallowed paths for AI bots?

If specific paths shouldn't be crawled (admin areas, search result pages with infinite parameter combinations, gated content), yes. Standard disallow patterns work the same for AI bots as for traditional search bots. Just be specific — Disallow: /admin/ not Disallow: /.

What's the relationship between robots.txt and llms.txt?

robots.txt controls crawler access. llms.txt is a proposed standard for declaring site identity to AI crawlers — but as we documented in our methodology change, llms.txt has no measurable impact on AI citation per the SERanking November 2025 study. robots.txt matters; llms.txt is hygiene-flag only.


Companion guides: Schema markup for AI search · The Cloudflare GPTBot trap · AI bot user-agent reference · The 10-Point AI Citation Framework.