AI bot User-Agent reference: every crawler that matters in 2026
The complete reference for AI bot user agents in 2026. GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, anthropic-ai, Claude-Web, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, cohere-ai, Meta-ExternalAgent. Owner, purpose, default treatment, recommended configuration.
This is the reference document for AI bot user agents that matter for AEO and GEO optimization in 2026. Thirteen user agents account for the AI bot traffic that affects citation behavior across ChatGPT, Perplexity, Claude, Gemini, Grok, Microsoft Copilot, and emerging AI assistants. Each user agent has a distinct purpose (training-data ingestion, real-time retrieval, image indexing), owner, and default treatment under common infrastructure configurations. Brands optimizing for AI citation need to verify each user agent can reach the site — and need to know what each one does so configuration decisions are deliberate rather than accidental. The reference is structured for the use case: a brand auditing their robots.txt, Cloudflare AI Crawl Control panel, or WAF configuration needs to know what each user agent does and which AI assistant citation depends on it. Per the 10-Point AI Citation Audit, crawler accessibility (Check 1) is the veto — a blocked user agent kills the corresponding citation channel entirely.
The complete AI bot user agent table
| User-Agent string | Owner | Purpose | Affects | Default treatment |
|---|---|---|---|---|
GPTBot | OpenAI | Training data ingestion | ChatGPT citation | Often blocked by Cloudflare default |
ChatGPT-User | OpenAI | User-action retrieval | ChatGPT real-time | Usually allowed |
OAI-SearchBot | OpenAI | ChatGPT Search retrieval | ChatGPT Search | Usually allowed |
ClaudeBot | Anthropic | Training data ingestion | Claude citation | Usually allowed |
anthropic-ai | Anthropic | Alternative crawler ID | Claude citation | Usually allowed |
Claude-Web | Anthropic | User-action retrieval | Claude real-time | Usually allowed |
PerplexityBot | Perplexity | Retrieval + training | Perplexity citation | Sometimes blocked |
Google-Extended | Gemini training | Gemini + AI Overviews | Usually allowed | |
Applebot-Extended | Apple | Apple Intelligence training | Apple Intelligence | Usually allowed |
Bytespider | ByteDance | Doubao / TikTok training | Doubao + TikTok AI | Sometimes blocked |
CCBot | Common Crawl | Open web archive | Many AI providers via Common Crawl | Usually allowed |
cohere-ai | Cohere | Cohere model training | Cohere model citation | Usually allowed |
Meta-ExternalAgent | Meta | Meta AI training | Meta AI citation | Usually allowed |
Below is the per-user-agent reference.
GPTBot
Owner: OpenAI
Purpose: Training data ingestion for GPT models. This is the primary OpenAI crawler for absorbing web content into future model training cycles.
Affects: ChatGPT citation (training-corpus signal). Critical for long-term ChatGPT optimization.
Default treatment: Often blocked by Cloudflare's AI Crawl Control panel on accounts provisioned 2024 onward. See The Cloudflare GPTBot trap.
Recommended treatment: Allow.
Documentation: https://platform.openai.com/docs/gptbot
Sample robots.txt:
User-Agent: GPTBot
Allow: /
ChatGPT-User
Owner: OpenAI Purpose: Real-time retrieval triggered by ChatGPT user actions (browsing, fetching specific URLs, performing web actions). Affects: ChatGPT real-time citation behavior when users explicitly request URL fetches or web actions. Default treatment: Usually allowed. Recommended treatment: Allow.
OAI-SearchBot
Owner: OpenAI Purpose: ChatGPT Search retrieval — the crawler that grounds ChatGPT Search responses in real-time web content. Affects: ChatGPT Search citation behavior. Critical for time-sensitive queries. Default treatment: Usually allowed. Recommended treatment: Allow.
ClaudeBot
Owner: Anthropic Purpose: Training data ingestion for Claude models. The primary Anthropic crawler. Affects: Claude citation (training-corpus signal). Default treatment: Usually allowed. Recommended treatment: Allow.
anthropic-ai
Owner: Anthropic Purpose: Alternative user agent identifier Anthropic uses in some contexts. Often appears alongside ClaudeBot. Affects: Claude citation. Default treatment: Usually allowed. Recommended treatment: Allow.
Claude-Web
Owner: Anthropic Purpose: The crawler Claude uses when users explicitly request the model to fetch a URL or perform web search. Affects: Claude real-time citation behavior. Default treatment: Usually allowed. Recommended treatment: Allow.
PerplexityBot
Owner: Perplexity Purpose: Retrieval and training. Perplexity does aggressive real-time crawling for current queries plus periodic training-corpus ingestion. Affects: Perplexity citation. Critical because Perplexity is heavily real-time-retrieval-dominant per the Two-Track Law. Default treatment: Sometimes blocked by aggressive WAF rules that catch "bot" user agents broadly. Recommended treatment: Allow. Note: PerplexityBot can generate high-volume requests for popular domains. Rate-limiting should be configured generously (>1000 requests/hour from the user agent).
Google-Extended
Owner: Google
Purpose: Gemini training data ingestion. Distinct from Googlebot (which is for traditional Google Search).
Affects: Gemini citation + Google AI Overviews. Critical for Track 2 optimization on Google's AI surfaces.
Default treatment: Usually allowed because most sites that allow Googlebot also implicitly allow Google-Extended via User-agent: *.
Recommended treatment: Allow.
Important: Blocking Google-Extended does NOT affect Google Search ranking. It only affects Gemini training and AI Overviews.
Applebot-Extended
Owner: Apple Purpose: Apple Intelligence training data ingestion. Distinct from Applebot (which is for Spotlight and Siri). Affects: Apple Intelligence citation. Still emerging surface as of mid-2026. Default treatment: Usually allowed. Recommended treatment: Allow.
Bytespider
Owner: ByteDance Purpose: Training data ingestion for ByteDance AI models including Doubao (the Chinese-market ChatGPT competitor) and TikTok AI features. Affects: Doubao citation + TikTok AI features. Relevant primarily for brands targeting Chinese-speaking markets or with significant TikTok presence. Default treatment: Sometimes blocked by WAF rules targeting Chinese-origin crawlers broadly. Recommended treatment: Allow for international brands; allow for brands with TikTok strategy; defer or block for brands with strict Chinese-market avoidance policies.
CCBot
Owner: Common Crawl Purpose: Building the Common Crawl public web archive, which many AI training datasets incorporate. Affects: Many AI providers indirectly via Common Crawl ingestion. Default treatment: Usually allowed. Recommended treatment: Allow. Blocking CCBot is a defensible choice for brands philosophically opposed to AI training but reduces the open-web representation of the brand across many AI training datasets.
cohere-ai
Owner: Cohere Purpose: Training data ingestion for Cohere's enterprise-focused language models. Affects: Cohere model citation. Limited relevance for consumer-facing AEO; significant for enterprise B2B brands where Cohere adoption is meaningful. Default treatment: Usually allowed. Recommended treatment: Allow.
Meta-ExternalAgent
Owner: Meta Purpose: Training data ingestion for Meta AI models including Llama and the consumer-facing Meta AI assistant. Affects: Meta AI citation. Default treatment: Usually allowed. Recommended treatment: Allow.
Sample complete robots.txt
The gold-standard explicit-allow pattern recommended in the AI bot robots.txt complete guide, applied as a complete reference robots.txt:
User-Agent: *
Allow: /
User-Agent: GPTBot
Allow: /
User-Agent: ChatGPT-User
Allow: /
User-Agent: OAI-SearchBot
Allow: /
User-Agent: ClaudeBot
Allow: /
User-Agent: anthropic-ai
Allow: /
User-Agent: Claude-Web
Allow: /
User-Agent: PerplexityBot
Allow: /
User-Agent: Google-Extended
Allow: /
User-Agent: Applebot-Extended
Allow: /
User-Agent: CCBot
Allow: /
User-Agent: Bytespider
Allow: /
User-Agent: cohere-ai
Allow: /
User-Agent: Meta-ExternalAgent
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Frequently asked questions
What about newer or less common AI bots?
The list above covers the AI bots that materially affect citation as of mid-2026. New bots may emerge — xAI's crawler for Grok training (currently undocumented user agent), Mistral's training crawler, others. The list will update as new bots cross meaningful citation impact thresholds.
Should I differentiate per-section access for AI bots?
Generally no for typical brands. Allowing AI bots to crawl the entire site (except genuinely private paths like admin areas) maximizes citation surface. The exceptions: paywalled content (block AI bots from the paywalled paths to protect the business model), highly sensitive proprietary content, and explicitly opt-out user-generated content.
How do I know if a user agent claiming to be an AI bot is legitimate?
User agent strings can be spoofed. The legitimate AI providers publish reverse-DNS verification patterns. For GPTBot: the request should come from an IP that reverse-DNS resolves to *.openai.com. For ClaudeBot: similar pattern with *.anthropic.com. WAF configurations can include verification rules that require both the user-agent string AND verified origin IP, rejecting spoofed requests.
Does blocking AI bots affect my SEO?
Not directly. AI bots are separate from search engine bots. Blocking GPTBot doesn't affect Googlebot's crawl or Google Search ranking. The exception is Google-Extended which is for Gemini training — blocking it doesn't affect Google Search but does affect Gemini citation.
What if I want to allow training for some AI providers but not others?
The toggles are independent. A brand can allow Claude (ClaudeBot, anthropic-ai, Claude-Web) while blocking GPTBot, or any other combination. The choice should reflect strategic AI citation priorities and any training-data preferences the brand holds. Most commercial brands allow all major AI bots; some philosophically-conscious brands block training-data crawlers (GPTBot, ClaudeBot, Google-Extended) while allowing real-time retrieval crawlers (ChatGPT-User, OAI-SearchBot, Claude-Web).
Companion guides: The AI bot robots.txt complete guide · The Cloudflare GPTBot trap · Schema markup for AI search · The 10-Point AI Citation Framework.