← All articles·§ technical·Supporting

AI bot User-Agent reference: every crawler that matters in 2026

Q: What about newer or less common AI bots?

The list covers AI bots that materially affect citation as of mid-2026. New bots may emerge — xAI's Grok training crawler, Mistral's training crawler, others. The list updates as new bots cross meaningful citation impact thresholds.

Q: What if I want to allow training for some AI providers but not others?

The toggles are independent. A brand can allow Claude while blocking GPTBot, or any other combination. The choice should reflect strategic AI citation priorities and any training-data preferences the brand holds.

The complete reference for AI bot user agents in 2026. GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, anthropic-ai, Claude-Web, PerplexityBot, Google-Extended, Applebot-Extended, Bytespider, CCBot, cohere-ai, Meta-ExternalAgent. Owner, purpose, default treatment, recommended configuration.

Data for AI Search Editorial Team·June 23, 2026·10 min read

This is the reference document for AI bot user agents that matter for AEO and GEO optimization in 2026. Thirteen user agents account for the AI bot traffic that affects citation behavior across ChatGPT, Perplexity, Claude, Gemini, Grok, Microsoft Copilot, and emerging AI assistants. Each user agent has a distinct purpose (training-data ingestion, real-time retrieval, image indexing), owner, and default treatment under common infrastructure configurations. Brands optimizing for AI citation need to verify each user agent can reach the site — and need to know what each one does so configuration decisions are deliberate rather than accidental. The reference is structured for the use case: a brand auditing their robots.txt, Cloudflare AI Crawl Control panel, or WAF configuration needs to know what each user agent does and which AI assistant citation depends on it. Per the 10-Point AI Citation Audit, crawler accessibility (Check 1) is the veto — a blocked user agent kills the corresponding citation channel entirely.

The complete AI bot user agent table

User-Agent string	Owner	Purpose	Affects	Default treatment
`GPTBot`	OpenAI	Training data ingestion	ChatGPT citation	Often blocked by Cloudflare default
`ChatGPT-User`	OpenAI	User-action retrieval	ChatGPT real-time	Usually allowed
`OAI-SearchBot`	OpenAI	ChatGPT Search retrieval	ChatGPT Search	Usually allowed
`ClaudeBot`	Anthropic	Training data ingestion	Claude citation	Usually allowed
`anthropic-ai`	Anthropic	Alternative crawler ID	Claude citation	Usually allowed
`Claude-Web`	Anthropic	User-action retrieval	Claude real-time	Usually allowed
`PerplexityBot`	Perplexity	Retrieval + training	Perplexity citation	Sometimes blocked
`Google-Extended`	Google	Gemini training	Gemini + AI Overviews	Usually allowed
`Applebot-Extended`	Apple	Apple Intelligence training	Apple Intelligence	Usually allowed
`Bytespider`	ByteDance	Doubao / TikTok training	Doubao + TikTok AI	Sometimes blocked
`CCBot`	Common Crawl	Open web archive	Many AI providers via Common Crawl	Usually allowed
`cohere-ai`	Cohere	Cohere model training	Cohere model citation	Usually allowed
`Meta-ExternalAgent`	Meta	Meta AI training	Meta AI citation	Usually allowed

Below is the per-user-agent reference.

GPTBot

Owner: OpenAI Purpose: Training data ingestion for GPT models. This is the primary OpenAI crawler for absorbing web content into future model training cycles. Affects: ChatGPT citation (training-corpus signal). Critical for long-term ChatGPT optimization. Default treatment: Often blocked by Cloudflare's AI Crawl Control panel on accounts provisioned 2024 onward. See The Cloudflare GPTBot trap. Recommended treatment: Allow. Documentation: https://platform.openai.com/docs/gptbot

Sample robots.txt:

User-Agent: GPTBot
Allow: /

ChatGPT-User

Owner: OpenAI Purpose: Real-time retrieval triggered by ChatGPT user actions (browsing, fetching specific URLs, performing web actions). Affects: ChatGPT real-time citation behavior when users explicitly request URL fetches or web actions. Default treatment: Usually allowed. Recommended treatment: Allow.

OAI-SearchBot

Owner: OpenAI Purpose: ChatGPT Search retrieval — the crawler that grounds ChatGPT Search responses in real-time web content. Affects: ChatGPT Search citation behavior. Critical for time-sensitive queries. Default treatment: Usually allowed. Recommended treatment: Allow.

ClaudeBot

Owner: Anthropic Purpose: Training data ingestion for Claude models. The primary Anthropic crawler. Affects: Claude citation (training-corpus signal). Default treatment: Usually allowed. Recommended treatment: Allow.

anthropic-ai

Owner: Anthropic Purpose: Alternative user agent identifier Anthropic uses in some contexts. Often appears alongside ClaudeBot. Affects: Claude citation. Default treatment: Usually allowed. Recommended treatment: Allow.

Claude-Web

Owner: Anthropic Purpose: The crawler Claude uses when users explicitly request the model to fetch a URL or perform web search. Affects: Claude real-time citation behavior. Default treatment: Usually allowed. Recommended treatment: Allow.

PerplexityBot

Owner: Perplexity Purpose: Retrieval and training. Perplexity does aggressive real-time crawling for current queries plus periodic training-corpus ingestion. Affects: Perplexity citation. Critical because Perplexity is heavily real-time-retrieval-dominant per the Two-Track Law. Default treatment: Sometimes blocked by aggressive WAF rules that catch "bot" user agents broadly. Recommended treatment: Allow. Note: PerplexityBot can generate high-volume requests for popular domains. Rate-limiting should be configured generously (>1000 requests/hour from the user agent).

Google-Extended

Owner: Google Purpose: Gemini training data ingestion. Distinct from Googlebot (which is for traditional Google Search). Affects: Gemini citation + Google AI Overviews. Critical for Track 2 optimization on Google's AI surfaces. Default treatment: Usually allowed because most sites that allow Googlebot also implicitly allow Google-Extended via User-agent: *. Recommended treatment: Allow. Important: Blocking Google-Extended does NOT affect Google Search ranking. It only affects Gemini training and AI Overviews.

Applebot-Extended

Owner: Apple Purpose: Apple Intelligence training data ingestion. Distinct from Applebot (which is for Spotlight and Siri). Affects: Apple Intelligence citation. Still emerging surface as of mid-2026. Default treatment: Usually allowed. Recommended treatment: Allow.

Bytespider

Owner: ByteDance Purpose: Training data ingestion for ByteDance AI models including Doubao (the Chinese-market ChatGPT competitor) and TikTok AI features. Affects: Doubao citation + TikTok AI features. Relevant primarily for brands targeting Chinese-speaking markets or with significant TikTok presence. Default treatment: Sometimes blocked by WAF rules targeting Chinese-origin crawlers broadly. Recommended treatment: Allow for international brands; allow for brands with TikTok strategy; defer or block for brands with strict Chinese-market avoidance policies.

CCBot

Owner: Common Crawl Purpose: Building the Common Crawl public web archive, which many AI training datasets incorporate. Affects: Many AI providers indirectly via Common Crawl ingestion. Default treatment: Usually allowed. Recommended treatment: Allow. Blocking CCBot is a defensible choice for brands philosophically opposed to AI training but reduces the open-web representation of the brand across many AI training datasets.

cohere-ai

Owner: Cohere Purpose: Training data ingestion for Cohere's enterprise-focused language models. Affects: Cohere model citation. Limited relevance for consumer-facing AEO; significant for enterprise B2B brands where Cohere adoption is meaningful. Default treatment: Usually allowed. Recommended treatment: Allow.

Meta-ExternalAgent

Owner: Meta Purpose: Training data ingestion for Meta AI models including Llama and the consumer-facing Meta AI assistant. Affects: Meta AI citation. Default treatment: Usually allowed. Recommended treatment: Allow.

Sample complete robots.txt

The gold-standard explicit-allow pattern recommended in the AI bot robots.txt complete guide, applied as a complete reference robots.txt:

User-Agent: *
Allow: /

User-Agent: GPTBot
Allow: /

User-Agent: ChatGPT-User
Allow: /

User-Agent: OAI-SearchBot
Allow: /

User-Agent: ClaudeBot
Allow: /

User-Agent: anthropic-ai
Allow: /

User-Agent: Claude-Web
Allow: /

User-Agent: PerplexityBot
Allow: /

User-Agent: Google-Extended
Allow: /

User-Agent: Applebot-Extended
Allow: /

User-Agent: CCBot
Allow: /

User-Agent: Bytespider
Allow: /

User-Agent: cohere-ai
Allow: /

User-Agent: Meta-ExternalAgent
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Frequently asked questions

What about newer or less common AI bots?

The list above covers the AI bots that materially affect citation as of mid-2026. New bots may emerge — xAI's crawler for Grok training (currently undocumented user agent), Mistral's training crawler, others. The list will update as new bots cross meaningful citation impact thresholds.

Should I differentiate per-section access for AI bots?

Generally no for typical brands. Allowing AI bots to crawl the entire site (except genuinely private paths like admin areas) maximizes citation surface. The exceptions: paywalled content (block AI bots from the paywalled paths to protect the business model), highly sensitive proprietary content, and explicitly opt-out user-generated content.

How do I know if a user agent claiming to be an AI bot is legitimate?

User agent strings can be spoofed. The legitimate AI providers publish reverse-DNS verification patterns. For GPTBot: the request should come from an IP that reverse-DNS resolves to *.openai.com. For ClaudeBot: similar pattern with *.anthropic.com. WAF configurations can include verification rules that require both the user-agent string AND verified origin IP, rejecting spoofed requests.

Does blocking AI bots affect my SEO?

Not directly. AI bots are separate from search engine bots. Blocking GPTBot doesn't affect Googlebot's crawl or Google Search ranking. The exception is Google-Extended which is for Gemini training — blocking it doesn't affect Google Search but does affect Gemini citation.

What if I want to allow training for some AI providers but not others?

The toggles are independent. A brand can allow Claude (ClaudeBot, anthropic-ai, Claude-Web) while blocking GPTBot, or any other combination. The choice should reflect strategic AI citation priorities and any training-data preferences the brand holds. Most commercial brands allow all major AI bots; some philosophically-conscious brands block training-data crawlers (GPTBot, ClaudeBot, Google-Extended) while allowing real-time retrieval crawlers (ChatGPT-User, OAI-SearchBot, Claude-Web).

Companion guides: The AI bot robots.txt complete guide · The Cloudflare GPTBot trap · Schema markup for AI search · The 10-Point AI Citation Framework.