Guide to AI User Agents

AI assistants now act like your most important site visitors. When someone asks a question in ChatGPT, Perplexity, Gemini, or Meta AI, those platforms dispatch bots to retrieve, parse, and cite content in real time. If they can’t access or understand your pages, you won’t be cited—and you’ll miss high‑intent demand.

This guide explains which AI user agents to allow, which to consider blocking, and the practical steps to make your site machine‑readable and citable in AI answers. It also includes a checklist you can use with your dev team and features to look for in an AI visibility tool.

tl;dr — Allowlist these user agents to be cited and get traffic from AI platforms

Platform	robots.txt identifier	Example User-Agent header
ChatGPT	`ChatGPT-User`	Mozilla/5.0 ...; compatible; ChatGPT-User/1.0; +https://openai.com/bot
Meta AI	`meta-externalagent` `meta-externalfetcher` `facebookexternalhit`	facebookexternalhit/1.1; meta-externalagent/1.1; meta-externalfetcher/1.1
Perplexity	`PerplexityBot`	Mozilla/5.0 ...; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Google AI Overviews	`Googlebot`	Mozilla/5.0 ...; compatible; Googlebot/2.1; +http://www.google.com/bot.html
Google Gemini	`Googlebot-extended`	Uses Googlebot UA; control via robots.txt

Also allow standard search indexers (Googlebot, Bingbot) so your content is discoverable.

The goal has changed: Be citable, not just findable

AI traffic isn’t one thing. Different bots do different jobs:

Retrievers service a live user query (highest intent, prioritize access).
Indexers keep AI platforms’ knowledge fresh (discoverability).
Trainers collect data for model training (long‑term brand comprehension).

Optimizing for retrieval should come first: it’s where real‑time citations and conversions originate.

How to make your website show up in AI platforms (checklist)

Use this as a practical, non-nested checklist with your web, SEO, and security teams.

Crawl access
Allowlist real‑time retrievers in robots.txt and in any WAF/bot manager (Cloudflare, Akamai, Imperva, AWS WAF).
Verify that these agents receive HTTP 200 (not 403/429) and aren’t blocked by geo/IP rules, login walls, or cookie consent gates.
Rendering and structure
Server‑render essential content. Most AI bots do not execute JavaScript. Provide HTML fallbacks for pricing tables, feature grids, and FAQs.
Put a concise 2–4 sentence summary at the top of key pages that an AI can lift and cite.
Use clear headings (H1–H3), short paragraphs, and descriptive anchor text.
Content coverage that answers prompts
Publish canonical, up-to-date pages for “What is…”, “How it works”, pricing, integrations, implementation, and FAQs.
Include comparison content that neutrally explains trade‑offs in your category.
Metadata and schemas
Add relevant schema.org (FAQPage, Product/Offer for pricing, HowTo, Organization).
Maintain clean Open Graph and Twitter Card tags (Meta AI often fetches these).
Sitemaps and freshness
Keep XML sitemaps accurate; ping search engines on updates.
Timestamp pages and show last‑updated in HTML to help retrieval agents trust freshness.
Performance and reliability
Target fast TTFB and small HTML payloads. Bots often have tight timeouts and won’t render heavy client bundles.
Avoid blocking interstitials and lazy‑loaded critical text.
Media and documents
Offer HTML summaries for PDFs and gated assets so agents can extract key facts.
Use alt text and captions to expose meaning where possible.
Internationalization and canonicals
Use rel=canonical and hreflang correctly to avoid duplicate/fragmented signals.
Testing and validation
Manually test with cURL using AI bot user agents to confirm content is returned without JS.
Ask live questions in ChatGPT (with browsing), Perplexity, and Gemini to see if you’re cited and what text is extracted.
Monitoring and iteration
Track AI bot visits, classify by retriever/indexer/trainer, and alert on retrieval failures.
Identify which pages get surfaced most often and tighten those summaries and FAQs.

If you want automated checks and ongoing visibility, see how Scrunch’s Monitoring & Insights helps you track citations and AI bot activity across ChatGPT, Perplexity, Gemini, and more: Monitoring & Insights. To deliver AI‑optimized versions of key pages without redesigning your site, learn about AXP.

What user agents is it safe to block? How to stop AI training without losing citations

If content monetization isn’t your primary revenue stream, consider allowing at least some training access to evergreen brand and product information to improve long‑term representation. If you prefer to limit training while preserving citations:

Safe to block (no impact on ChatGPT/Perplexity retrieval citations):
GPTBot (OpenAI training)
ClaudeBot (Anthropic training)
CCBot (Common Crawl; widely used for training by third parties)
Use caution:
Meta and Google have intertwined training and retrieval/grounding. Blocking their training agents may reduce presence in Meta AI and Gemini.

How to allow or block AI user agents

There are two main control points.

robots.txt
Advisory rules respected by reputable bots.
Example (allow retrieval, restrict training):

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: meta-externalagent
Allow: /

User-agent: meta-externalfetcher
Allow: /

User-agent: facebookexternalhit
Allow: /

User-agent: Googlebot
Allow: /

User-agent: Googlebot-extended
Allow: /              # Consider risks before disallowing

User-agent: GPTBot
Disallow: /           # Optional: block training

User-agent: ClaudeBot
Disallow: /           # Optional: block training

User-agent: CCBot
Disallow: /           # Optional: block training

WAF / anti‑bot middleware
Even with permissive robots.txt, WAFs can still block agents. Add explicit allow rules for:
- ChatGPT: ChatGPT-User, OAI-SearchBot (verify with OpenAI’s published IP ranges)
- Perplexity: PerplexityBot, Perplexity-User (verify with Perplexity’s IP ranges)
- Meta: meta-externalagent, meta-externalfetcher, facebookexternalhit
- Google: Googlebot, and consider Googlebot-extended
Monitor for 403/429 spikes on these agents; rate‑limit gently rather than block.

Note: Some platforms may fetch a specific URL provided by a user and bypass robots.txt in those user‑initiated contexts.

User agents and JavaScript or dynamic content

Unlike Googlebot, most AI bots do not execute JavaScript. If your key information (pricing, packaging, feature lists, FAQs) only renders client‑side, it likely won’t be seen or cited. Prefer server‑rendered HTML or provide static HTML fallbacks.

User agent reference

OpenAI (ChatGPT)
Documentation: OpenAI bot docs
Retrieval: ChatGPT-User (highest priority)
Indexing: OAI-SearchBot (future‑proof ChatGPT Search)
Training: GPTBot (can be blocked without affecting retrieval)
OpenAI states it does not train on ChatGPT-User or OAI-SearchBot.
Meta AI
Documentation: Meta web crawler docs
Agents: facebookexternalhit/1.1, meta-externalagent/1.1, meta-externalfetcher/1.1
Behaviors are evolving; some user‑provided URL fetches may bypass robots.txt.
Perplexity
Documentation: Perplexity bot docs
Agents: PerplexityBot (indexing and retrieval), Perplexity-User
States bot‑collected data isn’t used for model training.
Google: Gemini and AI Overviews
Documentation: Google crawler overview
Gemini training/grounding control: Googlebot-extended (details)
AI Overviews follows standard Googlebot access.
Anthropic (Claude)
Training: ClaudeBot (docs)
Claude currently doesn’t fetch web content for live queries.
Common Crawl
Training dataset crawler: CCBot (docs)

What to look for in an AI search visibility tool

When evaluating tools for SEO/content in an AI‑first world, prioritize capabilities that map to the new goal—citability in live AI answers:

Bot detection and classification
Identify retrievers vs indexers vs trainers across ChatGPT, Perplexity, Gemini, and Meta AI.
Real‑time monitoring and alerts
Notify on failed retrievals (403/429/timeout), blocked agents, and significant drops in citations.
Citation tracking
Measure how often and where your brand is cited; attribute to pages and platforms.
Page‑level readiness audits
Flag JS‑only content, missing FAQs/summaries/schema, heavy payloads, and non‑200 responses for AI agents.
robots.txt and WAF guidance
Generate recommended rules and verify enforcement with active checks.
Competitive visibility
Benchmark which competitors are cited for your key prompts and why.
Content optimization insights
Recommend structured summaries, schema, and FAQ coverage based on surfaced queries.
Evidence and logs
Preserve bot request/response samples for engineering triage and compliance.

If you need these capabilities out of the box, explore Scrunch’s Monitoring & Insights and the AI‑friendly delivery option via AXP.

How to allow or block: quick testing steps

Verify robots.txt is reachable and updated.
Use cURL to spot‑check with AI UAs:
curl -A "ChatGPT-User" -I https://yourdomain.com/pricing
curl -A "PerplexityBot" -I https://yourdomain.com/faq
Confirm you return 200 OK and that essential text is present in raw HTML (view-source level).

By making your content accessible, structured, and fast for AI retrievers—and by measuring citability instead of just clicks—you’ll meet customers where decisions now begin.