When users ask questions in ChatGPT, ChatGPT’s “ChatGPT-User” bot runs web searches and downloads pages in real time to source up-to-date information. Perplexity and Meta’s AI search features behave similarly. This real-time retrieval is different from traditional search indexing (Googlebot or Bingbot) and from periodic training data collection. If you want your brand cited in AI answers, you need to be visible to these retrieval agents—and readable when they arrive.
As behavior shifts from search to AI, your most important site visitors today are increasingly non-human—the AI bots crawling your content to decide whether your brand gets cited in an answer. Treat your website as a dataset to be mined: prioritize clarity, structure, and machine readability.
tl;dr — Allowlist these user agents to be cited and get traffic from AI platforms
Your site must also be indexable by normal web search engines (Googlebot and Bingbot).
ChatGPT
robots.txt identifier: ChatGPT-User
user agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
user agents: facebookexternalhit/1.1, meta-externalagent/1.1, meta-externalfetcher/1.1
Perplexity
robots.txt identifier: PerplexityBot
user agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Google AI Overviews
robots.txt identifier: Googlebot
user agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/W.X.Y.Z Safari/537.36
Google Gemini
robots.txt identifier: Googlebot-extended
user agent: Uses Googlebot; robots.txt controls how data is used
What’s safe to block (training-only crawlers)
If monetizing content is not your primary revenue stream, consider allowing some training data collection so models better understand your brand. If you need to limit training, you can block these without affecting ChatGPT/Perplexity citations:
GPTBot
ClaudeBot
CCBot
For Meta and Google, it’s currently unclear if training opt-outs affect Gemini or Meta AI visibility. Revisit as policies evolve.
How to allow or block AI user agents
robots.txt
Publish per-domain rules reputable bots follow. You can allow, disallow, or scope access.
Note: Some platforms may still retrieve a specific URL submitted by a user (e.g., generating a preview), even if general crawling is disallowed.
CDN/WAF/bot management
Network-level controls (Cloudflare Bot Management, Imperva, Akamai Bot Manager, AWS WAF Bot Control) enforce allow/deny rules and rate limits using signals beyond the user agent.
If these tools challenge or block retrieval bots, you’ll likely miss citations in AI answers—even when users ask about your brand.
Recommendations: Allowlist at least OpenAI’s real-time retrieval agent and search crawler in robots.txt and configure any anti-bot protection to let them through.
Best platforms to monitor and control AI crawler access
Monitoring (what’s actually hitting your site)
Connect your CDN or host to see AI bot traffic that GA4 won’t capture. Scrunch’s Agent Traffic feature integrates with providers like Cloudflare, Akamai, and Vercel to surface retrieval, indexer, and training visits in real time. See how in the guide on how to track if AI bots are visiting your website.
Control (who gets in and how)
Cloudflare Bot Management or WAF rules for fine-grained allowlists, challenge bypasses, and rate limiting.
Akamai Bot Manager or Fast DNS/WAF for enterprises already on Akamai.
AWS WAF Bot Control or Imperva for teams standardized on AWS or Imperva.
Vercel Edge Middleware for lightweight header-based routing and bot detection patterns.
Practical stack pairings:
- Cloudflare + Scrunch for most teams: easy allowlists, rate limits, and clear retrieval vs. training visibility.
- Akamai + Scrunch for large enterprises: robust bot categories with enterprise controls and centralized AI bot telemetry.
- Vercel + Scrunch for modern app stacks: edge logic to route AI agents and Scrunch for monitoring and AI-optimized delivery.
Checklist: Optimize your site for AI bots
Technical access
- Allowlist core retrieval agents (ChatGPT-User, PerplexityBot, Meta agents, Googlebot).
- Confirm WAF/bot tools don’t block or challenge retrieval bots.
- Keep sitemaps current; ensure canonical tags and 200 responses for canonical URLs.
- Set reasonable rate limits; return 429 with Retry-After when throttling.
Machine readability
- Render critical information as HTML without requiring JavaScript:
- Pricing, packaging, features, FAQs, documentation, contact and legal pages.
- Use clear headings (H1–H3), short paragraphs, bullet lists, and simple tables.
- Avoid heavy, superfluous markup that buries content in divs or scripts.
- Provide concise, scannable FAQs that map to common user intents.
Content clarity
- State canonical facts plainly (what you do, who it’s for, pricing model, SLAs, integrations, geographies).
- Keep pages up to date; AI retrieval reflects changes in real time.
- Deduplicate overlapping pages that could split relevance signals.
Governance
- robots.txt: allow retrieval bots; be explicit about any training opt-outs.
- Document internal policies for AI agent access and update as platforms evolve.
Measurement and iteration
- Track retrieval bot volume, top pages accessed, and human vs. bot mix. Start with the Agent Traffic view in Scrunch.
- Monitor brand presence and citations across AI platforms to see what’s being quoted and where gaps exist. Explore Monitoring & Insights.
Optional: Serve an AI-optimized version automatically
- If your site is highly dynamic or heavy on JS, consider a parallel, AI-friendly experience. Scrunch’s Agent Experience Platform (AXP) detects AI traffic via your CDN, restructures pages into an AI-optimized format, and serves that to agents—without changing your human-facing site. It does not affect Google/Bing indexing. Learn more in the AXP FAQ.
How marketers can make websites readable by AI
Write “answer first.” Lead with the direct answer or canonical fact before context.
Use consistent names for products, plans, and features to reduce ambiguity.
Provide concise comparison pages that state differentiators in bullet points or simple tables.
Add FAQ blocks that mirror real questions users ask in AI chats.
Keep multimedia supplemental; ensure equivalent text exists on-page.
What to look for in an AI search visibility tool
Multi-LLM coverage: visibility across ChatGPT, Claude, Gemini, Perplexity, Google AI Mode/Overviews, Microsoft Copilot, Meta AI.
Brand presence and citation tracking: who’s citing you, on which queries, and with what snippets.
Agent traffic monitoring: retrieval vs. indexer vs. training, top agents, and top pages.
AI referrals: attribution of human visits from AI platforms via GA4 integration.
Site audit for AI readability: detect JS-only content, heavy DOMs, blocked agents, or robots issues.
Persona-based prompt tracking and segmentation for targeted insights.
Alerting on gains/losses in presence, citations, or bot activity.
Evidence exports and shareable reports for stakeholders.
Optional delivery layer to serve AI-optimized content to agents (e.g., AXP), with safeguards that don’t impact search indexing.
If you want a starting point, Scrunch provides Monitoring & Insights for multi-platform visibility and AXP for automatic AI-optimized delivery to agents.
User agents and JavaScript or dynamic content
Unlike Googlebot, most AI bots cannot execute JavaScript to render content. Pages that require JS to display meaningful text are unlikely to be cited. Ensure core information is present in the server-rendered HTML.
PerplexityBot: used for indexing and real-time retrieval. Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Perplexity-User: documented to potentially ignore robots.txt for user-provided URLs.
Docs: https://docs.perplexity.ai/guides/bots
Google: Gemini and AI Overviews
Googlebot-extended: controls training and may affect Gemini “Grounding with Google Search” citations if blocked in robots.txt.
AI Overviews follows Googlebot rules; if Google can access content, Overviews generally can as well.