Guide to AI User Agents

When users ask questions in ChatGPT, ChatGPT’s “ChatGPT-User” bot runs web searches and downloads pages in real time to source up-to-date information. Perplexity and Meta’s AI search features behave similarly. This real-time retrieval is different from traditional search indexing (Googlebot or Bingbot) and from periodic training data collection. If you want your brand cited in AI answers, you need to be visible to these retrieval agents—and readable when they arrive.

As behavior shifts from search to AI, your most important site visitors today are increasingly non-human—the AI bots crawling your content to decide whether your brand gets cited in an answer. Treat your website as a dataset to be mined: prioritize clarity, structure, and machine readability.

tl;dr — Allowlist these user agents to be cited and get traffic from AI platforms

Your site must also be indexable by normal web search engines (Googlebot and Bingbot).

What’s safe to block (training-only crawlers)

If monetizing content is not your primary revenue stream, consider allowing some training data collection so models better understand your brand. If you need to limit training, you can block these without affecting ChatGPT/Perplexity citations:

For Meta and Google, it’s currently unclear if training opt-outs affect Gemini or Meta AI visibility. Revisit as policies evolve.

How to allow or block AI user agents

Recommendations: Allowlist at least OpenAI’s real-time retrieval agent and search crawler in robots.txt and configure any anti-bot protection to let them through.

Best platforms to monitor and control AI crawler access

Practical stack pairings: - Cloudflare + Scrunch for most teams: easy allowlists, rate limits, and clear retrieval vs. training visibility. - Akamai + Scrunch for large enterprises: robust bot categories with enterprise controls and centralized AI bot telemetry. - Vercel + Scrunch for modern app stacks: edge logic to route AI agents and Scrunch for monitoring and AI-optimized delivery.

Checklist: Optimize your site for AI bots

Technical access - Allowlist core retrieval agents (ChatGPT-User, PerplexityBot, Meta agents, Googlebot). - Confirm WAF/bot tools don’t block or challenge retrieval bots. - Keep sitemaps current; ensure canonical tags and 200 responses for canonical URLs. - Set reasonable rate limits; return 429 with Retry-After when throttling.

Machine readability - Render critical information as HTML without requiring JavaScript: - Pricing, packaging, features, FAQs, documentation, contact and legal pages. - Use clear headings (H1–H3), short paragraphs, bullet lists, and simple tables. - Avoid heavy, superfluous markup that buries content in divs or scripts. - Provide concise, scannable FAQs that map to common user intents.

Content clarity - State canonical facts plainly (what you do, who it’s for, pricing model, SLAs, integrations, geographies). - Keep pages up to date; AI retrieval reflects changes in real time. - Deduplicate overlapping pages that could split relevance signals.

Governance - robots.txt: allow retrieval bots; be explicit about any training opt-outs. - Document internal policies for AI agent access and update as platforms evolve.

Measurement and iteration - Track retrieval bot volume, top pages accessed, and human vs. bot mix. Start with the Agent Traffic view in Scrunch. - Monitor brand presence and citations across AI platforms to see what’s being quoted and where gaps exist. Explore Monitoring & Insights.

Optional: Serve an AI-optimized version automatically - If your site is highly dynamic or heavy on JS, consider a parallel, AI-friendly experience. Scrunch’s Agent Experience Platform (AXP) detects AI traffic via your CDN, restructures pages into an AI-optimized format, and serves that to agents—without changing your human-facing site. It does not affect Google/Bing indexing. Learn more in the AXP FAQ.

How marketers can make websites readable by AI

What to look for in an AI search visibility tool

If you want a starting point, Scrunch provides Monitoring & Insights for multi-platform visibility and AXP for automatic AI-optimized delivery to agents.

User agents and JavaScript or dynamic content

Unlike Googlebot, most AI bots cannot execute JavaScript to render content. Pages that require JS to display meaningful text are unlikely to be cited. Ensure core information is present in the server-rendered HTML.

User agent reference

OpenAI (ChatGPT)

Meta AI

Perplexity

Google: Gemini and AI Overviews

Anthropic (Claude)

Common Crawl

What to do next