Crawlability for AI Bots: How to Let Them Crawl Your Site

If the bots powering answer engines can't crawl your site, your brand never enters the corpus those models cite. Before you polish your schema or your authority, secure step zero — that GPTBot, PerplexityBot and ClaudeBot reach the HTML.

Why Crawlability Is the Prerequisite for Everything Else

An answer engine can only cite what a bot already read. If your server returns 403 to the user-agent feeding ChatGPT, nothing you optimized further down matters — not your schema, not your authority, not your copy.

That separation between crawl and consumption is explicit in vendor documentation. According to OpenAI, its infrastructure uses distinct "crawlers and user agents" for training, ChatGPT search and user-triggered fetches — three different behaviors from the same vendor. Each is allowed or blocked independently.

The pattern repeats elsewhere. According to the Claude help center, Anthropic publishes three distinct bots — ClaudeBot, Claude-User and Claude-SearchBot — "to enable website owner transparency and choice". The decision to appear cited is not binary — it's per bot type.

This is the technical floor beneath the Searchability framework and the condition for your JSON-LD to even be read. Without crawlability there is no indexation, and without indexation there is no citation.

Subscribe to the Madbotz newsletter to get the next analysis straight to your inbox. No spam, no noise — just new posts.

The Three Bots You Need to Let Through

GPTBot (OpenAI)

According to OpenAI, GPTBot "is used to crawl content that may be used in training" the foundation models. It's the bot you block if you do NOT want your content to enter the training data. The canonical user-agent token is GPTBot and the full string is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.1; +https://openai.com/gptbot.

OpenAI runs two other bots — OAI-SearchBot, which feeds the ChatGPT search index, and ChatGPT-User, which performs live fetches when a user pastes a URL into the chat. Blocking only GPTBot opts you out of training; blocking the other two removes your site from the answers ChatGPT delivers today. IP ranges live at openai.com/gptbot.json.

PerplexityBot (Perplexity)

According to the Perplexity documentation, PerplexityBot "is designed to surface and link websites in search results on Perplexity. It is not used to crawl content for AI foundation models". Its job is to give you visibility — not train models. Blocking it means giving up appearing in its answers.

The user-agent is PerplexityBot and the full string ends in +https://perplexity.ai/perplexitybot. Verifiable IPs live at perplexity.com/perplexitybot.json. Perplexity also runs Perplexity-User, which serves user-initiated fetches and, per its own documentation, "generally ignores robots.txt rules" because the traffic is user-triggered.

A dated caveat — according to Cloudflare, in August 2025 they reported that Perplexity "is using stealth, undeclared crawlers to evade website no-crawl directives". It's a public finding with a date, not a permanent property of the bot — but if you blocked Perplexity on policy, it's worth auditing your logs for undeclared user-agents.

ClaudeBot (Anthropic)

According to the Claude help center, ClaudeBot "helps enhance the utility and safety of our generative AI models by collecting web content that could potentially contribute to their training". It's Anthropic's training bot — blocking it opts your site out of the Claude training set.

Anthropic runs three distinct user-agents: ClaudeBot (training), Claude-User (fetches triggered by user questions) and Claude-SearchBot (indexing for search). Verifiable IPs live at claude.com/crawling/bots.json. Anthropic states explicitly that its bots "respect 'do not crawl' signals by honoring industry standard directives in robots.txt" and supports the Crawl-delay extension.

Bot	User-agent token	What it feeds	Respects robots.txt	Directive to allow it
GPTBot	`GPTBot`	OpenAI model training	Yes (documented)	Do not include `Disallow:`
OAI-SearchBot	`OAI-SearchBot`	ChatGPT search index	Yes (documented)	Do not include `Disallow:`
ChatGPT-User	`ChatGPT-User`	ChatGPT live fetches	Yes (documented)	Do not include `Disallow:`
PerplexityBot	`PerplexityBot`	Perplexity results	Yes (documented)	Do not include `Disallow:`
ClaudeBot	`ClaudeBot`	Anthropic model training	Yes (documented)	Do not include `Disallow:`
Claude-SearchBot	`Claude-SearchBot`	Claude search	Yes (documented)	Do not include `Disallow:`

How to Write Your robots.txt for AI Bots

The syntax is unchanged. One User-agent: directive per bot, followed by Allow: or Disallow: rules, separated by a blank line. To allow explicitly — the default case if you want to be cited — simply do not include a Disallow: for that user-agent.

To limit or block, vendors document the standard pattern. According to the Claude help center, a valid directive to throttle crawling is:

User-agent: ClaudeBot
Crawl-delay: 1

And to block entirely:

User-agent: ClaudeBot
Disallow: /

The same pattern applies to GPTBot and PerplexityBot. What changes is the trade-off — blocking training bots (GPTBot, ClaudeBot) opts your site out of the training set; blocking live indexing bots (PerplexityBot, OAI-SearchBot, Claude-SearchBot) gives up being cited by those engines.

Blocking is not neutral.

Every Disallow: / removes you from that engine's corpus or index. If your strategy is to be cited, what you need is NOT to have Disallow: for those user-agents — not to add an explicit Allow:.

This contradicts an idea floating since 2024 — that a new separate file (llms.txt) replaces robots.txt for AI bots. As we argued in the llms.txt cargo cult, robots.txt remains the canonical standard the documented bots read — and the format the ecosystem is converging on.

How to Verify the Bots Are Actually Crawling You

The first place is your server logs. Filter by user-agent for GPTBot, PerplexityBot, ClaudeBot, Claude-User, Claude-SearchBot, OAI-SearchBot and ChatGPT-User. If you don't see traffic from at least two of them in the last 30 days, there's a silent block somewhere.

The second is verifying IP against published ranges. Each vendor maintains a JSON endpoint: openai.com/gptbot.json, perplexity.com/perplexitybot.json and claude.com/crawling/bots.json. Any traffic with a known UA but an IP outside the range is an impersonator. The IPs change frequently — automate the fetch instead of hardcoding ranges.

The third, if your stack supports it, is reverse DNS. Resolve the request IP, resolve the resulting hostname back to IP, and compare — they must match. Not every vendor publishes that mechanism, but the IP-JSON cross-check covers the main case.

The Silent Killer — Your WAF or CDN Is Blocking Legitimate Bots

This is the lesson teams struggle with most. The most common block doesn't come from robots.txt — it comes from the WAF or CDN, and the site owner never finds out because the page loads normally in their own browser.

According to the Cloudflare blog, their managed "Block AI bots" rule takes precedence over the rest of Super Bot Fight Mode — including "Allow verified bots". The text is blunt: "if you have enabled Block AI bots and Allow verified bots, verified AI bots will also be blocked". A well-intentioned security toggle can remove you from all three answer engines without telling you.

Bot Fight Mode has an additional behavior. It injects JS Detections into every served page and challenges bot traffic by default — most AI bots don't execute JS, so they fail the challenge silently. From your browser, the symptom is invisible — the page loads fine for you and returns 403 to the bot.

The Madbotz Case — What Happened to Our Own Crawler

Madbotz operates a scanner for the Visibility product with canonical user-agent MadbotzVisibilityBot/1.0 (+https://visibility.madbotz.com/bot). In May 2026, while scanning madbotz.com from outside, Cloudflare blocked it silently. The response came back as 403 without showing up in application logs behind it.

The fix was a custom Cloudflare rule with Action = Skip when the User-Agent contains "MadbotzVisibilityBot", applied to All managed rules, Super Bot Fight Mode and Browser Integrity Check. It worked after propagation in the madbotz.com zone. We documented the full playbook on our bot's public identity page, with allowlist snippets for Cloudflare, Vercel Firewall, AWS WAF, Akamai and Imperva.

⚠️

Important clarification — MadbotzVisibilityBot is a visibility scanner, not an answer-engine bot like GPTBot, PerplexityBot or ClaudeBot. We bring it into this section because the WAF mechanism that blocked it is exactly the one blocking the three AI bots by default. If it happened to you with any legitimate bot, it's happening with the others too.

The transferable allowlist recipe is: custom rule in the WAF, high priority, based on the desired bot's User-Agent, Action = Skip or Allow for the managed rules. Do NOT base the allowlist on IP ranges — AI bots and our scanner use dynamic IPs, and the day they change your rule stops working.

AI Crawlability Checklist

Six concrete steps you can run this week:

Review your production robots.txt. Confirm there's no Disallow: / for GPTBot, PerplexityBot, ClaudeBot, Claude-User, Claude-SearchBot, OAI-SearchBot or ChatGPT-User.
Filter your server logs for those user-agents in the last 30 days. If you don't see traffic from at least two of them, there's a silent block somewhere.
If you use Cloudflare, open Security → Bots. If "Block AI bots" is on, you're already out of all three engines — create a custom rule with Skip per user-agent for the bots you want to let through.
Cross-check the IPs that DO appear in logs against each vendor's official JSON endpoint: openai.com/gptbot.json, perplexity.com/perplexitybot.json and claude.com/crawling/bots.json.
If you publish your own bot, document its identity at a public URL (example: your-domain.com/bot) and ship per-WAF allowlist snippets to your customers. It's the only way they won't block you.
Re-run the analysis two weeks later. WAF rule propagation and the AI bots' crawl cadence make the effect show up with lag.

Frequently Asked Questions

If I block GPTBot, do I disappear from ChatGPT?

Not entirely — it depends on which bot you block. GPTBot feeds the training set; OAI-SearchBot feeds the ChatGPT search index; ChatGPT-User is the live fetch when someone pastes your URL. Blocking all three removes your site from ChatGPT. Blocking only GPTBot opts you out of training but leaves the rest operating.

Is it true that AI bots ignore robots.txt?

OpenAI, Anthropic and Perplexity's declared bots say they respect it and document the behavior in their help centers. The notable dated exception is Cloudflare's August 2025 report about Perplexity's undeclared crawlers. For the documented user-agents, robots.txt remains the primary control.

Why does my site load fine but the bots don't cite me?

Almost always your WAF or CDN. Bot Fight Mode, Super Bot Fight Mode, AWS WAF Bot Control and Vercel Firewall challenge bot traffic by default, with no symptom visible from your browser. The page loads normally for you and returns 403 to the bot.

Allowlist by IP or by user-agent?

By user-agent. AI bots and your own scanners use dynamic IPs and publish JSON endpoints that change frequently. A User-Agent-based allowlist lives in one place; an IP-based one requires constant automation to stay current.

Conclusion

Three takeaways:

Crawlability is step zero of AI visibility — without bot access to your HTML, neither schema nor authority do you any good.
GPTBot, PerplexityBot and ClaudeBot respect robots.txt per their official documentation. The real control is what happens before that file — at your WAF or CDN.
The silent killer is Bot Fight Mode (or its AWS, Vercel or Akamai equivalent). It blocks by default and doesn't show up in application logs. Check it manually.

If you want to know whether your site is exposed to any of these blocks — and to the rest of the 131 check items in the Searchability framework — the Visibility analyzer tells you in under 60 seconds.

Analyze your site for free — enter a URL and get your AI Visibility Score in under 60 seconds.

AI Visibility Report