You see GPTBot, ClaudeBot, or PerplexityBot in your logs and you have to make a call: allow it, rate-limit it, or block it. But before any of that, you have to answer a more basic question — is it actually that bot? Because a User-Agent header is a claim, not proof. Anyone can run curl -A "GPTBot/1.3" https://your-site.com and your logs will dutifully record "GPTBot."
This isn't theoretical. HUMAN Security's Satori team found that 5.7% of all traffic claiming to be a well-known AI crawler was spoofed, peaking at 7.7% on some days (~2 million requests/day), with the ChatGPT-User agent impersonated in roughly one of every six requests. This guide covers the three real ways to verify a crawler at the IP and crypto layers — forward-confirmed reverse DNS, operator-published CIDR allowlists, and Web Bot Auth — plus the failure modes that trip up almost every other guide.
Why a User-Agent of "GPTBot" proves nothing
The User-Agent is self-reported metadata. There is no authentication in it, nothing signed, nothing an origin server can check. Verifying a crawler means tying the request to infrastructure or a key the operator actually controls. There are exactly three regimes that do that:
- Forward-confirmed reverse DNS (FCrDNS) — prove the IP resolves to and from the operator's domain.
- Published CIDR allowlists — match the IP against ranges the operator publishes.
- Web Bot Auth (RFC 9421) — verify a cryptographic signature on the request itself.
Everything below is about doing these correctly, and knowing what each one can and cannot prove.
The 2026 stakes: bots are the majority of the web
Crawler verification stopped being an SEO niche the moment automated traffic became the bulk of the internet. Imperva's 2025 Bad Bot Report found automated traffic hit 51% of all web traffic in 2024 — the first time it surpassed humans in a decade — with bad bots at 37% and 21% of bot attacks riding residential proxies to blend into consumer IP space. When a fifth of attacks come from residential addresses and AI crawlers are exploding in volume, "is this bot real?" becomes a question at your login, signup, and checkout endpoints — not just your robots.txt.
Method 1 — Forward-confirmed reverse DNS, done right
FCrDNS is the canonical method Google documents for verifying Googlebot. Four steps:
# 1. Reverse-DNS the source IP → hostname
host 66.249.66.1
# → 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com
# 2. Confirm the suffix matches the operator
# Googlebot: googlebot.com, google.com, or googleusercontent.com
# Bingbot: *.search.msn.com
# 3. Forward-DNS that hostname → IP
host crawl-66-249-66-1.googlebot.com
# → crawl-66-249-66-1.googlebot.com has address 66.249.66.1
# 4. Confirm it matches the original IP. Match = verified.
Why the forward step is load-bearing: an attacker controls the reverse (PTR) record for their own IP, so they can point 1.2.3.4 at crawl-fake.googlebot.com. What they cannot do is forge googlebot.com's authoritative forward DNS to make that hostname resolve back to 1.2.3.4. A check that only string-matches the PTR record for "googlebot.com" — which a surprising number of tutorials suggest — is itself spoofable. The forward confirmation is the whole point.
The accuracy ceiling to flag: PTR records are optional and frequently absent, especially for the newer AI vendors. When there's no PTR, FCrDNS returns "unverifiable" — which is neither pass nor fail. That's where Method 2 comes in.
Method 2 — Published CIDR allowlists, per bot
Most AI vendors skip PTR records and instead publish their crawler IP ranges as JSON. This is the most operationally practical check. Pull each operator's primary-source feed:
- OpenAI — per-bot files:
openai.com/gptbot.json,openai.com/searchbot.json,openai.com/chatgpt-user.json(OpenAI crawler docs). - Google — five CIDR files under
developers.google.com/static/crawling/ipranges/. - Anthropic — publishes fixed ranges in its platform docs (e.g. inbound
160.79.104.0/23), and explicitly lists retired/32s to remove from allowlists. - Perplexity —
perplexitybot.jsonandperplexity-user.json. - Apple iCloud Private Relay — an egress CSV at
mask-api.icloud.com/egress-ip-ranges.csv(Apple docs). Note: these are humans on a privacy relay, not bots — allowlist them, don't block them.
Three rules that separate a correct implementation from a broken one:
- Pin each bot to its own file.
GPTBotlives ingptbot.json, not "some OpenAI range." Matching the wrong file conflates a training crawler with a user-initiated fetcher. - Match the exact published prefix, never the hosting ASN. Allowlisting "all of AWS" or "all of Google Cloud" because a bot happens to run there will wave through every attacker on the same cloud.
- Refresh on a schedule. These ranges change; treat them as a feed, not a constant.
Vendor asymmetry: which method even exists
Most guides blur the vendors together. They differ — both in which verification method exists and in the crawler tokens each ships, which carry different policy implications:
| Operator | FCrDNS | Published CIDR | Web Bot Auth | Notable tokens (different policies) |
|---|---|---|---|---|
| ✅ (canonical) | ✅ | ⚠️ experimental | Googlebot vs user-triggered fetchers | |
| OpenAI | ✕ (mostly no PTR) | ✅ per-bot | emerging | GPTBot (training) · ChatGPT-User (user fetch) · OAI-SearchBot |
| Anthropic | ✕ | ✅ | emerging | ClaudeBot · Claude-User · Claude-SearchBot |
| Perplexity | ✕ | ✅ (disputed) | — | PerplexityBot · Perplexity-User |
The nuance that breaks naive pipelines: user-directed fetchers behave like real visitors and may legitimately originate outside published bot ranges. When a user asks an assistant to open a specific URL, that fetch can come from general cloud infrastructure with a normal browser User-Agent. Blocking "not in the GPTBot range" can therefore block a real user's agent action. Rate-limit and block policy has to differ per token, not per vendor.
Method 3 — Web Bot Auth: cryptographic verification
The structural fix to spoofing is to stop trusting the network position and verify a signature instead. Web Bot Auth has the agent sign each request with an Ed25519 key using RFC 9421 HTTP Message Signatures (a Proposed Standard, Feb 2024). The request carries:
Signature— the signature valueSignature-Input— covered components,created/expires,keyid, andtag="web-bot-auth"Signature-Agent— optional, names the signing agent
Keys are discovered as a JWKS at /.well-known/http-message-signatures-directory (built on the RFC 8615 well-known URI mechanism). Verification: fetch the directory, match the keyid thumbprint, verify the signature and the created/expires window. The advantage over IP methods is decisive — it's robust to shared, rotating, and CGNAT addresses, because identity rides the request, not the route.
Reality check: it's experimental, not a standard
Be honest with your architecture here. Web Bot Auth is specified in IETF Internet-Drafts (draft-meunier-web-bot-auth-architecture, 2 Mar 2026, Cloudflare + Google authors) that carry the boilerplate "not endorsed by the IETF and has no formal standing" and expire 3 Sep 2026. Google's own support is explicitly experimental — it is "not yet signing every request" and tells operators to keep using IP, reverse DNS, and User-Agent verification. At the infrastructure layer it's further along: Cloudflare folded Message Signatures into its Verified Bots Program (Jul 2025) and AWS WAF added support (Nov 2025). Verdict: prefer signatures where present, but do not depend on them yet.
The layered decision pipeline (copy this)
Combine the three methods into a precedence order, then enrich and classify:
1. Parse the claimed bot token from the User-Agent — treat as a HINT only.
2. If a Web Bot Auth signature is present → verify against the JWKS.
PASS → verified-good (strongest; IP-independent). Done.
3. Else match source IP against the vendor's published CIDR file,
pinned to that specific token.
IN RANGE → verified-good.
4. Else run FCrDNS where a PTR record exists.
FORWARD-CONFIRMED → verified-good.
5. Enrich the IP (this is where an IP-intelligence API slots in):
ASN / owner, hosting/datacenter flag, VPN/proxy/residential-proxy/Tor,
and an IP risk score.
6. Classify:
verified-good → allow
declared-but-unverified → rate-limit or challenge (NOT hard block)
undeclared automation → challenge / block by behavior + provenance
Step 5 is where GeoIPHub fits: when a self-identified bot's IP isn't in a published range and has no PTR, the ASN owner, datacenter/hosting flag, proxy and residential-proxy detection, and a risk score are what let you score the inconclusive cases at login, signup, and checkout instead of guessing.
Failure modes and false positives
This is what the ranking articles get wrong:
- Out-of-range ≠ fake. The Perplexity case (Cloudflare, Aug 2025) showed a crawler using a generic Chrome UA, IPs outside its published range, and rotating ASNs at 3–6M requests/day. In-range is a true positive; out-of-range is inconclusive, not proof of fakery.
- CGNAT geographic bias. Cloudflare research in late 2025 (via The Register) found CGNAT IPs were rate-limited ~3× more often despite lower bot signals — hundreds of users share one IPv4, concentrated in Africa and Asia. Naive IP blocking is a fairness problem.
- iCloud Private Relay. Shared, rotating egress IPs carrying real humans — allowlist via Apple's CSV.
- Residential proxies. They defeat ASN allowlisting entirely (21% of bot attacks use them), which is exactly why detecting anonymized traffic without blocking real customers is a behavioral problem, not a list lookup.
The rule: reserve hard blocks for behavior plus provenance, never for absence from a list alone.
Where this fits in the GIVT/SIVT framework
The neutral industry taxonomy comes from the IAB Tech Lab and the Media Rating Council. Declared bots, crawlers, non-browser user agents, and known datacenter-origin IPs (the MRC names AWS, Google, and Microsoft) are General Invalid Traffic (GIVT) — filterable by lists. Spoofed and evasive bots are Sophisticated Invalid Traffic (SIVT) — requiring advanced detection and human review. Verifying a self-identified crawler against published ranges, reverse DNS, or a signature is precisely how you separate genuine declared bots (GIVT you can allow) from the SIVT impersonating them. It maps cleanly onto the three pipeline buckets above.
Verify AI crawlers and agents with GeoIPHub
GeoIPHub supplies the step-5 enrichment the geo-IP incumbents leave out of their how-to content: ASN and connection-type, hosting/datacenter detection, VPN / proxy / residential-proxy / Tor flags, and an IP fraud-risk score — so you can confirm a self-identified bot actually originates from vendor infrastructure, and score the inconclusive cases instead of blunt-blocking them. Pair it with the verification methods above and the related guides on account-takeover and credential-stuffing detection and how accurate IP geolocation really is.
FAQ
How do I verify that a request is really from GPTBot and not a spoof?
Don't trust the User-Agent header — it is trivially forged. Match the source IP against OpenAI's published per-bot range file (openai.com/gptbot.json), pinning the check to GPTBot specifically rather than a generic OpenAI range. If a valid Web Bot Auth signature is present, verify it against OpenAI's key directory first, since that is IP-independent. Treat an IP that is not in the published range as inconclusive, not proven fake, because user-directed agent traffic can originate from other infrastructure.
What is forward-confirmed reverse DNS (FCrDNS)?
FCrDNS is a four-step check: reverse-DNS the source IP to a hostname, confirm the hostname ends in the operator's domain, forward-DNS that hostname, then confirm it resolves back to the original IP. The forward step is what defeats spoofing — an attacker can set a fake PTR record on their own IP, but cannot forge the operator's authoritative forward DNS. A check that only inspects the PTR string is spoofable.
Why can't I just block AI crawlers by User-Agent string?
Because the User-Agent is a claim anyone can send. HUMAN Security found 5.7% of all traffic labeled as a well-known AI crawler was fake, peaking near one in six requests for the ChatGPT-User agent. Blocking by header alone lets spoofed bots through while doing nothing to stop an operator who simply stops declaring itself.
What is Web Bot Auth and is it a finished standard?
Web Bot Auth is an emerging method where a bot cryptographically signs each HTTP request with an Ed25519 key using RFC 9421 HTTP Message Signatures, and the origin verifies it against keys published at /.well-known/http-message-signatures-directory. RFC 9421 itself is a Proposed Standard (Feb 2024), but Web Bot Auth is still an IETF Internet-Draft with no formal standing as of mid-2026. Treat it as experimental: verify signatures where present, but keep IP and reverse-DNS verification as the baseline.
Where do AI vendors publish their crawler IP ranges?
OpenAI publishes per-bot JSON at openai.com/gptbot.json, searchbot.json, and chatgpt-user.json; Google publishes CIDR JSON under developers.google.com/static/crawling/ipranges/; Anthropic publishes its ranges in its platform docs; Perplexity publishes perplexitybot.json and perplexity-user.json. Pin each check to the specific bot's file, match exact published prefixes rather than the hosting ASN, and refresh the files on a schedule.
If an IP is not in the vendor's published range, does that mean the request is fake?
No. An in-range IP, a passing FCrDNS, or a valid signature is a high-confidence true positive, but an out-of-range IP is inconclusive. It could be legitimate user-directed agent traffic from other infrastructure, a CGNAT-shared address, an iCloud Private Relay egress, a vendor that publishes no PTR records, or a stale range file on your side. Reserve hard blocks for behavioral and provenance signals, not for absence from a list alone.
Won't blocking crawler IP ranges affect real users on shared addresses?
Yes, which is why IP-absence should not trigger hard blocks. Cloudflare research in late 2025 found CGNAT addresses were rate-limited about three times more often than non-CGNAT ones despite showing lower bot indicators, because hundreds to thousands of users share one IPv4 address — and CGNAT is concentrated in parts of Africa and Asia, creating geographic bias. iCloud Private Relay egress IPs likewise carry real humans and should be allowlisted using Apple's published CSV.
