81.8% of My “AI Assistant” Traffic Was Fake. The Googlebot Number Was Worse.
Show Notes: 81.8% of My “AI Assistant” Traffic Was Fake
Episode summary
Over two weeks, a brand-new website with zero promotion behind it logged thirty-three visits from AI assistants. Only six were real. The rest were lying about who they were, and the Googlebot numbers were worse. In this episode I walk through exactly what I found in my own server logs, how I proved each finding past the point of doubt, and the simple method you can run on your own logs this week to see your real numbers. We cover why a bot’s name is a claim and not an identity, the difference between bots that fetch you for a live answer and bots that crawl you to train tomorrow’s models, the one crawler I had to chase four different ways to nail down, and the one major player you structurally cannot measure at all.
What you’ll learn:
• Why the bot names in your analytics are a “claims to be” number, not a real one, and the one check that fixes it.
• The 81.8 percent spoof rate hiding in live AI-assistant traffic, and how the fakes gave themselves away.
• Why Googlebot showed 87 percent impersonation, and why that is an old story, not a new one.
• The difference between retrieval crawlers (today’s visibility) and training crawlers (whether the model knows you tomorrow).
• A repeatable, four-step way to settle any bot you cannot verify on the first pass.
• Why Gemini is the one source you cannot measure by name, and how that rhymes with Google’s old “(not provided)” move.
The numbers, at a glance:
• Live AI-assistant fetches: 33 claimed, 6 verified, 27 spoofed. An 81.8 percent spoof rate among the requests that could be checked.
• Googlebot: 799 claimed, 107 verified, 692 spoofed. Roughly 87 percent not Google.
• Most active verified crawlers: Anthropic’s ClaudeBot 166, Googlebot 107, OpenAI’s GPTBot 46, OpenAI’s search crawler 40.
• CCBot (Common Crawl): 20 claimed, 0 verified. Confirmed as impostors across four independent checks.
A reminder these are two weeks on one small, new site. The method is the point, not my totals.
The published IP-range lists (verify your own logs)
These are the first-party files each operator publishes. A request is only legitimate if its source IP falls inside the matching list. Each link goes straight to the source.
OpenAI
ChatGPT-User (live user fetch): https://openai.com/chatgpt-user.json [https://openai.com/chatgpt-user.json]
OAI-SearchBot (search / retrieval): https://openai.com/searchbot.json [https://openai.com/searchbot.json]
GPTBot (training): https://openai.com/gptbot.json [https://openai.com/gptbot.json]
Anthropic (one file covers all of their bots, including ClaudeBot and Claude-User)
https://claude.com/crawling/bots.json [https://claude.com/crawling/bots.json]
Perplexity
Perplexity-User (live user fetch): https://www.perplexity.com/perplexity-user.json [https://www.perplexity.com/perplexity-user.json]
PerplexityBot (crawler): https://www.perplexity.com/perplexitybot.json [https://www.perplexity.com/perplexitybot.json]
Google (note: Google moved these to the /crawling/ipranges/ path in 2026, and the old URLs fail quietly)
Common crawlers, including Googlebot: https://developers.google.com/static/crawling/ipranges/common-crawlers.json [https://developers.google.com/static/crawling/ipranges/common-crawlers.json]
Special-case crawlers: https://developers.google.com/static/crawling/ipranges/special-crawlers.json [https://developers.google.com/static/crawling/ipranges/special-crawlers.json]
User-triggered agents: https://developers.google.com/static/crawling/ipranges/user-triggered-agents.json [https://developers.google.com/static/crawling/ipranges/user-triggered-agents.json]
Common Crawl
CCBot: https://index.commoncrawl.org/ccbot.json [https://index.commoncrawl.org/ccbot.json]
Verification and proof resources
• Google’s full crawler and fetcher reference, where Google states that Google-Extended has no separate user agent and is a robots.txt control token, not a fetcher, and where Google itself warns the user agent string can be spoofed: https://developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers [https://developers.google.com/crawling/docs/crawlers-fetchers/google-common-crawlers]
• Google’s guide to verifying that a request really came from Google, including the reverse-DNS method: https://developers.google.com/crawling/docs/crawlers-fetchers/verify-google-requests [https://developers.google.com/crawling/docs/crawlers-fetchers/verify-google-requests]
• Common Crawl’s public index. Drop in a domain and a recent crawl to check whether your site is actually in the corpus. Use a wildcard, for example yoursite.com/*, so you are not just matching the homepage: https://index.commoncrawl.org/ [https://index.commoncrawl.org/]
Run it yourself: the four-step chase
When a bot will not verify on the first pass, do not stop at “unknown.” Do this:
1. Check the published IP list. Is the source address inside the operator’s ranges?
2. Check reverse DNS. Does the IP resolve back to the operator’s own hostname?
3. Check the corpus or index where one exists, like Common Crawl’s, to see if you were actually captured.
4. Run a WHOIS lookup on the raw IP to see who really owns it. Commodity hosting in random countries is your answer.
Four angles that agree is proof. One that does not is a thread worth pulling.
Try it and tell me.
Run this on your own logs and send me two numbers: your demand spoof rate, and your Googlebot one. I suspect the real story is in the spread between them.
More on the question of what happens to your content after the fetch: https://www.citationiq.com [https://www.citationiq.com]
Follow the show so the next episode finds you.
Get full access to Duane Forrester Decodes at duaneforresterdecodes.substack.com/subscribe [https://duaneforresterdecodes.substack.com/subscribe?utm_medium=podcast&utm_campaign=CTA_4]
Comments
0Be the first to comment
Sign up now and become a member of the Duane Forrester Decodes community!