Methodology
How SeeLLM classifies AI traffic
We separate every request into one of eleven categories using user-agent fingerprints, ASN matching, and request shape. This page documents the rules we apply and the known limitations of each.
Last updated: 2026-05-05
Categories
AI training crawler
A bot that fetches pages to ingest into a model training set or an answer index. Identifies itself in the user-agent.
Signatures
- GPTBot (OpenAI)
- ClaudeBot (Anthropic)
- Bytespider (ByteDance)
- OAI-SearchBot (OpenAI)
- Amazonbot (Amazon)
- Applebot-Extended (Apple)
- CCBot (Common Crawl)
- PerplexityBot (Perplexity)
- Google-Extended (Google)
- Meta-ExternalAgent (Meta)
- Cohere-Training-Data-Crawler
- Diffbot, Firecrawl, HuggingFace-Bot, Webzio-Extended, Omgilibot, PanguBot, ImageSiftBot, Timpibot, Brightbot, AI2Bot
AI assistant
A bot fetching pages on behalf of a live user prompt — the human is waiting for a response. Distinct from training crawlers because the fetch is reactive, not scheduled.
Signatures
- ChatGPT-User (OpenAI)
- Claude-User, Claude-SearchBot (Anthropic)
- Perplexity-User (Perplexity)
- Gemini-Deep-Research (Google)
- OpenAI-User, OAI-SearchBot-User
AI referral
A request from a real human browser whose Referer header points at a known AI surface. Counts as a click-through from an AI assistant to the destination site.
Signatures
- chat.openai.com, chatgpt.com
- claude.ai
- perplexity.ai
- duckduckgo.com (AI mode)
- search.brave.com (Brave AI)
- gemini.google.com
AI coding agent
Autonomous coding assistants that fetch documentation, source files, or APIs while generating code.
Signatures
- Cursor
- GitHub-Copilot
- Devin
- Cody
- Windsurf
- Aider
Search engine
Traditional search index crawlers, distinct from AI training crawlers.
Signatures
- Googlebot
- Bingbot
- YandexBot
- DuckDuckBot
- Baiduspider
- SeznamBot
- Sogou
- Exabot
SEO tool
Third-party SEO and competitive-intelligence crawlers.
Signatures
- Ahrefs
- Semrush
- MJ12
- DotBot
- DataForSEO
- PetalBot
- Barkrowler
- Serpstat
- Sistrix
Social preview
Bots fetching pages to render link previews in social, messaging, and chat apps.
Signatures
- Twitterbot
- LinkedInBot
- Slackbot
- Discordbot
- TelegramBot
- facebookexternalhit
- SkypeUriPreview
- Redditbot
Monitoring
Synthetic monitoring and uptime checks.
Signatures
- UptimeRobot
- Pingdom
- StatusCake
- HyperPing
- Datadog Synthetic
- New Relic
HTTP client
Generic HTTP libraries with no AI, search, or browser identity. Often scripts or integrations.
Signatures
- curl
- wget
- python-requests
- aiohttp
- axios
- node-fetch
- Go-http-client
- OkHttp
- Java/
- libwww
Scanner
Security probes, vulnerability scanners, and traffic with empty or zero-length user-agent strings. Almost never legitimate site visitors.
Signatures
- SecurityScanner
- InternetMeasurement
- masscan, nmap, zgrab, nuclei
- Empty user-agent (UA length = 0)
Browser
A real human browser session. Falls through when no other category matches and the request is human-shaped.
Signatures
- Chrome, Safari, Firefox, Edge, mobile browsers
Known limitations
Referer stripping undercounts AI referrals
Mobile ChatGPT, in-app browsers, Arc, Brave, and several AI products strip the Referer header on outbound clicks. AI referral counts are conservative — they reflect what we can see, not the full traffic. Treat referral rankings between AI sources as directional.
User-agent matching is the primary signal
We rely on declared user-agents for bot classification. Some scrapers and small AI projects use Python-requests or browser-shaped UAs without identifying themselves. Those land in HTTP client or Browser, not AI training. ASN matching is used as a secondary signal where available.
AI assistant vs. AI training is a fuzzy line
Some bots (e.g., Perplexity) operate in both modes — training crawls and live-user fetches — sometimes with the same UA. We classify by UA token where vendors differentiate (GPTBot vs. ChatGPT-User), and conservatively otherwise.
Self-identifying is not the same as truthful
Any client can claim to be GPTBot. We do not verify ownership for every request. For high-stakes use cases, we recommend cross-checking with reverse DNS or vendor-published IP ranges.
Categories evolve
New AI products and crawlers appear monthly. The signature lists above are point-in-time and updated as new bots are observed at scale.
How we collect
- Edge-side classification via Cloudflare Worker on customer domains
- Cloudflare Logpush ingestion for customers preferring no install
- Server-log upload for one-off audits
- All classification happens server-side. There is no client-side script and no cross-site tracking.
See it on your site
Run a free Score on any URL to check AI readiness, or install the edge worker to start collecting the same classification data on your own domain.