1. CCBot
Owner: Common Crawl
Builds massive web datasets used to train third-party AI models. Your content trains models that never cite you.
User-agent: CCBot
Disallow: /
2. Bytespider
Owner: ByteDance (TikTok)
Aggressive crawler that often ignores robots.txt. Exhausts server resources for training data with no user-facing citations. Add Crawl-delay: 10 to rate-limit it even when it ignores the Disallow.
User-agent: Bytespider
Disallow: /
Crawl-delay: 10
3. Diffbot
Owner: Diffbot Inc.
Web data extraction service. Not an answer engine users interact with—just a scraper reselling structured data.
User-agent: Diffbot
Disallow: /
4. cohere-ai
Owner: Cohere AI
Enterprise AI training crawler. Minimal user-facing answer engine presence—primarily collects training data.
User-agent: cohere-ai
Disallow: /
5. AI2Bot
Owner: Allen Institute for AI
Academic research crawler. Not a public-facing answer engine—no citation or traffic opportunities.
User-agent: AI2Bot
Disallow: /
6. Google-Extended
Owner: Google (Gemini Training)
Safe to block: Only trains Gemini/Vertex AI. Does NOT affect Google Search rankings. Protect IP without SEO risk.
User-agent: Google-Extended
Disallow: /
7. Omgili / 0.5 NEW 2026
Owner: Webz.io
Aggressive scraper used to build news and web datasets. The specific Omgili/0.5 version string is the high-volume variant. Block both the generic agent and the versioned one.
User-agent: Omgilibot
Disallow: /
User-agent: Omgili/0.5
Disallow: /
8. TurnitinBot NEW 2026
Owner: Turnitin LLC
Plagiarism-detection crawler. Irrelevant for marketing and AEO content—no citations, no traffic. Adds unnecessary crawl budget pressure.
User-agent: TurnitinBot
Disallow: /
9. DataDog Synthetic Bot NEW 2026
Owner: Datadog Inc.
Synthetic monitoring bot that simulates user traffic for uptime checks. Unless you use Datadog to monitor your own site, these hits are noise—no value, no citations.
User-agent: DataDog Synthetic Bot
Disallow: /
10. AwarioSmartBot NEW March 2026
Owner: Awario (brand monitoring SaaS)
Brand monitoring crawler that tracks mentions across the web for paying customers. High crawl volume (143 visits/30 days on AEOfix), no referral traffic, no citations. Block it.
User-agent: AwarioSmartBot
Disallow: /
11. HubSpot Crawler NEW March 2026
Owner: HubSpot Inc.
HubSpot's domain inspection crawler used for CRM and marketing intelligence features. Provides no citations or referral traffic to your site. Commercial data collection only.
User-agent: HubSpot Crawler
Disallow: /
12. trendictionbot NEW March 2026
Owner: Trendiction GmbH
Social media and web trend monitoring crawler. Collects content for trend analysis products sold to third parties. No user-facing answer engine, no citations, no traffic.
User-agent: trendictionbot
Disallow: /
13. suspicious-agent NEW March 2026
Owner: Unknown (attack tooling)
Active attack probe with 175 hits recorded in 30 days on AEOfix. Uses UA string suspicious-agent/1.0 (attack-probe). Not a legitimate crawler — block at both robots.txt and server level.
User-agent: suspicious-agent
Disallow: /
14. Python-aiohttp scrapers NEW March 2026
Owner: Various (automated scraping scripts)
Generic Python async HTTP scrapers. UA pattern: Python/3.x aiohttp/3.x. No legitimate crawler uses this string — purely automated content harvesting. robots.txt rarely stops these; pair with server-level blocking.
User-agent: Python-aiohttp
Disallow: /