> CONFIGURATION_FILE
Robots.txt Was Designed to Block Crawlers.
The AI Era Requires You to Use It to Invite Them — Specifically, Selectively, and with the Right Syntax.
Practitioners who have accidentally blocked GPTBot with a blanket Disallow: / rule — and watched AI citations disappear — understand exactly why this file matters more than any other configuration on your site. This guide names each AI crawler, gives its exact user-agent string, and tells you which ones deserve access and which ones take without giving back.
AEOfix Robots.txt Strategy // Updated Feb 2026
Allow answer engines that provide citations. Block training scrapers that take without giving back.
# ===== ALLOW: Core Search (2026 best practice — explicit overrides) =====
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: GoogleOther
Allow: /
# ===== ALLOW: AI Answer Engines (Provide Citations) =====
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: PerplexityBot
Allow: /
# Grok / xAI live search — allow for citations
User-agent: xAI
Allow: /
User-agent: Grok
Allow: /
User-agent: Amazonbot
Allow: /
# ===== BLOCK: Training Scrapers (No Referral Value) =====
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Crawl-delay: 10
User-agent: Diffbot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: AI2Bot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Omgili/0.5
Disallow: /
User-agent: DataDog Synthetic Bot
Disallow: /
User-agent: TurnitinBot
Disallow: /
Sitemap: https://aeofix.com/sitemap.xml
Why This Matters
In the Agentic Web, robots.txt controls AI visibility. Answer engines (ChatGPT, Claude, Perplexity)
cite sources and send referral traffic—allow them. Training scrapers (CCBot, Bytespider, cohere-ai)
consume bandwidth to train models, provide zero citations, and send zero traffic—block them.
Key Distinction: GPTBot trains ChatGPT but also powers live citations.
CCBot just scrapes for datasets. Google-Extended trains Gemini but doesn't affect Google Search rankings—block it
to protect IP while keeping SEO intact.
2026 Updates: Explicit Allow entries for Googlebot and Bingbot are now best practice to prevent wildcard conflicts. Grok (xAI) is now a Tier 1 allow—it powers live search citations. GoogleOther handles Google's non-search crawls (rich results testing, research). Bytespider gets a Crawl-delay: 10 to rate-limit it since it often ignores the Disallow, and Omgili/0.5, TurnitinBot, and DataDog Synthetic Bot join the block list.
> BLOCK_LIST: Training Scrapers
These bots scrape content for AI training datasets but provide zero citations and zero traffic.
They consume bandwidth without reciprocal value. Block them to protect IP while allowing answer engines.
1. CCBot
Owner: Common Crawl
Builds massive web datasets used to train third-party AI models. Your content trains models that never cite you.
User-agent: CCBot
Disallow: /
2. Bytespider
Owner: ByteDance (TikTok)
Aggressive crawler that often ignores robots.txt. Exhausts server resources for training data with no user-facing citations. Add Crawl-delay: 10 to rate-limit it even when it ignores the Disallow.
User-agent: Bytespider
Disallow: /
Crawl-delay: 10
3. Diffbot
Owner: Diffbot Inc.
Web data extraction service. Not an answer engine users interact with—just a scraper reselling structured data.
User-agent: Diffbot
Disallow: /
4. cohere-ai
Owner: Cohere AI
Enterprise AI training crawler. Minimal user-facing answer engine presence—primarily collects training data.
User-agent: cohere-ai
Disallow: /
5. AI2Bot
Owner: Allen Institute for AI
Academic research crawler. Not a public-facing answer engine—no citation or traffic opportunities.
User-agent: AI2Bot
Disallow: /
6. Google-Extended
Owner: Google (Gemini Training)
Safe to block: Only trains Gemini/Vertex AI. Does NOT affect Google Search rankings. Protect IP without SEO risk.
User-agent: Google-Extended
Disallow: /
7. Omgili / 0.5 NEW 2026
Owner: Webz.io
Aggressive scraper used to build news and web datasets. The specific Omgili/0.5 version string is the high-volume variant. Block both the generic agent and the versioned one.
User-agent: Omgilibot
Disallow: /
User-agent: Omgili/0.5
Disallow: /
8. TurnitinBot NEW 2026
Owner: Turnitin LLC
Plagiarism-detection crawler. Irrelevant for marketing and AEO content—no citations, no traffic. Adds unnecessary crawl budget pressure.
User-agent: TurnitinBot
Disallow: /
9. DataDog Synthetic Bot NEW 2026
Owner: Datadog Inc.
Synthetic monitoring bot that simulates user traffic for uptime checks. Unless you use Datadog to monitor your own site, these hits are noise—no value, no citations.
User-agent: DataDog Synthetic Bot
Disallow: /
Complete AEOfix Block List
Copy this section to block all training scrapers while preserving answer engine visibility.
# ===== BLOCK: AI Training Scrapers =====
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Crawl-delay: 10
User-agent: Diffbot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: AI2Bot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili/0.5
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: DataDog Synthetic Bot
Disallow: /
# ===== BLOCK: SEO Data Resellers =====
User-agent: PetalBot
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: DotBot
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: SerpStatBot
Disallow: /
User-agent: SerpCo
Disallow: /
User-agent: MauiBot
Disallow: /
User-agent: DomainStatsBot
Disallow: /
# ===== BLOCK: Content Harvesters =====
User-agent: MagpieCrawler
Disallow: /
User-agent: spinn3r
Disallow: /
User-agent: proximic
Disallow: /
User-agent: Scrapy
Disallow: /
# ===== BLOCK: Vulnerability Scanners =====
User-agent: Nikto
Disallow: /
User-agent: sqlmap
Disallow: /
User-agent: Acunetix
Disallow: /
Advanced Enforcement (Server Level)
Since bad bots often ignore robots.txt, use .htaccess
(Apache) or Nginx config for hard blocks.
Apache (.htaccess)
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|Bytespider|Diffbot|cohere-ai|AI2Bot) [NC]
RewriteRule .* - [F,L]
Nginx Config
if ($http_user_agent ~* (CCBot|Bytespider|Diffbot|cohere-ai|AI2Bot)) {
return 403;
}
Need Your robots.txt and AI Access Configured Correctly?
AEOfix audits and fixes your robots.txt, llms.txt, and crawler access settings so every major AI engine can index your content.