Which AI crawlers should I allow in robots.txt?

Allow: GPTBot (OpenAI training), OAI-SearchBot (ChatGPT search), ChatGPT-User (live ChatGPT browsing), ClaudeBot and anthropic-ai (Anthropic), PerplexityBot (Perplexity search), xAI and Grok (xAI/Grok), Googlebot, Bingbot, and GoogleOther. These bots drive direct AI citations and referral traffic.

Should I block Google-Extended in robots.txt?

Google-Extended only affects Gemini/Vertex AI training — blocking it does NOT impact Google Search rankings. Most site owners block Google-Extended to control training data inclusion while keeping full Google Search access.

Which AI scrapers should I block in robots.txt?

Block high-volume scrapers with no referral value: CCBot (Common Crawl), Bytespider (ByteDance), Meta-ExternalAgent, cohere-ai, AI2Bot, Diffbot, and Omgilibot. Also block SEO tool crawlers: AhrefsBot, SemrushBot, MJ12bot, and similar — they consume crawl budget without driving any traffic.

Does blocking a bot in robots.txt prevent it from seeing my site?

robots.txt is a protocol, not a technical block — compliant bots honor it, but malicious scrapers can ignore it. All major AI crawlers (GPTBot, ClaudeBot, PerplexityBot, xAI) are compliant. For true access control, use server-side middleware to block by IP range or user-agent string.

> CONFIGURATION_FILE

Robots.txt Was Designed to Block Crawlers.
The AI Era Requires You to Use It to Invite Them — Specifically, Selectively, and with the Right Syntax.

Practitioners who have accidentally blocked GPTBot with a blanket Disallow: / rule — and watched AI citations disappear — understand exactly why this file matters more than any other configuration on your site. This guide names each AI crawler, gives its exact user-agent string, and tells you which ones deserve access and which ones take without giving back.

By William Bouch · Updated February 21, 2026

AEOfix Robots.txt Strategy // Updated March 2026

Allow answer engines that provide citations. Block training scrapers that take without giving back.

# ===== ALLOW: Core Search (2026 best practice — explicit overrides) =====
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: GoogleOther
Allow: /

# ===== ALLOW: AI Answer Engines (Provide Citations) =====
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

# Grok / xAI live search — allow for citations
User-agent: xAI
Allow: /

User-agent: Grok
Allow: /

User-agent: Amazonbot
Allow: /

# Meta AI Search — #1 most active AI crawler (meta-webindexer, NOT Meta-ExternalAgent)
User-agent: meta-webindexer
Allow: /

# You.com AI search engine
User-agent: YouBot
Allow: /

# LinkedIn — professional visibility and social sharing
User-agent: LinkedInBot
Allow: /

# Twitter/X — social sharing cards
User-agent: Twitterbot
Allow: /

# Yandex Search — international search index
User-agent: YandexBot
Allow: /

# ===== BLOCK: Training Scrapers (No Referral Value) =====
User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /
Crawl-delay: 10

User-agent: Diffbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: AI2Bot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Omgili/0.5
Disallow: /

User-agent: DataDog Synthetic Bot
Disallow: /

User-agent: TurnitinBot
Disallow: /

Sitemap: https://aeofix.com/sitemap.xml

Why This Matters

In the Agentic Web, robots.txt controls AI visibility. Answer engines (ChatGPT, Claude, Perplexity) cite sources and send referral traffic—allow them. Training scrapers (CCBot, Bytespider, cohere-ai) consume bandwidth to train models, provide zero citations, and send zero traffic—block them.

Key Distinction: GPTBot trains ChatGPT but also powers live citations. CCBot just scrapes for datasets. Google-Extended trains Gemini but doesn't affect Google Search rankings—block it to protect IP while keeping SEO intact.

March 2026 Updates: Live crawl data from the AEOfix Bot Tracker reveals Meta AI Search (meta-webindexer/1.1) is now the #1 most active AI crawler — surpassing Googlebot — with 939 visits across 113 pages in 30 days. It is now an explicit Tier 1 allow. Note: meta-webindexer is Meta's AI search indexer and is distinct from Meta-ExternalAgent (training scraper, still blocked). YouBot (You.com), LinkedInBot, Twitterbot, and YandexBot added to the allow list. New blocks added: AwarioSmartBot (brand monitor, 143 visits), HubSpot Crawler (32 visits), trendictionbot (23 visits), suspicious-agent (175 attack probe hits), Python-aiohttp scrapers, and 360Spider (Qihoo).

> BLOCK_LIST: Training Scrapers

These bots scrape content for AI training datasets but provide zero citations and zero traffic. They consume bandwidth without reciprocal value. Block them to protect IP while allowing answer engines.

1. CCBot

Owner: Common Crawl

Builds massive web datasets used to train third-party AI models. Your content trains models that never cite you.

User-agent: CCBot
Disallow: /

2. Bytespider

Owner: ByteDance (TikTok)

Aggressive crawler that often ignores robots.txt. Exhausts server resources for training data with no user-facing citations. Add Crawl-delay: 10 to rate-limit it even when it ignores the Disallow.

User-agent: Bytespider
Disallow: /
Crawl-delay: 10

3. Diffbot

Owner: Diffbot Inc.

Web data extraction service. Not an answer engine users interact with—just a scraper reselling structured data.

User-agent: Diffbot
Disallow: /

4. cohere-ai

Owner: Cohere AI

Enterprise AI training crawler. Minimal user-facing answer engine presence—primarily collects training data.

User-agent: cohere-ai
Disallow: /

5. AI2Bot

Owner: Allen Institute for AI

Academic research crawler. Not a public-facing answer engine—no citation or traffic opportunities.

User-agent: AI2Bot
Disallow: /

6. Google-Extended

Owner: Google (Gemini Training)

Safe to block: Only trains Gemini/Vertex AI. Does NOT affect Google Search rankings. Protect IP without SEO risk.

User-agent: Google-Extended
Disallow: /

7. Omgili / 0.5 NEW 2026

Owner: Webz.io

Aggressive scraper used to build news and web datasets. The specific Omgili/0.5 version string is the high-volume variant. Block both the generic agent and the versioned one.

User-agent: Omgilibot
Disallow: /

User-agent: Omgili/0.5
Disallow: /

8. TurnitinBot NEW 2026

Owner: Turnitin LLC

Plagiarism-detection crawler. Irrelevant for marketing and AEO content—no citations, no traffic. Adds unnecessary crawl budget pressure.

User-agent: TurnitinBot
Disallow: /

9. DataDog Synthetic Bot NEW 2026

Owner: Datadog Inc.

Synthetic monitoring bot that simulates user traffic for uptime checks. Unless you use Datadog to monitor your own site, these hits are noise—no value, no citations.

User-agent: DataDog Synthetic Bot
Disallow: /

10. AwarioSmartBot NEW March 2026

Owner: Awario (brand monitoring SaaS)

Brand monitoring crawler that tracks mentions across the web for paying customers. High crawl volume (143 visits/30 days on AEOfix), no referral traffic, no citations. Block it.

User-agent: AwarioSmartBot
Disallow: /

11. HubSpot Crawler NEW March 2026

Owner: HubSpot Inc.

HubSpot's domain inspection crawler used for CRM and marketing intelligence features. Provides no citations or referral traffic to your site. Commercial data collection only.

User-agent: HubSpot Crawler
Disallow: /

12. trendictionbot NEW March 2026

Owner: Trendiction GmbH

Social media and web trend monitoring crawler. Collects content for trend analysis products sold to third parties. No user-facing answer engine, no citations, no traffic.

User-agent: trendictionbot
Disallow: /

13. suspicious-agent NEW March 2026

Owner: Unknown (attack tooling)

Active attack probe with 175 hits recorded in 30 days on AEOfix. Uses UA string suspicious-agent/1.0 (attack-probe). Not a legitimate crawler — block at both robots.txt and server level.

User-agent: suspicious-agent
Disallow: /

14. Python-aiohttp scrapers NEW March 2026

Owner: Various (automated scraping scripts)

Generic Python async HTTP scrapers. UA pattern: Python/3.x aiohttp/3.x. No legitimate crawler uses this string — purely automated content harvesting. robots.txt rarely stops these; pair with server-level blocking.

User-agent: Python-aiohttp
Disallow: /

Complete AEOfix Block List

Copy this section to block all training scrapers while preserving answer engine visibility.

# ===== BLOCK: AI Training Scrapers =====
User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /
Crawl-delay: 10

User-agent: Diffbot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: AI2Bot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili/0.5
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: DataDog Synthetic Bot
Disallow: /

# ===== BLOCK: SEO Data Resellers =====
User-agent: PetalBot
Disallow: /

User-agent: DataForSeoBot
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: BLEXBot
Disallow: /

User-agent: SerpStatBot
Disallow: /

User-agent: SerpCo
Disallow: /

User-agent: MauiBot
Disallow: /

User-agent: DomainStatsBot
Disallow: /

# ===== BLOCK: Content Harvesters =====
User-agent: MagpieCrawler
Disallow: /

User-agent: spinn3r
Disallow: /

User-agent: proximic
Disallow: /

User-agent: Scrapy
Disallow: /

User-agent: Python-aiohttp
Disallow: /

# ===== BLOCK: Brand Monitors & Marketing Tools (March 2026) =====
User-agent: AwarioSmartBot
Disallow: /

User-agent: HubSpot Crawler
Disallow: /

User-agent: trendictionbot
Disallow: /

# ===== BLOCK: Attack Probes & Low-Value Crawlers (March 2026) =====
User-agent: suspicious-agent
Disallow: /

User-agent: 360Spider
Disallow: /

# ===== BLOCK: Vulnerability Scanners =====
User-agent: Nikto
Disallow: /

User-agent: sqlmap
Disallow: /

User-agent: Acunetix
Disallow: /

Advanced Enforcement (Server Level)

Since bad bots often ignore robots.txt, use .htaccess (Apache) or Nginx config for hard blocks.

Apache (.htaccess)

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|Bytespider|Diffbot|cohere-ai|AI2Bot|AwarioSmartBot|trendictionbot|suspicious-agent|360Spider|Python-aiohttp|Scrapy|HubSpot.Crawler) [NC]
RewriteRule .* - [F,L]

Nginx Config

if ($http_user_agent ~* (CCBot|Bytespider|Diffbot|cohere-ai|AI2Bot)) {
    return 403;
}

Need Your robots.txt and AI Access Configured Correctly?

AEOfix audits and fixes your robots.txt, llms.txt, and crawler access settings so every major AI engine can index your content.

View Services & Pricing Get a Free Consultation