Robots.txt: The Complete Guide

Understanding Harmful Scrapers vs. Beneficial Bots in 2025-2026

📅 Last Updated: January 2026
✍️ AI Big Deal Research Team
⏱️ 15 min read

Introduction

In 2025-2026, the bot landscape has become increasingly complex. With the rise of AI and machine learning, web scraping has intensified dramatically. According to industry research, the web scraping market is projected to nearly double by 2030 (Corporate Compliance Insights), creating what experts call a "free-rider problem" where valuable data is extracted without permission or compensation.

Understanding which bots to allow and which to block is crucial for:

  • Server performance - Preventing resource exhaustion
  • Content protection - Safeguarding intellectual property
  • SEO optimization - Ensuring legitimate search engines can index your content
  • Analytics accuracy - Preventing bot traffic from skewing your data

This guide will help you distinguish between harmful scrapers that drain your resources and beneficial bots that help your site succeed.

What is Robots.txt?

Definition

The robots.txt file is a simple text file placed in your website's root directory (e.g., www.example.com/robots.txt) that provides instructions to web crawlers about which parts of your site they can access (Wikipedia, Moz).

How It Works

When a bot visits your website, it typically:

  1. Looks for robots.txt first - Before crawling any pages (Cloudflare)
  2. Reads the instructions - Checks which paths are allowed or disallowed
  3. Follows the rules - If it's a "good bot" that respects the file

Primary Functions

  • Control indexing - Prevent search engines from indexing specific files or directories (GeeksforGeeks)
  • Manage server load - Limit bot activity to reduce strain on resources (Moz)
  • Optimize crawl budget - Direct crawlers to high-priority content
  • Guide to sitemaps - Point bots to XML sitemaps for efficient indexing
⚠️ Critical Limitations

Robots.txt is a request, not enforcement (Cloudflare, Netacea).

  • Malicious bots ignore it - Bad actors routinely disregard robots.txt instructions
  • Not a security measure - It doesn't prevent access, only requests compliance
  • Can reveal sensitive areas - Listing blocked paths can inadvertently inform attackers
  • Doesn't guarantee de-indexing - Pages blocked by robots.txt can still appear in search results if linked elsewhere

The Harmful Scrapers

Harmful scrapers are bots that consume excessive server resources, steal content, ignore robots.txt rules, and provide little to no value to website owners. According to Barracuda Networks, these "bad bots" and "gray bots" are becoming increasingly sophisticated in 2025, often leveraging AI to mimic human behavior and evade detection (BetaNews).

1. Bytespider HIGHLY PROBLEMATIC

Owner: ByteDance (TikTok's parent company)

Why It's Harmful:

  • Aggressive crawling rates - Generates massive traffic that can stress servers (WordPress Forums, F5 Labs)
  • Ignores robots.txt - Frequently disregards crawl directives (Reddit discussions)
  • Uses multiple IPs - Employs various IP addresses and hosting services to evade blocking
  • LLM data collection - Primarily scrapes content for training large language models (Dark Visitors)
  • Unintentional DoS - Can cause denial-of-service conditions due to traffic volume

Evidence:

  • F5 Labs documented Bytespider's aggressive behavior and evasion tactics (F5 Labs Report)
  • Multiple webmaster communities report resource exhaustion (WordPress, Reddit)
BLOCK THIS BOT
# Robots.txt Entry
User-agent: Bytespider
Disallow: /
Note: Due to its disregard for robots.txt, you'll likely need server-level blocking via .htaccess, firewall rules, or CDN settings.

2. PetalBot RESOURCE-INTENSIVE

Owner: Huawei (for Petal Search engine)

Why It Can Be Problematic:

  • Resource-intensive crawling - Can consume significant server resources (Reddit, Friendly Captcha)
  • Aggressive behavior - Some webmasters report it doesn't fully respect crawl-delay directives
  • Limited SEO value - Petal Search has minimal market share outside specific regions
  • Performance degradation - Can slow down websites during crawl sessions (DataDome)

When to Allow:

  • Your target audience is in regions where Huawei devices are popular (China, parts of Asia, Middle East)
  • You have sufficient server resources

When to Block:

  • Server resources are limited
  • You don't target Huawei/Petal Search users
  • You notice performance issues
BLOCK (unless targeting Huawei users)
# Robots.txt Entry
User-agent: PetalBot
Disallow: /

3. DataForSeoBot GRAY AREA

Owner: DataForSEO (SEO data aggregation company)

Purpose: Collects data for keyword ranking, competitive intelligence, and SERP analysis (DataDome)

Why It's Controversial:

  • Commercial data extraction - Scrapes your content for resale to SEO professionals
  • No direct benefit - Doesn't help your site's visibility or SEO
  • Resource consumption - Adds to server load without providing value
  • Respects robots.txt - Unlike Bytespider, it does honor robots.txt directives (DataForSEO)

The Case For Allowing:

  • DataForSEO categorizes itself as a "Good Bot" that identifies itself clearly
  • It respects robots.txt settings
  • Minimal impact if your server can handle it

The Case For Blocking:

  • Your content is being monetized by a third party without compensation
  • Server resources are limited
  • You don't want competitors analyzing your content via SEO tools
BLOCK (unless you use DataForSEO services)
# Robots.txt Entry
User-agent: DataForSeoBot
Disallow: /

4. Other Harmful Scrapers to Block

Bot Name Owner Why Block Source
AhrefsBot Ahrefs Commercial SEO data extraction, high crawl volume Imperva
SemrushBot Semrush Competitive intelligence gathering DataDome
MJ12bot Majestic Link index building, no direct SEO benefit Imperva
DotBot Moz/OpenSiteExplorer Commercial link analysis DataDome
BLEXBot WebMeUp SEO data aggregation Community reports
MegaIndex.ru MegaIndex Russian SEO tool, aggressive crawling Community reports
GPTBot OpenAI AI training data collection Dark Visitors
CCBot Common Crawl AI/ML dataset building Dark Visitors

Comprehensive Block List:

# Block harmful scrapers
User-agent: Bytespider
User-agent: PetalBot
User-agent: DataForSeoBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
User-agent: DotBot
User-agent: BLEXBot
User-agent: MegaIndex.ru
User-agent: GPTBot
User-agent: CCBot
Disallow: /

The Necessary Bots

Beneficial bots are essential for your website's success. They help users discover your content, improve your search rankings, enable social sharing, and provide valuable analytics. These bots should ALWAYS be allowed to crawl your site (Medium).

1. Googlebot ESSENTIAL

Owner: Google

Why It's Necessary:

  • Search visibility - Powers Google Search, the world's dominant search engine with 90%+ market share
  • Indexing - Discovers and indexes your content for search results (Rank Math)
  • SEO foundation - Critical for organic traffic and search rankings
  • Respects rules - Follows robots.txt directives and crawl-delay settings

Types of Googlebot:

  • Googlebot - Main web crawler
  • Googlebot-Image - Image search
  • Googlebot-Video - Video search
  • Googlebot-News - Google News
  • Google-InspectionTool - Search Console testing
ALWAYS ALLOW THIS BOT

2. Bingbot ESSENTIAL

Owner: Microsoft (Bing Search)

Why It's Necessary:

  • Second-largest search engine - Bing powers ~3-10% of global searches
  • Microsoft ecosystem - Integrated with Windows, Edge, Cortana, ChatGPT search
  • Yahoo partnership - Also powers Yahoo Search results
  • Growing AI integration - Powers Bing Chat and Copilot features
ALWAYS ALLOW THIS BOT

3. Other Essential Bots

Bot Name Owner/Purpose Why Allow
Googlebot Google Search 90%+ search market share
Bingbot Microsoft Bing 2nd largest search engine
DuckDuckBot DuckDuckGo Privacy-focused search engine
Slurp Yahoo (via Bing) Yahoo Search results
YandexBot Yandex Dominant in Russia/CIS countries
Baiduspider Baidu Dominant in China
Facebookbot Meta Social sharing previews
Twitterbot X/Twitter Link previews and cards
LinkedInBot LinkedIn Professional content sharing
ia_archiver Internet Archive Historical web preservation
Applebot Apple Siri, Spotlight search
Note: You don't need to explicitly allow these bots unless you've set restrictive default rules. By default, all bots are allowed unless you disallow them.

Key Differences: Good vs. Bad Bots

Characteristic Good Bots Bad Bots
Respect robots.txt ✅ Yes, always ❌ Often ignore it
Identify themselves ✅ Clear user-agent strings ❌ Disguise or rotate agents
Crawl rate ✅ Reasonable, respects delays ❌ Aggressive, overwhelming
Purpose ✅ Indexing, legitimate services ❌ Data theft, scraping, spam
Benefit to site ✅ SEO, visibility, traffic ❌ None, only resource drain
IP behavior ✅ Consistent, documented ranges ❌ Rotating, proxy-based
Response to blocks ✅ Stops when blocked ❌ Attempts to evade
Documentation ✅ Official docs available ❌ Little to no transparency

How to Identify Bot Type

Good Bot Indicators:

  1. Official documentation - Company provides clear bot information
  2. Reverse DNS verification - IP addresses resolve to official domains
  3. Consistent behavior - Predictable crawl patterns
  4. Contact information - Clear way to report issues or request changes

Bad Bot Indicators:

  1. High request volume - Thousands of requests in short periods
  2. Ignores delays - Doesn't respect crawl-delay directives
  3. Random user-agents - Changes identification frequently
  4. No reverse DNS - IPs don't resolve to legitimate companies
  5. Suspicious patterns - Targets specific content types aggressively

Robots.txt Best Practices

1. Basic Structure

# Basic robots.txt template

# Allow all good bots by default
User-agent: *
Disallow:

# Block specific harmful scrapers
User-agent: Bytespider
User-agent: PetalBot
User-agent: DataForSeoBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: GPTBot
User-agent: CCBot
Disallow: /

# Point to your sitemap
Sitemap: https://www.example.com/sitemap.xml

2. Protect Sensitive Areas

# Block all bots from admin and private areas
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /private/
Disallow: /user/
Disallow: /cart/
Disallow: /checkout/
Disallow: /*.pdf$
⚠️ Warning: Remember that robots.txt is not security. Use proper authentication for truly sensitive areas.

3. E-commerce Specific

# E-commerce robots.txt
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search/
Disallow: /*?sort=
Disallow: /*?filter=
Allow: /products/
Allow: /categories/

# Block scrapers completely
User-agent: Bytespider
User-agent: AhrefsBot
User-agent: SemrushBot
Disallow: /

Sitemap: https://www.example.com/sitemap.xml

4. Blog/Content Site

# Blog robots.txt
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Allow: /wp-content/uploads/

# Block AI scrapers if you want to protect content
User-agent: GPTBot
User-agent: CCBot
User-agent: ChatGPT-User
Disallow: /

# Block SEO scrapers
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: MJ12bot
Disallow: /

Sitemap: https://www.example.com/sitemap.xml

5. Testing Your Robots.txt

Tools:

Common Mistakes to Avoid:

  • ❌ Blocking CSS/JS files (hurts SEO)
  • ❌ Using robots.txt for security
  • ❌ Forgetting the trailing slash in directories
  • ❌ Blocking your entire site accidentally
  • ❌ Not including sitemap reference

Advanced Bot Management

Since malicious bots often ignore robots.txt, you need additional layers of protection:

1. Server-Level Blocking (.htaccess for Apache)

# Block Bytespider
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]

# Block multiple bad bots
SetEnvIfNoCase User-Agent "Bytespider" bad_bot
SetEnvIfNoCase User-Agent "PetalBot" bad_bot
SetEnvIfNoCase User-Agent "AhrefsBot" bad_bot
SetEnvIfNoCase User-Agent "SemrushBot" bad_bot
SetEnvIfNoCase User-Agent "MJ12bot" bad_bot

<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

2. Nginx Configuration

# Block bad bots in nginx
if ($http_user_agent ~* (Bytespider|PetalBot|AhrefsBot|SemrushBot|MJ12bot)) {
    return 403;
}

3. Cloudflare Bot Management

If you use Cloudflare:

  1. Go to Security > Bots
  2. Enable Bot Fight Mode (free) or Super Bot Fight Mode (paid)
  3. Create custom firewall rules for specific bots
  4. Use Rate Limiting to prevent aggressive crawling

Cloudflare Bot Management Documentation

4. Rate Limiting

# Nginx rate limiting
limit_req_zone $binary_remote_addr zone=bot_limit:10m rate=10r/s;

location / {
    limit_req zone=bot_limit burst=20 nodelay;
}

5. Monitor Bot Traffic

Tools for Bot Detection:

  • Google Analytics 4 - Filter bot traffic
  • Cloudflare Analytics - Bot traffic insights
  • Server logs - Analyze user-agent patterns
  • Wordfence (WordPress) - Bot blocking and monitoring

Conclusion

Key Takeaways

  1. Robots.txt is essential but limited - It's a request, not enforcement. Malicious bots will ignore it.
  2. Always allow search engine bots - Googlebot, Bingbot, and other legitimate search crawlers are critical for SEO and visibility.
  3. Block aggressive scrapers - Bytespider, PetalBot (unless targeting Huawei users), and commercial SEO bots drain resources without providing value.
  4. Use layered protection - Combine robots.txt with server-level blocking, CDN protection, and rate limiting for comprehensive bot management.
  5. Monitor and adapt - Bot landscapes change constantly. Regularly review your server logs and update your blocking rules.

The 2025-2026 Bot Landscape

According to Imperva's Bad Bot Report 2025, bad bots now account for a significant portion of web traffic, with AI-powered "gray bots" becoming increasingly sophisticated (Imperva). The rise of generative AI has created a new category of scrapers specifically designed to extract training data, making bot management more critical than ever.

Recommended Action Plan

Immediate Steps:

  1. ✅ Create or update your robots.txt file
  2. ✅ Block Bytespider, PetalBot, and commercial SEO bots
  3. ✅ Ensure search engine bots are allowed
  4. ✅ Add your sitemap reference

Advanced Steps:

  1. ✅ Implement server-level bot blocking
  2. ✅ Enable CDN bot protection (Cloudflare, etc.)
  3. ✅ Set up rate limiting
  4. ✅ Monitor bot traffic in analytics
  5. ✅ Regularly review and update rules

Final Thoughts

The battle between website owners and malicious scrapers is ongoing. While robots.txt is a crucial first line of defense, it's just one tool in your arsenal. By understanding the difference between beneficial bots that help your site succeed and harmful scrapers that drain your resources, you can make informed decisions about bot management.

Remember: Not all bots are bad, and not all scrapers are evil. The key is distinguishing between those that provide value (search engines, social media, archiving) and those that only extract it (aggressive scrapers, AI trainers, commercial data miners).

Additional Resources