Loading...

Knowledge Base

How do you control bot, spider, and crawler activity on your site?

Bots, spiders, and crawlers can place significant strain on a server by repeatedly accessing website pages, especially dynamic content. Using a robots.txt file allows site owners to guide compliant search engines on how their content should be crawled, helping reduce unnecessary load and improve overall site performance.

Overview

Bots, spiders, and crawlers accessing dynamic pages can consume significant server resources (CPU and memory). This may result in high server load and degraded website performance.

One way to reduce unnecessary load is by creating a robots.txt file at the root of your website. This file provides instructions to compliant search engine bots about which parts of your site should or should not be crawled and indexed.

⚠️ Important:
robots.txt is advisory only. Well-behaved search engines respect it, but malicious bots, scrapers, or email harvesters may ignore it entirely.


Understanding the Problem

Sometimes search engine bots (Google, Yahoo, etc.) may over-crawl your site. If a bot cannot complete a crawl due to resource limits, it may repeatedly making the problem worse.

Blocking Specific Bots

Blocking Googlebot

Example scenario:

You find the following IP address in your access.log:
66.249.66.167

Verify ownership using SSH:  host 66.249.66.167

Result: crawl-66-249-66-167.googlebot.com

To block Googlebot entirely, add this to robots.txt

# Block Googlebot
User-agent: Googlebot
Disallow: /

Field explanation:

  • # → Comment (for your reference only)

  • User-agent → The bot being targeted

  • Disallow: / → Blocks access to the entire site


Slowing Yahoo Crawlers

Yahoo’s crawler (Slurp) respects the Crawl-delay directive.

Example: Limit Yahoo to one request every 10 seconds:

# Slow down Yahoo
User-agent: Slurp
Crawl-delay: 10

 

Field explanation:

  • User-agent: Slurp → Yahoo’s crawler

  • Crawl-delay → Time (in seconds) between requests

 

Slowing All Well-Behaved Bots

To apply a crawl delay to most compliant bots:

User-agent: *
Crawl-delay: 10

Notes:

  • Applies to all bots

  • Googlebot ignores Crawl-delay

  • To control Googlebot’s crawl rate, use Google Search Console

 

Blocking All Bots

Block All Crawlers Completely

User-agent: *
Disallow: /

Note: This will remove your site from search engine indexes. 

Block a Specific Directory 

User-agent: *
Disallow: /yourfolder/

Note:  This prevents crawlers from indexing only the specified directory.


Default (Allow Everything)

If you do not want to block any bots:

User-agent: *
Disallow:

Note:  Alternatively, you may remove robots.txt entirely if you are not concerned about 404 log entries.

Security & Best Practices

  • Blocking all bots (User-agent: *) may cause de-indexing from legitimate search engines.

  • Malicious bots often ignore robots.txt.

  • Some bad bots may:

    • Use fake or misleading User-Agents

    • Treat robots.txt as a list of valuable targets

  • Blocking bots via .htaccess, firewall rules, or rate-limiting is often more effective for malicious traffic.

Loading...