Subtleties of robots.txt

I recently discovered a subtlety of the /robots.txt file. (For those who don’t know, /robots.txt is a configuration file for web spiders that tells them what URLs not to retrieve from a website.) The issue is this: parsers will not fall back to a more general User-Agent setting if they’ve already matched a specific one. This is actually spelled out in A Standard for Robot Exclusion — If the [User-Agent] value is ‘*’, the record describes the default access policy for any robot that has not matched any of the other records — but I’d forgotten it when I updated my /robots.txt recently to exclude certain additional URIs (removed from my site due to abuse) that Google seems to have such a fondness for that it won’t drop them from its index.

Here’s an excerpt from /robots.txt that shows the error:

User-agent: Googlebot
Disallow: /CBP/

User-agent: *
Disallow: /cgi-bin/
Disallow: /personal/

Imagine my surprise when Googlebot started spidering /cgi-bin/show?user=Webmaster, my address-obfuscating contact form, and pages under /personal/ which poison the lists of e-mail addresses and URIs that address harvesters collect.

I’ve updated /robots.txt to include both sets of URIs for the Googlebot, but (unlike Inktomi’s Slurp) Google doesn’t retrieve /robots.txt after it’s started spidering a site. I may have to extend my .htaccess blocking for the current dance to prevent the big G from grabbing a bunch of (more) useless garbage from my site.

Moral of the story: even if you’ve read the documentation, read it again before making assumptions. As my friend John is fond of saying, when you assume you make an ass of u and Dave.

Subtleties of robots.txt

Published by

Peter