Bad bots (what you gonna do?)

(Written a month ago, but not posted until now due to lack of editing. I figure if it’s public I’ll be shamed into updating it.)

The following list is gathered from observations of two somewhat-related events. The first event is that I inadvertently sent out Trackback pings from a local copy of my weblog, which set the blogbots all a-twitter; when I realized what I’d done, I began to block access to the URLs in question with HTTP status code 410 (Gone). The second event is an examination of my access logs, in which I discovered (and blocked, generally with code 403) spambots and crawlers that I felt were poorly-behaved.

Each criterion for badness is a violation of an element in Mark Pilgrim’s aggregator behaviour specification. I’ve interpreted SHOULD as MUST, as I’m unaware of any good reason for these agents to not follow the suggestion. This isn’t a complete list of user-agents or errors, it just lists the most common ones I’ve seen. Additions and updates are welcome.

User-AgentFlails?Ignores 403?Ignores 404?Ignores 410?Referrer spam?Unconditional retrieval?Notes
Googlebot/2.1yes?yes?noyes
Technoratibot/0.6no???yesno
Popdexter/1.0yes???yesyesNote: unthrottled spider, no robots.txt; adds / to URLs without extension
PubSub.com RSS readernoyesno?yesnono?403 and 410 processing fixed; if no more issues, will move to good bots list
Syndic8/1.0noyes?yesnonoignores 301
Feedster Crawler/1.0yesyesyesyesnonoignores 301
BlogPulse (ISSpider-3.0)yes
BlogLines/2.0ignores 301
Yahoo! Slurpignores 301
kinjabot beta2yes, every 6 hoursno
fastbuzz.comyes
Slower, Friendlier Spiders (BlogShares V1.36)ignores 301
FeedRover 1.0; Headlines Archiveyes
Twisted PageGetteryes
everyfeed-spider/1.0no??yesyes
IconSurf/2.0yes (by design!)yes??yes?/favicon.ico is stupid in the first place

Good bots

Giving credit where it’s due, here are the bots (and aggregators) that have been well-behaved so far:

  • NewsMonster
  • nntprss/0.4-beta-6
  • SharpReader/0.9.5.1

Published by