(Written a month ago, but not posted until now due to lack of editing. I figure if it’s public I’ll be shamed into updating it.)
The following list is gathered from observations of two somewhat-related events. The first event is that I inadvertently sent out Trackback pings from a local copy of my weblog, which set the blogbots all a-twitter; when I realized what I’d done, I began to block access to the URLs in question with HTTP status code 410 (Gone). The second event is an examination of my access logs, in which I discovered (and blocked, generally with code 403) spambots and crawlers that I felt were poorly-behaved.
Each criterion for badness is a violation of an element in Mark Pilgrim’s aggregator behaviour specification. I’ve interpreted SHOULD as MUST, as I’m unaware of any good reason for these agents to not follow the suggestion. This isn’t a complete list of user-agents or errors, it just lists the most common ones I’ve seen. Additions and updates are welcome.
User-Agent | Flails? | Ignores 403? | Ignores 404? | Ignores 410? | Referrer spam? | Unconditional retrieval? | Notes |
---|---|---|---|---|---|---|---|
Googlebot/2.1 | yes | ? | yes | ? | no | yes | |
Technoratibot/0.6 | no | ? | ? | ? | yes | no | |
Popdexter/1.0 | yes | ? | ? | ? | yes | yes | Note: unthrottled spider, no robots.txt; adds / to URLs without extension |
PubSub.com RSS reader | no | ? | no | ? | 403 and 410 processing fixed; if no more issues, will move to good bots list | ||
Syndic8/1.0 | no | yes | ? | yes | no | no | ignores 301 |
Feedster Crawler/1.0 | yes | yes | yes | yes | no | no | ignores 301 |
BlogPulse (ISSpider-3.0) | yes | ||||||
BlogLines/2.0 | ignores 301 | ||||||
Yahoo! Slurp | ignores 301 | ||||||
kinjabot beta2 | |||||||
fastbuzz.com | yes | ||||||
Slower, Friendlier Spiders (BlogShares V1.36) | ignores 301 | ||||||
FeedRover 1.0; Headlines Archive | yes | ||||||
Twisted PageGetter | yes | ||||||
everyfeed-spider/1.0 | no | ? | ? | yes | yes | ||
IconSurf/2.0 | yes (by design!) | yes | ? | ? | yes | ? | /favicon.ico is stupid in the first place |
Good bots
Giving credit where it’s due, here are the bots (and aggregators) that have been well-behaved so far:
- NewsMonster
- nntprss/0.4-beta-6
- SharpReader/0.9.5.1