The well-behaved web

This word [SHOULD], or the adjective RECOMMENDED, mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.

RFC 2119, section 3

The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. User agents SHOULD include this field with requests.

RFC 2616, section 14.43, User-Agent

SHOULD means SHOULD. From now on, unconfigured web browsing libraries can take a hike: as of this evening I’m blocking bogus user agents, which will be sent to this page. Mark Pilgrim points out in a comment on his recent article on blocking spambots and spybots that there is no legal requirement for bots to identify themselves… most users won’t bother [changing their User-Agent], because they don’t understand how the web works or why it would matter. And not a lot of people block by User-Agent, so it really doesn’t matter all that much. Perhaps there’s no legal requirement, but there is an explicit expectation of proper identification given by the RFCs quoted above. (RFCs are the de facto protocol standards, much as the W3C’s TRs are the de facto content standards.) Because it’s the default is hardly a valid reason not to set a user-agent string.

This isn’t about getting rid of spambots or spybots, although that may be a corollary of the idea. It’s about following the rules–in societal terms, it’s about being polite. I know my piddly little site isn’t going to change the world; if someone can’t retrieve a page they want they can contact me or move on somewhere else and I won’t mind a bit. But if one person looks at the bogus-bot page and says hey, I should do that then I’ll be a happy camper.

Without further ado, here’s the mod_rewrite magic. Share and enjoy (and suggest more!).

RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Java1 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Java/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl/ [OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib/
RewriteRule .* /bot-redirect [R]

Update: Hmm. I should probably include RewriteCond %{HTTP_USER_AGENT} !.*/[0-9] as well, perhaps in place of the Java1 rule that it supersets. Interesting to note that a couple of the blogbots (obidos-bot and the myelin ecosystem bot in my logs from the last week) would then show up bogus.

About 12 minutes later: Or not. Re-reading section 3.8, I see that product = token ["/" product-version]; product-version = token, i.e. the /product-version is optional and needs not start with a digit anyway. I’m going to claim to have been thrown because both bots misuse the Referer field too.

And another few minutes: Re-reading 14.43. The User-Agent field itself should be included, but additional tokens are only present by convention, listed in order of their significance for identifying the application. So the letter of the RFC shoots down most of my rationale, minus the check for an empty string. Still, I think I’m still following the spirit of 2616:14.43, so I’m going to keep the rules and modify the verbiage of the redirection page slightly.

Published by

2 thoughts on “The well-behaved web

  1. For some reason diveintomark.org always times out on my Trackback pings, although they seem to be received properly. I’ve just upped MT’s PingTimeout value a bit; we’ll see if that does the trick the next time around.

    Apologies for the multi-ping spam, Mark. Maybe I just need to stop re-editing my posts.

Comments are closed.