{"id":199,"date":"2003-02-28T20:15:33-05:00","date_gmt":"2003-03-01T04:15:33+00:00","guid":{"rendered":"http:\/\/peterjanes.ca\/wordpress\/?p=199"},"modified":"2003-02-28T20:15:33-05:00","modified_gmt":"2003-03-01T04:15:33+00:00","slug":"the-well-behaved-web","status":"publish","type":"post","link":"https:\/\/peterjanes.ca\/blog\/2003\/02\/28\/the-well-behaved-web\/","title":{"rendered":"The well-behaved&nbsp;web"},"content":{"rendered":"<div class='e-content'><blockquote><p>This word [<cite>SHOULD<\/cite>], or the adjective <cite>RECOMMENDED<\/cite>, mean that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.<\/p><\/blockquote>\n<p class=\"source\"><a href=\"http:\/\/www.ietf.org\/rfc\/rfc2119.txt\">RFC 2119<\/a>, section 3<\/p>\n\n<blockquote><p>The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. User agents SHOULD include this field with requests.<\/p><\/blockquote>\n<p class=\"source\"><a href=\"http:\/\/www.ietf.org\/rfc\/rfc2616.txt\">RFC 2616<\/a>, section 14.43, <cite>User-Agent<\/cite><\/p>\n\n<p><cite>SHOULD<\/cite> means <cite>SHOULD<\/cite>.  From now on, unconfigured web browsing libraries can take a hike: as of this evening I&#8217;m blocking bogus user agents, which will be sent to <a href=\"\/bot-redirect\">this page<\/a>.  Mark Pilgrim points out in a <a href=\"http:\/\/diveintomark.org\/archives\/2003\/02\/26\/how_to_block_spambots_ban_spybots_and_tell_unwanted_robots_to_go_to_hell.html#c000468\">comment on his recent article<\/a> on blocking spambots and spybots that <q>there is no legal requirement for bots to identify themselves&#8230; most users won&#8217;t bother [changing their User-Agent], because they don&#8217;t understand how the web works or why it would matter. And not a lot of people block by User-Agent, so it really doesn&#8217;t matter all that much.<\/q>  Perhaps there&#8217;s no <em>legal<\/em> requirement, but there is an explicit expectation of proper identification given by the RFCs quoted above.  (RFCs are the de facto protocol standards, much as the W3C&#8217;s <abbr title=\"Technical Recommendations\">TRs<\/abbr> are the de facto content standards.)  <q>Because it&#8217;s the default<\/q> is hardly a valid reason not to set a user-agent string.<\/p>\n\n<p>This isn&#8217;t about getting rid of spambots or spybots, although that may be a corollary of the idea.  It&#8217;s about following the rules&#8211;in societal terms, it&#8217;s about <em>being polite<\/em>.  I know my piddly little site isn&#8217;t going to change the world; if someone can&#8217;t retrieve a page they want they can contact me or move on somewhere else and I won&#8217;t mind a bit.  But if one person looks at the bogus-bot page and says <q>hey, I <em>should<\/em> do that<\/q> then I&#8217;ll be a happy camper.<\/p>\n\n<p>Without further ado, here&#8217;s the <cite>mod_rewrite<\/cite> magic.  Share and enjoy (and suggest more!).<\/p>\n\n<pre>RewriteCond %{HTTP_USER_AGENT} ^$ [OR]\nRewriteCond %{HTTP_USER_AGENT} ^Java1 [OR]\nRewriteCond %{HTTP_USER_AGENT} ^Java\/ [OR]\nRewriteCond %{HTTP_USER_AGENT} ^libwww-perl\/ [OR]\nRewriteCond %{HTTP_USER_AGENT} ^Python-urllib\/\nRewriteRule .* \/bot-redirect [R]<\/pre>\n\n<p><ins datetime=\"2003-03-01T02:25:00-05:00\">Update: Hmm.  I should probably include <code>RewriteCond %{HTTP_USER_AGENT} !.*\/[0-9]<\/code> as well, perhaps in place of the <code>Java1<\/code> rule that it supersets.  Interesting to note that a couple of the blogbots (<cite>obidos-bot<\/cite> and the <cite>myelin<\/cite> ecosystem bot in my logs from the last week) would then show up bogus.<\/ins><\/p>\n\n<p><ins datetime=\"2003-03-01T02:37:00-05:00\">About 12 minutes later: Or not.  Re-reading section 3.8, I see that <code>product = token [\"\/\" product-version]; product-version = token<\/code>, i.e. the <code>\/product-version<\/code> is optional and needs not start with a digit anyway.  I&#8217;m going to claim to have been thrown because both bots <a href=\"http:\/\/www.kottke.org\/03\/01\/030130rss_readers_.html\">misuse the Referer field<\/a> too.<\/ins><\/p>\n\n<p><ins datetime=\"2003-03-01T02:53:00-05:00\">And another few minutes: Re-reading 14.43.  The <cite>User-Agent<\/cite> field itself should be included, but additional tokens are only present by convention, <q>listed in order of their significance for identifying the application.<\/q>  So the letter of the RFC shoots down most of my rationale, minus the check for an empty string.  Still, I think I&#8217;m still following the spirit of 2616:14.43, so I&#8217;m going to keep the rules and modify the verbiage of the redirection page slightly.<\/ins><\/p><\/div><div class=\"syndication-links\"><\/div>","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve decided to take <cite>SHOULD<\/cite> as it&#8217;s meant to be taken.  From now on, unconfigured web browsing libraries can take a hike.<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"mf2_syndication":[],"venue_id":0},"categories":[3],"tags":[],"kind":false,"_links":{"self":[{"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/posts\/199"}],"collection":[{"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/comments?post=199"}],"version-history":[{"count":0,"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/posts\/199\/revisions"}],"wp:attachment":[{"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/media?parent=199"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/categories?post=199"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/peterjanes.ca\/blog\/wp-json\/wp\/v2\/tags?post=199"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}