all you really need in this world is
a glue gun, a bass guitar and a few good punches.
Author:
1969-2004
Les Expos sont morts. Vive les Expos!
Bad bots (what you gonna do?)
(Written a month ago, but not posted until now due to lack of editing. I figure if it’s public I’ll be shamed into updating it.)
The following list is gathered from observations of two somewhat-related events. The first event is that I inadvertently sent out Trackback pings from a local copy of my weblog, which set the blogbots all a-twitter; when I realized what I’d done, I began to block access to the URLs in question with HTTP status code 410 (Gone). The second event is an examination of my access logs, in which I discovered (and blocked, generally with code 403) spambots and crawlers that I felt were poorly-behaved.
Each criterion for badness is a violation of an element in Mark Pilgrim’s aggregator behaviour specification. I’ve interpreted SHOULD as MUST, as I’m unaware of any good reason for these agents to not follow the suggestion. This isn’t a complete list of user-agents or errors, it just lists the most common ones I’ve seen. Additions and updates are welcome.
User-Agent | Flails? | Ignores 403? | Ignores 404? | Ignores 410? | Referrer spam? | Unconditional retrieval? | Notes |
---|---|---|---|---|---|---|---|
Googlebot/2.1 | yes | ? | yes | ? | no | yes | |
Technoratibot/0.6 | no | ? | ? | ? | yes | no | |
Popdexter/1.0 | yes | ? | ? | ? | yes | yes | Note: unthrottled spider, no robots.txt; adds / to URLs without extension |
PubSub.com RSS reader | no | ? | no | ? | 403 and 410 processing fixed; if no more issues, will move to good bots list | ||
Syndic8/1.0 | no | yes | ? | yes | no | no | ignores 301 |
Feedster Crawler/1.0 | yes | yes | yes | yes | no | no | ignores 301 |
BlogPulse (ISSpider-3.0) | yes | ||||||
BlogLines/2.0 | ignores 301 | ||||||
Yahoo! Slurp | ignores 301 | ||||||
kinjabot beta2 | |||||||
fastbuzz.com | yes | ||||||
Slower, Friendlier Spiders (BlogShares V1.36) | ignores 301 | ||||||
FeedRover 1.0; Headlines Archive | yes | ||||||
Twisted PageGetter | yes | ||||||
everyfeed-spider/1.0 | no | ? | ? | yes | yes | ||
IconSurf/2.0 | yes (by design!) | yes | ? | ? | yes | ? | /favicon.ico is stupid in the first place |
Good bots
Giving credit where it’s due, here are the bots (and aggregators) that have been well-behaved so far:
- NewsMonster
- nntprss/0.4-beta-6
- SharpReader/0.9.5.1
Don’t Give Up
Thank you Suzy! You rock!
(And since you won’t accept anything but my gratitude, I’ve doubled my donation on your behalf.)
Get yours today!
Today’s million-dollar idea is this: the Foo Bar™. Comes in caffeinated chocolate, caffeinated peanut, and–for pulling an all-nighter in a lab that can’t get pizza delivered–extra-caffeinated energy varieties. These things would sell like hotcakes at ThinkGeek.
(Hotcakes, eh? Hmmm….)
Oops
I contacted GMPG with my robot profile. However, in the e-mail I sent I munged the URL. (I’ve been trying out WordPress and have a slightly different naming scheme than what’s here. I cut-and-pasted the URL from WordPress, realized the domain was wrong, and fixed it without thinking about the rest of the path.) Matt kindly let me know I’d fubared the link and updated the GMPG folks with the correct one, but I decided I should fix the link anyway through an Apache rewrite:
RewriteRule ^(200.*) /blog/archives/$1 [R=301,L]
So, patch applied and all’s well. I like the shorter URLs, but I’m not going to convert MT just yet.
No more Firefox
I will not–can not–promote the Firefox and Thunderbird projects by including their banners on this weblog until this removal is overturned. The hasty, draconian decision has been overruled by Brendan Eich and Ben Goodger. I’ve read the arguments against retaining the feature, and they’re far outweighed by the arguments for it–most overwhelmingly this one from the feature developer that was ignored by drivers:
I am planning to fix all those bugs… for the 1.0 localization freeze (FF 0.10?).
If buggy, untested new features can be added at the last minute to a supposedly stablizing product, why can’t less-buggy, well-tested, useful features be retained? Asa: it’s too late in the game to get decent feedback
is not a reason, it’s an excuse.
And a final bit of irony: in Mozilla Seamonkey 1.8’s latest alpha, Users can now disable CSS via Use Style > None or a global preference
… the very thing that’s been removed from Firefox.
Daniel Glazman sums it up nicely: If [a list of 7 minor bugs] is accurate, the whole story of the removal of the Style Switcher is a real shame.
XMDP-style Robot Profile
I’m thinking that HTML should have an element that basically says
content within this section may contain links from external sources; just because they are here does not mean we are endorsing them.
I’m not convinced an HTML extension is necessary or desirable. Instead, I think this might be better handled through a back door approach.
…imagine allowing a div that lets you block out links or sections of a page not to index/follow.
It’s a simple idea, really: use an XHTML MetaData Profile and specially-named classes to indicate whether the content of an element should be considered in some way undesirable to a robot. This has several advantages over creating a new element, particularly that it preserves backward and forward compatibility with HTML and XHTML: it’s just a use of an ability that’s there already. It also has an advantage over specially-formatted comments in that it doesn’t require tag-soup handling of XML documents.
Like XFN, the use of special classnames would be indicated by an HTML metadata profile. It might be desirable to have the classnames namespaced–really just another use of the profile attribute–or otherwise made unique so they could be identified without the profile, but for now I’ll stick with the simple case.
On to some examples. The first shows the use of the profile http://example.org/ignore, which indicates that the content marked with an ignore-content class is not to be used in indexing the page.
<head profile="http://example.org/ignore">
...
<div class="ignore-content">There once was a man from Nantucket...</div>
<p>This is not about <span class="ignore-content">porn</span>.</p>
Next, let’s mark some links as not to be followed; example.{tld} links shouldn’t be followed anyway, but this will reinforce that. The text is fair game, though.
<head profile="http://example.org/ignore">
...
<p class="ignore-links">This is <a href="http://example.com/bogus">a bogus link</a>
and so is <a href="http://example.net/bogus">this</a>.</p>
Finally, we’ll cause an entire page to be ignored, similar to the <meta name=”robots” content=”noindex,nofollow,noarchive”> convention.
<head profile="http://example.org/ignore">
...
<body class="ignore-content ignore-links">
<p>The <a href="http://example.com/">hot girls</a> cooled off with a glass of ice water.</p>
So, there it is. Is this worth following up? Would GMPG be interested? (I think it follows their principles.)
These are a few of the references and sources for this proposal that haven’t been linked above. In no particular order, and subject to expansion:
- Jim Winstead and other commenters
- Tim Bray’s There’s Still No Such Thing as a Web Site and On Search (particularly Metadata)
- Robots Exclusion META Tag
- Lachlan Hunt has another take. It’s more focused on links, which is what Hixie’s original post was about, and incorporates even more metadata about each link.
I’m not sure a lot of that metadata describes relationships, which is part of the reason I went with classes instead of rel.I’ve seen the light, thanks to the discussion below.
This item is licensed under a Creative Commons License.
First Mac tip
I got rid of the Internet Connect icon!
While trying to set up my Mac’s VPN connection to work, I configured several L2TP and PPTP connections in OS X’s Internet Connect control panel. I later discovered that they don’t work with the Cisco VPN software we use at work (nasty words to Cisco) and so I removed them, but the toolbar icon didn’t disappear.
I knew the icon wasn’t there originally, so I was fairly sure it could be gotten rid of. Because it didn’t happen by itself, I decided to experiment.
Menus? Nope.
Preferences? No.
Dragging the icon? No… but hold on, the Option key seems to be a favourite of the single-mouse-button crowd.
How about Option-dragging the icon? Hey, it moved!
How about Option-dragging it to the trash? Bingo, no more icon!
So, thumbs down to Apple for not making the icon disappear in the first place, but thumbs up for making it at least somewhat intuitive to get rid of. (Although describing it as intuitive may be a stretch… others have had the same problem and I haven’t been able to find anyone else’s answer.)
Jeanette and Lenni
Lenni Jabour was recently interviewed on CIUT 89.5 FM, a community radio station in Toronto. As certain of you will understand, I was incredibly disappointed to find when the program aired that I was unable to listen via streaming audio (due to network issues). I got in touch with one of the station programmers, hoping it might be rebroadcast, but the station doesn’t do that, or keep archives of their shows. So it was with reluctance that I admitted to myself that I’d never hear it.
A few days later I made the drive to Toronto to see Lenni’s final show at nia. (The venue is hosting its last performance on Saturday, by the way.) As per usual, I took a seat at one of the front tables. Shortly thereafter, a couple sat down on the other side of the table and an attractive young woman took one of the padded benches at the end. The three of us got talking about music in general and Lenni in particular, and I discovered that the couple were visiting from Michigan and that Lenni’s show had been recommended to them. When I mentioned that Lenni had been on the radio the previous weekend and that despite being a huge fan I’d missed it, the young woman piped up and said that she had done the interview and still had a copy. We exchanged cards and I saw that, sure enough, she was Jeanette (that Jeanette) Cabral, host and producer of About The Music.
Flash forward a couple of weeks. I’m sitting at my desk today when Cathy, a co-worker, comes into the office with a padded manila envelope, return-addressed to Jeanette at CIUT and… containing a CD copy of the interview! Showing enormous self-restraint, I opened it, checked that the CD was intact (it was), and set it aside until I left the office.
I’m pleased to report that Lenni is just as charming on the radio as she is in person, as is Jeanette. There are some intriguing tidbits, including the fact that Lenni appears on an album by a singer whose music is diametrically opposite her own in just about every respect. Lenni and Jeanette talk in some depth about projects I’d previously only heard the names of, discuss The Third Floor and Un Trio D’Hommes Très Gentils, share thoughts on the etiquette of concert-going, and wonder at the loyalty of Lenni’s fans. It’s a wide-ranging, well-done interview, and I’d like to publicly thank Jeanette for her kindness in passing it along. Thank you!