Subtleties of robots.txt

I recently discovered a subtlety of the /robots.txt file. (For those who don’t know, /robots.txt is a configuration file for web spiders that tells them what URLs not to retrieve from a website.) The issue is this: parsers will not fall back to a more general User-Agent setting if they’ve already matched a specific one. This is actually spelled out in A Standard for Robot ExclusionIf the [User-Agent] value is ‘*’, the record describes the default access policy for any robot that has not matched any of the other records — but I’d forgotten it when I updated my /robots.txt recently to exclude certain additional URIs (removed from my site due to abuse) that Google seems to have such a fondness for that it won’t drop them from its index.

Here’s an excerpt from /robots.txt that shows the error:

User-agent: Googlebot
Disallow: /CBP/

User-agent: *
Disallow: /cgi-bin/
Disallow: /personal/

Imagine my surprise when Googlebot started spidering /cgi-bin/show?user=Webmaster, my address-obfuscating contact form, and pages under /personal/ which poison the lists of e-mail addresses and URIs that address harvesters collect.

I’ve updated /robots.txt to include both sets of URIs for the Googlebot, but (unlike Inktomi’s Slurp) Google doesn’t retrieve /robots.txt after it’s started spidering a site. I may have to extend my .htaccess blocking for the current dance to prevent the big G from grabbing a bunch of (more) useless garbage from my site.

Moral of the story: even if you’ve read the documentation, read it again before making assumptions. As my friend John is fond of saying, when you assume you make an ass of u and Dave.

All hits all the time

More than half a century ago the debut of vinyl LPs was a revelation for music fans. By the early ’70s, albums were being stuffed with up to a dozen hit tracks and sometimes ran close to 40 minutes.

Flash forward to today, when CDs max out north of 70 minutes… The days of releasing an album with 17 or 18 cuts are over….

MTV.com, via The Shifted Librarian

Um, excuse me? The only CD I’ve bought in recent memory that comes close to 70 minutes or 18 tracks is Darlene‘s, and I considered it unique enough to blog about. The 12-track CD is most common in my collection by far, and few are longer than 48 to 50 minutes; I often compile the contents of three discs to two CD-RWs to take with me in the car.

Which isn’t to say I prefer quantity to quality or musical preference: just the opposite, in fact. Given the choice between a 70-minute 20-track Christina Aguilera disc and a 12-track release from Spirit of the West, I’ll pick the latter every time.

Really Short Item

Finding it difficult to type due to my wrist, so just a link: Blogistan Pie. Classic.

Oh, and the London Fringe Festival rocks! The Fringe no longer rocks. The first four shows I attended were great, but the last three have been pretty bad, with the most recent (on the history of blues/jazz/rock) the worst yet. This, to me, is not a good trend. All in all, the Fringe rocked. I saw three more shows and they were uniformly excellent, for an overall success rate of 70%. That ain’t half bad.

Ouch

My right wrist has been hurting on and off for the last week, and fairly constantly over the last couple of days. I’ve got reasonably ergonomic setups at home and at work, but I think it’s time I start looking more closely at RSI information.

Semantic HTML lyrics markup

I’ve been trying to figure out the best way to mark up lyrics for the Sirens website (and, tangentially, my Lenni Jabour fan site). For my purposes, best means having the most flexibility for entry and layout while retaining as much semantic value as possible. There are a few options:

<pre>
Preformatted text is probably the simplest way to go, but it’s also semantically poor. It also suffers from an apparent bug in IE (who’d’a thunk?) which doesn’t allow the font of a <pre> element to be styled.
<p/>/<div/>
Piggin.Net suggests using a separate paragraph for each line. This makes layout much easier, but drops the semantic idea of paragraphs being groups of sentences. On the other hand, the stanzas example does group lines into verses using <div/> elements.
<p/>/<br/>
Paragraphs and line breaks are reasonable and commonly used, though they provide somewhat less-flexible layout options than simple <p/> elements. Classes can be applied directly to each paragraph, however, which allows things like <p class=”chorus”>…</p>.
<p/> plus white-space: pre
Simple paragraphs are probably the most semantically correct, and applying the CSS white-space: pre property allows them to be laid out in lines. Non-CSS2 browsers will see long lines of text for each verse, though, which isn’t the desired effect.
All <div/>, all the time
<div/> throws out all the semantics and most of the style inherent in HTML entirely, leaving everything to CSS. Non-CSS browsers will have a fit, and this isn’t much different otherwise from <p/>/<div/> or <p/>/white-space: pre.
XML markup
Definitely the most semantic, but also the least usable for general browsing. TEI‘s Base Tag Set for Verse seems to be the ultimate method; all of the other XML music formats (like 4ML) are note-based, so it’s impossible to do just lyrics. Layout-wise, if more browsers supported styling inline XML it might be worthwhile, but ultimately there’s nothing to see here. Move along.

Markupbation aside, I’m thinking <p/>/<br/> gives the biggest bang for the buck. Add some classes (verse and chorus) and some CSS (.chorus:before { content: ‘Chorus:’; }) and I’m ready to go.

As Dave Shea says, This is an article I needed to find myself…. Feel free to link gratuitously with [appropriate search term phrases]… so that others may benefit from it.

One month

Gone gone gone, she been gone so long
She been gone gone gone so long
Gone gone gone, she been gone so long
She been gone gone gone so long
Gone gone gone, she been gone so long
She been gone gone gone so long
Gone gone gone, she been gone so long
She been gone gone gone so long

Ever since she left me, I sure feel all alone
A little misunderstanding
I can’t get her on the telephone…

Chilliwack, My Girl (1981)