Feeding the Googlebot

Photo Matt is seeing something that I’ve noticed for a while as well: Googlebot is making up URLs to retrieve. I’m surprised no one at the big G has heard of RSS autodiscovery; they’ve obviously already got lots of content they could use as a basis. Then again, Googlebot doesn’t recognize application/xhtml+xml pages, either:

64.68.82.28 - - [16/Aug/2004:18:36:53 -0700] "GET /blog/archives/2004/08/16/fun-with-xfn HTTP/1.0" 406 398 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

Following up on some of the comments on Matt’s linkback that indicate otherwise, I’ve discovered that Googlebot sometimes retrieves application/xhtml+xml pages:

64.68.82.18 - - [18/Aug/2004:04:41:20 -0700] "GET /blog/archives/2004/08/16/fun-with-xfn HTTP/1.0" 200 6609 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)"

Nothing has changed on the page or in the way I’m serving it. Anyone have an explanation for this?

GoMeme 4.0

There are by some estimates more than a million weblogs. But most of them get no visibility in search engines. Only a few A-List blogs get into the top search engine results for a given topic, while the majority of blogs just don’t get noticed. The reason is that the smaller blogs don’t have enough links pointing to them. But this posting could solve that. Let’s help the smaller blogs get more visibility!

This posting is GoMeme 4.0. It is part of an experiment to see if we can create a blog posting that helps 1000s of blogs get higher rankings in Google. So far we have tried 3 earlier variations. Our first test, GoMeme 1.0, spread to nearly 740 blogs in 2.5 days. This new version 4.0 is shorter, simpler, and fits more easily into your blog.

Why are we doing this? We want to help thousands of blogs get more visibility in Google and other search engines. How does it work? Just follow the instructions below to re-post this meme in your blog and add your URL to the end of the Path List below. As the meme spreads onwards from your blog, so will your URL. Later, when your blog is indexed by search engines, they will see the links pointing to your blog from all the downstream blogs that got this via you, which will cause them to rank your blog higher in search results. Everyone in the Path List below benefits in a similar way as this meme spreads. Try it!

Instructions

Just copy this entire post and paste it into your blog. Then add your URL to the end of the path list below, and pass it on! (Make sure you add your URLs as live links or HTML code to the Path List below.)

Path List

  1. Minding the Planet
  2. Luke Hutteman’s public virtual MemoryStream
  3. geek ramblings
  4. Petroglyphs
  5. (your URL goes here! But first, please copy this line and move it down to the next line for the next person).

(NOTE: Be sure you paste live links for the Path List or use HTML code.)

Silly result set 1

The first results of the experiment are in, and they’d seem to indicate that Google places no importance at all on heading elements relative to any other text in a page.

In order, the current results (with duplicate hits included) are:

  1. embedded h2
  2. full content of a normal paragraph
  3. embedded h1
  4. normal text embedded in a paragraph
  5. split between h1 and h2

Note, however, that the missed h1 test–probably the most relevant of any of them–is not yet included. I don’t hold high hopes that it will fare any better, though.

I’ll leave this to age for a few more days and see if things change at all, then try a new iteration. Suggestions are welcome, as always.

Fun with XFN

Via Eric Meyer I see that XFN 1.1 has been released. I’ve updated my stylesheet accordingly; it’s quite a bit bigger because I chose to duplicate the 1.0 selectors rather than switch to CSS3 *= syntax. As noted in the original post, it’s largely theoretical: most people can’t use the CSS3 rules, and both CSS2 and CSS3 are made redundant by the last rule that overrides everything above.

Anyway, that’s not what this is about. What this is about is a bit of geek humour using the rel attribute.

All of the above are legal… that is, with respect to the specification! That’s not to say there aren’t some illegal combinations that you wouldn’t necessarily want to come across either.

Comments are open… you know what to do!

Another one bites the dust

Hot on the heels of flaming Mac death, my trusty rusty firewall/mailserver died after a power outage this morning. (Yes, it was hooked up to the UPS. Yes, it shut down cleanly. Yes, when the power had been out for almost an hour I had a flashback to last August 14.) I’m sure it’s just another dead power supply, but again it’s hardly worth replacing, this time due to the age of the machine (hint: it’s a Pentium 120). Fortunately I was able to salvage the hard drive and network cards and install them in one of my other boxes, so the overall downtime was fairly minimal.

Ironically, I’ve been meaning to upgrade the dead box to a recent Linux kernel so I could use some of the more advanced firewalling features; the replacement already had 95% of what I needed, so I actually managed to save myself a few hours.

Silly expert experiment

Can anyone confirm that engines like Google actually make use of heading elements in determining page rank? I’m looking for a link to actual results demonstrating the effect of headings on Google’s ranking of a page; if you have one handy, kindly drop it to me via e-mail. I’d just like to know one way or the other.

No existing results here, but it sounds like a fun experiment along the lines of nigritude ultramarine. So here’s what I’ve done: I’ve created a three-word term that’s not currently found in Google (and which I won’t include here so as to not skew the results). It’s embedded in different ways in five randomly-named and titled XHTML 1.0 Strict files that contain only Lipsum text:

  1. as an <h1> at the top of a page
  2. as <h1> within a page
  3. as an <h2> within a page
  4. split between <h1> and <h2> elements
  5. as the entire contents of a <p>
  6. embedded in the middle of a paragraph as normal text

The pages are linked in random order from a single page. If heading elements do help to determine page rank, one would expect the pages to be ranked in the order I’ve listed them above.

Although I’ve tried to reduce bias as much as possible in the test, it’s not exhaustive and hardly scientific. I’m open to any suggestions on how to improve the method.

(In case you’re wondering, the phrase was chosen by taking words from webpages I had open at the time and adding a descriptive adjective. J. is a fellow Lenni Jabour fan and the host of a radio program in Toronto, and M. is the name of a show being put on by former Lenni cohort Andrew Downing.)

Bleah. While checking to see if Google had crawled the pages–it has–I discovered that I forgot to link the very first item in the list above. I’ve added it now and re-requested a crawl, but it may skew the results. (The term doesn’t appear in search results as of yet, so I have some hope it may not matter.)