XMDP-style Robot Profile

I’m thinking that HTML should have an element that basically says content within this section may contain links from external sources; just because they are here does not mean we are endorsing them.

I’m not convinced an HTML extension is necessary or desirable. Instead, I think this might be better handled through a back door approach.

…imagine allowing a div that lets you block out links or sections of a page not to index/follow.

It’s a simple idea, really: use an XHTML MetaData Profile and specially-named classes to indicate whether the content of an element should be considered in some way undesirable to a robot. This has several advantages over creating a new element, particularly that it preserves backward and forward compatibility with HTML and XHTML: it’s just a use of an ability that’s there already. It also has an advantage over specially-formatted comments in that it doesn’t require tag-soup handling of XML documents.

Like XFN, the use of special classnames would be indicated by an HTML metadata profile. It might be desirable to have the classnames namespaced–really just another use of the profile attribute–or otherwise made unique so they could be identified without the profile, but for now I’ll stick with the simple case.

On to some examples. The first shows the use of the profile http://example.org/ignore, which indicates that the content marked with an ignore-content class is not to be used in indexing the page.

<head profile="http://example.org/ignore">
...
<div class="ignore-content">There once was a man from Nantucket...</div>
<p>This is not about <span class="ignore-content">porn</span>.</p>

Next, let’s mark some links as not to be followed; example.{tld} links shouldn’t be followed anyway, but this will reinforce that. The text is fair game, though.

<head profile="http://example.org/ignore">
...
<p class="ignore-links">This is <a href="http://example.com/bogus">a bogus link</a>
and so is <a href="http://example.net/bogus">this</a>.</p>

Finally, we’ll cause an entire page to be ignored, similar to the <meta name=”robots” content=”noindex,nofollow,noarchive”> convention.

<head profile="http://example.org/ignore">
...
<body class="ignore-content ignore-links">
<p>The <a href="http://example.com/">hot girls</a> cooled off with a glass of ice water.</p>

So, there it is. Is this worth following up? Would GMPG be interested? (I think it follows their principles.)

These are a few of the references and sources for this proposal that haven’t been linked above. In no particular order, and subject to expansion:

Jim Winstead and other commenters
Tim Bray’s There’s Still No Such Thing as a Web Site and On Search (particularly Metadata)
Robots Exclusion META Tag
Lachlan Hunt has another take. It’s more focused on links, which is what Hixie’s original post was about, and incorporates even more metadata about each link. ~~I’m not sure a lot of that metadata describes relationships, which is part of the reason I went with classes instead of rel.~~I’ve seen the light, thanks to the discussion below.

This item is licensed under a Creative Commons License.

Published by

Peter

View all posts by Peter

6 thoughts on “XMDP-style Robot Profile”

Lachlan Hunt says:

August 26, 2004 at 19:33

Hi,
Thanks for your feedback about my idea to use rev and rel attributes. Can you explain a little more about why you say “I’m not sure a lot of that metadata describes relationships”?

With regard to the use of class names, The HTML4 spec explicitly says it is for styling or general purpose processing. Thus, attaching specific semantics to it would be incorrect. Also, class has been used on nearly every website in existence; so not only would it would be impossible to come up with class names that are guarenteed not to have not been used elsewhere for other purposes, but you are also defining semantics for an attribute that it was not designed for.

In order to define a way to say that all links within a section are to be ignored, you would need to define either a new attribute or element as you and Hixie suggested. However, I think this could potentially be abused by authors who don’t fully understand the ramifications of doing so. I’m worried that we would start seeing thousands of pages abusing the system so that all links are ignored regardless of what they are actually for. if enough authors abused the system (which history shows us, that given the opportunity, they *will*), it may actually reduce the accuracy of page rank rather than enhance it. This is because doing so only says to ignore the links, but says nothing about the reason for doing so. This is one reason why I went with individual rev and rel attributes â€” it means the author is forced to think about the semantics before doing so.
Peter says:

August 26, 2004 at 21:05

Reading it again, I see how the data can describe relationships (actually judgments, which is what we’re really talking about). I think the Accessibility section threw me off; throwing in wcag-A and section-508 in the same document as awful and inaccurate is a huge left turn. I’m also not convinced anyone will ever use anything other than rel=”inaccessible”. I may be misreading again, but it appears to be implied that Quality values are used to comment on the content of a page and Accessibility values are used to comment on the markup and presentation?

I admit to taking some liberty with Ian’s original goal; as well as judging links to be worthy or unworthy, I wanted to include the idea that parts of the page itself might be endorsed or not. (The span example basically shows how to use the idea to mark anti-keywords.) As I noted to Ian in the e-mail that was the origin of my post–but which I somehow dropped from the final result–search engines can get the strangest idea about what a particular site is about from a few words or phrases. (See Ian’s own experience from 2002.)

I disagree on the interpretation of general purpose processing by user agents, by the way; I wasn’t there when it was written, obviously, but I think the intention of the HTML4 spec’s phrasing was to say that classes can be used for any purpose you want. I chose simple terms like ignore-links and ignore-content just to make the purpose clear; in practise they would probably be something more unique. But not necessarily: also note that, as with XFN’s rel/rev values, the specially-named classes only take on meaning when their profile is named in the page. I’d actually suggest something similar for your relationships; who’s to say someone’s not already using rel=”comment” in their markup?

Finally, what stops a page author from abusing rel and rev the same way you suggest they’ll abuse classes? If I were in an ornery mood I could add rel=”awful inaccessible inaccurate mature-adult member-only” to every link regardless of what I really think about it. (Similarly, I could write <a href=”http://www.microsoft.com/” rel=”excellent accurate believable wcag-AAA endorsed”>…</a>. *grin*)
Lachlan Hunt says:

August 27, 2004 at 11:08

> I think the Accessibility section threw me off; throwing in wcag-A and
> section-508 in the same document as awful and inaccurate is a huge
> left turn.

I don’t see why? It’s quite possible for a linked resource to be highly accessible, yet still contain poor awful or inaccurate content; and vice versa.

> I’m also not convinced anyone will ever use anything other
> than rel=”inaccessible”.

For those of us who are actually concerned about accessibility, declaring content as accessible would help their readers find accessible content; and if a user can easily find good, accessible information from your page, they’re likely to have a higher opinion of you, and your content.

Google could also provide options to rank results according to their accessiblity, accuracy, quality and rating. It could even show that information for each result to which would help the user decide which sites to visit. Thus it not only benefits the author for being honest, but the web as a whole.

> I may be misreading again, but it appears to be implied that Quality
> values are used to comment on the content of a page and Accessibility
> values are used to comment on the markup and presentation?

Yes, the Quality values refer to the content; but the Accessibility only refers to the markup. Presentation is irrelevant, which is why I didn’t even consider adding “beautiful” and “ugly” values.

> I admit to taking some liberty with Ian’s original goal; as well as
> judging links to be worthy or unworthy, I wanted to include the idea
> that parts of the page itself might be endorsed or not.

For a user agent, the author’s opinion about their own content is irrelevant. For a user, the endorsement of content can be obtained from the context. That’s what I believe makes this method, and in fact what has made Google so powerful — it depends more other’s opinion of your page, not your own.

> XFN’s rel/rev values … only take on meaning when their profile
> is named in the page. I’d actually suggest something similar for
> your relationships.

Yes, that’s what I was intending to do. I thought that was clear from where I wrote:

“It involves defining a profile of values for the rev and rel attributes
in (X)HTML”

> Finally, what stops a page author from abusing rel and rev the same
> way you suggest they’ll abuse classes?

Firstly, because it takes a lot more effort than just putting a single value in the body, or other containing element. Secondly, because I think this can be implemented in a way that benefits both the user and the author for being honest.

> If I were in an ornery mood I could add rel=”awful inaccessible
> inaccurate mature-adult member-only” to every link regardless
> of what I really think about it. (Similarly, I could write
> <a href=”http://www.microsoft.com/” rel=”excellent accurate
> believable wcag-AAA endorsed”>…</a>. *grin*)

Yes, indeed you could. You could also write hreflang=”fr”, even though the resource is in english. But that doesn’t mean you do, or would even consider it.

However, that’s the beauty of a democracy!. It would take more than the opinion of one rogue individual to seriously affect someones elses page rank. Finally, If a user can’t find any good, accessible and accurate content from your site, they’ll go elsewhere. Conversely, if you state that some content is good, accessible and accurate; but it is not, then they will not appreciate you for it, and may not return to your site.
Peter says:

August 27, 2004 at 19:45

I think we’re talking at cross purposes here for the most part, and I’m pretty sure it’s my fault for linking my idea to yours. Both benefit the user and the author, yours through reputation and mine through limiting irrelevancy, but the two don’t solve the same problem. (However, the things they do solve are facets of a larger issue, which I suppose is best referred to as the Semantic Web.)

> the author’s opinion about their own content is irrelevant

I think this is the fundamental difference between the two concepts. The basis of mine is to keep undesirable traffic away from certain pages–generally my own–by making it so that robots don’t associate certain words, phrases and links with an otherwise unrelated post. In other words, I may not know or recognize what parts of my content are relevant to a particular topic, but I can hazard a pretty good guess towards the bits that aren’t. XRP can be almost totally summed up as a localized robot exclusion protocol; just as I have a robots.txt file to keep robots from indexing certain specific subsets of my site, I’d have ignore-content and ignore-links to keep them from indexing certain specific content (including entire pages).

I believe that your profile, on the other hand, is targeted at users and robots (by which I mean non-interactive UAs) and is mainly about external content. (I suppose you could link to your own pages as being accurate and of high quality–and in your case you’d be correct to!–but it could be considered disingenuous to do so by the hoi polloi.) Two parts of it in particular, quality and accuracy, fill in the relevancy aspect an author may not be able to judge objectively, and which can’t be represented in my profile.

In short, I think the two profiles are complementary. They might overlap slightly, in that unendorsed links are similar to ignored links, but I think the reasons to choose one or the other (or both!) are fairly clear.

A few quickies regarding other items in your comment.

Accessibility terminology: wcag-A et al are the accurate terms to use, certainly, but they seem out of place with vernacular words like awful. That’s all I meant to say by a huge left turn. inaccessible is the only one a lot of people will have any clue about; I’ve read the WCAG and I couldn’t tell you what the different levels mean, other than that wcag-AAA is better than wcag-A.

Quality/accessibility vs. content/markup+presentation: I can make a site that’s got beautiful markup, but if I set it to a 1pt yellow-on-white blinking font it’s going to be inaccessible to most users. Screen readers will be fine, folks who know about turning off CSS will be fine, but everyone else will run screaming from the room.

Democracy and page rank: Understood, I just wanted to point out the potential for abuse. If I’m a search engine spammer I’m going to put high ratings on every link to every one of my sites. If I’m astroturfing I’m going to slag my competitors with poor ratings.

Speaking of abuse: I think we’re agreed that anything can be (and will be, and has been) abused, be it rel/rev attributes, classes, profiles, tables, header elements, etc. I’m still not convinced it’s any less easy to abuse rel/rev than class—just putting a single value in the body is exactly what you’re doing with both–but that’s a quibble.

So, as the Vogon captain said, tell me how good you thought my poem was.
Lachlan Hunt says:

August 28, 2004 at 00:00

> I believe that your profile, on the other hand, is targeted at users
> and robots (by which I mean non-interactive UAs) and is mainly about
> external content.

Well, it’s actually targetted at both interactive and non-interactive UAs. eg. An interactive UA could provide a warning before following a link marked as restricted or mature-adult, or highlight links that the author has marked as good or excellent, etc. But, yes it is about external content; so in a way it is could be complimentary to your idea.

> (I suppose you could link to your own pages as being accurate and of
> high quality–and in your case you’d be correct to!…)

Gee, thanks. It’s always nice to get positive feedback â€” but I’m yet to get any negative feedback from anyone, so I don’t know if I’ve written anything bad; which, IMO, is equally important to know.

> So, as the Vogon captain said, tell me how good you thought my
> poem was.

Only if you promise not to have your Vogon guard throw me off your ship! 🙂 It’s not as bad as the Azgoths of Kriaâ€™s Ode to a Small Lump of Green Puttyâ€¦ 😉 and your ideas are floating around my head in exactly the same way the bricks don’t. That is to say, that I think your “poetry” is good.
Pingback: Petroglyphs

Comments are closed.