Semantics in HTML 5

I’m going to make a bold prediction. Long after you and I are gone, HTML will still be around. Not just in billions of archived pages from our era, but as a living, breathing entity. Too much effort, energy, and investment has gone into developing the web’s tools, protocols, and platforms for it to be abandoned lightly, if indeed at all.

Let’s stop to consider our responsibility. By an accident of history, we are associated with the development of an important tool our civilization will use to communicate for decades to come. So, when we turn our minds, idly or in earnest, to improving HTML, we must understand just how far-reaching the ramifications of today’s decisions may be.

HTML 5, the W3C’s recently redoubled effort to shape the next generation of HTML, has, over the last year or so, taken on considerable momentum. It is an enormous project, covering not simply the structure of HTML, but also parsing models, error-handling models, the DOM, algorithms for resource fetching, media content, 2D drawing, data templating, security models, page loading models, client-side data storage, and more.

There are also revisions to the structure, syntax, and semantics of HTML, some of which Lachlan Hunt covered in “A Preview of HTML 5.”

But for this article, let’s turn solely to the semantics of HTML. It’s something I’ve been interested in for many years, and something which I believe is fundamentally important to the future of HTML.

The BBC recently announced that they would drop the hCalendar microformat from their program listings, due to accessibility and usability concerns with the abbr design pattern. This demonstrates that we have, beyond any doubt, pushed the semantic capability of HTML far past what was ever intended, and indeed, what is reasonably possible with the language. We have simply run out of HTML elements and attributes with which to mark up more richly semantic documents. If we continue to be clever with the existing constructs of HTML, more problems such as this will arise. But HTML suffers from a fundamental defect as a semantic markup language—its semantics are fixed, not extensible.

This is not simply a theoretical problem. Hundreds of thousands of developers use the class and id attributes of HTML to create more richly semantic markup. (They also use them as “hooks” for CSS styling, but that’s another matter.) Almost invariably, those developers use ad hoc vocabularies—that is, values they have made up, rather than values taken from existing schemas. It’s pseudo semantic markup at best.

Many pages around the web use microformats to add more structured semantics than available in HTML’s impoverished set of elements and attributes. In this case, the values used for the class attribute come from agreed-upon vocabularies, sometimes adopted from other standards, such as vCard, sometimes from newly minted vocabularies where no solid pre-existing standard exists (as is the case for hReview).