HTML5 and Semantics
Author: José Manuel Cantera Fonseca, Telefónica I+D
HTML4, the language of the Web, is intended to define the content of a web page from an structural and presentational point of view but not from a semantic point of view. For instance, in HTML4 a <table> element can be used to present information about different entities such as cars, hotels or others. However there is no a machine-readable format that allows to indicate unambiguously the semantics of the information or contents presented. In addition HTML4 relies on the usage of the semantically-neutral <div> tag which is used to group content.
HTML5 is the next major version of HTML. Initially developed by the WHATWG, now is in process of standardization by W3C. One important step ahead brought by HTML5 is the introduction of new elements which enable the creation of web pages with more explicit semantics, leveraging the traditional usage of the semantically-neutral <div> tag:
- <article> which represents a blog article, forum post or similar.
- <aside> which indicates that the content is tangential to the main content of the page, for example ads or similar.
- <nav> which indicates that such a block gives the main navigation points for a page.
- <figure> to represent figures on a page
- <header> to indicate header content
- <footer> to indicate footer content such as copyright information, additional links, etc.
Although these elements are going to be extremely helpful for “syntactic” search engines or assistive technologies they are still far from providing rich semantics and structured data. For instance, in HTML5 a block of HTML code can be marked-up as an article but it cannot be expressed what is the article’s subject. In fact, authors and publishers are demanding technologies to express such structured data and semantics more completely, and when tools can read it, a new world of user functionality will become available, letting users transfer structured data between applications and web sites, and allowing browsing applications to improve the user experience:
- an event on a web page can be directly imported into a user’s desktop calendar
- a license on a document can be detected so that users can be informed of their rights automatically
- a photo’s creator, camera setting information, resolution, location and topic can be published as easily as the original photo itself, enabling structured search and sharing.
To address this emerging necessities there are two technologies that complement HTML5: HTML5 Microdata and RDFa. They are based on the same idea: to use a series of extra attributes to provide a set of machine readable labels which provide the explicit semantics of the HTML content. For example, using such extra attributes, the author of a page could annotate it to indicate that each row in a table represents a certain kind of entity (car, hotel, etc.)
HTML5 Microdata has been proposed by the WHATWG and can be considered as the evolution of microformats. However it has not been implemented yet by web browsers. HTML5 Microdata allows nested groups of name-value pairs to be added to documents, in parallel with the existing content.
RDFa is a W3C Recommendation. It specifies only a syntax and relies on independent specification of terms (often called vocabularies or taxonomies) by others. RDFa allows terms from multiple independently-developed vocabularies to be freely intermixed and is designed such that the language can be parsed without knowledge of the specific term vocabulary being used.
In order to better understand how semantic annotations work, an RDFa example (using XHTML) is provided below. A page which describes an upcoming event, has been annotated using the ical vocabulary by means of RDFa attributes. It can be observed the usage of the typeof and property attributes to convey the semantic annotations over the standard HTML elements.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:cal="http://www.w3.org/2002/12/cal/ical#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema" >
<head>
<title>Jo's Friends and Family Blog</title>
</head>
<body>
<p typeof="cal:Vevent">
I'm holding <span property="cal:summary"> one last summer Barbecue</span>,
on <span property="cal:dtstart" content="2007-09-16T16:00:00-05:00"
datatype="xsd:dateTime"> September 16th at 4pm</span>.
</p>
</body>
</html>
Although RDFa and HTML5 Microdata are similar in spirit there are contentious differences that have been thoroughly discussed. See for example the discussion on this blog: http://blog.whatwg.org/microdata1. Microdata supporters claim that RDFa is more complex and that the usage of XML Namespaces makes it difficult to embed RDFa annotations in HTML5 documents (which are going to be namespace free). On the other hand, RDFa supporters say that their technology is more powerful and has been started to being used successfully in the industry (Yahoo, Google, the UK Government …). What it is true is that the two solutions have interesting points to be taken into account, for example, HTML5 microdata defines a DOM API to extract the semantics from the elements.
Finally, after months of lengthy discussions, W3C has decided to create a new Working Group on RDFa which it is chartered to “support the developing use of RDFa for embedding structured data in Web documents in general. The Working Group will publish W3C Recommendations to extend and enhance the currently published RDFa 1.0 documents, including an API”. This Working Group is meant to create a new version of RDFa (1.1) more flexible an easy to be used, getting the best from HTML5 Microdata and RDFa 1.0.