HTML5: Better, Still Not Best

By Dan Corrigan of Scribe

Published

In October of last year, the World Wide Web Consortium gave final approval to HTML5, making it the new default standard for web-based content. From a publishing perspective, it’s a substantial improvement over its predecessor. It offers more possibilities for semantically tagging the structure of content with division elements like “section,” “article,” and “header.”

Because of this, HTML5 has recently come into vogue with some publishers as a practical option for digital publishing and archiving texts. While it’s hard to dispute its dominance in the digital world as a portable electronic format, it still presents many of the same problems to publishers as its previous incarnations. Tags can be subject to inconsistent use, and there are incompatibilities between print and web requirements. This means that the HTML5 code may not be reliable for publishers. These problems can be addressed more easily by an XML tailored to book publishing and easily portable to print and digital formats.

HTML5 has more semantic elements but still not enough to be implemented by publishers without any ambiguity. Take head levels as an example. Suppose we have a chapter from a book about Sherlock Holmes with the following content:

	Chapter One (The chapter number)
	Holmes of Baker Street (The chapter title)
	... (Some content)
	The Significance of 221B (A first-level section head)
	... (More content)
	Watson and Holmes on Moving Day (A second-level section head)
	... (More content)
	The Problem of How Much to Tip the Movers (Another second-level section head)
	... (More content)

In HTML5, there are more than a few ways to go about tagging this content, all of them equally valid. For example, we could wind up with this HTML snippet:

<section>
  <header>
    <h1>Chapter One</h1>
    <h1>Holmes of Baker Street</h1>
  </header>
  <p>... (Some content)</p>
  <section>
    <h2>The Significance of 221B</h2>
    <p>... (More content)</p>
    <section>
      <h3>Watson and Holmes on Moving Day</h3>
      <p>... (More content)</p>
    </section>
    <section>
      <h3>The Problem of How Much to Tip the Movers</h3>
      <p>... (More content)</p>
    </section>
  </section>
</section>

This is one of the more straightforward ways to tag the content as HTML5, but it’s far from perfect. Questions begin to arise immediately, such as whether both the chapter number and chapter title should really be “h1.” We could place them in the same “h1” element and break them:

<h1>Chapter One<br/>Holmes of Baker Street</h1>

However, then we lose the ability to easily distinguish between them for styling purposes if one part needs to render differently than the other. If we leave the chapter number as “h1” and make the chapter title “h2,” do all the other heads need to shift down accordingly? If so, we quickly run into a situation where no two books will call the same parts of the book by the same names, which is a nightmare for maintaining style sheets and transforming content.

Someone who is both knowledgeable about HTML and alert to other possibilities might argue that we could class the headers (<h1 class="chapter_num">). Adding a “class” attribute allows for a more granular definition of elements, like the “chapter_num” value given to the class attribute in the snippet. Unlike HTML elements (h1, h2, p, etc.), the values given for attributes are not defined by the HTML5 specification and are entirely up to the user. The problem with this is that our content comes to depend on a set of rules defined outside HTML5 or any other standard (not to mention the fact that these attributes do not carry to and from InDesign).

This is just one specific example of the problems publishers will encounter if they try to apply HTML5 as a storage format for all their content. HTML5 is flexible by design, and this makes it difficult to apply rigorously to structured content such as books. The structure of any two books will have more in common than not, but if two employees at the same publishing company (or two different publishers who later decide to partner) decide to signify the structure of a book differently using HTML5, that presents huge problems in both developing the content for common platforms and reusing the content for future projects.

A far better solution is to call elements what they actually are and use a well-defined XML that allows us to do this. A chapter number is always a chapter number, no matter where it falls in a book’s content, and should be named the same thing across projects. This keeps a publisher’s archival files meaningful, manageable, and portable. A well-designed XML allows books to be portable to HTML5 for increasingly diverse digital products (ePub3, mobile platforms) but retains the content in a format that’s equally portable to print formats (InDesign, Quark, and PDF).

The very flexibility that makes HTML5 a boon to the web harms publishers by injecting ambiguity into the structure of their book products. For publishers, HTML5 should be yet another format to be produced, but not an end in itself.