Choosing an XML DTD

Published Thu, Nov 4, 2010

Because of the demand for electronic publications, XML is a hot topic for consideration. It seems that everyone is worried about what type of XML he or she should use. There are discussions and meetings focusing on the various XML schemata. Publishers argue over what kind of XML coding they should follow. Frankly, loyalty to an XML schema and its DTD (document type definition) misses the point.

The only reason to select an existing DTD is convenience. Using an extant DTD means that you do not need to develop your own. If a preexisting DTD is well supported, there might be a variety of conversion tools available to you. However, selecting a published DTD does not completely relieve you of the need to perform a content analysis of your publications and reach a consensus on what each element is and how it relates to the other elements.

There are two important issues when approaching XML markup. First, you need to have an XML scheme that accommodates all of your element types. Second, you need to apply it validly, correctly, and consistently. Please note that valid (tagged in a way that meets the rules of your DTD) and correct (applying the correct tag to an element) are differentiated. You can have a valid but incorrect tag. For example, you could mark a b-level head as an a-level head. It would validate, but it would still be incorrect.

In publishing, a DTD needs to accommodate the elements contained within your publication in all of their permutations. Since XML is about structure and the relationship between types of content, you must not only be able to define an element but also be able to mark it according to its use within your publications. This is the area in which many DTDs fail.

Most DTDs are built by technical people and are intended to help operate within an electronic environment. The common schemata (e.g., DocBook, TEI, or OSIS) do an excellent job of marking content. But they do not account for all of the needs within a printed work. For example, all DTDs have paragraph types. But in printing, there are various paragraphs (first, continued, those after heads, etc.). There are variations in lists, heads, poetry, block quotations, and other elements within publishing that are needed in order to accommodate the special treatment afforded to initial, medial, and final instances of elements. Technical people might argue that these variations are meaningless in the XML world, and to some extent they are correct. But that argument does not address the reality of the publishing business, where the relationship between content types is important. And taking the minimalist stance actually defeats the purpose of XML.

XML is supposed to allow for automation and easy conversion from one format to another. Scribe would also argue that when understood correctly, XML (as a well-formed document) should be the basis of the entire workflow. We believe that the advantages of XML should be derived throughout the publishing process and not be an afterthought or burdensome postproduction application (but this is a different topic altogether). If the supposed advantages of XML are to be derived, then any DTD should accommodate the minimal needs of computer technologies (where an a-level head is an a-level head is an a-level head) and print (where an a-level head can appear alone, following various elements like chapter titles, or be a part of a stack of heads).

Scribe’s markup language (ScML) does just that. We have a full DTD, which essentially has two forms of XML. We have the fully articulated XML, which accounts for all of the elements in printing. And we have the condensed XML, which allows for the resolution of the unnecessary permutations in many electronic publications and databases. To move from the condensed to the articulated is a simple activity. We have built a set of conversion tools that allow a publisher to move between each type. When combined with our Well-Formed Document Workflow, this means that editorial staff can work with the minimal tag sets (they do this within Microsoft Word using Word’s styles—but this is another discussion). Computer technology can then be employed to articulate the styles so that text can be automatically flowed for typesetting (thus no need for the typesetter to add or manipulate styles). Following the typesetting, XML can be exported and converted into a variety of markup languages (other XML schema, HTML, ePub, etc.). ScML isn't the only schema that allows for this. And using rendering features, you can achieve this within the common DTDs.

The point is that when discussing XML, it is important that the flavor that you select can accommodate all of your needs and that it does not require you to perform work that is extraneous to your current workflow. Thus, when discussing XML schemata, it is important that you know your content in its entirety, that you understand the relationship among the various elements within your publications, and that you account for those in the markup. The advantages of XML are not gained merely by selecting an XML schema. No matter what XML schema you choose, you will need to thoroughly investigate the tags and reach a consensus on how to use them. The advantages of XML are gained when your entire enterprise is focused on your publications and what is inside of them. When you gain a common understanding of your content, and you treat similar types in a consistent fashion, then you will realize the advantages of XML and can intelligently discuss DTDs. Of course, when that happens, you will realize that loyalty to a particular flavor of XML is unnecessary.