This Is XML

By David Alan Rech of Scribe Inc.

Published

Recently, an editor asked me to clarify a number of questions she had about XML. She had been taking scrupulous notes during a number of meetings and presented me with a long list of conflicting information. She was baffled, frustrated, and didn't know what to do.

Her confusion was completely understandable, as there is a lot of information (and misinformation) about XML. Those of us who specialize in markup languages have an interest in maintaining that confusion because we profit from it. No doubt, there is a lot to know about XML, its possibilities, and proper implementation. But XML is not difficult.

Please examine the following image.

This is a screenshot from a Word document. This–exactly how you see it–is XML. How can that be? Where are the angle-brackets, the confusing codes, and so on? The answer is that it is a well-formed document.

XML, as defined by the World Wide Web Consortium (W3C), requires that your content be held as well-formed documents. Examining the definition of well-formed documents from Wikipedia (http://en.wikipedia.org/wiki/Well-formed_document), we see that a well-formed document is one in which your content is defined, each element is delimited with a beginning and end tag, and nesting rules are followed. In the above document, all of those things are accomplished. It is a well-formed document. To be fair, there are a few additional rules, some technology, and a little computer programming to get you from here to a typeset book or ePub file. But the basic principle will get you delimited text that meets the basic requirements of XML.

You may notice that the document does not contain angle-bracket delimited material. While the use of left and right angle-brackets has become the convention, XML does not require those codes. Technically, content can be delimited with any distinctive character string. The requirement is to delimit content so that the algorithmic processes of a computer can interpret it. This requires the ability to parse the markers.

If you look at the document, it is not confusing. This is created using Word styles. If you look at the code, you can probably understand it. In fact, it is probably similar to the work that is done by your editorial staff or typesetters. They may use other codes or styles in Word, InDesign, or QuarkXPress, but in every case they are delimiting content according to a set of rules in proper relation (i.e., nested) with each other. They may not know it, but they are very close to creating well-formed documents.

At the end of the process, you might wish to hold your content in angle-bracket delimited form. Preferably that form would be consistent with all the content in the corpus of your published works. It would be exceedingly useful if the material was tagged in such a way as to maintain the full complement of your content (e.g., all of the elements necessary for print). But that would not be an XML requirement. That is a business requirement–no doubt an extremely important one.

This is the point. XML should not be confusing, because it is so close to the way publishing functions already. Well-formed documents are a necessary and natural output of publishing. Most of us work in a way that is closely compatible with the requirements of well-formed documents. With a little modification of our behaviour, we can easily create XML.

When properly understood, the application of XML is actually a normal part of the editorial process. According to page 5 in Amy Einsohn's The Copyeditor's Handbook (Second Edition), "The heart of copyediting consists of making a manuscript conform to an editorial style (also called house style). Editorial style includes…treatment of special elements (headings, lists, tables, charts, and graphs." In other words, copyeditors are (in addition to managing grammar, etc.) delimiting content according to a set of predefined rules. They are creating well-formed documents. And when working properly in an electronic environment, they can easily perform their work in a way that results in tagged text.

When you understand well-formed documents, you understand XML. Working correctly, editors can efficiently, without interruption to the normal process, produce well-formed documents. They can easily produce XML. XML should not be confusing. It's a simple concept. Just think well-formed document.