File Format Essentials

Published Mon, Aug 20, 2012

Scribe’s blog/e-mail blast is intended to help provide information and best practices for publishers. When we sent out a request to members of the community, Craig Rairdin immediately responded. Craig is the president of Laridian, which produces Bible study software. While his focus is primarily on biblical materials, the lessons that he espouses apply to all of us.

At Laridian, we create Bible study software for mobile and desktop computing platforms. Our products include PocketBible, MyBible, Theophilos, and several other readers on iPhone, iPad, Android, Blackberry, Windows, and other platforms.

We work with all the major Christian publishers, licensing their Bibles and reference books for inclusion in our Bible software. Between our time here at Laridian and our prior work at Parsons Technology back in the ’90s, we’ve been doing this for twenty-four years. During that time we’ve seen significant changes in the technological sophistication of publishers and in their attitudes toward electronic publishing. Most are for the better, but there is room for improvement.

One area of improvement has been in our ability to get electronic files of most of the titles we publish. Back in the last century, we often found ourselves scanning paper books and using optical character recognition (OCR) to produce files that we could edit and process for use in our software. This seldom is the case today. Most publishers have computerized the editorial and publishing process to the point where we can count on getting some form of electronic files for processing.

It is rare, however, to get files in any kind of format that doesn’t require hours (often weeks) of work to make useful. With publishers demanding a higher percentage of our revenues in royalties, we sometimes have to turn away opportunities because the cost to process files presented to us in PDF or some other less-than-ideal format is just too excessive. (We turned away content from a well-known author and major publisher just last week due to the combination of high processing cost and exorbitant royalty demands.)

Why Should You Care about My Problems?

So what if e-book publishers have to spend days or weeks converting your files into their formats, right? You might think it’s not your problem, but I would argue that it is. The problems that we have today converting your files into our “format” will be your problem tomorrow when you need to repurpose your content for another medium (or simply reissue it in print). The archival format you choose directly impacts your costs in the future and therefore the practicality of making use of your content beyond its original purpose. If we have trouble today, you’ll have trouble tomorrow.

In this article I want to address just a few of the problems we run into with your files.

Portable Document Format (PDF)

Adobe PDF is a popular archival format among publishers. It has the advantage of being supported by a wide variety of word processing and page layout programs on all computing platforms. It is of marginal additional value to us, however, over just scanning the printed book or sending it out to be typed by hand.

Think of PDF as “virtual paper.” PDF represents your document as it should appear when printed. It grew out of PostScript, which was an early printer language. Any file format that is tightly bound to a particular medium is not ideal for archiving and repurposing.

It is often the case with PDF that portions of the text will be stored as an image rather than as characters. To reconstruct the text of the book, the software we use to extract the text must apply OCR technology to convert what is essentially a picture of your book into actual characters that the computer can understand. This introduces misspellings and can also result in blocks of text appearing out of order.

Since paragraph justification and hyphenation at the end of a line are characteristics of printed text, we often have to reconstruct words that have been hyphenated in the PDF file. This can be complicated when hyphenation occurs at the end of a page or column, or when the lines of columnized text appear out of order. Again, this introduces errors into the text.

Some of the other problems described in the following section also apply to PDF. It is a convenient format for transmitting documents between computing platforms (Windows, Mac, Linux, and mobile platforms) for the purpose of printing. It is not a suitable document as a source for other products, or as an archival format for a publisher.

InDesign and Other Proprietary App Formats

We often have publishers provide files to us in a format that is specific to their word processor or page layout program. In some cases we can open these files in the same app and ask the app to save the file in a more usable format, like HTML. However, it’s rare to find a publisher who is running the latest version of those apps, so if we don’t have the app and have to purchase it, and if the new version won’t read old files, we may not be able to read the file. If we can’t read it today, you’re not going to be able to read it when you want to read it. And sometimes the app simply doesn’t have the ability to save the file in any more readable format.

Some systems in use by publishers are proprietary or very expensive and we simply don’t have access to them. While this may not be a problem for you (since you already own the system, obviously), as time goes on and your needs change and you switch to a better software solution for your editorial and publishing needs, you could be saddled with an expensive conversion process or simply not be able to access your archived books.

Symbols, Foreign Language, or Phonetic/Transliterated Fonts

It is very common for books to make use of symbols, foreign languages, or transliteration schemes. For Bible publishers, Greek and Hebrew are examples of this. Back in the twentieth century you would purchase a particular Greek or Hebrew font and all your books would make use of that font. Your word processor would map your English keyboard to Greek or Hebrew characters. The mapping from English letters to Greek and English letters to Hebrew quickly became natural to your typists. But they’ve all died or moved on to other jobs, and now you have a bunch of files with random sequences of letters and punctuation and nobody remembers what font they were using or what keys they pressed to get each character and diacritical.

Today we use Unicode to represent non-English (and English) characters in a way that is independent of the font you use to display those characters. It is very (very!) common for us to get files, regardless of format, that contain symbols, foreign languages, or transliteration schemes that we simply cannot convert to the appropriate characters. Our only option is to examine every word in the printed edition of the book and reverse engineer the mapping. This is a time-consuming process for us and will be the same for you in the future.

Inadequate Tagging

When you print a book, you depend on a human brain to read and understand it. The human brain is much more sophisticated than any computer program and is very tolerant of imprecision and vague meanings. If a commentary has an italicized “16” at the start of a paragraph, the brain can immediately recognize that as a verse number, then recall that it just started reading about John chapter 3, and conclude that this paragraph must be the commentary on John 3:16. The computer is less forgiving. An italicized “16” is just an italicized “16.” Nothing about it describes the function of the “16.” All it sees is a “1” and a “6.”

The human reader can then look over at his open Bible and scan down the page to John 3:16 and read that verse to put the commentary in context. The computer, having seen a “1” and a “6,” has no reason to conclude that this paragraph has anything to do with any other book anywhere in your library.

To resolve this problem for the computer, one of the main things we do is insert tags that describe the function of various bits of your text. For example, in PocketBible we would insert <pb_sync type="verse" value="John 3:16" /> right before your italicized "16." This tells our PocketBible program that this bit of text should be synchronized to John 3:16 in the Bible.

You might think this doesn’t apply to you. As long as you have an italicized “16” printed on a page, you’re happy. But you don’t know how you might want to make use of this content in the future. Imagine a web-based application that searched a vast library and coalesced all the data either for a researcher writing another book or for an end user. Your italicized “16” would be useless in that context. Or imagine wanting to produce a book for the blind that used a synthesized voice to read the text to a user. You might want to say “commentary on John 3:16” at this point in the text instead of just tacking “sixteen” onto the first sentence of the paragraph. Unless you know the function of the “16” it really isn’t doing you any good at all.

Similar problems occur throughout your books. We might be able to know that the text at the top of a page is 24-point Times New Roman, but is it the title of the book? the title of the chapter? the title of this section of the chapter? Or is 24-point Times New Roman just the normal size for all the text in this book and there’s nothing special about the text at all? Unless the file contains tags that indicate the function of the text in the book, that text is useless.

Recommendations

At Laridian, we don’t care what format the files you provide us are in as long as it’s Unicode (so we don’t have font mapping problems) and tagged for functionality rather than physical appearance (i.e., <h1>Chapter One</h1> is better than an image of the text "Chapter One" with a really fancy "C" and "O" embedded in a PDF file). A good test for proper tagging is that your content shouldn't be styled in a way that is unique to any particular word processor or software package.

The number of tags and complexity of the tagging system is not as daunting as when the file is untagged or tagged with one medium in mind. We got a file the other day where every single word was wrapped with a tag. That’s crazy, but it’s fine. It beats the Bible file we got from one publisher where the footnotes and verse numbers were in line with and indistinguishable from the text (“1 1 Jesus instructs Nicodemus on the necessity of a new birth from above. Now there was a Pharisee named Nicodemus, a ruler of the Jews. 2 A ruler of the Jews: most likely a member of the Jewish council, the Sanhedrin. 2 He came to Jesus at night…”).

I would argue that our problems are your problems. It’s painful to see otherwise successful and competent publishers who have archived their most valuable resources—their content—in a format that makes it useless to them in the future. And useless to us in the present.