Bad ePub? Blame the PDF!

By David Alan Rech of Scribe Inc.

Published

In a recent Publishers Weekly article Craig Teicher highlighted some of the problems with ePub development.* His focus was on vendors and other problems that have arisen in ePub development.

No doubt, some of the problems seen in ePubs are caused by vendors. All of us make mistakes. Some vendors are sloppy and inconsistent, and these faults add problems. Sometimes their employees are not properly trained, do not properly examine their work, or just don't understand how to craft an e-book. Frequently, vendors will treat all publications the same, no matter what their structure may be (for more on this, please read our blog on imprint identity). But a great deal of ePub problems can be traced to the activities of publishers.

Over the next few weeks, we will highlight a number of the common errors that we see in the creation of e-books and offer methods to avoid them. You can follow the entire discussion on our blog. We welcome your comments and questions. We hope that our discussion will help lead to improved e-book development.

Now to the PDF

PDFs are great at representing the visual page and those interested in preserving page layout rely heavily on PDFs. PDF files do not contain the information needed to create an ePub. In order to reflect the print, PDFs organize pages in a linear fashion and contain added information that causes problems in ePubs. Examples of problems that are caused by the formatting contained in PDFs are unintelligible font information, carriage returns in bad places, incorrect hyphenation, and the loss or addition of symbols. To make matters worse, the interpolation algorithm reads in a strict vertical way through a page. Thus, material that is held in sidebars, multiple columns, and other detached boxes can be reorganized, lost, misplaced, or pushed incorrectly into other text. If you wish to understand the nature of this, try selecting a page of text in a PDF and copying it into a Word or Notepad file (make sure you reveal hidden characters).

In order to work around some of these problems, vendors will often perform OCR on PDF pages. They do this because they can use the features of OCR software to separate distinctive elements on a page. But OCR creates problems of its own. OCR introduces errors because it is not perfect. Often those errors are undetectable, because the result will be a legal but misplaced word due to a single letter change (e.g., the confusion of case and ease). Hyphenation is often lost, as is content.

Recently, plug-ins have been developed to "automatically" generate ePubs from the mechanics files. But this also results in problems (this will be the topic of another blog article).

The correct way to create an ePub is to use the source out of the QuarkXPress or InDesign files. Often, however, this can be expensive. If books are not built with a consistent use of styles then conversion into ePubs can be time consuming, thus expensive. And this type of work requires some expertise that is not often part of vendors' skill sets.

However, it is possible to extract XML, or XML-like, text out of Quark and InDesign. And if the files are set up in a consistent fashion through the use of style sheets, they can be easily converted into accurate ePubs. By working in the correct manner, publishers can avoid the problems associated with ePubs and keep the costs down.

PDF files are an end product and should not be the source for any other product. When we create well-formed documents, we can easily avoid the problems caused by PDFs and effectively multipurpose publish.

*Craig Teicher’s article “Why Some E-books Just Don’t Look Right” appeared in Publishers Weekly on 18 October 2010.