Demand for OCR

By David Alan Rech of Scribe Inc.


When Scribe began in 1993, we were responsible for the conversion of books into electronic form. At the time, scanning, performing optical character recognition (OCR), and verifying text made sense. Many publications had never been in an electronic form, so the conversion process was necessary.

In 2001, we tried to stop training entry-level personnel in OCR and verification (Scribe’s method is a linguistic approach to OCR verification and requires considerable training). We attempted this believing the demand for OCR would dwindle and the accuracy of alternative methods would make our service redundant.

In 2012, years after the industry switched to developing publications with computers, demand for OCR is up. This is because publishers are now realizing the value of reissuing backlist titles. OCR is necessary when publishers do not have electronic source files for reissued publications. While some of these are legacy materials that have never been in electronic format, most titles we work on have been produced in recent years. Recent publications should never require OCR, because publishers produce content in electronic form from the start.

Any book produced since the 1990s has been in an electronic form at some stage of production. Thus a need for OCR is usually due to poor business practices. When publishers lack the electronic versions of books, it is usually because they never possessed them or thought holding dead-end technologies (e.g., PDF or ePub) would be sufficient for any future need. Publishers often have only digital PDF or rely on vendors to archive materials, only to find that when they require those files, their vendors don’t have them. And the sad thing is that this situation is completely avoidable.

Here’s how to avoid being caught without your content in electronic form:

  • Prioritize the possession of archival files. Put this on par with copyright protection. After all, your backlist is a huge asset, and possessing it in a useful form increases its value.
  • Never rely on a format that contains reduced information, is not electronic text, or does not facilitate easy conversion (i.e., a dead-end technology like PDF or ePub).
  • Archive all files on some type of accessible, current storage media.
  • Have every version of your books, both print and electronic, in your own storage system.
  • Never store files in a single place; redundancy is required.
  • Implement a method to store, log, and access files for your publications, including employing a file naming scheme.
  • Require vendors to supply all files to you, especially typesetting files.

Publishers have known about electronic publications, and thus the need to store electronic data, since the 1990s. Many resisted migrating to an electronic workflow and storage system because they refused to imagine a world in which electronic data would be useful. When they grudgingly opened the door to this world, some insisted that PDF was sufficient, or that automated conversion software would have to be developed. Others relied on vendors who have either disappeared or failed to store materials themselves. None of us knows what the future will bring, but we do know that it is best to store the most robust, complete set of files in a safe, self-reliant manner. If nothing else, doing so will keep your options open—including the option to never need OCR.