Documentation

OCR Verification

Use the following procedure to verify the accuracy of files received from an OCR vendor per Scribe’s OCR standards.

The goal of OCR verification is to determine if the Word file meets OCR standards and matches the source materials. Checks will be performed in both the point document (Microsoft Word) and Sublime Text.

If verifying the file only to determine if the file supplied by the vendor is correct, files may be rejected at any point in which an unacceptable number of errors is found. If correcting the file while verifying, the benchmark rate for OCR verification is 100,000 characters per hour (86,000 characters per hour for complex files, such as files with extensive Hebrew usage).

References/Prerequisites

Planning and Approach

Plan the Work

1. Assess

  • Confirm that all text is present and in the proper order.
  • Confirm that all files are named correctly.
  • Confirm that all images have been processed to the specified size and color settings.
  • Spot check major aspects such as paragraph integrity, character style rendering, smart quote usage, and table formatting. (Run the SAI’s Rendering tool in Word to apply colors to character styles.)

2. Plan

Review the aspects of the files that may be most prone to error.

3. Act

Carry out the planned actions.

Working

Maintain a list of all OCR errors found.

Maintain a list of all errors present in the source material.

Note: Depending on the project, some errors present in the source material may need to be corrected at a later stage and some may need to be maintained.

Procedure

Note: Searches may yield false positives, finding aspects that correctly match the source materials.

Run SAI Tools for File Prep

Run the SAI’s Rendering and List Leaders tools in Word.

Spellcheck

Run a complete spellcheck in Microsoft Word.

Optimal settings for verification spellchecks are listed here.

Page IDs

Page IDs should be formatted as {~?~PG: @#@} (where “#” is the page number) and connected to text or an image callout; they should not be in their own paragraphs.

Paragraph Integrity

Check the file to determine if any paragraphs have been improperly broken or combined.

Spot check Page IDs that occur at the beginning of paragraphs against the source.

Check that paragraph integrity has been maintained across pages.

Check that soft returns are not present.

Check that paragraphs have not been broken to insert content such as images, tables, or sidebars.

Character Rendering

Compare the Word document with the source to confirm that all character rendering (e.g., italics, small-caps, bold) has been applied.

Check for non-ScML styles that have been applied to characters to apply rendering.

Footnotes and Endnotes

Check that footnotes appear in the document using Microsoft Word’s footnotes feature; Check that they match the source.

Check that endnotes appear in the document using Microsoft Word’s endnotes feature; Check that they match the source.

Image Callouts

Check that image callouts and associated captions are placed at the nearest paragraph break after the content appears in the source.

Check that images use the format {~?~IM: insert projectname-p#.jpg here.} where “#” is the page number on which the image appears.

Check that the callouts match the names of the image files.

Check that figure captions appear as text.

OCR Verification Searches

Copy all content from the Word document into a text-only (.txt) file.

Perform the following searches in Sublime Text.

Enable regular expressions and match case.

Straight Quotes

Search for straight quotes.

["'`´]

Multiple Spaces and Paragraph Breaks

Search for multiple spaces or paragraph breaks in a row.

( ){2,}

(\x{a0}){2,}

(\n){2,}

Symbols

Search for ANSI symbols and rare punctuation.

[~\`!/@#$%&\*=\+\|\{\}_\^]

Thousands

Search for thousands with a space after the comma (e.g., 1, 000).

\<([0-9]{1,3},)[ \x{a0}]([0-9]{3})\>

Adjacent Numerals and Letters

Search for adjacent numerals and letters.

([A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}])([0-9])

([0-9])([A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}])

Number/Letter Confusion

Search for potentially confused numbers and letters.

Numeral 1, Capital I, Lowercase l:

Spot check/Find: \<[1I]\>

Find: \<l\>

Find: \<(11|II|ll)\>

Find: \<(111|III|lll)\>

Numeral zero, Capital O, Lowercase o:

\<[0Oo]\>

\<(I[Oo]|l[Oo]|10)\>

Numeral 5, Capital S:

\<[5S]\>

\<(IS|15|ls)\>

Three of the Same Characters in a Row

Search for three of the same characters in a row (excluding I1l).

([^I1l])\1\1

Letter Case Issues

Adjacent Lowercase and Uppercase Letters

Search for lowercase before uppercase and multiple uppercase before lowercase letters.

([a-z\x{e0}-\x{ff}])([A-Z\x{c0}-\x{dd}])|([A-Z\x{c0}-\x{dd}]){2,}([a-z\x{e0}-\x{ff}])

Mac and O’ Names

Search for names starting with “Mac” and “O’.”

Ma?c[a-z]

\<(O)(’)([a-z])

Single Letters, “st,” and Contractions

Search for an isolated capital letter. (Note: This search excludes common occurrences and elements addressed in previous searches.)

([^\.])(\<[B-HJ-NP-RT-Z\x{c0}-\x{dd}]\>)

Search for lowercase letters in unexpected patterns.

\<[b-ru-z\x{e0}-\x{ff}]\>

Search for “st.”

[^’]\<[st]\>

Search for contractions.

[’][^nst \x{a0}]

Articles “A” and “An”

Search for the article “a” followed by a vowel (or vowel sound) and the article “an” followed by a consonant.

\<[Aa][ \x{a0}]([AaEeFfHhIiLlMmNnOoRrSsXx])\>

\<[Aa]n[ \x{a0}]([BbCcDdGgJjKkPpQqTtUuVvWwYyZz])\>

\<[Aa][ \x{a0}]([AEIOUaeiouÀ-ÆÈ-ÏÒ-ÖØ-Ýà-æè-ïò-öø-üÿ])

\<[Aa]n[ \x{a0}]([B-DF-HJ-NP-TV-Zb-df-hj-np-tv-zÇÑçñ])

\<[Aa]n[ \x{a0}]([1-79]\>|1[02-79]|[2-79][0-9])

\<[Aa][ \x{a0}](1[18]\>|8)

Missing Spaces Near Punctuation

Search for missing spaces between punctuation and letters or numerals.

The standard search will find Bible references and decimals; if numerous instances are present, use the second search set.

Standard missing space searches:

([\.,:;\?!”’\)\]\}>])([0-9A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}])

([0-9A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}])([\[\(\{<“‘])

\<([0-9]+)(,)([0-9]{1,2}\>|[1-9][0-9]{2}\>|[0-9]{4,})

If numerous Bible references and decimals:

([\.,:;\?!”’\)\]\}>])([A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}])

([0-9A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}])([\[\(\{<“‘])

([,;\?!”’\)\]\}>])([0-9])|([^0-9][\.:][0-9])

\<([0-9]+)(,)([0-9]{1,2}\>|[1-9][0-9]{2}\>|[0-9]{4,})

Extra Spaces Near Punctuation

Search for extra spaces near punctuation.

The standard search will find spaces before closing punctuation or after opening punctuation. If numerous ellipses are present, use the second search set.

Standard extra space search:

([ \x{a0}])([\]\)\}>/\.,:;!\?”’–—-])

([\[\(\{/<“‘–—-])([ \x{a0}])

If numerous ellipses:

([ \x{a0}])([\]\)\}>/,:;!\?”’–—-])

([\[\(\{/<“‘–—-])([ \x{a0}])

([ \x{a0}]\.[ \x{a0}][^\.]|[^\.][ \x{a0}]\.)

Punctuation Location

Search for closing punctuation at the start of the line.

^([\]\)\}>/\.,:;!\?”’–—-])

Search for opening punctuation at the end of a line.

([\[\(\{</“‘–—-])$

Identical Punctuation

Search for two identical punctuation marks in a row.

The standard search will find identical punctuation marks appearing adjacent to each other. If numerous ellipses are present, use the second search set.

Standard identical punctuation search:

([\[\(\{\]\)\}<>/\.,:;!\?“‘”’–—-])\1

If numerous ellipses:

([\[\(\{\]\)\}<>/,:;!\?“‘”’–—-])\1

Em Dashes, En Dashes, and Hyphens

Search for combined dashes.

[–—-][–—-]

Search for hyphens between number ranges.

([0-9])-([0-9])

([0-9]-[^0-9]|[^0-9]-[0-9])

Search for en dashes that do not appear in number ranges.

([^0-9]–|–[^0-9])

Search for potential hyphen errors.

([^a-z0-9\x{c0}-\x{ff}\x{100}-\x{17f}])(-)

(-)([^a-z0-9\x{c0}-\x{ff}\x{100}-\x{17f}])

Search for all hyphenated terms. Paste the results into a new text file and permute by unique (Edit > Permute Lines > Unique) to remove duplicates.

\<([0-9A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}]+)-([0-9A-Za-z\x{c0}-\x{ff}\x{100}-\x{17f}]+)\>

Quotation Marks

Search for incorrect spaces around quotation marks.

[“‘][ \x{a0}]?[”’]|[”’][“‘]

[“”‘’][”“‘’]

Search for quotation marks adjacent to em dashes to determine if they are facing the proper direction.

[“”‘’]—|—[“”‘’]

Parentheses and Brackets

Search for empty parentheses and brackets.

([\[\(\{<])([>\]\)\}])

Nonmatching Opening and Closing Punctuation

Search for nonmatching opening and closing double quotation marks.

(”[^“\n]*”|“[^”\n]*“|“[^”\n]*$|^[^“\n]*”)

Search for opening single quotation marks without a closing single quotation mark before another opening single quotation mark or the end of a line.

(‘[^’\n]*‘|‘[^’\n]*$)

Search for nonmatching opening and closing parentheses.

^[^\(\n]*\)|\([^\)\n]*\(|\)[^\(\n]*\)|\([^\)\n]*$

Search for nonmatching opening and closing square brackets.

^[^\[\n]*\]|\][^\[\n]*\]|\[[^\]\n]*\[|\[[^\]\n]*$

Search for nonmatching opening and closing curly brackets.

^[^\{\n]*\}|\{[^\}\n]*\{|\}[^\{\n]*\}|\{[^\}\n]*$

Search for nonmatching opening and closing angle brackets.

^[^<\n]*>|<[^>\n]*<|>[^<\n]*>|<[^>\n]*$

A legal word error is a term that does not match the source material, but the incorrect term would not be found by a standard spellcheck. Common examples include Cod/God and modem/modern.

If previous searches have revealed legal word errors, search for them throughout the file.

Search for prefixes appearing on their own.

Standard extra space search:

\<([CcNn]on|[Pp]?[RrDd]e|[Cc]o)\>

If numerous legal hyphenated prefixes:

\<([CcNn]on|[Pp]?[RrDd]e|[Cc]o)[^-]\>

If numerous instances of “Co.” as “Company”:

(\<[Cc]o\>[^\.])