Regular Expressions Resource Supplement

The regular expressions found on the Regular Expressions Resource page are presented with minimal context. This supplement provides some additional detail about how to use the regular expressions to find potential issues in .sam and .scml files using Sublime.

The regular expression searches find patterns that may indicate errors in files. In many cases, however, the text patterns are intentional. Check for issues with an awareness of false positives or the requirements of some files to retain text errors (as when material is quoting text that contains errors). Additionally, the searches do not necessarily indicate how to fix potential issues. In some cases, an error may be due to a problem in an InDesign source; in others, many issues can best be resolved in Word using the SAI Cleanup Tools, particularly when scribing or editing a file.

Searches should be run with the “Case sensitive” setting turned on.

Searches with file types listed in parentheses apply specifically to the preparation of files of that type. These searches should be run before proceeding with further conversion or finalizing the files. When using the Regular Expressions Resource to review the text content as part of Word scribing, for example, the searches listed for .sam/.scml would not need to be done.

Sublime Text Package

For ease of use, a Sublime Text Package is available. In the file to be checked, run Scribe Inc. > Check > Check 1: Text Patterns and Scribe Inc. > Check > Check 2: Titles, Phrases, Alt Text, and Indexes.

To review files after OCR or PDF Extraction, run Scribe Inc. > Check > Check 3: OCR Review.

Each tool will open the results in a new Sublime window. The Check 2 report will open two files: One contains the results for review, and the other contains a full list of all words and phrases contained within italics or quotation marks.

Each tool reads the regular expressions from the Regular Expressions Resource page. It then runs the expressions as searches and reveals the number of matches found in each document. If a search returns zero matches, it will not appear on this list. Users, therefore, do not need to spend time searching for text patterns that do no appear within the document.

Quotation Marks, Parentheses, and Brackets

The searches listed here find issues with content that is expected to be paired or to face a particular direction.

Check the results for when parentheses/brackets open without closing or close without opening. Look for quotation marks facing the wrong direction.

False positives may include instances of parentheses within parentheses. In some cases, this is intentional. In others, the internal parentheses may need to be changed to brackets. Another common false positive is when closing parentheses are used after list numbers.

One search will also find straight double quotation marks that may need to be smart (curly) quotes or the double prime character (for inches): “, ”, or ″. If there is more than one instance in a paragraph, only the last one will be found. If one instance is found, check for others.

Punctuation

The searches here look for a number of different potential issues with punctuation. For example, two periods or two question marks in a row would likely be incorrect, so one of the searches looks for a piece of punctuation followed by itself. Other patterns are included that involve punctuation near other text and spaces in different combinations.

Italic Commas and Periods

Searching for italic and roman commas can reveal editorial inconsistencies. For example, if the specification for a file requires that all punctuation following italic phrases be made italic, searching for commas and periods on either side of a closing <i> tag could reveal inconsistencies that may not be intentional, depending on the style guide. Investigate any outliers to confirm if they are intentional, as when closing punctuation is italic as part of a complete sentence.

Common errors with roman periods following the italic abbreviations like Jr., Dr., St., Ms., Mr., and Mrs. are also included as a separate search.

It may also be necessary to search for other styles (like <b> or <sm>) or to search for other forms of punctuation.

Unexpected Character Patterns

As the section name indicates, these searches look for unexpected character patterns. For example, one would not expect to find a space after opening quotation marks or to find a comma followed by a number (rather than a space). Sometimes, these patterns are intentional, as in the presentation of foreign currency, but often the text found by these searches can reveal formatting or spacing errors.

Structure Tags and Chapter Divisions

These searches find certain errors related to structure tags, especially in situations that may occur due to improper InDesign setup when exporting XML.

Spaces and Tabs

These searches find a variety of text patterns that involve spaces (regular and nonbreaking) and tabs. This includes missing or extra spaces.

Write-in Lines

This search finds consecutive underscores that may need to be scribed with the <freeform> character style.

Potentially Incorrect Line Breaks

This search must be run with the “Case sensitive” setting turned on.

It looks for paragraphs that start with lowercase letters. This may have many false positives, commonly finding poetry and index entries. Especially in the early stages of a project, however, this can reveal broken paragraphs that need to be reconnected. This is particularly common when preparing Notes and Bibliographies in manuscripts.

URLs

These searches here find the most common issues with tagging URLs. They find potentially bad spaces or characters, as well as any <url> tags that contain sentence- or phrase-ending punctuation.

If “http” appears in the middle of a URL string, check that it is preceded by something that indicates it is part of a web archive.

If amazon.com is found as a general link, the url style should be removed, as it is prohibited to link to amazon.com, rather than a specific page on that site.

To review dead links, process the file to ePub 3 in the Digital Hub. Open the file using Kindle Previewer and run the Quality Checks to view a report about potentially problematic links. Review any listed URL to confirm if it leads to a nonexistent web page and needs to be resolved.

Note: The status of a URL could change at any time. Scribe recommends checking them up to the point of first publication and using a disclaimer on the copyright page and parenthetical “(no longer extant)” notes following inactive links. Sample disclaimer: “References to internet websites (URLs) were accurate at the time of writing. Neither the author nor [Press Name] is responsible for URLs that may have expired or changed since the manuscript was prepared.”

Zip Codes

This search finds if an en dash is used within a zip code. If an en dash is used, replace it with a hyphen.

Typesetter Spaces

Some spaces are available to typesetters that are not available in ebooks. Ebooks can have regular spaces or nonbreaking spaces. Typesetters may use en spaces, em spaces, thin spaces, and other variations.

If there are any bad or problematic spaces (between single and double quotation marks, for example), these should be resolved before producing an ScML file to be used in creating an ebook.

Hyphen Spacing

This search finds a space on either side of a hyphen, excluding instances in which the hyphen is followed by and, or, to, or und.

Missing Spaces around Tags and Commas

These searches find potentially missing spaces around tags and commas. Check that any instances are intentional, as in words like “book(s).”

Scribing/Articulation

These searches look for the potentially incorrect scribing of paragraphs. For example, if <ah> is followed by <p>, the <p> paragraph should be <paft>.

Self-Closing Note Reference Tags

Search for any self-closing note reference tags. These tags can throw off the statistics as reported by the Digital Hub and prevent the proper linking of the tags during a Hub conversion.

Small Caps

This search looks for any bad casing for content in <sm> tags, including text in all caps or mixed casing (lIKe ThiS).

Alt Text

The alt text searches gather all <img> tags. The results can be reviewed to confirm the phrasing, and instances of missing or duplicated alt text will be noted.

Similar Titles, Italics, and Quotes

These searches compare titles in italics and quotation marks. In the resulting list, nonbreaking spaces are changed to regular spaces and select closing punctuation has been deleted in order for the tool to compare each text string.

Errors may include unintentional inconsistencies in capitalization, spelling, or punctuation usage.

Some inconsistencies may be correct based on their context. For example, sometimes titles will intentionally use title case in one section but sentence case in another.

False positives will result from the comparisons of individual words or small difference like the volume number associated with a journal reference.

Index Section (.scml files)

The searches include looking for numbers preceded by spaces, potential page range issues, potential letters in the wrong place, potential index entries tagged as page numbers, italics within page ranges, numbers pointing to the same page, potentially unlinked “see” references, and an indication of the first and last pages listed in the index.

To check an index before it has even been typeset, the following procedure can be used:

Export the content from InDesign and process to .sam in the Digital Hub.
Process the index’s Word document to .sam in the Digital Hub.
Place the index at the end of the main .sam file.
Process that to .scml in the Digital Hub.
Run Checks 2 on the .scml file.

Self-Closing and Unnecessary Tags (.sam/.scml)

When preparing a .sam or .scml file, search for any unnecessary tags so they can be removed, resulting in a cleaner file.

Position of Tags and Spaces (.sam/.scml)

Search individually throughout the file to check that spaces are properly placed in relation to opening and closing tags.

Page IDs (.sam/.scml)

These searches help to find and resolve issues regarding the placement and formatting of page IDs.

With the exception of page-based indexes, it is not considered a best practice to reference page numbers within ebooks. If internal page number references exist, they must be linked to their corresponding page ID in order for the resulting ebook to be compliant with current standards.

If a page ID interrupts a URL, both halves of the URL should be linked to the full href for an ebook or HTML output.

Page references (.scml)

This search finds references to pages in an ScML file that may need to be linked to the corresponding page ID.

Single-Chapter Bible Books (.scml)

In a file in which Bible books are linked with <xbr> tags, these searches look for single-chapter Bible books in order to implement the proper reference formatting.

If working with blind endnotes, these searches can help find if any pairs of hidden markers are not properly matched.

DTD Validation Troubleshooting (.sam/.scml)

The searches here can assist in finding the reason for a validation error. These searches are also available on the Digital Hub File Alerts page.

Documentation