Regular Expressions Resource Supplement

The regular expressions found on the Regular Expressions Resource page are presented with minimal context. This supplement provides some additional detail about how to use the regular expressions to find potential issues in .sam and .scml files using Sublime.

The regular expression searches find patterns that may indicate errors in files. In many cases, however, the text patterns are intentional. Check for issues with an awareness of false positives or the requirements of some files to retain text errors (as when material is quoting text that contains errors). Additionally, the searches do not necessarily indicate how to fix potential issues. In some cases, an error may be due to a problem in an InDesign source; in others, many issues can best be resolved in Word using the SAI Cleanup Tools, particularly when composing or editing a file.

Unless otherwise noted, searches should be run with the “Case sensitive” setting turned on.

Gathered at the end of this resource are groups of searches with file types listed in parentheses. These apply specifically to the preparation of files of that type. These searches should be run before proceeding with further conversion or finalizing the files. When using the Regular Expressions Resource to review the text content as part of a Word composition, for example, the searches listed for .sam/.scml would not need to be done.

Quotation Marks, Parentheses, and Brackets

The searches listed here find issues with content that is expected to be paired or to face a particular direction.

Search individually throughout the file.

Check the results for when parentheses/brackets open without closing or close without opening. Look for quotation marks facing the wrong direction.

False positives may include instances of parentheses within parentheses. In some cases, this is intentional. In others, the internal parentheses may need to be changed to brackets. Another common false positive is when closing parentheses are used after list numbers.


The searches here look for a number of different potential issues with punctuation. For example, two periods or two question marks in a row would likely be incorrect, so one of the searches looks for a piece of punctuation followed by itself. Other patterns are included that involve punctuation near other text and spaces in different combinations.

Search individually throughout the file for most of these searches.

Click “Find All” to review the results of the search that finds any individually wrapped special character. (This is the search before the italic and roman commas searches.) Check for any instances in which the result is unintentional. This search will likely find false positives for symbols or diacritics.

Searching for italic and roman commas can reveal editorial inconsistencies. For example, if the specification for a file requires that all punctuation following italic phrases be made italic, searching for commas and periods on either side of a closing <i> tag could reveal inconsistencies that may not be intentional, depending on the style guide. For that set of searches, click “Find All” and compare the number of results for italic and roman commas after <i> tags, and then do the same for periods. Investigate any outliers to confirm if they are intentional. The number of results are displayed in Sublime below the search window. If no instances are found, Sublime will indicate it is unable to find the search pattern.

Unexpected Character Patterns

As the section name indicates, these searches look for unexpected character patterns. For example, one would not expect to find a space after opening quotation marks or to find a comma followed by a number (rather than a space). Sometimes, these patterns are intentional, but often the text found by these searches can reveal formatting or spacing errors.

Search individually throughout the file.


These searches find a variety of text patterns that involve spaces (regular and nonbreaking). This includes missing or extra spaces.

Search individually throughout the file.

Incorrect Line Breaks

This search must be run with the “Case sensitive” setting turned on.

It looks for paragraphs that start with lowercase letters. This will have many false positives, commonly finding poetry and index entries. Especially in the early stages of a project, however, this can reveal broken paragraphs that need to be reconnected. This is particularly common when preparing Notes and Bibliographies in manuscripts.


The first two searches here find the most common issues with tagging URLs. They find potentially bad spaces or characters, as well as any <url> tags that contain sentence- or phrase-ending punctuation. Search for them individually.

The next searches find any text that follows the common pattern of URLs to find text that may not in fact be a URL, as well as text strings that may represent two URLs that got bunched together. If “http” appears in the middle of a URL string, check that it is preceded by something that indicates it is part of a web archive.

In some cases, these URL searches should be done individually. In others, it may be beneficial to click “Find All” and copy the results into a different file to review them.

To review dead links, process the file to ePub 3 in the Digital Hub. Open the file using Kindle Previewer and run the Quality Checks to view a report about potentially problematic links. Valid URLs will be listed if they are interrupted by a page ID. Review any listed URL to confirm if it leads to a nonexistent web page and needs to be resolved.

Special Characters

It is recommended that special characters be reviewed by uploading files to the Digital Hub. The search here is provided in case a user is working offline. Review the stats to confirm all expected characters are present and rendering properly in the output (print or digital).

The Digital Hub special characters list provides search patterns to help find the characters in a sam or ScML file, as some (like a zero-width nonbreaking space) can be difficult to copy and paste into a search window.

When tracking special character usage throughout the life of a project, pay attention to any extreme or unexpected changes in the characters list. This could reveal the loss or mishandling of content.

ISBNs and Zip Codes

These searches look for common errors with dashes being used in ISBNs and zip codes. If any en dashes or em dashes are used, replace them with hyphens.

Angle Brackets

Use this search to find if any stray angle brackets remain in a file due to a conversion error or notes left by someone preparing a manuscript outside of the WFDW.

Typesetter Spaces

Some spaces are available to typesetters that are not available in e-books. E-books can have regular spaces or nonbreaking spaces. Typesetters may use en spaces, em spaces, thin spaces, and other variations.

Search individually throughout the file for some of these. If there are any bad or problematic spaces (between single and double quotation marks, for example), these should be resolved before producing an ScML file to be used in creating an e-book.

Click “Find All” to highlight the nonbreaking spaces. They may be used frequently to keep the text together, as in Bible book names or people’s titles (e.g., Dr., Prof., St.). Check that entire paragraphs are not filled with unwanted nonbreaking spaces, as the result could be problematic for the display of a print or digital output.

Hyphen Spacing

This search finds a space on either side of a hyphen. This will often have false positives, but it can reveal if hyphens have been misused or if hyphenated terms have become separated.

Pay particular attention to this search if working on a file produced through OCR or a manual conversion process.

Search individually throughout the file.

Incorrect Hyphenation

Click “Find All” to pull all the hyphenated words from a document, then paste them into a separate file. In some cases, this requires only a brief review to confirm nothing stands out as problematic. When reviewing a file that was created through OCR or a manual conversion, however, this search can reveal many common source errors.

In the separate file, turn on spell check to review the flagged results, which may be word fragments or other spelling errors.

Missing Spaces around Tags and Commas

Search individually throughout the file to find missing spaces around tags and commas. Check that any instances are intentional.

If a book has many long numbers, click “Find All” to check that any commas that are not followed by a space are part of long number strings.

Small Caps

Click “Find All” and copy the results into a separate file. Look for any bad casing for content in <sm> tags.

Text is likely incorrect in <sm> tags if it is typed as all caps or has mixed casing (lIKe ThiS).


This is similar to the small caps search, but it is specific to instances of tetragrammaton.

Self-Closing Note Reference Tags

Search for any self-closing note reference tags. These tags can throw off the statistics as reported by the Digital Hub and prevent the proper linking of the tags during a Hub conversion.

Self-Closing and Unnecessary Tags (.sam/.scml)

When preparing a .sam or .scml file, search for any unnecessary tags so they can be removed, resulting in a cleaner file.

Index Section (.scml files)

Pull the index section out of the .scml file. In files containing page IDs, the Hub will add <xref> tags around the listed pages.

The first search will find any numbers that have a space in front of them. False positives will include numbers that are part of the index entry, rather than a page number to be linked. If an index page number is found, check the file to confirm the page number is in the book and that the index entry is formatted properly.

The second search will find all the “see” references. Paste the results in a new file. Review each entry to see if any do not have the <xref> tag. If it is missing the <xref> tag, there may be a spelling or formatting issue. If correct, it may not match the main entry (often due to the main entry including a parenthetical phrase) and would require manual linking when producing an e-book version.

The third search will find any number ranges, particularly when indexing text in an endnote or footnote, that will require manual linking when producing an e-book version.

The fourth search looks for letters that look similar to numbers (like “l”) being used instead of digits for page numbers. The letters “n” and “p” are excluded from this search, as they will commonly appear next to numbers for notes and page IDs, and they are unlikely to be mistaken for a digit during the index creation or editing processes.

Run these searches when creating an index, typesetting an index, and producing the ScML/e-book versions. By finding these issues early, indexes in typesets can be correct from the outset (or corrected when reviewing the typeset file before print). Then any linking issues can be addressed when producing an e-book without discovering errors to be corrected in print files.

Position of Tags and Spaces (.sam/.scml)

Search individually throughout the file to check that spaces are properly placed in relation to opening and closing tags.

Page IDs (.sam/.scml)

These searches help to find and resolve issues regarding the placement and formatting of page IDs.

With the exception of page-based indexes, it is not considered a best practice to reference page numbers within e-books. If internal page number references exist, they must be linked to their corresponding page ID in order for the resulting e-book to be compliant with current standards.

If a page ID interrupts a URL, both halves of the URL should be linked to the full href for an e-book or HTML output. (Another option, though less accurate, is to move the ID to follow the URL.)

Search individually throughout the file.

Page references (.scml)

This search finds references to pages in an ScML file that may need to be linked to the corresponding page ID.

Search individually throughout the file.

Single-Chapter Bible Books

In a file in which Bible books are linked with <xbr> tags, these searches look for single-chapter Bible books in order to implement the proper reference formatting.