Documentation

Regular Expressions Resource Supplement

The regular expressions found on the Regular Expressions Resource page are presented with minimal context. This supplement provides some additional detail about how to use the regular expressions to find potential issues in .sam and .scml files using Sublime.

The regular expression searches find patterns that may indicate errors in files. In many cases, however, the text patterns are intentional. Check for issues with an awareness of false positives or the requirements of some files to retain text errors (as when material is quoting text that contains errors). Additionally, the searches do not necessarily indicate how to fix potential issues. In some cases, an error may be due to a problem in an InDesign source; in others, many issues can best be resolved in Word using the SAI Cleanup Tools, particularly when scribing or editing a file.

Unless otherwise noted, searches should be run with the “Case sensitive” setting turned on.

Gathered at the end of this resource are groups of searches with file types listed in parentheses. These apply specifically to the preparation of files of that type. These searches should be run before proceeding with further conversion or finalizing the files. When using the Regular Expressions Resource to review the text content as part of Word scribing, for example, the searches listed for .sam/.scml would not need to be done.

Sublime Text Package

For ease of use, a Sublime Text Package is available. In the file to be checked, run Scribe Inc. < Check < Regular Expression Result Counts. It will open the results in a new Sublime window. It should take only 10-20 seconds to run on a file of average length; on a particularly large file, it may take up to a minute.

The script reads all the regular expressions from the Regular Expressions Resource page. It then runs all the expressions as searches and reveals the number of matches found in each document. If a search returns zero matches, it will not appear on this list. Users, therefore, do not need to spend time searching for text patterns that do no appear within the document.

Quotation Marks, Parentheses, and Brackets

The searches listed here find issues with content that is expected to be paired or to face a particular direction.

Search individually throughout the file.

Check the results for when parentheses/brackets open without closing or close without opening. Look for quotation marks facing the wrong direction.

False positives may include instances of parentheses within parentheses. In some cases, this is intentional. In others, the internal parentheses may need to be changed to brackets. Another common false positive is when closing parentheses are used after list numbers.

One search will also find straight double quotation marks that may need to be smart (curly) quotes or the double prime character (for inches): “, ”, or ″. If there is more than one instance in a paragraph, only the last one will be found. If one instance is found, check for others.

Punctuation

The searches here look for a number of different potential issues with punctuation. For example, two periods or two question marks in a row would likely be incorrect, so one of the searches looks for a piece of punctuation followed by itself. Other patterns are included that involve punctuation near other text and spaces in different combinations.

Search individually throughout the file for most of these searches.

Click “Find All” to review the results of the search that finds any individually wrapped special character. (This is the search before the italic and roman commas searches.) Check for any instances in which the result is unintentional. This search will likely find false positives for symbols or diacritics.

These searches include a couple known text pattern errors. One is for the name of Scribe Inc., which should not include a comma. The other is for scripture references for Zondervan on copyright pages, which should not include a period after the URL.

Italic Commas and Periods

Searching for italic and roman commas can reveal editorial inconsistencies. For example, if the specification for a file requires that all punctuation following italic phrases be made italic, searching for commas and periods on either side of a closing <i> tag could reveal inconsistencies that may not be intentional, depending on the style guide. For that set of searches, click “Find All” and compare the number of results for italic and roman commas after <i> tags, and then do the same for periods. Investigate any outliers to confirm if they are intentional.

Common errors with roman periods following the italic abbreviations like Jr., Dr., St., Ms., Mr., and Mrs. are also included as a separate search.

It may also be necessary to search for other styles (like <b> or <sm>) or to search for other forms of punctuation.

Unexpected Character Patterns

As the section name indicates, these searches look for unexpected character patterns. For example, one would not expect to find a space after opening quotation marks or to find a comma followed by a number (rather than a space). Sometimes, these patterns are intentional, but often the text found by these searches can reveal formatting or spacing errors.

One pattern in this group is checking for a scribing error. If <ah> is used after a part, chapter, or unit opener, it should be <ahaft>.

Another pattern in this group is looking for numbers or letters that may be page numbers appearing in a table of contents. These should be removed when preparing content that has been exported from a typeset file.

Another pattern in this group looks for instances of too many characters included in the dropcap character style.

Another pattern in this group looks for spaces in front of en dashes or en dashes that start paragraphs. In some cases, the en dash should be a minus sign (−); in some cases, the space may need to be closed up.

Another pattern in this group searches for the common misuse of accents as apostrophes.

Another pattern in this group searches for footnotes that begin with a lowercase letter, which could indicate that a character was removed unintentionally.

Search individually throughout the file.

Spaces

These searches find a variety of text patterns that involve spaces (regular and nonbreaking). This includes missing or extra spaces.

Search individually throughout the file.

Bible References

This regular expression looks for spaces after colons that occur between numbers. If found, this space may need to be removed for bible references. Example: “Psalm 87: 2–3” should be “Psalm 87:2–3.”

Incorrect Line Breaks

This search must be run with the “Case sensitive” setting turned on.

It looks for paragraphs that start with lowercase letters. This will have many false positives, commonly finding poetry and index entries. Especially in the early stages of a project, however, this can reveal broken paragraphs that need to be reconnected. This is particularly common when preparing Notes and Bibliographies in manuscripts.

URLs

The first two searches here find the most common issues with tagging URLs. They find potentially bad spaces or characters, as well as any <url> tags that contain sentence- or phrase-ending punctuation. Search for them individually.

The next searches find any text that follows the common pattern of URLs to find text that may not in fact be a URL, as well as text strings that may represent two URLs that got bunched together. If “http” appears in the middle of a URL string, check that it is preceded by something that indicates it is part of a web archive.

If amazon.com is found as a general link, the url style should be removed, as it is prohibited to link to amazon.com, rather than a specific page on that site.

In some cases, these URL searches should be done individually. In others, it may be beneficial to click “Find All” and copy the results into a different file to review them.

To review dead links, process the file to ePub 3 in the Digital Hub. Open the file using Kindle Previewer and run the Quality Checks to view a report about potentially problematic links. Valid URLs will be listed if they are interrupted by a page ID. Review any listed URL to confirm if it leads to a nonexistent web page and needs to be resolved.

Note: The status of a URL could change at any time. Scribe recommends checking them up to the point of first publication and using a disclaimer on the copyright page and parenthetical “(no longer extant)” notes following inactive links. Sample disclaimer: “References to internet websites (URLs) were accurate at the time of writing. Neither the author nor [Press Name] is responsible for URLs that may have expired or changed since the manuscript was prepared.”

ISBNs and Zip Codes

These searches look for common errors with dashes being used in ISBNs and zip codes. If any en dashes or em dashes are used, replace them with hyphens.

Angle Brackets

Use this search to find if any stray angle brackets remain in a file due to a conversion error or notes left by someone preparing a manuscript outside of the WFDW.

Typesetter Spaces

Some spaces are available to typesetters that are not available in e-books. E-books can have regular spaces or nonbreaking spaces. Typesetters may use en spaces, em spaces, thin spaces, and other variations.

Search individually throughout the file for some of these. If there are any bad or problematic spaces (between single and double quotation marks, for example), these should be resolved before producing an ScML file to be used in creating an e-book.

Click “Find All” to highlight the nonbreaking spaces. They may be used frequently to keep the text together, as in Bible book names or people’s titles (e.g., Dr., Prof., St.). Check that entire paragraphs are not filled with unwanted nonbreaking spaces, as the result could be problematic for the display of a print or digital output.

Hyphen Spacing

This search finds a space on either side of a hyphen. This will often have false positives, but it can reveal if hyphens have been misused or if hyphenated terms have become separated.

Pay particular attention to this search if working on a file produced through OCR or a manual conversion process.

Search individually throughout the file.

Potentially Incorrect Hyphenation

Click “Find All” to pull all the hyphenated words from a document, then paste them into a separate file. In some cases, this requires only a brief review to confirm nothing stands out as problematic. When reviewing a file that was created through OCR or a manual conversion, however, this search can reveal many common source errors.

In the separate file, turn on spell check to review the flagged results, which may be word fragments or other spelling errors.

Missing Spaces around Tags and Commas

Search individually throughout the file to find missing spaces around tags and commas. Check that any instances are intentional.

If a book has many long numbers, click “Find All” to check that any commas that are not followed by a space are part of long number strings.

Scribing/Articulation

This search looks for potentially incorrect scribing of text near heads. Examples would be if the second head in a pair of stacked heads did not use the <aft> variation, or if a head is followed by <p> or <pcon> rather than <paft>.

Note: This search will find if <pf> follows a head. This articulation would be correct if the <pf> paragraph is the first regular paragraph in the chapter, but it would be incorrect if it occurs elsewhere in a chapter.

The second search looks for the <bl> styles followed by letters or numbers instead of bullets or symbols.

The third search looks for an unordered list followed by a number and either a period or closing parenthesis.

Small Caps

Click “Find All” and copy the results into a separate file. Look for any bad casing for content in <sm> tags.

Text is likely incorrect in <sm> tags if it is typed as all caps or has mixed casing (lIKe ThiS).

Tetragrammaton

This is similar to the small caps search, but it is specific to instances of tetragrammaton.

Alt Text

The first search finds all <img> tags. Select them all and paste into a new document to review for the presence and accuracy of the alt text.

The second search looks for alt text that only uses the word Presentation or presentation. For decorative images, this should be in all caps as PRESENTATION.

Italic Terms, Phrases, and Titles

Find all italic phrases. Paste them into a new document in Sublime, permute the document, and then sort lines. (It may be necessary to remove ending punctuation in order to fully permute the results.) Compare the results to reveal if there are any errors or inconsistencies.

Errors may include unintentional inconsistencies in capitalization, spelling, or punctuation usage.

Some inconsistencies may be correct based on their context. For example, sometimes titles will intentionally use title case in one section but sentence case in another.

For more information about how and why to run this search, see Reviewing Italic Terms, Phrases, and Titles.

Self-Closing Note Reference Tags

Search for any self-closing note reference tags. These tags can throw off the statistics as reported by the Digital Hub and prevent the proper linking of the tags during a Hub conversion.

Self-Closing and Unnecessary Tags (.sam/.scml)

When preparing a .sam or .scml file, search for any unnecessary tags so they can be removed, resulting in a cleaner file.

Index Section (.scml files)

Pull the index section out of the .scml file. In files containing page IDs, the Hub will add <xref> tags around the listed pages.

The first search will find any numbers that have a space in front of them. False positives will include numbers that are part of the index entry, rather than a page number to be linked. If an index page number is found, check the file to confirm the page number is in the book and that the index entry is formatted properly.

The search also looks for common roman numeral letters preceded by a space and followed by a comma, semicolon, or period. This may reveal unlinked or incorrectly formatted roman numeral page numbers.

The second search will find all the “see” references. Paste the results in a new file. Review each entry to see if any do not have the <xref> tag. If it is missing the <xref> tag, there may be a spelling or formatting issue. If correct, it may not match the main entry (often due to the main entry including a parenthetical phrase) and would require manual linking when producing an e-book version.

The third search will find any number ranges, particularly when indexing text in an endnote or footnote, that will require manual linking when producing an e-book version. It also looks for non-consecutive numbers in indexed notes by finding numbers separated by commas without spaces.

The fourth search looks for letters that look similar to numbers (like “l”) being used instead of digits for page numbers. The letters “n” and “p” are excluded from this search, as they will commonly appear next to numbers for notes and page IDs, and they are unlikely to be mistaken for a digit during the index creation or editing processes.

Another search looks for all indexed pages. Do a find-all and copy the results into a new file. Sort the lines and check if any pages fall outside of the expected page range of the indexable pages.

Another search looks for the same page number appearing more than once for the same entry/subentry.

Run these searches when creating an index, typesetting an index, and producing the ScML/e-book versions. By finding these issues early, indexes in typesets can be correct from the outset (or corrected when reviewing the typeset file before print). Then any linking issues can be addressed when producing an e-book without discovering errors to be corrected in print files.

Position of Tags and Spaces (.sam/.scml)

Search individually throughout the file to check that spaces are properly placed in relation to opening and closing tags.

Page IDs (.sam/.scml)

These searches help to find and resolve issues regarding the placement and formatting of page IDs.

With the exception of page-based indexes, it is not considered a best practice to reference page numbers within e-books. If internal page number references exist, they must be linked to their corresponding page ID in order for the resulting e-book to be compliant with current standards.

If a page ID interrupts a URL, both halves of the URL should be linked to the full href for an e-book or HTML output. (Another option, though less accurate, is to move the ID to follow the URL.)

Search individually throughout the file.

Page references (.scml)

This search finds references to pages in an ScML file that may need to be linked to the corresponding page ID.

Search individually throughout the file.

Single-Chapter Bible Books (.scml)

In a file in which Bible books are linked with <xbr> tags, these searches look for single-chapter Bible books in order to implement the proper reference formatting.

Blind Notes Pairs

If working with blind endnotes, these searches can help find if any pairs of hidden markers are not properly matched.

DTD Validation Troubleshooting (.sam/.scml)

The searches here can assist in finding the reason for a validation error. These searches are also available on the Digital Hub File Alerts page.