Documentation

Language Scribing

Language Scribing Overview

Web Content Accessibility Guidelines (WCAG) AA accessibility requires that ebooks mark when a language shifts within a book. This helps screen readers and other assistive technology read the content with minimal jarring or incorrect pronunciation, thus achieving a comprehensible output for the users of assistive technology.

The following are exempt from this requirement:

  • Proper names (personal names, place names, and organization names).
  • Technical and scientific terms.
  • Words of indeterminate language.
  • Words or phrases that have become part of the vernacular of the immediately surrounding text. See Loanwords.
  • Constructed (fictional) languages, as may be found in works of science fiction or fantasy.

Scribe book, movie, and publication titles when they are in other languages.

Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) can frequently be identified using the Scribe Language Styles setting in the Digital Hub. Languages that use the Latin alphabet often need to be identified through manual actions and human judgment based on factors including indications within a manuscript and one’s fluency/familiarity with foreign languages.

Language scribing can take place at any stage of the workflow, even as early as when a manuscript is prepared in Word (see When to Scribe Language Styles). In a .docx file, languages can be marked by manually creating a new paragraph or character style with a name that combines an ScML style and an established language code.

  • Pattern in Word: [ScML Style]@lang=[Language Code]
  • Example Style Name: lang-i@lang=es

Note: While most language tagging will occur on the character style level, if entire paragraphs use a different language, the scribing can be applied on the paragraph level.

These styles can be created and applied to scribed manuscripts by using the SAI’s Add Language Style tool. This may be done during the Word Scribing procedure or at a designated time during the copyedit, before the production stages. When creating ebooks from files that have already been produced, language styles may be added to the ScML file.

Language codes generally consist of two or three letters, determined by the BCP-47 standard. See Language Codes for a list of many common languages and how to find a corresponding code. If a language has no corresponding code, Scribe recommends applying lang or lang-i to this content with no additional code.

The metadata (language codes) and language styles can be added in a Word document, a sam file, an ScML file, or an InDesign document. At whatever stage it is added, this metadata will travel through the Well-Formed Document Workflow.

This example shows how the metadata for Spanish-language italic text could be identified in Word and carried through to sam, ScML, and InDesign. In each environment, the formatting of the style name is slightly different.

  • In Word: lang-i@lang=es
  • In sam/ScML: <lang-i lang="es">
  • In InDesign: lang-i-language-es

When scribing paragraph styles, the style names follow the same pattern in each environment. For example, the formatting for a Spanish block quotation would be bq@lang=es in Word, <bq lang="es"> in sam/ScML, and bq-language-es in InDesign.

Note: Hyphenated language codes, including region subtags (en-US, en-GB), are not completely supported throughout the WFDW. The language codes must be entirely lowercase. If this level of specificity is required, region subtags can be added at the ScML stage before converting to ebook.

Note: Even if language styles are applied at the manuscript stage, changes to content (adding indexes and praise pages and applying alterations) require that attention is given to this throughout the workflow. For example, the scribing of foreign-language terms in indexes should match how they are scribed in the body text.

Note: If an entire book is in a particular language, this can be indicated for an ePub through the Extended Metadata settings.

Methods of Finding Languages to Be Scribed

Whether starting with a scribed Word file, a sam file, or an ScML file, the following methods can be used to determine what content will require language styles to be applied. If starting with a .docx file, process the file to sam in order to run the listed regular expressions.

  • Use the book topic and TOC as a guide. Take a broad view of what to expect based on the subject matter of the publication.
  • Review the special characters list in the Digital Hub for languages that fall outside the Latin alphabet. Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) should already be well served by the Scribe Language Styles setting in the Digital Hub. When using this setting, the results should be reviewed to confirm all language blocks have been identified correctly by this automated process.
  • Use the spelling and grammar features in programs like Microsoft Word.
  • Use Sublime Check 2 and skim the list of titles to see if there is any widespread use of a language.
  • Review character styles, paragraph styles, book-specific styles, and text in quotation marks. Use the searches listed in the Language Scribing in sam, ScML, or InDesign section.
  • Use AI tools (as a last resort). As of 2026, Scribe does not endorse the use of artificial intelligence to perform any actions within the WFDW, and the official procedure presented here does not provide any recommendations for prompts or methods to interact with AI services. However, users may choose for themselves to prompt AI to flag terms for a human to review.

Determining Languages

Many terms and phrases cannot be identified programmatically, particularly due to the use of a common alphabet. Therefore, a key aspect of language scribing is the need for a human to review terms and decide what action should be taken.

When determining what language a term or phrase is, exceptions abound, and borrowed terms or loanwords may fall into a gray area. Terms like “déjà vu” and “rendezvous” are commonly accepted as English; if surrounded by French text, however, they would reasonably be scribed as French.

Consider the following when encountering terms and phrases:

  • When using the Scribe Language Styles setting in the Digital Hub, some Asian scripts may get identified with different subtags by the Digital Hub, even when it’s clear from the context that the same language is being used for the entire block of text. For example, a single line of text could have both the general Chinese language code (“zh”) and the Taiwanese subtag (“zh-TW”) applied to different words. In these cases, the correct tag needs to be applied.
  • Some Unicode blocks may be used in more than one language. Examples include the CJK Unified Ideographs used in Chinese, Japanese, and Korean; the Arabic script used in Arabic, Persian, and others; the Hebrew script used in Hebrew and Yiddish; and the Cyrillic script used in multiple Slavic languages. If using the Scribe Language Styles setting, copy the languages into a new file with the following search and review the list for anything that does not match what is known about the book:

Find: lang="[^"]+"

Scribe Recommendations

  • If there is any doubt about whether a language tag should be applied, default to applying the style.
  • Scribe book, movie, and publication titles when they are in other languages.
  • Identify styles on a large scale before investigating outliers.
  • Use case sensitive (match case) searches whenever using find-and-replace.

Investigating Outliers

When words are encountered that require some investigation to determine the language, take the following steps to avoid getting bogged down for an extended period of time.

  1. Check footnotes, endnotes, or the surrounding text for any context clues or direct indications of the word origins.
  2. Search the internet for the term.
  3. If no definitive answer can be found, flag the individual instance for an author, editor, or designated expert to review. In Word, this can be marked as an author query (AQ). If there are many instances of terms to be reviewed, it may be best to create a list of terms to review outside of the main document.

Loanwords

In many cases, terms with foreign origins have been adopted into the English language. When making a decision about whether to apply the language scribing, consider the following factors:

  • If the term is well known from English usage, it can be considered English. Foods like “nacho” or “taco” are examples of Spanish words that are commonly used and accepted as English when used in an English-language context.
  • If a term appears in the Merriam-Webster Collegiate Dictionary without being marked specifically as a foreign term, it can be considered English.
  • If a term has an English-language Wikipedia page, this may be indicative of it being a technical term that is understood across languages.

If the determination is unclear, default to applying a language style.

When to Scribe Language Styles

Language styles can be applied at any time within the WFDW. Consider the following to determine a work plan that will be most efficient, with language choices made by the appropriate person.

  • Scribing stage. If there are relatively few instances, it is recommended to apply the language styles during the scribing stage.
  • Copyediting stage. If the language scribing will be more involved or content could change during author review, it is recommended to schedule the language scribing for after all other editorial considerations have been handled, before proceeding to production stages.
  • Print production stage. It is not recommended to do extensive language scribing during typesetting and page proof stages. However, as content is added and alterations are applied, language scribing should be included and maintained.
  • Ebook production stage. For ebooks being produced from typeset files, language scribing can be included as part of the ScML preparation steps. Language scribing should be handled within the ScML file before processing it to ePub format.

Language Scribing in Word

Use the SAI or SAI Lite to add and apply language styles in Word.

  1. Use the Add Language Style tool to create the necessary language styles for the project in Word.
    • Per the scribing procedure for all projects, load ScML styles into the document.
    • Use the drop-down menu in Load ScML Styles to select Add Language Style.
    • Select the base style to use. Select from the drop-down menu or click the Get base style from selection button to use the style of selected text.
    • Enter the language code. Common languages can be selected from the drop-down menu.
    • The Resulting style field will display the name of the style being added.
    • Click Apply style to selection to apply the style to selected text or Add style to document to add the style without applying it.
  2. At the designated time (during the initial scribing or while copyediting), review all italic text for phrases that need to have language metadata applied. To find italic text, convert the Word file to .sam and use the Character Styles regular expressions or search for italic font in the Word Find window (Format > Font… > Font style > Italic > OK > Find In > Main Document/Endnotes/Footnotes). Apply the language styles created.
  3. Apply additional language styles as needed (e.g., apply lang@lang=es to Spanish text using the default paragraph font or gt@lang=es to Spanish-language glossary terms).

Language Scribing in sam, ScML, or InDesign

If the live file is in a production stage, the language styles can be added in a sam, ScML, or InDesign file. When each language style has been identified, add them to the point document with the appropriate formatting.

  • In sam/ScML: <lang-i lang="es">
  • In InDesign: lang-i-language-es

The searches presented here can be used as a starting point for finding languages in books based on the character or paragraph styles applied to them as well as word and letter patterns that are unique to a particular language.

Review the results before changing or replacing any styles.

Paragraph Styles

Review block quote (bq) and senseline (sl) paragraphs as common places to identify if there are full paragraphs in another language. Check other paragraph styles as needed.

Find: <[^>"]*(bq|sl)[^>]*>[^\n]+

Copy into a new file, turn off word wrap, and skim for non-English text.

Character Styles

Review italic terms, various “-i” styles, and various lang terms in a new Sublime file.

Find: <i>[^<]+</i>|<[^>]+-b?i>[^<]+</[^>]+-b?i>|<lang[^>]*>[^<]+</lang[^>]*>|<[^>]*lang[^>]*>[^<]+</[^>]*>

Copy into a new file, permute unique lines, remove English text, and delete proper names. Add lang attributes as needed to the original file.

Book-Specific Styles

Certain books such as Bibles or language/grammar books may have additional paragraph or character styles that are being used to identify languages. Review additional content for languages based on the type of publication.

Text in Quotation Marks

Publications with a significant amount of dialogue or other content in quotation marks may make reviewing all quotes unfeasible. If it is determined that it will be beneficial, use this search to find all instances of text in quotation marks and review the results.

Find: (“[^\n”]*”)

Language Names

Use the following searches to identify languages that may be referenced specifically as well as certain text patterns associated with particular languages.

Spanish

Find: (?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)

Find: (“[^\n”]*(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)[^\n”]*”)
Replace with: <lang lang="es">\1</lang>

French

Find: \b(?:[Jj]e|[vV]ous|moi|[Mm]ais|[Uu]ne|[cCsS]’est|[Ll]eur)\b

German

Find: \b(?:für|und)\b

Language Codes

Some of the most common language codes are listed here as well as on the BCP-47 Wikipedia page.

The subtag lookup tool can be used to find thousands of additional language codes. Search for a language name in the Find window and look through the results listed under “language codes.”

Language Code
Afrikaans af
Albanian sq
Amharic am
Arabic ar
Armenian hy
Assamese as
Azerbaijani az
Bashkir ba
Basque eu
Belarusian be
Bengali bn
Bosnian bs
Breton br
Bulgarian bg
Burmese my
Catalan ca
Central Kurdish ckb
Chinese zh
Corsican co
Croatian hr
Czech cs
Danish da
Dari prs
Divehi dv
Dutch nl
English en
Estonian et
Faroese fo
Filipino fil
Finnish fi
French fr
Frisian fy
Galician gl
Georgian ka
German de
Gilbertese gil
Greek el
Greenlandic kl
Gujarati gu
Hausa ha
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Igbo ig
Indonesian id
Inuktitut iu
Irish ga
Italian it
Japanese ja
Kʼicheʼ quc
Kannada kn
Kazakh kk
Khmer km
Kinyarwanda rw
Kiswahili sw
Konkani kok
Korean ko
Kurdish ku
Kyrgyz ky
Lao lo
Latin la
Latvian lv
Lithuanian lt
Lower Sorbian dsb
Luxembourgish lb
Macedonian mk
Malay ms
Malayalam ml
Maltese mt
Maori mi
Mapudungun arn
Marathi mr
Mohawk moh
Mongolian mn
Moroccan Arabic ary
Nepali ne
Norwegian (Bokmål) nb
Norwegian (Nynorsk) nn
Norwegian no
Occitan oc
Odia or
Papiamento pap
Pashto ps
Persian fa
Polish pl
Portuguese pt
Punjabi pa
Quechua qu
Romanian ro
Romansh rm
Russian ru
Sami (Inari) smn
Sami (Lule) smj
Sami (Northern) se
Sami (Skolt) sms
Sami (Southern) sma
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Sesotho st
Sinhala si
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Swiss German gsw
Syriac syc
Tagalog tl
Tajik tg
Tamazight tzm
Tamil ta
Tatar tt
Telugu te
Thai th
Tibetan bo
Tswana tn
Turkish tr
Turkmen tk
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Uzbek uz
Vietnamese vi
Welsh cy
Wolof wo
Xhosa xh
Yakut sah
Yi ii
Yoruba yo
Zulu zu

Note: Bold text indicates commonly used languages.

Language Scribing QC Checklist

Quality control steps for language scribing represent a collaboration and confirmation of decisions that may not have a definitive right or wrong aspect. While the formatting of the language tags must follow particular patterns in environments like Word, InDesign, sam, or ScML, the choices of which terms should be tagged may vary from person to person.

QC should assess the content for the following:

  • Terms that are not tagged that should be
  • Terms that are tagged that should not be
  • Terms that have the wrong language applied to them

Use a sam or ScML file to search for terms to review.

Tagged Paragraph Styles

Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.

Find: ^ *<[^>]*lang[^\n]+

Tagged Character Styles

Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.

Find: <[^>]*lang[^>]*>[^<]*<[^>]*>

Untagged Terms

In a copy of the file, remove the content that has language tags applied to it.

Find: ^ *<[^>]*lang[^\n]+
Replace with: NOTHING

Find: <[^>]*lang[^>]*>[^<]*<[^>]*>
Replace with: NOTHING

With this content deleted, repeat some of the searches and techniques used to find foreign languages. (Use spell-check selectively to help skim through results.)

Review the following:

  • Text in italics
  • Text in quotation marks
  • Text marked by spell-check
  • Terms containing the special characters listed in the Digital Hub stats
  • Words in the body of the book that may indicate certain languages