Language Scribing Overview
Web Content Accessibility Guidelines (WCAG) AA accessibility requires that ebooks mark when a language shifts within a book. This helps screen readers and other assistive technology read the content with minimal jarring or incorrect pronunciation, thus achieving a comprehensible output for the users of assistive technology.
The following are exempt from this requirement:
- Proper names (personal names, place names, and organization names).
- Technical and scientific terms.
- Words of indeterminate language.
- Words or phrases that have become part of the vernacular of the immediately surrounding text. See Loanwords.
- Constructed (fictional) languages, as may be found in works of science fiction or fantasy.
Scribe book, movie, and publication titles when they are in other languages.
Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) can frequently be identified using the Scribe Language Styles setting in the Digital Hub. Languages that use the Latin alphabet often need to be identified through manual actions and human judgment based on factors including indications within a manuscript and one’s fluency/familiarity with foreign languages.
Language scribing can take place at any stage of the workflow, even as early as when a manuscript is prepared in Word (see When to Scribe Language Styles). In a .docx file, languages can be marked by manually creating a new paragraph or character style with a name that combines an ScML style and an established language code.
- Pattern in Word:
[ScML Style]@lang=[Language Code] - Example Style Name:
lang-i@lang=es
Note: While most language tagging will occur on the character style level, if entire paragraphs use a different language, the scribing can be applied on the paragraph level.
These styles can be created and applied to scribed manuscripts by using the SAI’s Add Language Style tool. This may be done during the Word Scribing procedure or at a designated time during the copyedit, before the production stages. When creating ebooks from files that have already been produced, language styles may be added to the ScML file.
Language codes generally consist of two or three letters, determined by the BCP-47 standard. See Language Codes for a list of many common languages and how to find a corresponding code. If a language has no corresponding code, Scribe recommends applying lang or lang-i to this content with no additional code.
The metadata (language codes) and language styles can be added in a Word document, a sam file, an ScML file, or an InDesign document. At whatever stage it is added, this metadata will travel through the Well-Formed Document Workflow.
This example shows how the metadata for Spanish-language italic text could be identified in Word and carried through to sam, ScML, and InDesign. In each environment, the formatting of the style name is slightly different.
- In Word:
lang-i@lang=es - In sam/ScML:
<lang-i lang="es"> - In InDesign:
lang-i-language-es
When scribing paragraph styles, the style names follow the same pattern in each environment. For example, the formatting for a Spanish block quotation would be bq@lang=es in Word, <bq lang="es"> in sam/ScML, and bq-language-es in InDesign.
Note: Hyphenated language codes, including region subtags (en-US, en-GB), are not completely supported throughout the WFDW. The language codes must be entirely lowercase. If this level of specificity is required, region subtags can be added at the ScML stage before converting to ebook.
Note: Even if language styles are applied at the manuscript stage, changes to content (adding indexes and praise pages and applying alterations) require that attention is given to this throughout the workflow. For example, the scribing of foreign-language terms in indexes should match how they are scribed in the body text.
Note: If an entire book is in a particular language, this can be indicated for an ePub through the Extended Metadata settings.
Methods of Finding Languages to Be Scribed
Whether starting with a scribed Word file, a sam file, or an ScML file, the following methods can be used to determine what content will require language styles to be applied. If starting with a .docx file, process the file to sam in order to run the listed regular expressions.
- Use the book topic and TOC as a guide. Take a broad view of what to expect based on the subject matter of the publication.
- Review the special characters list in the Digital Hub for languages that fall outside the Latin alphabet. Languages that use non-Latin scripts (e.g., Hebrew, Arabic, Chinese, Greek) should already be well served by the Scribe Language Styles setting in the Digital Hub. When using this setting, the results should be reviewed to confirm all language blocks have been identified correctly by this automated process.
- Use the spelling and grammar features in programs like Microsoft Word.
- Use Sublime Check 2 and skim the list of titles to see if there is any widespread use of a language.
- Review character styles, paragraph styles, book-specific styles, and text in quotation marks. Use the searches listed in the Language Scribing in sam, ScML, or InDesign section.
- Use AI tools (as a last resort). As of 2026, Scribe does not endorse the use of artificial intelligence to perform any actions within the WFDW, and the official procedure presented here does not provide any recommendations for prompts or methods to interact with AI services. However, users may choose for themselves to prompt AI to flag terms for a human to review.
Determining Languages
Many terms and phrases cannot be identified programmatically, particularly due to the use of a common alphabet. Therefore, a key aspect of language scribing is the need for a human to review terms and decide what action should be taken.
When determining what language a term or phrase is, exceptions abound, and borrowed terms or loanwords may fall into a gray area. Terms like “déjà vu” and “rendezvous” are commonly accepted as English; if surrounded by French text, however, they would reasonably be scribed as French.
Consider the following when encountering terms and phrases:
- When using the Scribe Language Styles setting in the Digital Hub, some Asian scripts may get identified with different subtags by the Digital Hub, even when it’s clear from the context that the same language is being used for the entire block of text. For example, a single line of text could have both the general Chinese language code (“zh”) and the Taiwanese subtag (“zh-TW”) applied to different words. In these cases, the correct tag needs to be applied.
- Some Unicode blocks may be used in more than one language. Examples include the CJK Unified Ideographs used in Chinese, Japanese, and Korean; the Arabic script used in Arabic, Persian, and others; the Hebrew script used in Hebrew and Yiddish; and the Cyrillic script used in multiple Slavic languages. If using the Scribe Language Styles setting, copy the languages into a new file with the following search and review the list for anything that does not match what is known about the book:
Find:
lang="[^"]+"
- See Language Scribing in sam, ScML, or InDesign for a list of searches to help find languages that use the Latin alphabet.
Scribe Recommendations
- If there is any doubt about whether a language tag should be applied, default to applying the style.
- Scribe book, movie, and publication titles when they are in other languages.
- Identify styles on a large scale before investigating outliers.
- Use case sensitive (match case) searches whenever using find-and-replace.
Investigating Outliers
When words are encountered that require some investigation to determine the language, take the following steps to avoid getting bogged down for an extended period of time.
- Check footnotes, endnotes, or the surrounding text for any context clues or direct indications of the word origins.
- Search the internet for the term.
- If no definitive answer can be found, flag the individual instance for an author, editor, or designated expert to review. In Word, this can be marked as an author query (AQ). If there are many instances of terms to be reviewed, it may be best to create a list of terms to review outside of the main document.
Loanwords
In many cases, terms with foreign origins have been adopted into the English language. When making a decision about whether to apply the language scribing, consider the following factors:
- If the term is well known from English usage, it can be considered English. Foods like “nacho” or “taco” are examples of Spanish words that are commonly used and accepted as English when used in an English-language context.
- If a term appears in the Merriam-Webster Collegiate Dictionary without being marked specifically as a foreign term, it can be considered English.
- If a term has an English-language Wikipedia page, this may be indicative of it being a technical term that is understood across languages.
If the determination is unclear, default to applying a language style.
When to Scribe Language Styles
Language styles can be applied at any time within the WFDW. Consider the following to determine a work plan that will be most efficient, with language choices made by the appropriate person.
- Scribing stage. If there are relatively few instances, it is recommended to apply the language styles during the scribing stage.
- Copyediting stage. If the language scribing will be more involved or content could change during author review, it is recommended to schedule the language scribing for after all other editorial considerations have been handled, before proceeding to production stages.
- Print production stage. It is not recommended to do extensive language scribing during typesetting and page proof stages. However, as content is added and alterations are applied, language scribing should be included and maintained.
- Ebook production stage. For ebooks being produced from typeset files, language scribing can be included as part of the ScML preparation steps. Language scribing should be handled within the ScML file before processing it to ePub format.
Language Scribing in Word
Use the SAI or SAI Lite to add and apply language styles in Word.
- Use the Add Language Style tool to create the necessary language styles for the project in Word.
- Per the scribing procedure for all projects, load ScML styles into the document.
- Use the drop-down menu in Load ScML Styles to select .
- Select the base style to use. Select from the drop-down menu or click the button to use the style of selected text.
- Enter the language code. Common languages can be selected from the drop-down menu.
- The Resulting style field will display the name of the style being added.
- Click to apply the style to selected text or to add the style without applying it.
- At the designated time (during the initial scribing or while copyediting), review all italic text for phrases that need to have language metadata applied. To find italic text, convert the Word file to .sam and use the Character Styles regular expressions or search for italic font in the Word Find window (). Apply the language styles created.
- Apply additional language styles as needed (e.g., apply
lang@lang=esto Spanish text using the default paragraph font orgt@lang=esto Spanish-language glossary terms).
Language Scribing in sam, ScML, or InDesign
If the live file is in a production stage, the language styles can be added in a sam, ScML, or InDesign file. When each language style has been identified, add them to the point document with the appropriate formatting.
- In sam/ScML:
<lang-i lang="es"> - In InDesign:
lang-i-language-es
The searches presented here can be used as a starting point for finding languages in books based on the character or paragraph styles applied to them as well as word and letter patterns that are unique to a particular language.
Review the results before changing or replacing any styles.
Paragraph Styles
Review block quote (bq) and senseline (sl) paragraphs as common places to identify if there are full paragraphs in another language. Check other paragraph styles as needed.
Find:
<[^>"]*(bq|sl)[^>]*>[^\n]+
Copy into a new file, turn off word wrap, and skim for non-English text.
Character Styles
Review italic terms, various “-i” styles, and various lang terms in a new Sublime file.
Find:
<i>[^<]+</i>|<[^>]+-b?i>[^<]+</[^>]+-b?i>|<lang[^>]*>[^<]+</lang[^>]*>|<[^>]*lang[^>]*>[^<]+</[^>]*>
Copy into a new file, permute unique lines, remove English text, and delete proper names. Add lang attributes as needed to the original file.
Book-Specific Styles
Certain books such as Bibles or language/grammar books may have additional paragraph or character styles that are being used to identify languages. Review additional content for languages based on the type of publication.
Text in Quotation Marks
Publications with a significant amount of dialogue or other content in quotation marks may make reviewing all quotes unfeasible. If it is determined that it will be beneficial, use this search to find all instances of text in quotation marks and review the results.
Find:
(“[^\n”]*”)
Language Names
Use the following searches to identify languages that may be referenced specifically as well as certain text patterns associated with particular languages.
Search:
-
\b(Aari|Abanyom|Abaza|Abkhaz|Abkhazian|Abujmaria|Acehnese|Adele|Adyghe|Afar|Afrikaans|Afro-Seminole Creole|Aimaq|Barbari|Aini|Ainu|Akan|Akawaio|Aklanon|Albanian|Aleut|Algonquin|Alsatian|Altay|Alutor|Amharic|Anda|Amdang|Ancient Meitei|Angika|Anyin|Ao|A-Pucikwar|Arabic|Aragonese|Aramaic|'Are'are|Argobba|Aromanian|Macedo-Romanian|Armenian|Arvanitic|Ashkun|Asi|Assamese|Assyrian Neo-Aramaic|Asturian|Ateso|Teso|A'Tong|'Auhelawa|Auslan|Austro-Bavarian|Avar|Avestan|Awadhi|Aymara|Azerbaijani|Badaga|Badeshi|Bahnar|Balinese|Balochi|Balti|Bambara|Bamanankan|Banjar|Banyumasan|Bartangi|Basaa|Bashkardi|Bashkir|Basque|Batak Karo|Batak Toba|Batak Simalungun|Bats|Beja|Belarusian|Belhare|Berta|Bemba|Bengali|Bezhta|Betawi|Bete|Bhili|Bhojpuri|Bijil Neo-Aramaic|Bikol|Bikya|Furu|Bissa|Blackfoot|Boholano|Bohtan Neo-Aramaic|Bonan|Paoan|Bororo|Bodo|Bosnian|Brahui|Breton|Bua|Buginese|Bukusu|Bulgarian|Bunjevac|Burmese|Burushaski|Buryat|Caddo|Cahuilla|Caluyanon|Caluyanun|Cantonese|Catalan|Cayuga|Cebuano|Chabacano|Chavacano|Chaga|Kichagga|Chakma|Chamorro|Chaouia|Tachawit|Chechen|Chenchu|Chenoua|Cherokee|Cheyenne|Chhattisgarhi|Chickasaw|Chintang|Chhintang|Chilcotin|Chinese|Chiricahua|Mescalero-Chiricahua Apache|Chichewa|Nyanja|Chipewyan|Chittagonian|Choctaw|Chorasmian|Khwarezmian|Chukchi|Chukot|Chulym|Church Slavonic|Chuukese|Trukese|Chuvash|Cocoma|Cocama|Cocopa|Coeur d’Alene|Comanche|Comorian|Cornish|Corsican|Cree|Crimean Tatar|Crimean Turkish|Croatian|Csángó|Cuneiform|Cuyonon|Czech|Dagbani|Dahlik|Dalecarlian|Dameli|Danish|Dargin|Dakota|Dari|Dari-Persian|Daur|Dagur|Dena'ina|Tanaina|Dhatki|Dhivehi|Maldivian|Dida|Dioula|Jula|Dogri|Dogrib|Tli Cho|Dolgan|Domaaki|Dumaki|Dongxiang|Santa|Duala|Dungan|Dutch|Dzhidi|Judeo-Persian|Dzongkha|Eastern Yugur|Edo|Efik|Esan|Egyptian Arabic|Egyptian Hieroglyphs|Ekoti|Enets|Yenisey Samoyed|English|Erzya|Esperanto|Estonian|Evenk|Evenki|Ewe|Extremaduran|Faroese|Fang|Fijian|Filipino|Finnish|Flemish|Fon|Franco-Provençal|Arpitan|French|Frisian|Friulian|Fula|Fulfulde|Fulani|Fur|Ga|Gadaba|Gagauz|Galician|Gallo|Gan|Ganda|Gangte|Garhwali|Gayo|Gen|Gẽ|Mina|Georgian|German|Gikuyu|Kikuyu|Gilbertese|Kiribati|Gileki|Goaria|Gondi|Gorani|Gurani|Gowro|Gawar-Bati|Gowari|Narsati|Greek|Guaraní|Guinea-Bissau Creole|Gujarati|Gula Iro|Kulaal|Gullah|Sea Island Creole English|Gusii|Gwichʼin|Hadza|Haida|Haitian Creole|Hakka|Hän|Harari|Harauti|Harsusi|Haryanavi|Harzani|Hausa|Havasupai|Upland Yuman|Hawaiian|Hazaragi|Hebrew|Herero|Hértevin|Hiligaynon|Hindi|Hinukh|Hiri Motu|Hixkaryana|Hmong|Ho|Hobyót|Hopi|Hulaulá|Hungarian|Hunsrik|Hutterite German|Ibibio|Iban|Ibanag|Icelandic|Ido|Ifè|Igbo|Biafra|Ikalanga|Kalanga|Ili Turki|Ilokano|Ilocano|Inari Sami|Indonesian|Ingrian|Izhorian|Ingush|Interlingua|Inuktitut|Inupiaq|Inuvialuktun|Iraqw|Irish|Irish Gaelic|Irish|Irula|Isan|Northeastern Thai|Ishkashimi|Ishkashmi|Istro-Romanian|Italian|Itelmen|Kamchadal|Jacaltec|Jakalteko|Jalaa|Jamaican Patois|Japanese|Jaqaru|Jarai|Javanese|Jen|Jewish Babylonian Aramaic|Jibbali|Shehri|Jicarilla Apache|Juang|Jurchen|Kabardian|Kabyle|Kachin|Jingpo|Kalaallisut|Greenlandic|Kalami|Gawri|Dirwali|Kalasha|Kalmyk|Oirat|Kalto|Nahali|Kamtapuri|Rangpuri|Rajbongshi|Kankanai|Kankanaey|Kannada|Kaonde|Chikaonde|Kapampangan|Karachay-Balkar|Karagas|Karaim|Karakalpak|Karelian|Karenni|Kashmiri|Kashubian|Kazakh|Kerek|Ket|Khakas|Khalaj|Kham|Sheshi|Khandeshi|Khanty|Ostyak|Khasi|Khitan|Khmer|Khmu|Khowar|Kildin Sami|Kimatuumbi|Kinaray-a|Hiraya|Kinyarwanda|Kirombo|Kirundi|Kivunjo|Klallam|Clallam|Klingon|Kodava Takk|Kodagu|Coorgi|Kohistani|Khili|Kolami|Komi|Komi-Zyrian|Konkani|Kongo|Kikongo|Koraga|Korandje|Korean|Korku|Korowai|Korwa|Koryak|Kosraean|Kota|Koyra Chiini|Western Songhay|Koy Sanjaq Surat|Koya|Krymchak|Judeo-Crimean Tatar|Krio|Kujarge|Kui|Kumauni|Kumyk|Kumzari|ǃKung|Kurdish|Kurukh|Kurux|Kusunda|Kutenai|Kootenay|Ktunaxa|Kwanyama|Ovambo|Kxoe|Kyrgyz|Kirghiz|Láadan|Laal|Ladakhi|Ladin|Ladino|Judeo-Spanish|Laki|Lakota|Lakhota|Teton|Lambadi|Lamani|Banjari|Lao|Laotian|Larestani|Latin|Latvian|Laverent|Laz|Lazuri|Leonese|Lepcha|Lemerig|Lezgi|Agul|Ligbi|Ligby|Ligurian|Limbu|Limburgish|Lingala|Lipan Apache|Lisan al-Dawat|Lishana Deni|Lishanid Noshan|Lishana Didan|Lithuanian|Livonian|Liv|Lombard|Lotha|Low German|Low Saxon|Lower Sorbian|Lozi|Silozi|Ludic|Ludian|Lunda|Chilunda|Luo|Luri|Lushootseed|Lusoga|Soga|Luvale|Luwati|Luxembourgish|Lycian|Lydian|Macedonian|Magadhi|Maguindanao|Maithili|Makasar|Makhuwa|Makua|Makhuwa-Meetto|Malagasy|Malay|Malayalam|Maltese|Malto|Sauria Paharia|Malvi|Malavi|Ujjaini|Mam|Manchu|Mandaic|Mandarin|Mandinka|Mansi|Vogul|Manx|Manyika|Maori|Mapudungun|Mapuche|Maranao|Marathi|Mari|Cheremis|Marquesan|Marshallese|Ebon|Masaba|Masbatenyo|Minasbate|Maya|Mazandarani|Tabari|Meänkieli|Tornedalen Finnish|Megleno-Romanian|Megrelian|Mingrelian|Mehri|Mahri|Meitei|Manipuri|Meithei|Menominee|Mentawai|Meroitic|Mescalero Apache|Meru|Kimeru|Michif|Mikasuki|Miccosukee|Mi'kmaq|Micmac|Minangkabau|Mirandese|Mobilian Jargon|Moghol|Mohawk|Moksha|Molengue|Mon|Mongolian|Mono|Mono|Mono|Montagnais|Montenegrin|Motu|Muher|Mundari|Munji|Muria|Nafaanra|Nagarchal|Nahuatl|Nama|Nanai|Nauruan|Navajo|Navaho|Ndau|Southeast Shona|Ndebele|Ndonga|Neapolitan|Negidal|Nepal Bhasa|Newari|Nepali|Nihali|Nahali|Nganasan|Tavgi|Ngumba|Nheengatu|Geral|Modern Tupí|Nias|Niellim|Nigerian Pidgin|Nisenan|Niuean|Niue|Nivkh|Gilyak|Nogai|Norfuk|Norfolk|Pitcairn-Norfolk|Norman-French|Northern Sami|Northern Sotho|Sepedi|Northern Yukaghir|Norwegian|Bokmål|Nynorsk|Riksmål|Nuer|Nurt|Nuxálk|Bella Coola|Nyabwa|Nyah Kur|Nyangumarta|Nyoro|Nǀu|Occitan|Provençal|Ojibwe|Ojibwa|Chippewa|Okinawan|Olonets Karelian|Liv|Livvi|Omagua|Ongota|Odia|Ormuri|Oroch|Orok|Oromo|Afaan Oromoo|Ossetic|Ossetian|Old East Slavic|Old Russian|Oostfräisk|East Frisian Low Saxon|Old Prussian|Oshimbalantu|Odia|Padaung|Páez|Nasa Yuwe|Palauan|Palawa_kani|Pangasinan|Pa'O|Papiamento|Papiamentu|Parachi|Parya|Pashto|Pushto|Pashtu|Pennsylvania Dutch|Pennsylvania German|Persian|farsi|Phalura|Phuthi|Pig Latin|Picard|Pirahã|Plautdietsch|Mennonite Low German|Polish|Portuguese|Pradhan|Pardhan|Puelche|Puma|Punjabi|Panjabi|Pwo Karen|Palestinian Arabic|Pascenda|Pashandah|Phat Thai|Q’eqchi’|Qashqai|Ghashghai|Quechua|Qui|Rajasthani|Ratagnon|Datagnon|Latagnun|Réunion Creole|Bourbonnais|Romagnol|Romanian|Romansh|Rhaeto-Romance|Romany|Romblomanon|Rotokas|Runyankole|Nyankore|Russian|Ruthenian|Rusyn|Carpathian|Sabaean|Sadri|Salar|Samoan|Sandawe|Sango|Sanskrit|Santali|Saramaccan|Sardinian|Sarikoli|Saurashtra|Sourashtra|Savara|Savi|Sawai|Scots|Ulster Scots|Hiberno-Scots|Ullans|Scots Gaelic|Scottish Gaelic|Gaidhlig|Gaelic|Selkup|Ostyak Samoyed|Semnani|Senaya|Serbian|Serbo-Croatian|Sesotho|Seto|Setu|Seychellois Creole|S'gaw Karen|Shimaore|Shina|Shona|Shor|Shoshoni|Shughni|Shumashti|Shuswap|Sicilian|Sidamo|Sika|Silesian|Silt'e|Selti|East Gurage|Sindhi|Sinhalese|Sioux|Sivandi|Skolt Sami|Slavey|Slovak|Slovene|Slovenian|Soddo|Kistane|Somali|Sonjo|Temi|Sonsorolese|Sonsorol|Soqotri|Sora|Sorbian, Lower|Sorbian, Upper|Sourashtra|Southern Sami|South Estonian|Southern Yukaghir|Tundra Yukaghir|Spanish|Sranan Tongo|St'at'imcets|Lillooet|Sucite|Sìcìté Sénoufo|Suba|Sundanese|Supyire|Supyire Senoufo|Surigaonon|Susu|Svan|Swahili|Swati|Swazi|Siswati|Seswati|Swedish|Syriac|Tabasaran|Tabassaran|Tachelhit|Tagalog|Tahitian|Tajik|Takestani|Talysh|Tamil|Tamasheq|Tamazight|Tanacross|Tangut|Tarifit|Rifi|Riff Berber|Tat|Tati|Tatar|Tausug|Tehuelche|Telugu|Tetum|Tepehua|Tepehuán|Thai|Tharu|Tibetan|Tigre|Xasa|Tigrinya|Timbisha|Panamint|Tiv|Tlingit|Tobian|Toda|Tok Pisin|Tokelauan|Tonga|Tongan|Torwali|Turvali|Tregami|Tsat|Tsez|Dido|Tshiluba|Luba-Kasai|Luba-Lulua|Tsonga|Tswana|Setswana|Tu|Monguor|Tuareg|Tamasheq|Tulu|Tumbuka|Tupiniquim|Turkish|Turkmen|Turoyo|Tuvaluan|Tuvan Tuvin|Tyvan|Udihe|Ude|Udege|Udmurt|Votyak|Ukrainian|Ukwuani-Aboh-Ndoni|Ulch|Olcha|Unserdeutsch|Rabaul Creole German|Upper Sorbian|Urdu|Uripiv|Urum|Ute|Uyghur|Uigur|Uzbek|Vafsi|Valencian|Vasi-vari|Prasuni|Venda|Tshivenda|Venetian|Veps|Vietnamese|Volapük|Võro|Votic|Votian|Waddar|Waigali|Kalasha-Ala|Waima|Roro|Wakhi|Walloon|Waray-Waray|Binisaya|Washo|Welsh|Western Frisian|Western Neo-Aramaic|Wolaytta|Wolane|Silt'e|Wolof|Wu|Xhosa|Xiang|Xibe|Sibo|Xipaya|Xóõ|Yaaku|Yaeyama|Yaghnobi|Yakut|Yankunytjatjara|Yanomami|Yanyuwa|Yapese|Yaqui|Yauma|Yavapai|Yazdi|Yazgulyam|Yazgulami|Yemenite Hebrew|Yeni|Yevanic|Yi|Yiddish|Yidgha|Yogur|Yoghur|Sarï Uyghur|Yellow Uyghur|Mongolic|Yokutsan|Yonaguni|Yoruba|Yucatec Maya|Yuchi|Yugur|Yughur|Sarïgh Uyghur|Yellow Uyghur|Turkic|Yukaghir|Yupik|Yurats|Yurok|Záparo|Zapotec|Zazaki|Zulu|Zuñi|Zuni|Zway|Zay)\b
Spanish
Find:
(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)
Find:
(“[^\n”]*(?:\sdel?\s|[Ee]stoy|¿|[eE]stá|iendo|[sS]í|Dios|\s[eE]s\s|\s[Ee]res\s|\sy\s)[^\n”]*”)
Replace with:<lang lang="es">\1</lang>
French
Find:
\b(?:[Jj]e|[vV]ous|moi|[Mm]ais|[Uu]ne|[cCsS]’est|[Ll]eur)\b
German
Find:
\b(?:für|und)\b
Language Codes
Some of the most common language codes are listed here as well as on the BCP-47 Wikipedia page.
The subtag lookup tool can be used to find thousands of additional language codes. Search for a language name in the Find window and look through the results listed under “language codes.”
| Language | Code |
|---|---|
| Afrikaans | af |
| Albanian | sq |
| Amharic | am |
| Arabic | ar |
| Armenian | hy |
| Assamese | as |
| Azerbaijani | az |
| Bashkir | ba |
| Basque | eu |
| Belarusian | be |
| Bengali | bn |
| Bosnian | bs |
| Breton | br |
| Bulgarian | bg |
| Burmese | my |
| Catalan | ca |
| Central Kurdish | ckb |
| Chinese | zh |
| Corsican | co |
| Croatian | hr |
| Czech | cs |
| Danish | da |
| Dari | prs |
| Divehi | dv |
| Dutch | nl |
| English | en |
| Estonian | et |
| Faroese | fo |
| Filipino | fil |
| Finnish | fi |
| French | fr |
| Frisian | fy |
| Galician | gl |
| Georgian | ka |
| German | de |
| Gilbertese | gil |
| Greek | el |
| Greenlandic | kl |
| Gujarati | gu |
| Hausa | ha |
| Hebrew | he |
| Hindi | hi |
| Hungarian | hu |
| Icelandic | is |
| Igbo | ig |
| Indonesian | id |
| Inuktitut | iu |
| Irish | ga |
| Italian | it |
| Japanese | ja |
| Kʼicheʼ | quc |
| Kannada | kn |
| Kazakh | kk |
| Khmer | km |
| Kinyarwanda | rw |
| Kiswahili | sw |
| Konkani | kok |
| Korean | ko |
| Kurdish | ku |
| Kyrgyz | ky |
| Lao | lo |
| Latin | la |
| Latvian | lv |
| Lithuanian | lt |
| Lower Sorbian | dsb |
| Luxembourgish | lb |
| Macedonian | mk |
| Malay | ms |
| Malayalam | ml |
| Maltese | mt |
| Maori | mi |
| Mapudungun | arn |
| Marathi | mr |
| Mohawk | moh |
| Mongolian | mn |
| Moroccan Arabic | ary |
| Nepali | ne |
| Norwegian (Bokmål) | nb |
| Norwegian (Nynorsk) | nn |
| Norwegian | no |
| Occitan | oc |
| Odia | or |
| Papiamento | pap |
| Pashto | ps |
| Persian | fa |
| Polish | pl |
| Portuguese | pt |
| Punjabi | pa |
| Quechua | qu |
| Romanian | ro |
| Romansh | rm |
| Russian | ru |
| Sami (Inari) | smn |
| Sami (Lule) | smj |
| Sami (Northern) | se |
| Sami (Skolt) | sms |
| Sami (Southern) | sma |
| Sanskrit | sa |
| Scottish Gaelic | gd |
| Serbian | sr |
| Sesotho | st |
| Sinhala | si |
| Slovak | sk |
| Slovenian | sl |
| Spanish | es |
| Swedish | sv |
| Swiss German | gsw |
| Syriac | syc |
| Tagalog | tl |
| Tajik | tg |
| Tamazight | tzm |
| Tamil | ta |
| Tatar | tt |
| Telugu | te |
| Thai | th |
| Tibetan | bo |
| Tswana | tn |
| Turkish | tr |
| Turkmen | tk |
| Ukrainian | uk |
| Upper Sorbian | hsb |
| Urdu | ur |
| Uyghur | ug |
| Uzbek | uz |
| Vietnamese | vi |
| Welsh | cy |
| Wolof | wo |
| Xhosa | xh |
| Yakut | sah |
| Yi | ii |
| Yoruba | yo |
| Zulu | zu |
Note: Bold text indicates commonly used languages.
Language Scribing QC Checklist
Quality control steps for language scribing represent a collaboration and confirmation of decisions that may not have a definitive right or wrong aspect. While the formatting of the language tags must follow particular patterns in environments like Word, InDesign, sam, or ScML, the choices of which terms should be tagged may vary from person to person.
QC should assess the content for the following:
- Terms that are not tagged that should be
- Terms that are tagged that should not be
- Terms that have the wrong language applied to them
Use a sam or ScML file to search for terms to review.
Tagged Paragraph Styles
Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.
Find:
^ *<[^>]*lang[^\n]+
Tagged Character Styles
Find all, copy the results into a new file, and review the list for anything that appears to be incorrect.
Find:
<[^>]*lang[^>]*>[^<]*<[^>]*>
Untagged Terms
In a copy of the file, remove the content that has language tags applied to it.
Find:
^ *<[^>]*lang[^\n]+
Replace with:NOTHING
Find:
<[^>]*lang[^>]*>[^<]*<[^>]*>
Replace with:NOTHING
With this content deleted, repeat some of the searches and techniques used to find foreign languages. (Use spell-check selectively to help skim through results.)
Review the following:
- Text in italics
- Text in quotation marks
- Text marked by spell-check
- Terms containing the special characters listed in the Digital Hub stats
- Words in the body of the book that may indicate certain languages