Documentation

IDTT to sam

Use the following procedure to convert InDesign Tagged Text (IDTT) that has been extracted from an InDesign file to a Scribe Abbreviated Markup (sam) file.

Only use this procedure to convert IDTT to .sam for files that have been typeset outside of the Well-Formed Document Workflow.

If a book has been typeset using the WFDW, use the Export .sam from InDesign process.

References/Prerequisites

Planning and Approach

Plan the Work

1. Assess

Develop/review the specifications to identify the requirements of the final output. Check that the IDTT files contain everything that is required for conversion. Aspects to check include content, page ID placement, and formatting tags.

Search across files for page IDs to identify if any are missing.

Find: \{~\?~PG: @([a-z0-9]+)@\}

2. Plan

Determine what elements will need to be checked and addressed specifically in the resulting .sam file.

3. Act

Carry out the planned actions.

Working

1. Save Regularly

Save your file regularly as you work. Saving ensures you can fix a problem or backtrack as needed.

2. Regular Expressions

In Sublime, enable regular expressions, with the match case setting turned on.

Note the following about the regular expressions listed below:

  • Replace with NOTHING means literally having nothing in the replacement box.
  • Replace with SPACE means a literal space character.

Procedure

Merge Files

Merge all IDTT files into a single text file with a .sam extension.

Place content in order.

Note: In some cases, it may be most efficient to place content into approximate locations and then review the content for final order later in the process, after all extraneous tags have been removed.

Remove Metadata

Delete file setup information.

Find: <(ASCII|Version|Define)[^\n]*\n
Replace with: NOTHING

Find: <FILENAME[^\n]*\n
Replace with: NOTHING

Control Characters

Remove all control characters (e.g., ESC, BEL, BS). These will have a shaded background.

Find: [^ -~\n\t]
Replace with: NOTHING
or: SPACE

Named Entities

Replace ampersand, less than, and greater than characters with named entities.

Find: &
Replace with: &amp;

Find: \\<
Replace with: &lt;

Find: \\>
Replace with: &gt;

Line Breaks

Note: Some of the following searches may need to be run again at different stages during conversion.

Add Placeholder Style Name

Add “nostyle” as a placeholder paragraph style name.

Find: (<ParaStyle:)(>)
Replace with: \1nostyle\2

Place Paragraphs on New Lines

Find: ([^\n])(<ParaStyle:)
Replace with: \1\n\2

Remove Empty Lines

Repeat the following until there are no more results:

Find: \n\n
Replace with: \n

Move Closing Tags to the Ends of Lines

Find: \n(<[^>]*:>)
Replace with: \1

Remove Unnecessary InDesign Tags

Note: The following searches may be modified if an aspect can be used to determine where an ScML style should be used. For example, “Skew” may be useful for identifying italics, or “TextAlignment” may indicate poetry. Do not delete any tag that may contain vital style information until the appropriate ScML style has been applied.

Remove unnecessary character rendering tags.

Find: <c[^>]*(Leading|Kerning|Tracking|Spacing|Size|Ligatures|OTF|Skew|Language|Baseline)[^>]*>
Replace with: NOTHING

Find: <c(Bouten|Kent(en)|Shatai|Tatech?u|Tsume|Wari(chu)|Hindi|StrokeGradient|NextXChars)([^>]*>)
Replace with: NOTHING

Remove unnecessary paragraph rendering tags.

Find: <p[^>]*(Space|TabRuler|KeepwithNext|Auto|Hyphen)[^>]*>
Replace with: NOTHING

Remove unnecessary paragraph styles representing blank lines.

Find: ^<ParaStyle:[^>]*>[ \t]*\n
Replace with: NOTHING

Remove unnecessary hyperlink tags.

Find: <Hyperlink:=(<[^>]*>)*>
Replace with: NOTHING

Remove unnecessary text alignment tags.

Find: <pTextAlignment[^>]*>
Replace with: NOTHING

Convert Characters to Unicode Entities

Search for the following and determine the best “replace” option based on context.

Typesetting Spaces and Manual Breaks

Search for typesetting spaces.

Find: <0x200[0-9A-F]>
Replace with: NOTHING
or: SPACE

Search for soft hyphens.

Find: <0x00AD>
Replace with: NOTHING

Search for manual line breaks.

Find: <0x000A>
Replace with: \n
or: SPACE
or: NOTHING

Entity Format

Change remaining characters to hexadecimal entity format.

Find: <0(x[A-F0-9]+)>
Replace with: &#\1;

Characters in hexadecimal entity format will be converted to their corresponding Unicode characters when processed to other file formats through the Digital Hub.

Convert Character Styles

Construct searches based on what is found in order to apply the appropriate ScML character styles.

Note: The same rendering may be applied to elements that require different ScML styles.

<cPosition

Find: <cPosition

Example:

Find: <cPosition:Superscript>([a-z\d]+)<cPosition:>
Replace with: <enref>\1</enref>
or: <fnref>\1</fnref>
or: <sup>\1</sup>

<c

Find: <c

Example:

Find: <cTypeface:Italic>([^<]*)<cTypeface:>
Replace with: <i>\1</i>

Page IDs

Convert page IDs to self-closing tags.

Single page IDs:

Find: <CharStyle:page>\{~\?~PG: @([\da-z]+)@\}<CharStyle:>
Replace with: <page id="p\1"/>

Adjacent page IDs:

Find: <CharStyle:page>\{~\?~PG: @([\da-z]+)@\}\{~\?~PG: @([\da-z]+)@\}<CharStyle:>
Replace with: <page id="p\1"/><page id="p\2"/>

Search for any remaining page IDs.

Find: \{~\?~PG:

<CharStyle:

Find: <CharStyle:

Example:

Find: <CharStyle:Italic>([^<]*)<CharStyle:>
Replace with: <i>\1</i>

<cCase:

Find: <cCase:Small Caps>([^<]*)<cCase:>
Replace with: <sm>\1</sm>

Find: <cCase:All Caps>([^<]*)<cCase:>
Replace with: \U\1

Remove Stray Closing Character Style Tags

After ScML character styles have been applied, remove any remaining closing character style tags.

Find: <cTypeface:>|<CharStyle:>
Replace with: NOTHING

Convert Paragraph Styles

Construct searches based on what is found in order to apply the appropriate ScML paragraph styles.

At this time, do not compose spacing variations (f, l, s, or o) unless the existing styles in the file provide a 1-to-1 correspondence. Identify only the structural aspects of the paragraphs. Articulation can be added when converting the .sam file to .scml at a later stage by enabling the Articulate Spacing Distinctions setting in the Digital Hub.

<Para

Find: <Para

Example:

Find: <ParaStyle:Chapter Title>([^\n]*)$
Replace with: <ct>\1</ct>

Remaining InDesign Tags

Search for remaining InDesign tags. Replace the tag with the appropriate ScML style or delete it.

Find: <[^>]*:[^>]*>

Images

Place callouts for images in the appropriate locations.

<fig><img src="imagename.jpg"/></fig>

Note: If a logo image is part of the title page, compose it as bkpub (or bkpub1, if necessary) rather than fig. If a logo image is part of the copyright page, compose it as crtf (or a different crt style, if necessary) rather than fig.

Structure Indicators

If required, place structure indicators.

Example:

<structure>{~?~ST: begin chapter}</structure>

and

<structure>{~?~ST: end chapter}</structure>

sam Tags and Validation

Add sam Tags and DOCTYPE Declaration

Add the following text to the beginning of the file.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://www.scribeproduction.com/datafiles/dtd/scml.css" type="text/css"?>
<!DOCTYPE sam PUBLIC "-//Scribe Inc.//DTD sam v1.3.0//EN" "http://scml.scribenet.com/dtds/current/sam.dtd">
<sam>

Add the following text to the end of the file.

</sam>

Validation and sam QC

Validate the file.

Note: To validate, set up Sublime Text as indicated here and use the validation options under Build > XML: DTD Validation. You can also upload your file to the Digital Hub and address the errors it lists.

Once the file is valid, review the file using the .sam QC Checklist.