Procedures/tools for Transcoding Cross-Platform Legacy Text to Unicode Using TECkit

1. Elaborating the cross-platform legacy encoding

One of the most difficult tasks in transcoding (particularly when working cross-platform between Macintosh and Microsoft Windows) is to discover the comprehensive details of the glyph order (legacy) encoding. A confusing trick from the past that facilitated use of a single font cross-platform (to prevent programmes from using the 'cmap' tables in the font that pointed to different glyphs for the same code dependent upon platform), was to create a font as a 'symbol' font. Hence extracting the cmap tables using a tool like TTX may not be altogether helpful (example command line syntax: ttx -t cmap KHMER.ttf). To understand the structure of the tables generated by TTX you need to look at the TrueType specification (http://developer.apple.com/fonts/TTRefMan/).

One could also open a font in a font editing programme, but the ID number might not match the source encoding. My preferred (inexpensive!) TrueType font editor is Font Creator Program 4 (http://www.high-logic.com/). The king of font editors is, of course, FontLab (available on both Macintosh and Microsoft Windows OSes) which you should get if you can afford it. Fontographer is another option, but it has not been improved for many years...instead treated like a cash cow by Macromedia.

Probably the best idea is to create a sample text file (legacy code list) that lists all codes and open that in a text editor on the source platform using the font from which one seeks to elaborate the encoding. It is useful to generate tab characters [in Perl chr(9)] to separate the encoding number from the glyph it generates and to place on either side of the glyph some character that helps to distinguish vertical and horizontal glyph variants (possibly parentheses such as these). A programme that combined the above with counts of the number of instances of each character is legacystats.pl. (Caution: *.pl file types cannot be accessed from this Web site, so they have all been supplemented with an extension *.txt.) Note that the incoming and outgoing file names are hard-coded in the Perl script. The incoming text file should be a long sample of text entered using the legacy encoding/font (source text file). The output file is the 'legacy code list'.

2. Match the codes of legacy font with Unicode characters

The 'legacy code list' output from step 1. (in this case ChuonNathDict_Saw.txt), is then opened in a text editor on the same platform the source text file was generated. My preference on Windows is UltraEdit (available from http://www.ultraedit.com/). There is a relatively useful and inexpensive text editor for Macintosh, but it has a severe limitation in that one cannot print from it in MacOS 10.3: Alpha 7 (available from http://www.kelehers.org/alpha/ or ftp://ftp.ucsd.edu/alpha/); hence I needed to fall back on the simple TextEdit that comes with the MacOS system to print. On the other hand, TextEdit has an annoying way of substituting certain characters from another font (i.e., 218, 219, 222, 223, 240) and filling in characters for code points that do not exist in the font. Hence, it may be useful to print out the results from TextEdit and correct the printout manually from what is seen on the screen in Alpha 7. BBEdit 7.1 on Macintosh is another option...but rather expensive. One advantage of a powerful text editor is the ability to select a problematic character and determine the code behind it...very useful in debugging TECkit output (in UltraEdit select a character, Edit menu->Hex Edit item; in Alpha 7 Utils menu->Ascii Etc->Get Ascii [returning the character with its decimal/octal/hex values]). Another advantage of an advanced text editor is the constant display of the number of the line/character where the cursor lies (and a quick means to get to a specific line). Often the problematic character in output, however, is the default character (anything not understood is converted to that...but this is a bit early to discuss that), not too enlightening.

Next print out the Unicode charts that reflect the character set being used in the source text file. In the case of Khmer that is at least: http://www.unicode.org/charts/PDF/U1780.pdf . Then write the legacy codes from the 'legacy code list' near to each character they are derived from. For Khmer this would involve writing subscript KA (U+1780) underneath KA, for example. For Khmer it may be useful to place three columns of boxes to the left of the chart with the four corners offering space to put the codes of ligatures in parallel with their parent character: base with AA, base with AU, spacing subscript with AA, and spacing subscript with AU. As characters are encountered, be sure a Unicode code point is found for each one (it may also be useful to reference charts: http://www.unicode.org/charts/PDF/U0000.pdf, http://www.unicode.org/charts/PDF/U0080.pdf, and http://www.unicode.org/charts/PDF/U0080.pdf)

2. Creating an algorithm to convert to/from Unicode

TECkit written by Jonathan Kew of SIL (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit) is an excellent tool (for Macintosh and Microsoft Windows) that allows conversion to and from Unicode from legacy encodings. The following features make this tool especially attractive: (1) reversability of the encoding (2) ability on-the-fly to convert into/from any of the common transformation formats of Unicode, and (3) ability to incorporate as a COM object to selectively transcode only that text entered in a specific font (especially useful in Microsoft Word) without affecting formatting.

Initial experiments indicate some usefulness of decoding in the following passes for forward encoding (to Unicode) of Khmer:

a. Separate ligatures within the legacy encoding. Sometimes codes standing for characters not encoded in the font are created at this stage, feeding into the next pass. The output of each pass becomes the input into the succeeding pass. Do not break up legacy subscripts at this point into COENG and BASE, but do separate spacing subscript ligatures (maintaining the vertical variants of the subscripts). Do break up the Khmer word 'to give', creating a 'new' code for U+17b1 or U+17b2 QOO.

b. Reorder legacy encoding in order: consonant/independent vowel, ROBAT, register shifters, first phonetic order subscript, second phonetic order subscript, prefix vowel glyphs, remaining vowel glyphs, signs. This may require four or more passes. It may be best at first to divide all non-base characters into two separate classes, incrementally exchanging pairs of characters step by step until they are all in the right order. Keep the vertical subscript variants intact as a means to ensure the first subscript precedes the second subscript at the end of the reordering.

c. Normalise super/sub-script glyphs and exception substitutions (NYO, COENG RO) within legacy encoding

d. Sift pre-base vowels and subscript RO into the appropriate position after the base, remembering that subscript RO is virtually always the second subscript in a two subscript situation.

e. At this point implement an algorythm to insert ZWNJ before register shifter characters that stay in a superscript position despite the presence of a superscript vowel and determine which register shifter has assumed the form resembling vowel U (U+17bb) in the presence of a superscript vowel, changing its code to the legacy equivalent of MUUSIKATOAN or TRIISAP. Also combine pre-base vowels with post-base vowels, creating 'new' legacy codes for them.

f. Convert between legacy encoding and Unicode encoding (this is the only stage where a default character is automatically substituted if an explicit transcoding has not been provided). Note that a number of two-glyph vowels have 'new' legacy codes derived from previous passes. These are here converted into Unicode. Subscripts require two Unicode values: U+17d2 plus the base consonant/independent vowel.

g. Many lines of information can cause confusion in a script, so it is recommended to organise the script for ease of recognition. For example, liberally fill the script with comments, separate the class definitions from the rules, and put classes and/or rules in a regular order (for example, the order in which characters should appear in a cluster.

The rules are typed up in a text editor into a file with the .map extension.

Some hints that may help one understand the internal workings of TECkit:

a. Classes need to be declared anew for each pass (that may require simply copying and pasting into a new pass). That could be simplified by creating a Define for the content of the classes. Obviously there is no need to refer to ligatures in later passes if they have all been separated into their separate parts in the first pass.

b. Passes are not actually numbered, but they do follow each other sequentially. It is probably preferable to have many 'Byte' passes...and only at the one last pass convert to/from Unicode.

c. If there is no rule to cover a particular code in pure 'Byte' or 'Unicode' passes, the incoming code passes through to the next pass. However, the Byte_Unicode (or Unicode_Byte) passes require a rule to catch each and every code entering. If there is no rule...that code will be converted to a default character.

d. Once a code in a certain position is processed, you cannot reprocess that code in the same pass. Neither can you reprocess a code output from a rule in the same pass.

e. One way to avoid 'processing' later codes in a string is to place them only in the end of a context of a rule

f.

3. Next, the *.map file embedding the conversion routines needs to be compiled and files subsequently transcoded

Sample syntax for compiling a *.map file on the command line in Microsoft Windows:

TEKkit_Compile ArunBiblicalKhmer.map.

On Macintosh I prefer to use the TECkit Mapping Editor for composing/compiling the .map files into .tec files (but *.tec files created on Windows may be used). There may be a degree of debugging that needs to be done in response to error codes generated at this point, but this seems relatively straightforward...even easy if you use a text editor that will take you directly to a numbered line.

In order to transcode text from a text file on the command line in Microsoft Windows follow this sample syntax:

TxtConv -i Matthew.txt -o Matthew_Unicode.txt -t ArunBiblicalKhmer.tec -of UTF8

Note that you might get a stray BOM (byte order mark) appear as the first character in your file and an end of file mark at the end. Note that I set the 'output transformation format as UTF8. Although most programmes on Windows use a form of UTF16, they will generally convert from UTF8 on the fly (and UTF8 is more useful for Web or Unix use).

On the Macintosh use DropTEC, dragging the appropriate files into their matching windows (this tool is also available on Windows).

4. Debugging

It is a bit tedious, but I have found the creation of a test file from a spreadsheet most useful in detecting whether the *.map (i.e., *.tec) file is functioning properly. Starting from Unicode charts on which legacy codes have been written, insert the appropriate functions one row at a time using legacy codes representing individual codes and related characters (subscripts and ligatures, for example) in parallel columns. Hence, KA would be represented as =CHAR(65) in the KSCII encoding. In a parallel column it might be necessary to insert a ligature form of KA (=CHAR(128)). For a subscript, however, you would have to insert a base character on which it could 'hang' (=CONCATENATE(CHAR(65),CHAR(193)). For two-glyph vowels you would have to insert a pre-base vowel form, a base, as well as the post-base vowel form (=CONCATENATE(CHAR(188),CHAR(65),CHAR(174))). Very complex disordered clusters could be constructed in the same way to test out the reordering mechanism. This spreadsheet would then be exported as text and used as an input file (after the -i flag of TxtConv).

Also useful has been the temporary separation of passes, running a single pass against a test file appropriate to the functions of that pass (note that the header needs to be included in that test *.map file as well).

Some of the things that can go wrong...from benefit of hindsight and some hand-holding by that kind man, Jonathan: (1) The source (or *.map) file might inadvertantly be converted to Unicode so none of the rules would apply, with the introduction of all that 'null' code [so far as a single byte encoding is concerned], (2) The rules may be too specific...hence skipping the combinations in the incoming text file, (3) Some glyph variants might be accidentally omitted in the classes used by the rules [be sure to have a separate class for those variants], (4) if a number of classes are sequentially placed in parentheses in a rule, it means the Rule is expecting a character from each class; rather they should normally be separated by a vertical bar in Rules (but not in definitions of groups of classes).

5. Thoughts on Khmer

One could simplify this for multiple legacy encodings by converting each legacy encoding to a 'normalised' one-byte encoding and doing all the reordering and Unicode conversion from an included 'normalised' one-byte encoding with related passes. It should be very easy to convert between legacy encodings (except in Unicode_Byte mode where different fonts handle ligatures very differently; for example under KSCII one needs to insert a ligature space at an appropriate place after every base-vowel ligature and a small swish to form AU out of AA [even when ligatured]). The hard work in Khmer Byte_Unicode transcodings is the reordering within a cluster, the conversion of Register shifter characters (and possible insertion of ZWNJ), and potentially the distinguishing of COENG DA and COENG TA.

A couple working sample files (still in rather rough shape) are: KSCIIKhmer.map and ArunBiblicalKhmer.map.