User Friendly Recoding of Legacy Text in RTF/MIF

A Masters Project in Fulfillment of Requirements for CS693

Adviser: Dr. Yonglei Tao, Grand Valley State University

Author: Maurice Bauhahn

Date: 22 April 1998, 11:52am Wednesday

Summary

This paper describes the rationale, difficulties, goals, and processes involved in creation of a tool to translate legacy representations of natural languages (in the context of the two text formatting languages, Rich Text Format and Maker Interchange Format) to or via Unicode. JJTree and JavaCC tools are used to generate lexers, parsers, and parse trees with Khmer legacy encoding as examples. Microsoft Visual C++ 5.0 is used to generate the GUI interface (particularly drag-and-drop of glyphs).

Why Recode?

Encoding is the representation of a glyph or character for the computer's benefit as a number. It is a very compact and useful way to store text (more manageable than a scanned picture of each character, for example). Most legacy systems use only one byte (8 bits) for these purposes. That allows the simultaneous storage of less than 256 types of characters in plain text (some of these codes are used for control purposes like tab, carriage return, line feed, end of file, et cetera; so not all are available for character storage). A computer does not 'recognize' any character, but only uses a different number to represent each character (and then uses a matching number in a font to reference a picture to display that character). This representation of characters by numbers may be called 'encoding'. Significant problems develop when different fonts or computers use a different set of numbers to represent the same alphabet. That problem has largely disappeared for English due to adoption of the widely used ASCII encoding (first 127 characters). But after English and control codes have used 127 out of 256 characters there is not much room left to represent other languages (only a few European languages, in fact). The real solution is to use the two byte Unicode encoding system (consisting of slightly less than 65,536 code points) or the four byte ISO10646 encoding system (which allows almost unlimited expansion of unusual Chinese/Japanese/Korean characters). Unicode and ISO10646 are identical for the first 65,536 code points (meant to encode the major scripts of the world). These two schemes are still very young (and incomplete), so much textual data is being accumulated in incompatible and overlapping encodings based on one byte. Tools are needed to convert this legacy text to compatible (and eventually non-overlapping) representations…on a font by font basis in many cases. Legacy systems often encoded glyphs (discrete components of writing usually meant for display). Unicode on the contrary normally matches a character (alphabetic entity) with each numeric code (in phonetic order), and lets the display mechanism deal with distributing the various glyph components to requisite display positions (before/above, before/after, above/below). For some languages (i.e., Khmer) two or more glyphs may be needed to display one character.

Non-Roman script texts in many incompatible legacy number-character linkings need to be made shareable via one standardized representation.

Unicode/ISO10646 is the migration goal for a world standard uniform method of character identification. Until it is more widely available for end-user manipulation it can be used within the parser as a comprehensive intermediary encoding between legacy number-character pairings.

Wouldn't it be Easier to Parse Plain Text?

Legacy texts may contain multiple incompatible single byte encodings coexisting even in individual words; these disparate representations should be unified for they would become ambiguous and corrupted if exported as plain text.

Half of the effort of producing a document may be in the formatting of it, that effort should be preserved...and not reinvested.

Why a (User Friendly) GUI?

Most computer users are losing the ability to use command line interfaces, so it is appropriate that an application for wide use should have a user friendly graphic user interface (GUI).

Hopefully an application could be made simple enough so that a mildly sophisticated user could customize the recoding of documents on a given computer. This would be carried out in the context of the particular font situation on that computer.

Display of non-Latin characters makes their identification and use much easier and more accurate.

Nevertheless, some characteristics are not visible or distinguishable (character width, character offset; and invisible codes such those for word demarcation, non-ligation, and conjoining).

Code page changes may vary the binding of a specific glyph between various code points, but that is not considered in this first implementation.

Drag and drop technology facilitates the use of characters (there are many inconsistent keyboard layouts and it is difficult to locate one of 200+ characters using the keyboard).

Is It Necessary or Possible to Generalize Such a Complicated Activity?

There are so many encodings and variations between generations of fonts having the same name that it is difficult, if not impossible, to supply a stock recoder that would meet the needs of every desktop. Therefore the end user must be introduced to character level translation.

Each script requires:

A framework file with all possible glyphs.

A grammar with rules to convert glyph codes to an intermediate encoding (Unicode).

A grammar with rules to convert glyph codes from an intermediate representation (i.e., Unicode) to another encoding (if necessary). The default is to emit a Unicode encoding. This is not provided in the preliminary implementation.

The end user visually matches reference glyphs with source font glyphs, and drags the latter to a box next to the former. One glyph may not be dragged to two locations (that would effect ambiguity). Two source font glyphs may not be combined to match one reference glyph. The language grammar is supposed to supply the 'intelligence' to properly handle compositon, decomposition, and look-alike characters.

Most Java command-line functions can be achieved using menu commands of the VisualFlex application.

Why Maurice Bauhahn?

Domain expert in the Khmer script (writing system used in Cambodia) and programmer.

Author of Khmer amendment to ISO10646 (approved by the relevant International Standards Organization Working Group: JTC 1/SC2/Working Group 2 the week of 16 March 1998). The proposal and submission are referenced at the bottom of http://www.csis.gvsu.edu/~bauhahnm/Encoding.html Subsequent to my proposal there has been some renaming of characters (and removal of one dotted circle character present elsewhere in Unicode); otherwise it stands as submitted.

Committed to creation of such a tool for several years.

Possess substantial textual data in digital format that requires such translation:

Compiled 1,617 page French/Khmer/Vietnamese/English Medical Dictionary (4^th Dimension database, Macintosh)

Previously responsible for production of Cambodia's primary and secondary textbooks…over thirty books (FrameMaker 4.0, Macintosh)

1,888 page Khmer-Khmer Dictionary (Choen Nath, Compiler) (FrameMaker 3.0, Macintosh)

Two versions of the Khmer Bible (FrameMaker 3.0, Macintosh)

Involved in large Theological Education by Extension publishing effort of Christian and Missionary Alliance (Khmer, Various Macintosh programs)

Cathedral in the Rice Paddy by François Ponchaud (Khmer version, Microsoft Word 5.01a, Macintosh)

A large Khmer/English legal dictionary from The Asia Foundation (4^th Dimension, Macintosh)

Have implemented Khmer plain text recoding, automated insertion of word demarcation, and sorting of Khmer using the Summer Institute of Linguistic's Consistent Changes program.

Committed to production of a Khmer Bible Concordance which relies on some of this technology.

The Complications:

Example of Khmer (of which the presenter is a domain expert and author of the current Khmer amendment to ISO10646) as one of the most complicated languages to use on the computer:

There are ten factors that can make a language difficult to use on computer if their implementation is complex (1. Coding/storage, 2. Searching, 3. Vertical positioning, 4. Horizontal positioning, 5. Accurate entry, 6. Proper sorting, 7. Word demarcation, 8. Line layout, 9. Word-wrap/justification, and 10. Display of glyphs). Almost all of these are a problem with Khmer: Display codings can easily exceed the allowed character codes in a one byte scheme, overlapping ASCII. Searching can be made difficult by a varying order of glyphs of a given word (the seeming identical visual appearance of a set of superscripts and subscripts may belie a much different ordering of numbers). Superscripts and subscripts may have to be moved up or down depending on the degree of visual overlap of other glyphs. Only horizontal positioning is relatively easily attained: all conjoined characters line up on the right margin and are not horizontally centered in positions over or under other characters. Data entry is difficult due to the large number of characters, the differing of display order from phonetic order, and the presence of multiple discontiguous glyphs to represent a single character. Sorting requires a complicated algorithm based on the weighting of different types of characters. It is not natural for native users to enter word breaks since they are invisible (hence a dictionary lookup scheme is needed to automatically insert them). Line layout is difficult in Khmer due to the fact that characters entered (phonetically) after other characters sometimes have to be displayed before them. Word wrap and justification are difficult due both to the lack of easy word demarcation and to the need to increase/decrease spaces between phrases and not words (a matter relating to the sense of a phrase…something the computer does not understand). The complex glyphs needed to represent Khmer are very different from those of any other script. Before the days of high resolution bit-mapped screens it was nearly impossible to represent Khmer.

Legacy encodings in glyph order differ from phonetic orderings of standards

Discontiguous multiple glyphs of legacy encodings need to be unified into single code in destination encoding

A single glyph in legacy encoding may require expansion into multiple destination codes (or intermediate glyph codes)

Multiple fonts may be used in a single word, due to (1) the large number of glyphs being distributed over multiple fonts, (2) the unintentional absence of glyphs from a given font, or (3) the user being unable to locate a certain character in the font they are using.

The insertion of formatting codes inside words (due to the mixing of fonts within words) can lead to incomplete tokenizing of grammatical units

Khmer has no visible word breaks, so word breaks are seldom entered, even though they are needed for word-wrap and other functions

Automatic insertion of word breaks requires dictionary lookup facilities

Legacy fonts may have inconsistent encodings between different sizes of the same font family

Source fonts sometimes have Latin characters not present in destination Khmer fonts.

Conventional technology for lexing and parsing is inadequate

Lex, Flex, JLex, Yacc, JavaCUP, Bison, et cetera are inadequate because they are not both cross-platform and Unicode savvy (and they are uniformly lacking the latter). JLex and JavaCUP are described at: http://www.cs.princeton.edu/~appel/modern/java/index.html#software

Only Java Compiler Compiler (JavaCC) supports lexing and parsing of Unicode. The latest release of JavaCC is Version 0.8pre1. This version was released on April 21, 1998. It generates top-down (recursive descent) parsers and allows parsing to any non-terminal in the grammar specification. Although by default it does single lookahead, specification grammars can be modified to do multiple syntactic or semantic lookaheads at the statement level. Features are summarized at http://www.suntest.com/JavaCC/features.html . The unique characteristics of this latest version are elaborated at http://www.suntest.com/JavaCC/DOC/relnotes08.html (moving away from deprecated InputStream, instead using InputStreamReader of Java JDK 1.1). The split grammars of Version 0.8pre2 are eagerly anticipated.

Conventional lexing does not support caching of tokens, reordering of tokens, preservation of identity of source tokens after unification, and decomposition of tokens (special actions are needed to achieve these). Parsers at least preserve the identity of (and allow decomposition) of tokens when implemented with parse trees but they need to be undertaken in actions as well.

Multiple lexers and parsers need to be integrated with a parse tree to coordinate multiple encodings and prune excess formatting branches

Object-oriented lexing and parsing would be beneficial to encapsulate identically named variables in multiple lexers/parsers (via packages)

A Java-based tool allows cross-platform use of the resulting translators

Instead of code generation, the final output will be an RTF or MIF file

Rich Text Format or Maker Interchange Format deficiencies:

No parsing grammars are publicly available for either

These are very large and complicated text formatting languages

Various generations of these languages are quite different from each other

Maker Interchange Format still does not support Unicode

Display of Khmer Unicode encoding is lagging:

No program which reads MIF or RTF file formats can at this time display a complicated Indic script such as Khmer encoded in phonetic order. The Khmer aspects of this proposal anticipate this functionality.

Only Apple's GX fonts and Java 1.2 with Java2D offer adequate display technology at this time (although they have not yet been implemented for Khmer), and no program which supports MIF and RTF takes advantage of either GX fonts or Java 1.2

Further Problems to Resolve:

Even JavaCC poses significant obstacles to lexing and parsing Khmer. Although tokens can be generated by actions within a token, that parent token itself cannot self-destruct. Furthermore any tokens emitted by actions within a parent token, precede the parent token in the output stream. In order to solve these obstacles: we must have tokens change their own identity. So to create a new order with itself as a leader…the parent token can clone itself in its first production, emit a number of other tokens, and finally change its own identity into the last item required. In order to emit other tokens, the token class needs to contain in a newToken procedure the names of ALL possible tokens. To hold special information like the Unicode value, special variables need to be embedded within the token (avoid simply declaring the variable as a TokenManager variable in the TOKEN_MGR_DECLS block).

The interaction of a text formatting language parser and one or more text encoding parsers (which may be inappropriately chopped, especially in the case that two differently encoded legacy fonts are used in the composition of individual words…resulting in the insertion of text formatting structures in the middle of words) creates significant complexity. It appears that the Visitor pattern will be required to traverse the abstract syntax tree (AST) formed by the interaction of the formatting language parser and the text encoding. Visitor can visit objects that do not have a common parent class. Happily the JavaTree utility (which comes with JavaCC) supports the Visitor pattern. On the other hand every Enhanced BNF will have to have an accept-visitor function and the visitor classes themselves will have to refer to every EBNF…not a small task.

Three tiered tokenizing is being implemented: The first tier has tokens declared and emitted by VisualFlex as an appendix to Khmer.jjt. It is the tier that recognizes the source encoding. A naming convention associated with this first tier provides that: non-Unicode entities names have an X_ or X?_ prefix. This applies to (1) display ordered vowels - and one subscript - which precede their associated consonant and to glyph variants, (2) subscripts [there are no one code subscripts in the Unicode encoding…these are formed by a combination of <JOENG> and another character] have an XS_ prefixing their name, (3) ligatures [there are no ligatures in Khmer Unicode] have an XL_ prefix. Non-Khmer characters are named with the Unicode name (underscores replace spaces as necessary and alternate Unicode names are used when the primary Unicode name is not recognized by Americans). Khmer names bear abbreviated Unicode names, if available, with underscores separating multiple components. If VisualFlex has not received a value for a certain glyph (at the moment there are 262 alternative glyphs and that number will probably double as time goes on), an 'impossible' value is inserted. There are no actions at the level of tier one. The second tier duplicates the entire first tier and is composed in part of tokens bearing the same name as the first tier, but with a Z prefix. Additional tokens (still with the Z prefix) facilitate decomposition, unification, and Unicode normalization. Actions are associated with tokens on this second level (and the third level). Additional tokens are used to reorder and combine tokens. There is much overlap between the second and the third tier. The third tier consists of all the pure Unicode characters from the second tier. Only pure Unicode characters are passed from the lexer to the parser.

Major Design Decisions

GUI aspects for the first implementation will be made with Microsoft Visual C++ on Windows NT 4.0 and will be called VisualFlex.

Actions, lexing, parsing, and parsing trees will be implemented using Java, JavaCC and JavaTree.

Most processing will be in Unicode; the output (code generation), in this first generation will simulate Unicode; later versions will be able to convert the output to any output encoding.

Priority will be given to encoding, not display (appropriate display technology is not yet available on the Windows platform or in programs which read RTF or MIF).

Code page issues which impinge on encoding will be not be considered in the first implementation.

Java applications generated will be command-line driven (but accessible from the VisualFlex menus).

Major components:

RTF.jjt and MIF.jjt Grammar Template files (one each). These are large and complicated. Input file for JJTree which inserts parse tree building actions in JavaCC source. The output of JJTree is in turn processed by JavaCC (which does lexical analysis and parsing of Rich Text Format or Maker Interchange Format text files, analyzing the syntax as it is associated with text and impinges on font use [via construction and translation of a parse tree]). The many .java files are then compiled by javac. The output classes RTF or MIF will detect FontSize.class files in the same directory as their class and covert corresponding text streams accordingly. The RTF and MIF classes require one parameter (an appropriate document to parse). If a second parameter ("Extract") is supplied, the application will create a table (the next item) listing which font families at what sizes were actually used in the document filtered.

MyDoc.fsu, font/size/usage/translator file for a given document (one each type of document, output of above). This is a simple table listing the font, size, usage and translator classes. This document is opened by the VisualFlex application to suggest source font/sizes needed to be translated.

Khmer.jjt Grammar Template file for each script (this one for Khmer). This file a mechanism to condense these to a minimum essential character set (generally Unicode) and a method to parse these into quanta (as Unicode or other encoding). This early implementation does not do a complete dictionary lookup or generate a third encoding equivalent. It does not contain a list of all possible glyph types that can be used to represent the language as this is appended to this file by VisualFlex to create FontSizeA.jjt files.

FontSizeA.jjt This is a grammar file for a stated font/size combination (one, generated from above template). Same as above but with specific codes assigned to each possible glyph type (or an impossible combination for those glyph types not represented in this font). Processing of this file by JJTree, JavaCC, and javac yields a parser (and associated classes) to translate a given font and size combination. One or more such files would be called by a translator traversing a RTF or MIF parse tree to handle specific strings. An A after the font size means that this encoding scheme is valid for all sizes of fonts with this name. A comparison of source and destination sizes would allow proportional resizing of replacement formatting.

KhmerOut.jjt Export Grammar Template file for each script (this one is for Khmer). This file contains a list of all possible glyph types that can be exported and Visitor functions that (during traversal of the parse tree) can choose which non-null glyphs to export and in what order. The initial implementation does not provide this functionality (only a Unicode simulation is exported).

KhmerEncodingGUI.vfl, GUI application file for each script (matched with Khmer.jjt template) (one for Khmer). Multiple variants of this file may be stored as foundations for recoding related encoding schemes. A document used to initialize the VisualFlex application containing: name of display fonts, their size, integer index to each font. There follow a long series of datasets containing: Latin script names of all possible glyphs (compatible with the needs of the Khmer.jjt grammar file; Unicode name if available; no spaces allowed), Unicode encoding (if applicable), code(s) to display a reference visual glyph, display font index for that reference glyph, glyph code to display source encoding, font index for that source encoding, code to display destination encoding, font index for that destination encoding.

VisualFlex, a Multiple Document Interface application, which reads in MyDoc.fsu files and reads in and modifies KhmerEncodingGUI.vfl files. It generates FontSize.jjt files one at a time based on the data accumulated in KhmerEncodingGUI.vfl. It displays a limited subset of glyph information in one Window: reference glyph name (Unicode name if possible), reference glyph shape, reference font index, source glyph, destination glyph. As this window is scrolled the targets are filled with data previously saved to/taken from the current KhmerEncodingGUI.vfl document. Source glyphs are either typed in or dragged from a source font glyph list (at the top of this Window the font and font size can be selected [limited to those detected in a recently opened MyDoc.fsu file] as well as [eventually] the code page). Similarly, destination glyphs may be either typed in or dragged from a destination font glyph list (at the top of this Window the font and font size can be selected [unlimited selection] as well as [eventually] the code page). Only a single source font can be processed at one time in the VisualFlex application. The Java parsers (MIF or RTF), however, can take advantage of as many translators as exist in their folder.

The lexer passes individual ordered Unicode character classifications to the parser. It was initially hoped to pass quanta (a quantum consists of a Khmer independent vowel [which has no other characters associated with it] or a consonant [with associated first subscript, second subscript, vowel, primary sign, and sign…,if any, in that order]). However there are cases where incomplete quanta must be passed (i.e., when a second font is used to complete the glyphs of a quantum which was started by another font). Actions within the lexer recognize the close of a quantum when it encounters another consonant, a number, Khmer punctuation, Latin punctuation or other Latin script characters. Display order glyphs which precede consonants, variables, and signs are cached in Token Manager switch variables. The associated display order glyphs (those that visually precede a consonant) are cached in a round robin fashion via three stages. Subscripts are reordered as necessary on the second tier token level, and then (other than 'RO' which displays before its associated consonant) are emitted as (<JOENG> consonant) third-tier token pairs.

Construction of an abstract syntax tree (AST) within the parser commences bottom up with independent_consonant() or a series (of whatever is available) in the order: consonant() subscript_consonant1() subscript_consonant2() vowel() primary_sign() sign() as the leaves. These leaves are grouped on quantum() nodes (at most one of each leaf per node). One or more demarcation characters (including punctuation, Latin characters, numbers, et cetera) may complete the node. If contiguous Khmer text derived from two differently encoded fonts has been normalized to one encoding, a unification visitor class could strip out unnecessary text formatting commands [those that separated two identically styled fonts] that may be arbitrarily dividing quanta or words. It would also be advantageous if it were possible to delete unnecessary end of line markers which were often inserted mid-paragraph…but that probably requires too generous a dose of artificial intelligence (unless every true paragraph began with a tab character or was preceded by a blank line, in which case a paragraph() node would be advisable). A series of quanta compose the word() node. These word() nodes may be subdivided or consolidated by word demarcation visitor classes according to dictionary lookups. Such a word demarcation visitor is needed to insert demarcation characters at appropriate points if they are missing (but not before punctuation, for example). The algorithm would be based on a four pointer system: The first pointer would point at the beginning of a set of glyphs. The second pointer denotes the end of an enclosed word, extending one additional glyph at a time. The glyph(s) surrounded by these first two pointers would be compared against a dictionary of words; if they matched, the second pointer would be extended one more glyph. The goal is to have the longest match possible. If at some point of the cycle of extending the second pointer there is no longer a match, the set of glyphs between the first pointer and the previous position of the second pointer is considered to enclose a word. Simultaneous to the above, a third pointer is positioned before the second quantum, and a fourth pointer marks the end of the second quantum. The quantum is compared against a dictionary. If there is no match for the beginning of a word in a dictionary, both the third and the fourth pointers are advanced one quantum. If there is a match, the fourth pointer is advanced in a manner similar to the movement of the second pointer, enclosing and comparing against a dictionary of Khmer words. The longest match achieved by the first or second pair of pointers is assumed to constitute a word, with word demarcation characters inserted after or before and after as appropriate. This algorithm allows the 'coining' of unrecognized words (which could be set off with a different color style or some other fashion, if appropriate). There were many false starts before the above scenario was settled upon. Attempts to restrict the complexity to lower levels failed.

Composition of RTF.jjt or MIF.jjt Grammar File Templates

First section: Options (set STATIC=false to allow recursion; JAVA_UNICODE_ESCAPE=true to allow \u0000 style handling of Unicode encodings and NODE_SCOPE_HOOK=true to allow stepping through tokens [as well as child nodes] in the parse tree, as explained at http://users.ox.ac.uk/~popx/jjtree.html ; VISITOR=true to cause JavaTree to insert jjtAccept() methods into all of the node classes it generates)

Second Section: Java Compilation Unit written in Java

Enclosed between PARSER_BEGIN(RTF or MIF) and PARSER_END(RTF or MIF) expressions

Initially includes a static main() function to declare a command line useable class of parser; later this can be transformed into a GUI type of application

Declares and passes font and font size identification stack ()

Third Section: Productions

Include lexing, parsing (for recursive functions), and parse tree functions with actions to implement much of its intelligence

Fill font identification stack, character style stack, and paragraph style stack.

Call font specific parsers as required

It is important to add first_token and last_token fields to the node classes. This facilitates capture of text embedded between grammatical constructs (http://users.ox.ac.uk/~popx/jjtree.html )

Composition of Khmer.jjt Grammar Template file

First section: Options (set JAVA_UNICODE_ESCAPE=true; NODE_SCOPE_HOOK=true; VISITOR=true, STATIC=false; for reasons as explained above.)

Second Section: Java Compilation Unit written in Java

Enclosed between PARSER_BEGIN(Khmer) and PARSER_END(Khmer) expressions

Initially includes a static main() function to declare a command line useable class of parser for command-line parsing of plain text files (debugging)

Third Section: Productions

Glyph tokens are taken from a list generated by VisualFlex (those that are not relevent are in that program assigned an unlikely value; those which are relevant are assigned the single value of the source encoding) which will later be appended to the Khmer.jjt Grammar Template file.

All tokens taken from the list generated by VisualFlex are repeated within Khmer.jj with the same name (preceded by 'Z'). This latter set contains all the Java actions connected with each character. Can actions be emitted after the token where they are generated? Erased after the actions?

All multiple-character/multiple-glyph tokens having attached actions that will emit decomposition tokens.

Those glyphs which normally proceed the consonant (three vowel glyphs and one subscript glyph) will be 'eaten' and flags set in memory (a buffering function) with all possible combinations tested as unified entities (seven combinations). However, if an in-quantum (consonant) flag is already set, all pre-existing flags (initial subscript, vowel, sign) will be emitted as character tokens in the proper order with vowel unification as appropriate.

When a consonant (or placeholder or non-alphabetic character) is encountered, all pre-existing flags (initial subscript, vowel, sign) will be emitted as character tokens in the proper order with vowel unification and character correction implemented (for identically shaped KHMER VOWEL SIGN U, alternate form of KHMER SIGN TRUYSAP, or alternate form of KHMER SIGN MUUSIKATOAN ) as appropriate. The register of the previous consonant should have been flagged to facilitate differentiating between these signs.

If a subscript is encountered, it will be emitted as a proper token.

If a second level subscript is encountered before a first level subscript, their order will be switched.

If any vowel or sign glyph is encountered it will be 'eaten' and appropriate flags set.

As a result of the above processing, a set of tokens in proper glyph order will be emitted.

A parse tree will be constructed at this point to package these tokens into individual quanta (a 'quantum' is an independent vowel or a consonant [or placeholder] and associated subscripts, vowel, and signs).

Dictionary look-up algorithms will further unite quanta into words.

Conclusion

There is still a lot to do in this project to which I am committed. Unfortunately at this stage there is no program to demonstrate. I am learning that even though I am a domain expert in the Khmer language aspects of this exercise and even though I have implemented much of this functionality using a different tool (the Summer Institute of Linguistic's Consistent Changes), design is still my biggest hurdle (and I have not even started testing!). Continued effort in this project will involve reworking the following texts and code I have produced: specification files for RTF and MIF (over 2800 lines), an 1616 line Khmer specification file, a rudimentary Visual C++ program, and a 2115 line KhmerEncodingGUI.vfl file (tier one).

Parse trees have historically been used to preserve the static structure of a programming language. The present scenario envisions a dynamic restructuring which may extend beyond the limits of current tools. Given the history of this project…one can expect many unforeseen challenges ahead. It is as though there is an overriding structure on top of the text formatting language and the natural language recoding. Suggestions on how to deal with that complexity would be very much appreciated.

Bibliography

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers, Principles, Techniques, and Tools. Reading: Addison-Wesley Publishing Company, 1986. ISBN 0-201-10088-6. 796pp. ("The Dragon Book")

Alfred V. Aho and Jeffrey D. Ullman. Principles of Compiler Design. Reading: Addison-Wesley Publishing Company, 1977. ISBN 0-201-00022-9. 604pp.

Andrew W. Appel. Modern Compiler Implementation in Java: Basic Techniques. Cambridge: Cambridge University Press, 1998. Paperback. ISBN 0-521-58654-2. 398pp. (a more complete edition is anticipated in 1998).

Mike Blaszczak. Professional MFC with Visual C++ 5. Third Edition. Birmingham: Wrox Press Ltd., 1997. Hardcover with CD-ROM. ISBN 1-861000-14-6. 1061pp.

Marshall P. Cline and Greg A. Lomow. C++ FAQS: frequently asked questions. Reading: Addison-Wesley Publishing Company, 1995. ISBN 0-201-58958-3. 461pp.

Frank L. Friedman and Elliot B. Koffman. Problem Solving, Abstraction, and Design Using C++. Reading: Addison-Wesley, 1995? ISBN 0-201-52649-2. 888pp + appendices.

Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Reading: Addison-Wesley Publishing Company, 1995. ISBN 0-201-63361-2. 395pp.

Ken Arnold and James Gosling. The Java™ Programming Language. Reading: Addison-Wesley, 1996. ISBN 0-201-63455-4. 329pp.

Kate Gregory. Special Edition Using Visual C++ 5. Indianapolis: Que Corporation, 1997. Paperback with CD-ROM. ISBN 0-7897-1145-1. 956pp.

Nathan Gurewich and Orl Gurewich. Teach Yourself Visual C++ in 21 Days. Fourth Edition. Indianapolis: SAMS Publishing, 1997. ISBN 0-672-31014-7. 799pp.

David J. Kruglinski. Inside Visual C++. Fourth Edition. Redmond: Microsoft Press, 1997. Paperback with CD-ROM. ISBN 1-57231-565-2. 986pp.

MIF Reference. San Jose: Adobe Systems Incorporated, August, 1997. 336pp.

Patrick Naughton. The Java Handbook. Berkeley: Osborne McGraw-Hill, 1996. ISBN 0-07-882199-1. 424pp.

Fred Pandolfi, Mike Oliver and Michael Wolski. Microsoft Foundation Class 4 Bible. Corte Madera: Waite Group Press, 1996. Paperback with CD-ROM. ISBN 1-57169-021-2. 1132pp.

Rich Text Format (RTF) Specification and Sample RTF Reader Program. Microsoft Technical Support Application Note. RTF Version 1.5. Redmond: The Microsoft Corporation, April, 1997. 157pp.

Bjarne Stroustrup. The C++ Programming Language. Third Edition. Reading: Addison-Wesley, 1997. ISBN 0-201-88954-4. 910pp.

The Unicode Consortium. The Unicode Standard, Version 2.0. Reading: Addison-Wesley Developers Press, 1996. Paperback with CD-ROM. ISBN 0-201-48345-9.

Web sites:

Ask the MFC Pro http://www.inquiry.com/techtips/mfc_pro/index.html

Automata Course http://www.math.grin.edu/~stone/courses/automata

Fm2html http://www.ai.mit.edu/~mpf/homedir/files/proj/Web+Bib/fm2html/fm2html.test/trans.y

Frame Maker Interchange Format (MIF) parser (Perl) http://www.oac.uci.edu/indiv/ehood/mif.pl.doc.html

Introduction to JJTree http://users.ox.ac.uk/~popx/jjtree.html

JavaCC/JavaTree http://www.suntest.com/JavaCC (especially "Description of the JavaCC Grammar File" in subdirectory /doc/DOC/javaccgrm.html, "The README File for JJTreeExamples" in subdirectory /doc/DOC/jjtreeREADME.html, )

Java Tree Builder http://www.cs.purdue.edu/homes/taokr/jtb/index.html

Java Test Coverage and Instrumentation Toolkits http://www.glenmccl.com/~glenm/instr/index.html

Khmer Unicode Proposal (Maurice Bauhahn, April 1998) http://www.csis.gvsu.edu/~bauhahnm/Encoding.html and ftp://brookie.csis.gvsu.edu/maurice/KhmerEncoding11.pdf ftp://brookie.csis.gvsu.edu/maurice/KhmerProposal11.pdf

Language Theory http://www.spectra.net/~j_alan/hitech/compiler/comp2.html

MFC Programmer's Sourcebook http://www.codeguru.com

Microsoft http://www.microsoft.com/visualc/ ;

Modern Compiler Implementation… http://www.cs.princeton.edu/~appel/modern/

MSDN Online SDK http://www.microsoft.com/msdn/sdk/changes.htm

RTF Converters ftp://ftp.uni-mannheim.de/packages/WWW/www/converters/rtf/RTF/00-SW-README

Source Code Archive http://www.cuj.com/code/archive.html

Together/J Java design environment http://www.oi.com

Top Down Analysis of the Assignment Statement http://www.cs.um.edu.mt/~hzarb/CSM201/notes/lecture3/node4.html

Unified Modeling Language (UML) Dictionary http://softdocwiz.com/UML.htm

User Friendly Recoding of Legacy Text in RTF/MIF (This document) http://www.csis.gvsu.edu/~bauhahnm/UserFriendlyRecodingLegacy.html

Other Resources:

Adobe FrameMaker version 5.5 (Windows)

Adobe Illustrator version 5.5 (Macintosh)

Frame Technology FrameMaker version 3.0 (Macintosh)

Gateway2000 GP5-166 128mb RAM/4 Gb Hard drive.

Macromedia Fontographer 4.1 (Windows and Macintosh)

Microsoft™ Visual C++ Version 5.0 Professional Edition (Microsoft Developer Network (MSDN) Library was especially helpful)

Microsoft™ Windows NT 4.00 Workstation

Microsoft™ Word 5.01a (Macintosh)

Microsoft™ Word 97 (Windows)

Usenet: news://msnews.microsoft.com microsoft.public.vc.mfc.docview, microsoft.public.vc.events, microsoft.public.vc.mfc, microsoft.public.vc.language

comp.text.frame, comp.compilers.tools.javacc

User Friendly Recoding of Legacy Text in RTF/MIF

Summary

Why Recode?

Non-Roman script texts in many incompatible legacy number-character linkings need to be made shareable via one standardized representation.

Unicode/ISO10646 is the migration goal for a world standard uniform method of character identification. Until it is more widely available for end-user manipulation it can be used within the parser as a comprehensive intermediary encoding between legacy number-character pairings.

Wouldn't it be Easier to Parse Plain Text?

Legacy texts may contain multiple incompatible single byte encodings coexisting even in individual words; these disparate representations should be unified for they would become ambiguous and corrupted if exported as plain text.

Half of the effort of producing a document may be in the formatting of it, that effort should be preserved...and not reinvested.

Why a (User Friendly) GUI?

Most computer users are losing the ability to use command line interfaces, so it is appropriate that an application for wide use should have a user friendly graphic user interface (GUI).

Hopefully an application could be made simple enough so that a mildly sophisticated user could customize the recoding of documents on a given computer. This would be carried out in the context of the particular font situation on that computer.

Display of non-Latin characters makes their identification and use much easier and more accurate.

Nevertheless, some characteristics are not visible or distinguishable (character width, character offset; and invisible codes such those for word demarcation, non-ligation, and conjoining).

Code page changes may vary the binding of a specific glyph between various code points, but that is not considered in this first implementation.

Drag and drop technology facilitates the use of characters (there are many inconsistent keyboard layouts and it is difficult to locate one of 200+ characters using the keyboard).

Is It Necessary or Possible to Generalize Such a Complicated Activity?

There are so many encodings and variations between generations of fonts having the same name that it is difficult, if not impossible, to supply a stock recoder that would meet the needs of every desktop. Therefore the end user must be introduced to character level translation.

Each script requires:

A framework file with all possible glyphs.

A grammar with rules to convert glyph codes to an intermediate encoding (Unicode).

A grammar with rules to convert glyph codes from an intermediate representation (i.e., Unicode) to another encoding (if necessary). The default is to emit a Unicode encoding. This is not provided in the preliminary implementation.

Most Java command-line functions can be achieved using menu commands of the VisualFlex application.

Why Maurice Bauhahn?

Domain expert in the Khmer script (writing system used in Cambodia) and programmer.

Committed to creation of such a tool for several years.

Possess substantial textual data in digital format that requires such translation:

Compiled 1,617 page French/Khmer/Vietnamese/English Medical Dictionary (4th Dimension database, Macintosh)

Previously responsible for production of Cambodia's primary and secondary textbooks…over thirty books (FrameMaker 4.0, Macintosh)

1,888 page Khmer-Khmer Dictionary (Choen Nath, Compiler) (FrameMaker 3.0, Macintosh)

Two versions of the Khmer Bible (FrameMaker 3.0, Macintosh)

Involved in large Theological Education by Extension publishing effort of Christian and Missionary Alliance (Khmer, Various Macintosh programs)

Cathedral in the Rice Paddy by François Ponchaud (Khmer version, Microsoft Word 5.01a, Macintosh)

A large Khmer/English legal dictionary from The Asia Foundation (4th Dimension, Macintosh)

Have implemented Khmer plain text recoding, automated insertion of word demarcation, and sorting of Khmer using the Summer Institute of Linguistic's Consistent Changes program.

Committed to production of a Khmer Bible Concordance which relies on some of this technology.

The Complications:

Example of Khmer (of which the presenter is a domain expert and author of the current Khmer amendment to ISO10646) as one of the most complicated languages to use on the computer:

Legacy encodings in glyph order differ from phonetic orderings of standards

Discontiguous multiple glyphs of legacy encodings need to be unified into single code in destination encoding

A single glyph in legacy encoding may require expansion into multiple destination codes (or intermediate glyph codes)

Multiple fonts may be used in a single word, due to (1) the large number of glyphs being distributed over multiple fonts, (2) the unintentional absence of glyphs from a given font, or (3) the user being unable to locate a certain character in the font they are using.

The insertion of formatting codes inside words (due to the mixing of fonts within words) can lead to incomplete tokenizing of grammatical units

Khmer has no visible word breaks, so word breaks are seldom entered, even though they are needed for word-wrap and other functions

Automatic insertion of word breaks requires dictionary lookup facilities

Legacy fonts may have inconsistent encodings between different sizes of the same font family

Source fonts sometimes have Latin characters not present in destination Khmer fonts.

Conventional technology for lexing and parsing is inadequate

Lex, Flex, JLex, Yacc, JavaCUP, Bison, et cetera are inadequate because they are not both cross-platform and Unicode savvy (and they are uniformly lacking the latter). JLex and JavaCUP are described at: http://www.cs.princeton.edu/~appel/modern/java/index.html#software

Multiple lexers and parsers need to be integrated with a parse tree to coordinate multiple encodings and prune excess formatting branches

Object-oriented lexing and parsing would be beneficial to encapsulate identically named variables in multiple lexers/parsers (via packages)

A Java-based tool allows cross-platform use of the resulting translators

Instead of code generation, the final output will be an RTF or MIF file

Rich Text Format or Maker Interchange Format deficiencies:

No parsing grammars are publicly available for either

These are very large and complicated text formatting languages

Various generations of these languages are quite different from each other

Maker Interchange Format still does not support Unicode

Display of Khmer Unicode encoding is lagging:

No program which reads MIF or RTF file formats can at this time display a complicated Indic script such as Khmer encoded in phonetic order. The Khmer aspects of this proposal anticipate this functionality.

Only Apple's GX fonts and Java 1.2 with Java2D offer adequate display technology at this time (although they have not yet been implemented for Khmer), and no program which supports MIF and RTF takes advantage of either GX fonts or Java 1.2

Further Problems to Resolve:

Major Design Decisions

GUI aspects for the first implementation will be made with Microsoft Visual C++ on Windows NT 4.0 and will be called VisualFlex.

Actions, lexing, parsing, and parsing trees will be implemented using Java, JavaCC and JavaTree.

Most processing will be in Unicode; the output (code generation), in this first generation will simulate Unicode; later versions will be able to convert the output to any output encoding.

Priority will be given to encoding, not display (appropriate display technology is not yet available on the Windows platform or in programs which read RTF or MIF).

Code page issues which impinge on encoding will be not be considered in the first implementation.

Java applications generated will be command-line driven (but accessible from the VisualFlex menus).

Major components:

MyDoc.fsu, font/size/usage/translator file for a given document (one each type of document, output of above). This is a simple table listing the font, size, usage and translator classes. This document is opened by the VisualFlex application to suggest source font/sizes needed to be translated.

Composition of RTF.jjt or MIF.jjt Grammar File Templates

Second Section: Java Compilation Unit written in Java

Enclosed between PARSER_BEGIN(RTF or MIF) and PARSER_END(RTF or MIF) expressions

Initially includes a static main() function to declare a command line useable class of parser; later this can be transformed into a GUI type of application

Declares and passes font and font size identification stack ()

Third Section: Productions

Include lexing, parsing (for recursive functions), and parse tree functions with actions to implement much of its intelligence

Fill font identification stack, character style stack, and paragraph style stack.

Call font specific parsers as required

It is important to add first_token and last_token fields to the node classes. This facilitates capture of text embedded between grammatical constructs (http://users.ox.ac.uk/~popx/jjtree.html )

Compiled 1,617 page French/Khmer/Vietnamese/English Medical Dictionary (4^th Dimension database, Macintosh)

A large Khmer/English legal dictionary from The Asia Foundation (4^th Dimension, Macintosh)