User Friendly Recoding of Legacy Text in RTF/MIF

A Masters Project in Fulfillment of Requirements for CS693

Adviser: Dr. Yonglei Tao, Grand Valley State University

Author: Maurice Bauhahn

Date: 22 April 1998, 11:52am Wednesday


Summary

This paper describes the rationale, difficulties, goals, and processes involved in creation of a tool to translate legacy representations of natural languages (in the context of the two text formatting languages, Rich Text Format and Maker Interchange Format) to or via Unicode. JJTree and JavaCC tools are used to generate lexers, parsers, and parse trees with Khmer legacy encoding as examples. Microsoft Visual C++ 5.0 is used to generate the GUI interface (particularly drag-and-drop of glyphs).


Why Recode?

 Encoding is the representation of a glyph or character for the computer's benefit as a number. It is a very compact and useful way to store text (more manageable than a scanned picture of each character, for example). Most legacy systems use only one byte (8 bits) for these purposes. That allows the simultaneous storage of less than 256 types of characters in plain text (some of these codes are used for control purposes like tab, carriage return, line feed, end of file, et cetera; so not all are available for character storage). A computer does not 'recognize' any character, but only uses a different number to represent each character (and then uses a matching number in a font to reference a picture to display that character). This representation of characters by numbers may be called 'encoding'. Significant problems develop when different fonts or computers use a different set of numbers to represent the same alphabet. That problem has largely disappeared for English due to adoption of the widely used ASCII encoding (first 127 characters). But after English and control codes have used 127 out of 256 characters there is not much room left to represent other languages (only a few European languages, in fact). The real solution is to use the two byte Unicode encoding system (consisting of slightly less than 65,536 code points) or the four byte ISO10646 encoding system (which allows almost unlimited expansion of unusual Chinese/Japanese/Korean characters). Unicode and ISO10646 are identical for the first 65,536 code points (meant to encode the major scripts of the world). These two schemes are still very young (and incomplete), so much textual data is being accumulated in incompatible and overlapping encodings based on one byte. Tools are needed to convert this legacy text to compatible (and eventually non-overlapping) representations…on a font by font basis in many cases. Legacy systems often encoded glyphs (discrete components of writing usually meant for display). Unicode on the contrary normally matches a character (alphabetic entity) with each numeric code (in phonetic order), and lets the display mechanism deal with distributing the various glyph components to requisite display positions (before/above, before/after, above/below). For some languages (i.e., Khmer) two or more glyphs may be needed to display one character.
 Non-Roman script texts in many incompatible legacy number-character linkings need to be made shareable via one standardized representation.
 Unicode/ISO10646 is the migration goal for a world standard uniform method of character identification. Until it is more widely available for end-user manipulation it can be used within the parser as a comprehensive intermediary encoding between legacy number-character pairings.


Wouldn't it be Easier to Parse Plain Text?

 Legacy texts may contain multiple incompatible single byte encodings coexisting even in individual words; these disparate representations should be unified for they would become ambiguous and corrupted if exported as plain text.
 Half of the effort of producing a document may be in the formatting of it, that effort should be preserved...and not reinvested.


Why a (User Friendly) GUI?

 Most computer users are losing the ability to use command line interfaces, so it is appropriate that an application for wide use should have a user friendly graphic user interface (GUI).
 Hopefully an application could be made simple enough so that a mildly sophisticated user could customize the recoding of documents on a given computer. This would be carried out in the context of the particular font situation on that computer.
 Display of non-Latin characters makes their identification and use much easier and more accurate.
 Drag and drop technology facilitates the use of characters (there are many inconsistent keyboard layouts and it is difficult to locate one of 200+ characters using the keyboard).


Is It Necessary or Possible to Generalize Such a Complicated Activity?

 There are so many encodings and variations between generations of fonts having the same name that it is difficult, if not impossible, to supply a stock recoder that would meet the needs of every desktop. Therefore the end user must be introduced to character level translation.
 Each script requires:
 The end user visually matches reference glyphs with source font glyphs, and drags the latter to a box next to the former. One glyph may not be dragged to two locations (that would effect ambiguity). Two source font glyphs may not be combined to match one reference glyph. The language grammar is supposed to supply the 'intelligence' to properly handle compositon, decomposition, and look-alike characters.
 Most Java command-line functions can be achieved using menu commands of the VisualFlex application.


Why Maurice Bauhahn?

 Domain expert in the Khmer script (writing system used in Cambodia) and programmer.
 Author of Khmer amendment to ISO10646 (approved by the relevant International Standards Organization Working Group: JTC 1/SC2/Working Group 2 the week of 16 March 1998). The proposal and submission are referenced at the bottom of http://www.csis.gvsu.edu/~bauhahnm/Encoding.html Subsequent to my proposal there has been some renaming of characters (and removal of one dotted circle character present elsewhere in Unicode); otherwise it stands as submitted.
 Committed to creation of such a tool for several years.
 Possess substantial textual data in digital format that requires such translation:
 Have implemented Khmer plain text recoding, automated insertion of word demarcation, and sorting of Khmer using the Summer Institute of Linguistic's Consistent Changes program.
 Committed to production of a Khmer Bible Concordance which relies on some of this technology.


The Complications:

 Example of Khmer (of which the presenter is a domain expert and author of the current Khmer amendment to ISO10646) as one of the most complicated languages to use on the computer:
 Conventional technology for lexing and parsing is inadequate
 Rich Text Format or Maker Interchange Format deficiencies:
 Display of Khmer Unicode encoding is lagging:
 Further Problems to Resolve:


Major Design Decisions

 GUI aspects for the first implementation will be made with Microsoft Visual C++ on Windows NT 4.0 and will be called VisualFlex.
 Actions, lexing, parsing, and parsing trees will be implemented using Java, JavaCC and JavaTree.
 Most processing will be in Unicode; the output (code generation), in this first generation will simulate Unicode; later versions will be able to convert the output to any output encoding.
 Priority will be given to encoding, not display (appropriate display technology is not yet available on the Windows platform or in programs which read RTF or MIF).
 Code page issues which impinge on encoding will be not be considered in the first implementation.
 Java applications generated will be command-line driven (but accessible from the VisualFlex menus).
 Major components:


Composition of RTF.jjt or MIF.jjt Grammar File Templates

 First section: Options (set STATIC=false to allow recursion; JAVA_UNICODE_ESCAPE=true to allow \u0000 style handling of Unicode encodings and NODE_SCOPE_HOOK=true to allow stepping through tokens [as well as child nodes] in the parse tree, as explained at http://users.ox.ac.uk/~popx/jjtree.html ; VISITOR=true to cause JavaTree to insert jjtAccept() methods into all of the node classes it generates)
 Second Section: Java Compilation Unit written in Java
 Third Section: Productions


Composition of Khmer.jjt Grammar Template file

 First section: Options (set JAVA_UNICODE_ESCAPE=true; NODE_SCOPE_HOOK=true; VISITOR=true, STATIC=false; for reasons as explained above.)
 Second Section: Java Compilation Unit written in Java
 Third Section: Productions


Conclusion


Bibliography

Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers, Principles, Techniques, and Tools. Reading: Addison-Wesley Publishing Company, 1986. ISBN 0-201-10088-6. 796pp. ("The Dragon Book")

Alfred V. Aho and Jeffrey D. Ullman. Principles of Compiler Design. Reading: Addison-Wesley Publishing Company, 1977. ISBN 0-201-00022-9. 604pp.

Andrew W. Appel. Modern Compiler Implementation in Java: Basic Techniques. Cambridge: Cambridge University Press, 1998. Paperback. ISBN 0-521-58654-2. 398pp. (a more complete edition is anticipated in 1998).

Mike Blaszczak. Professional MFC with Visual C++ 5. Third Edition. Birmingham: Wrox Press Ltd., 1997. Hardcover with CD-ROM. ISBN 1-861000-14-6. 1061pp.

Marshall P. Cline and Greg A. Lomow. C++ FAQS: frequently asked questions. Reading: Addison-Wesley Publishing Company, 1995. ISBN 0-201-58958-3. 461pp.

Frank L. Friedman and Elliot B. Koffman. Problem Solving, Abstraction, and Design Using C++. Reading: Addison-Wesley, 1995? ISBN 0-201-52649-2. 888pp + appendices.

Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Reading: Addison-Wesley Publishing Company, 1995. ISBN 0-201-63361-2. 395pp.

Ken Arnold and James Gosling. The Java™ Programming Language. Reading: Addison-Wesley, 1996. ISBN 0-201-63455-4. 329pp.

Kate Gregory. Special Edition Using Visual C++ 5. Indianapolis: Que Corporation, 1997. Paperback with CD-ROM. ISBN 0-7897-1145-1. 956pp.

Nathan Gurewich and Orl Gurewich. Teach Yourself Visual C++ in 21 Days. Fourth Edition. Indianapolis: SAMS Publishing, 1997. ISBN 0-672-31014-7. 799pp.

David J. Kruglinski. Inside Visual C++. Fourth Edition. Redmond: Microsoft Press, 1997. Paperback with CD-ROM. ISBN 1-57231-565-2. 986pp.

MIF Reference. San Jose: Adobe Systems Incorporated, August, 1997. 336pp.

Patrick Naughton. The Java Handbook. Berkeley: Osborne McGraw-Hill, 1996. ISBN 0-07-882199-1. 424pp.

Fred Pandolfi, Mike Oliver and Michael Wolski. Microsoft Foundation Class 4 Bible. Corte Madera: Waite Group Press, 1996. Paperback with CD-ROM. ISBN 1-57169-021-2. 1132pp.

Rich Text Format (RTF) Specification and Sample RTF Reader Program. Microsoft Technical Support Application Note. RTF Version 1.5. Redmond: The Microsoft Corporation, April, 1997. 157pp.

Bjarne Stroustrup. The C++ Programming Language. Third Edition. Reading: Addison-Wesley, 1997. ISBN 0-201-88954-4. 910pp.

The Unicode Consortium. The Unicode Standard, Version 2.0. Reading: Addison-Wesley Developers Press, 1996. Paperback with CD-ROM. ISBN 0-201-48345-9.


Web sites:

Ask the MFC Pro http://www.inquiry.com/techtips/mfc_pro/index.html

Automata Course http://www.math.grin.edu/~stone/courses/automata

Fm2html http://www.ai.mit.edu/~mpf/homedir/files/proj/Web+Bib/fm2html/fm2html.test/trans.y

Frame Maker Interchange Format (MIF) parser (Perl) http://www.oac.uci.edu/indiv/ehood/mif.pl.doc.html

Introduction to JJTree http://users.ox.ac.uk/~popx/jjtree.html

JavaCC/JavaTree http://www.suntest.com/JavaCC (especially "Description of the JavaCC Grammar File" in subdirectory /doc/DOC/javaccgrm.html, "The README File for JJTreeExamples" in subdirectory /doc/DOC/jjtreeREADME.html, )

Java Tree Builder http://www.cs.purdue.edu/homes/taokr/jtb/index.html

Java Test Coverage and Instrumentation Toolkits http://www.glenmccl.com/~glenm/instr/index.html

Khmer Unicode Proposal (Maurice Bauhahn, April 1998) http://www.csis.gvsu.edu/~bauhahnm/Encoding.html and ftp://brookie.csis.gvsu.edu/maurice/KhmerEncoding11.pdf ftp://brookie.csis.gvsu.edu/maurice/KhmerProposal11.pdf

Language Theory http://www.spectra.net/~j_alan/hitech/compiler/comp2.html

MFC Programmer's Sourcebook http://www.codeguru.com

Microsoft http://www.microsoft.com/visualc/ ;

Modern Compiler Implementation… http://www.cs.princeton.edu/~appel/modern/

MSDN Online SDK http://www.microsoft.com/msdn/sdk/changes.htm

RTF Converters ftp://ftp.uni-mannheim.de/packages/WWW/www/converters/rtf/RTF/00-SW-README

Source Code Archive http://www.cuj.com/code/archive.html

Together/J Java design environment http://www.oi.com

Top Down Analysis of the Assignment Statement http://www.cs.um.edu.mt/~hzarb/CSM201/notes/lecture3/node4.html

Unified Modeling Language (UML) Dictionary http://softdocwiz.com/UML.htm

User Friendly Recoding of Legacy Text in RTF/MIF (This document) http://www.csis.gvsu.edu/~bauhahnm/UserFriendlyRecodingLegacy.html


Other Resources:

Adobe FrameMaker version 5.5 (Windows)

Adobe Illustrator version 5.5 (Macintosh)

Frame Technology FrameMaker version 3.0 (Macintosh)

Gateway2000 GP5-166 128mb RAM/4 Gb Hard drive.

Macromedia Fontographer 4.1 (Windows and Macintosh)

Microsoft™ Visual C++ Version 5.0 Professional Edition (Microsoft Developer Network (MSDN) Library was especially helpful)

Microsoft™ Windows NT 4.00 Workstation

Microsoft™ Word 5.01a (Macintosh)

Microsoft™ Word 97 (Windows)

Usenet: news://msnews.microsoft.com microsoft.public.vc.mfc.docview, microsoft.public.vc.events, microsoft.public.vc.mfc, microsoft.public.vc.language

comp.text.frame, comp.compilers.tools.javacc