Remaining Issues with Khmer Unicode

Further discussion is needed to resolve disagreements on what should be deprecated or how the Khmer Unicode standard should be extended. Here are some issues which hopefully will focus on key issues which should be quickly resolved. There is no point in debating further the VIRAMA/COENG model of encoding. The International Standards Organisation via the WG2 unanimously rejected an appeal against that model...and there were very good reasons for implementing it in the first place. However, there are many, many other issues which still need resolution. Now that there are structures within the Khmer government to process such standards, the proposal to the ISO/Unicode should come from those bodies (although to avoid appearance of conflict which could delay implementations, it would be good if the next proposal were submitted with the added signatures of Michael Everson and Maurice Bauhahn). A full and public debate is encouraged over these issues. Some issues are (others are detailed in N2380 with discussion in N2385):

1. Sign in Chhuan Nath dictionary not in Khmer Unicode

I am aware of one unique sign in the Chhuan Nath dictionary which is not in Khmer Unicode (something like three peddles coming away from a central location, NoKhvak). The original linguists committee recommended against including this sign. Another sign used in the dictionary is the Swastika.

2. Ligature characters under AKSAW MUOLE

Two ligature characters (U+178F U+17B7 and U+179C U+17B8?) pointed out in N2380, Appendix point 9) seem to be true ligatures (the constituent parts affect the sorting order). Hence they are not unique characters in their own right, nor worthy to be separately encoded. They are only glyph variants. Are there others? For an analygous example consider KHMER LETTER BA and KHMER VOWEL SIGN AA (and KHMER LETTER BA and KHMER VOWEL SIGN AU). Both form ligatures which are absolutely necessary to be displayed, but should not be encoded. When there is an understanding of the font technology, it is easy to come to an agreement on how to handle ligatures. It is commendable to make available full ligatures in Khmer (much like handwriting) whereby the intelligence added to fonts allows for full ligature formation (it is probably a good idea to make certain levels of ligation optional additions ['features'], however, so characters could optionally be displayed without 'full' ligatures). Hopefully identifying those technologies and illustrating how font 'features' allow different levels of ligature formation would alleviate the misunderstanding. The above ligatures are only present under AKSAW MUOLE, aren't they? Would it be best for them always to be present under that font type? At the moment the only glyph type we have implemented in Unicode is AKSAW CHRUNG. Once the logic of that is complete, however (for OpenType, Graphite, and AAT) it would be relatively simple to move it over to other fonts (however that would also require adding lots of ligatures and changing all the PostScript and Unicode names of the recipient fonts). Additional insight is most welcome on other ways that AKSAW MUOLE has behaviours that are different from AKSAW CHRUNG. Maurice remembers standing on the streets of Phnom Penh asking twenty folks to write the AKSAW MUOLE subscript form of KHMER LETTER CO (only one could do it!).

3. Deprecate U+17A3?

It would be wise to discuss further the merits of deprecating U+17A3 (It appears that the distinction against U+17A2 is only in sort order...which is language specific [not encoding specific]). Could some Buddhist monk linguists (or other Indic language experts) be brought in to identify whether the Pali or Sanskrit independent vowel that this transliterates has any decomposition form or decomposition function? Can those monk linguists point out any other distinction between U+17A2 (Khmer) and U+17A3 (Sanskrit/Pali)? The original linguists committee felt that there was a significant difference between U+17A2 and U+17A3 (the first was a consonant, the second an independent vowel). In the 2002 WG2 meeting in Dublin it was agreed by all parties that so far as Unicode was concerned, U+17A3 should be deprecated.

4. Why U+17A4 should not be deprecated.

U+17A4 has much greater significance and decomposing it would be like decomposing one of the Khmer independent vowels (making it extremely difficult to sort under a separate heading, as this character is in Pali dictionaries: TitlePage, Beginning17A4, Ending17A4). Remember the warning in TUS 3.0 that U+17A4 is "used only for Pali/Sanskrit transliteration". Furthermore, decomposing U+17A4 could break a fundamental rule of consonant cluster formation if it were found in a subscript position in a cluster which had a dependent vowel (The rule is: there can only be one dependent vowel CHARACTER per cluster).

5. Deprecate KHMER SIGN ROBAT (U+17CC)?

There is a strong case for deprecating KHMER SIGN ROBAT because (a) It functions identically to its Indian root which is acknowledged to be an initial RO, (b) In pronunciation the Khmer ROBAT has a RO sound preceding that of the consonant over which it sits (hence for phonetic ordering it must precede the code for the 'BASE' consonant), (c) In most cases that RO is the initial in a consonant cluster it appears in the ROBAT form (which should be encoded U+179A U+17D2 BASE), but only extremely rarely does KHMER LETTER RO (U+179A) have a subscript which appears in typical subscript form (and in those cases that variant could be encoded U+179A U+17D2 U+200C BASE; where the subscripting behaviour is interrupted by ZERO WIDTH NON-JOINER, (d) Theoretically sorting should treat ROBAT as a base (Does it? There do not appear to be any words starting with ROBAT.). (e) Presumably in handwriting the BASE is written before the ROBAT form of RO, so this does introduce a clash of technology against tradition. Note that BASE in this paragraph stands for another consonant/independent vowel in the cluster.

6. Character order within consonant clusters in Khmer.

It is difficult to impose a strict character order in Khmer...but absolutely necessary if searching, spell-checking, and sorting are taken into account. That encoding order is formalised (a normative property which cannot be changed) in Khmer Unicode as: Base RegisterShifter FirstSubscript SecondSubscript Vowel Signs. RegisterShifter does not have a high impact on sorting order (and is only taken into consideration on a secondary level along with some of the signs). It would be a good idea to extend the specify encoding order to (a) force the three signs U+17C6, U+17C7, and U+17C8 to immediately follow the VOWEL when they are present [in fact they create unique vowels in those combinations]...admittedly violating the handwriting order, or (2) allowing all other signs to precede on the of the three signs U+17C6, U+17C7, and U+17C8. Chiefly it is important to force entry into a particular order without that being too complicated.

7. Sorting order

A binary sort is not appropriate for Khmer (it does not handle syllabic sorting which is the Indic/Khmer type, independent vowels are regularly decomposed for sorting purposes, RegisterShifter has a very marginal affect on sort order [except with KHMER LETTER BA, which needs further discussion] even though it appears earlier in the character order, et cetera). Basically, a binary sort of Khmer (which compares two words, one character at a time) would only be possible in a context where there are contiguous consonants without explicit vowels, signs, word breakers, or subscripts. Two documents which discuss Khmer sorting issues are KhmerSortingQuestions.pdf and KhmerSortingUnicodeBeta.pdf.

8. Why the combination U+17B6 U+17C6 should not be encoded as one character

This item might be a bit more difficult to come to an agreement on. It should be resolvable, however, if there can be some agreement in sorting on how to handle the cases: U+17BB + U+17C6, U+17C6 standalone, U+17B6 + U+17C6. Furthermore, if we are going to leave the Khmer script extensible to cover other languages there is no way we can exclude the possibility of other dependent vowel/U17C6 combinations. If the second and third cases were added as separate characters to Unicode, it would be quite natural to alternately enter either the single character or the constituent parts...leading to confusion in searching/sorting/et cetera. Possibly the sorting would need to be adjusted (without changing the encoding) to adapt to using those combinations as unique characters/combinations near the end of vowels [rather than near the beginning as is consistent with the many other vowels which are 'formed' with the addition of U+17C7 or 17C8]. I would like to see greater consistency in sorting...but admit that would require standardisation within Cambodia. It would be interesting to test a cross-section of the Khmer population to see how many could alphabetically order a set of Khmer words (and find words in a Khmer dictionary). I believe the results would be very disappointing. For the sake of information retrieval by future generations, it might be a time for Cambodia to simplify and standardise sorting!

9. Additional signs

Additional signs may be needed to complete the repetoire of Khmer script transliteration of Sanskrit/Pali signs. There is a Sanskrit Grammar which lists five signs.

It appears that the first sign in the pair of graphics above is a filled circle. Does it need to be distinguished from KHMER SIGN NIKAHIT (U+17C6)?

The second sign above probably should be added to Unicode (KHMER TRANSLITERATED ANUNIASIK [U+17DD]...or what suggested name?). A cautionary note should be added that this is not for use with Khmer language, but rather for transliteration of Sanskrit into the Khmer script.

Is the third sign above functionally identical with KHMER SIGN YUUKALEAPINTU (U+17C8)?

The fourth sign above appears to be identical to KHMER SIGN VIRIAM (U+17D1).

The fifth sign above is KHMER SIGN AVAKRAHASANYA (U+17DA). Note that the graphic used in Unicode could be modified to reflect the apostrophe-like sign used above...but other sources adopt the sign intact.

10. Krung consonants which should be added.

Just as the Khmer script is used to write Sanskrit and Pali languages in Cambodia, the Khmer script is also used to write the minority languages of Cambodia. Since some of these languages have unique sounds...additional characters have need of being added. One script which has official approval from the Royal Government of Cambodia is Krung (we need a scan of the certification of that script!). I am aware of ten consonants which have been used with that language (and are temporarily being encoded in the Private Use Area of Unicode):

U+E1C0 KHMER LETTER KA LEFT SINGLE TICK (derived from U+1780)
U+E1C1 KHMER LETTER KA LEFT DOUBLE TICK (derived from U+1780)
U+E1C2 KHMER LETTER KO LEFT SINGLE TICK (derived from U+1782)
U+E1C3 KHMER LETTER KO LEFT DOUBLE TICK (derived from U+1782)

U+E1C4 KHMER LETTER CA LEFT SINGLE TICK (derived from U+1785)
U+E1C5 KHMER LETTER CO LEFT SINGLE TICK (derived from U+1787)
U+E1C6 KHMER LETTER DA LEFT SINGLE TICK (derived from U+178A)
U+E1C7 KHMER LETTER DO LEFT SINGLE TICK (derived from U+178C)

U+E1C8 KHMER LETTER BA LEFT SINGLE TICK (derived from U+1794)
U+E1C9 KHMER LETTER YO LEFT SINGLE TICK (derived from U+1799)