Khmer Unicode and Inherent Vowels

Inherent Vowels

There is some imprecision concerning what constitutes an 'inherent' vowel.
In this note I am referring to normally unwritten vowels that are
nevertheless pronounced.

In Khmer Unicode there are two inherent "a" characters. A long inherent (native Khmer language) at U+17B5 and a short inherent (Sanskrit/Pali) at U+17B4. Their encoding has raised some outcry (in fact some parties are trying to deprecate them), but the more I analyse grammars, dictionaries, and round-trip transliteration the more importance they assume.

(1) If you look at a dependent vowel series in an Indic script...they often start with an unwritten 'inherent a' character, recognising their unique existence.

(2) If you transliterate between an Indic script and a Latin [or other phonetic] transliteration, the inherent vowel must become explicit in the transliteration (hence it would be extremely useful for round-trip conversion reasons to have a code in the Indic encoding to match that). Dependable round trip conversion of text is becoming increasingly important when a single minority language spans national borders where government authorities on opposite sides of the boarder insist the 'national' script of their respective country be used to render that language.

(3) Not every consonant cluster that lacks an explicit dependent vowel also contains an 'inherent a' (in particular in Khmer it is unpredictable from the context [i.e., without a lookup] whether a final consonant cluster without a dependent vowel has a pronounced inherent or not).

(4) Non-final clusters lacking an explicit dependent vowel 'always' (a dangerous word to use!) have an 'inherent a', possibly short or long.

(5) Depending on the foreignness of the word an 'inherent a' in Khmer may be short (foreign) or long (Khmer language)

(6) Dictionaries have to make the short 'inherent a' vowel explicit in their pronouncing sections (usually borrowing U+17C8 to display it; however, you would not want to raise ambiguity by using that code both when it is normally displayed and when it is there for making pronunciation clear)

(7) For phonetic rendering of an Indic script, therefore, it would be very useful to selectively encode it. In the future data input and output will increasingly move to verbal/aural, rather than keyboard means. This would be quite an exciting development for Khmer...because Khmer is difficult to keyboard and presumably relatively easy for a computer to recognise (what with about fifty vowel/vowel-sign combinations that are easier for computers to recognise than consonants). Hence, I would assume that codes to capture verbal data converted to Unicode text will similarly become increasingly important.

(8) 'Inherent a' is often used in combination with vowel-like signs such as U+17C6 NIKAHIT, U+17C7 REAHMUK, U+17C8 YUUKALEAPINTU to generate vowels with consonantal final sounds. Failure to recognised the 'inherent a' results in wrongly interpreting those consonant-like signs as vowels. These vowel+sign ligatures are in fact treated like unique vowels in sorting.

There are arguments against using 'inherent' vowels.

(a) Unwritten characters tend to not be typed! And if they were, the data stream length would grow remarkably.
(b) Binary comparison of words with and words without 'inherent' vowels would be problematic
(c) The average user would probably not gain advantage from the inclusion of 'inherent' vowels in the text stream
(d) I could not find more than one instance in the authoritative Chuon Nath Khmer dictionary where two words otherwise spelled the same were distinguished by the length of their inherent vowels. It is hard to write a sorting rule on one data point;-)
(e) Rendering mechanisms may not recognise the (rarely used) inherent code and cause problems when it is used.

Hence, it would be preferred that the use of inherent vowels be sharply circumscribed...but not eliminated altogether.

In summary, inherent vowels:

(1) Are characters in their own right
(2) Are needed for round trip script conversion (transliteration)
(3) Are not a trivial case: They are not contained in every consonant cluster even when that cluster does not contain a visual dependent vowel
(4) Are useful for preserving phonetic value in dictionaries or text-to-speech applications