Khmer Unicode Mailing List 2001/11/13

Dear Mr Nelson

I want to express the appreciation of the Cambodian delegation for your sensitive responses to our position papers, and your open-minded attitude to the problem of encoding Khmer in the best possible manner, taking into
account the various (at times conflicting) needs and interests of designers, implementors and, above all, Cambodian users.

The opportunity to meet some of the actors in this saga recently in Singapore was greatly appreciated, and I know I can speak for the whole delegation and for the Committee on Standardization when I say we will redouble our efforts in the coming period to try and reach a mutually acceptable solution.

Looking forward to a continuing and positive sharing of ideas,

Helen Jarvis

Dr Helen Jarvis
Associate Professor, School of Information Systems, Technology & Management
University of New South Wales
Sydney, Australia

currently in Phnom Penh
Phone: 012812658 (international 85512-812658)
Fax: Office of the Council of Ministers (85523) 880-629
Mail: c/- PO Box 1109, Phnom Penh, Cambodia
Email: 012812658@mobitel.com.kh OR h.jarvis@unsw.edu.au (please do not use
both)

**********

My sincere apologies for appearing to be so obstinate in this case of COENG! I have come from the same position so ably put forward by the opponents of COENG on this list and now find myself on the other side of the discussion
after several years of analysis and thought.

Our discussion reminds me of an experience in Bangkok of finding a book on the history of the Thai script...which ignored the Khmer script!

It is of no advantage to ignore the roots of the Khmer script in the Indic tradition (in this case for supposed principle, proposing the addition of nearly 80 [unnecessary] stand-alone characters to Khmer Unicode). I recall reading that until quite recently the Royal Court in Cambodia had an Indian counsellor. Such an advisor would be helpful in this discussion;-) In another case (which should not distract us now) it was certainly startling to me to find that the relatively obscure ROBAT character was not merely a diacritic sign as widely believed, but in fact an initial RO (look at the words in which that is pronounced and note that the RO sound precedes the consonant above which the ROBAT sits). That insight came from folk with an Indic background. It reminds me of a signature I've seen in some emails which runs something like: "Those that have no roots, have no future."

If this discussion were only about subscript consonants it might be easier to come to an agreement. However we have seen the principle of subscript (COENG = foot) in Khmer extends beyond consonants to independent vowels and lunar dates. An even more damaging ambiguity would result if some of those used a COENG encoding and others an explicit subscript encoding (end users would alternately select the two alternatives for any given subscript).
Furthermore a goal of investigation with a scientific mindset is to find those unifying principles.

On 11/12/2001 06:51:08 AM Paul Nelson wrote:

>C) This is not about keyboarding. It is just as easy to use your dead
>key approach to enter a subscript character representation as it is to
>enter the COENG + character with the same keyboard.

Keyboarding is in fact a VERY important aspect of this discussion. It is only at the keyboard that the end user has any clue about COENG. Isn't the supposed offence to the end user the chief argument against COENG? The
arguments against COENG in the background are really very weak. Yesterday I learned that we can expect terabyte hard disk storage on PCs in a couple years (incidentally I found 22033 COENG characters in a Unicode document of about a quarter of a million Khmer characters [about 11%]). Networking is also speeding up by leaps and bounds. Microprocessor speed is soaring. But typing...that is the real obstacle! Keyboard efficiency is an extremely
important requirement...for that is the most expensive and enduring resource limitation. If there were fewer subscripts even that could conveniently be hidden by having COENG and its base inserted by single key (and the delete
algorithm always taking those two codes out in pairs). However one of my chief points in this discussion is that the use of COENG helps most vividly at the keyboard. If we did not have it in the encoding, we should have to
simulate it (to produce time savings of 25 - 50% in typing subscripts). Although I like 'dead keys' because they remarkably speed up typing, they have one very limiting characteristic: they do not give visual feedback. So
with COENG we can get dead key functionality...but without the limitations.

A sparse encoding such as that in the existing Khmer Unicode helps one create a sparse keyboard. Why create a bloated encoding and then have to do workarounds on the keyboard to get it back to a manageable size (especially
when the experience at the keyboard is the only substantive rationale for challenging the sparse encoding in the first place!)?

>D) I would say that the ability to convert the Khmer to Latin is
>fascinating, but is not a factor that needs to be considered for
>correctly encoding Khmer into Unicode. It will be possible to take any
>correct encoding of a language and deduce rules for some type of
>morphological transformation to another language. The same rules will
>apply to your transformation if the subscript letters are encoded.

I believe Paul meant to write 'transformation to another script' in the fourth line above.

Encoding is fundamental to many different processes. This is merely illustrative of one of the ways in which explicit subscripts in Khmer are of no advantage. It also brings out the very real existence of hidden characters...in analogy to COENG.

Sincerely,

Maurice

*************

A. One point of view is expressed as "bloated encoding" and one point of view is expressed as "bloated backing store".

While the future is very bright, for what someday may be in some countries, the harsh reality is that there are major parts of Cambodia that will never see fast networks or huge storage spaces. Personally, I find it very difficult when traveling away from my home environment to get "acceptable" connectivity. Here I sit at home connected via DSL
(728) and communicating faster than anyone in the country of Indonesia that has a direct connection to the Internet(128).

Data transmition and storage cannot be summarily dismissed as a cheap item. Nor, should the approach of encoding subscript characters as two Unicode characters be summarily dismissed because it will take up too much space.

B. One point of view says "return to the roots" and one point of view says "here is where we are today".

One can claim that the only way to write Khmer is as an Indian advisor would have years ago...and of course he would have encoded Khmer as it is currently represented in Unicode. One can also claim that this [is]
the current modern (and therefore correct) usage of Khmer, and miss a wealth of historical documentation and minority languages that should be written in the Khmer script.

I have heard of literacy work with rural Cambodians to write their tribal languages into Khmer script. These tribes have sounds that are not represented in the current Khmer script encoding supported by either side of the encoding issue. How will these unencoded sounds be represented in Unicode?

C. One point of view say "COENG solves all subscripts" and one point of view says "encode all subscripts".

The current demonstrative implementation of Unicode I did using the COENG allowed the COENG model to represent all forms of the subscripts that Maurice could find, as well as marking up Lunar dates. The
Cambodian proposal encoded all known subscripts and lunar dates, and carefully left blank spaces for those COENG forms that were not currently used for modern usage and marked them as "RESERVED". Both the
minimalist approach and the normal usage approach considered that there might be other issues in this realm.

I do not care to take sides and say that one method is better than the other. Each point of view has its merits and needs to be allowed to state their objective viewpoint. My supposition is that each point of view wants the best for the users of the Khmer script. Otherwise this might be a purely academic exercise.

To those in favor of the COENG model I ask:
Can you please present why the COENG model (based on virama) is so critical for implementing "correct" Unicode use of subscripts? If you remove constraints of the number of characters and any data entry issues, what is the compelling reason to encode Khmer in this manner? Is there any reason why encoding subscript forms and lunar numbers would not work?

To those opposed to the COENG model I ask:
Is it not at all possible to look at COENG KA as the same as COENG + KA? Apart from the extra characters required in the backing store and the perception that one "character" should not be represented as two
codepoints, is there any major reasons that the current implementation of Unicode could not be used to correctly represent Khmer text?

One Unicode principle we must maintain in the quest for an acceptable solution is that currently assigned Unicode values cannot be changed.

In the world of Latin script there is a normalization process. A + acute = Aacute. If a proposed Extended Khmer range was considered it could not contain any characters already encoded. It could only contain subscript forms and any new characters that were not put in the Khmer block. Additionally, it would be necessary to have some type of mapping between the current Unicode encoding model and the subscripts encoded in the Extended Khmer range. COENG + KA = COENG KA, etc. Is this a layer that is acceptable to introduce into the Khmer standard?

Perhaps it is to early to propose such a compromise. Oh well. It is an idea.

There is an additional huge issue that must be considered. If the Khmer encoding is changed to include subscript forms, what rationale can be provided to prevent Indic languages from also demanding that all of their subscript forms be encoded? What makes Khmer more like Tibetan script than Sanskrit? This question does not mean that we should not consider doing the right thing for Khmer...whatever that is. However, this situation cannot be resolved without having this question on the table.

I hope that this discussion can continue, and that I have not irritated either side of the issue too much.

Best regards,

Paul

P.S. - My opinion is still that keyboarding is an implementation issue that needs to be left out of the encoding mix. There are many ways to implement keyboards and other input devices. Those methods may change from device to device and evolve as technology evolves. This issue is about deriving the best Khmer encoding solution to solve this impass.

*********

Maurice Bauhahn wrote:

>On 11/12/2001 06:51:08 AM Paul Nelson wrote:
>
>>C) This is not about keyboarding. It is just as easy to use your dead
>>key approach to enter a subscript character representation as it is to
>>enter the COENG + character with the same keyboard.
>
>Keyboarding is in fact a VERY important aspect of this discussion...

>A sparse encoding such as that in the existing Khmer Unicode helps one
>create a sparse keyboard.

I'd suggest that there can be independence. A sparse keyboard can be used to generate a rich encoding, and a rich keyoard can be used to generate a sparse encoding. It's just a matter of engineering the input method properly.

- Peter