Khmer Unicode Mailing List 2001/11/12

Dear Michael

My name is Svay Leng. I'm Cambodian and also a member of KPP. You mentioned that we bringing politics into Khmer scripts problem. I would like to say that it is not a politic problem, it's a really culture problem and we want to solve this problem not politically if it is possible as we can do. Khmer language is one of our culture. Cambodian people know that our language was come from Bali or [Sanskrit] but our remote ancestor had change step by step to meet Cambodian people needs for more than one thousand year. The continuity in the transition of a culture is very
important. We can't deny our existing culture and introduce the Virama model to our Khmer script. Technically speaking, we can use the existing UNICODE table without Coeng and we don't need your proof with the representation with this Virama model. Nobody in Cambodia (may be except a very few people ) say that KA and COENG KA are the same. In Latin I can write your name in minuscule like michael everson and I can get the same pronunciation but it is very impolite and in this [L]atin script we have "A" and also"a" in the UNICODE table even if we can get the same pronunciation. But in Khmer, the consonant can have different pronunciation than the COENG, for example (july=KA+KA+COENG KA+DA+SRAK AA) and (KA+KA+KA+DA+SRAK AA) have not same
pronunciation. If you say that is the same we can replace it but in reality nobody can recognize that is "july". You are a very good linguistic and do a bunch of contribution in IT field to achieve UNICODE for many countries, I
think you can understand very very well. Frankly speaking technologies are not used for changing the culture if this culture doesn't against public order or public moral or other cultures in the world. We need to talk to find out a best solution to make Khmer people happy and also you and Bauhahn by seeing just only the UNICODE Khmer script table, not by the representation , because it is easy to use technologies to achieve it. Respect and understanding of the the culture of each other are the best way to bring out peace and coexistence in the world.
Regards,
Svay Leng

**********

Valuable contributions to the discussion of Khmer encoding were suggested in N2380R. In particular I was grateful to see it subscribed to a phonetic encoding: that is a significant step forward from earlier discussions (when glyph-based encoding was tenaciously held). Therefore in the report of N2394 it is written:

"Recommendation to Cambodian delegate:
Keeping the principle in mind, the ad-hoc recommends the Cambodian delegate to take following actions.

1. Communicate with the author of n2385 to clarify the comments each other.
2. Provide a new proposal of addition of new characters.
3. Propose draft text of additional note (such as annotation) for unreasonably coded characters to avoid a misuse of the characters by the users."

Subsequently the document N2406 has been offered. It also offers some additional valuable insights...but does not respond in the spirit or letter of the N2394 recommendation.

Many issues have been raised and each should be answered in time, but orderly and in a spirit of cooperation.

Obviously the most fundamental issue is the use of 17D2 (COENG). There is a discussion of that on page 6 and following of N2385 (and page 8 of N2406).

(a) It is not insignificant that the use of COENG in one stroke makes unnecessary the addition of about (5 * 16 = ) 80 characters (or spaces reserved for potential future subscript characters) proposed in N2380R. Hence in effect it is one character versus 80 ligatures.

(b) The existing Khmer block has only 25 unused slots. It will take a very big shoehorn to squeeze 80 characters into 25 slots (and this is not even taking into consideration ten minority script characters, ten divination lore numbers and other miscellaneous numbers that might be added in addition).

(c) Given the limited number of characters easily accessible from a keyboard, an implementation something like COENG would have to be improvised to accommodate such an unwieldy group of characters. Under the COENG model of encoding (along with frequently used non-Khmer characters) there are already about 150 characters which need to be typed from a Khmer keyboard. Obliviating COENG would result in an addition of greater than 100
characters.

(d) I am presently undertaking an interesting implementation of a Khmer font which has an optional feature that would facilitate transliteration of the Khmer script into Latin script. The only difference between base characters
and subscript characters in this context is figuring out which one is the last in the cluster (in order to attach a vowel to it [and at this point the inherent vowels need to become explicit!])

(e) The linguists committee (upon which much of the existing Khmer Unicode encoding was based) was not composed of implementation experts; however, they were not offended by the COENG model.

More could be added but unfortunately I must close for now.

Sincerely,

Maurice

********

Dear Svay Leng:

>Cambodian people know
>that our language was come from Bali or [Sanskrit] but our remote ancestor
>had change step by step to meet Cambodian people needs for more than one
>thousand year. The continuity in the transition of a culture is very
>important. We can't deny our existing culture and introduce the Virama model
>to our Khmer script. Technically speaking, we can use the existing UNICODE
>table without Coeng and we don't need your proof with the representation
>with this Virama model. Nobody in Cambodia (may be except a very few
>people ) say that KA and COENG KA are the same.

I very much appreciate your concern for cultural appropriateness of technologies. I certainly know there are cases in which one might be inclined to impose a virama model to scripts of the Brahmic family simply because they are from that family, but for which that may not make sense in relation to the way that that particular script actually works or the way that it is perceived within the primary culture in which it is used.

I do not personally have an opinion for or against one or the other approach to implementation at this point as far as COENG is concerned. (I have some opinions with respect to the representation of vowels, but that is a different matter.) Before we go very far in judging cultural validity, though, I wonder if it might be helpful to step back and
consider a larger perspective. What I have in mind is that we perhaps need to distinguish between two things:

1) the way users will perceive an implementation, which is based on their cultural models and their experience in using the implementation; and

2) the technical details regarding how an implementation actually works and produces the user experience that it does.

In this regard, you have said that Khmer users would not perceive a common identity between KA and COENG KA. Thus, I gather, you are suggesting that they should have distinct and comparable encodings, and that the current implementation in Unicode violates this.

Without suggesting what users should or shouldn't perceive or hold as culturally valid, I'd like to ask the question as to whether it is possible that implementations might be able to hide the technical details of how the implementation is being accomplished? For instance, I can easily envision overall implementations based on the current definitions
in Unicode in which users are not at all aware that KA and COENG KA do not have distinct and comparable encodings.

There seems to me to be a slight [analogy] with Latin case pairs. There is a measure to which English speakers do view "a" and "A" as being the same. Our history with type has reinforced a distinction, but from typewriters through current computer implementations both are typed using the same key on the keyboard. Now, in terms of the encoding implementation, it just happens that these are encoded as distinct characters of comparable status. Note that it would have been possible to develop Unicode and related implementations on another basis, one in which "a" was represented as a variant of "A", or vice versa. For example, imagine that "a" is encoded as LATIN LETTER A and "A" is encoded as a sequence < LATIN LETTER A, UPPER CASE MODIFIER >. Technically, this would have been entirely possible. What is crucial to note, however, is that users would not necessarily have to be aware of any difference whatsoever. For instance, it would the possible to place two systems side by side, one that
implemented one way (two comparable characters), and another that implemented another way (a basic character and an casing modifier character), and have these two systems implemented in such a way that users could not distinguish them based on the user experience.

So, what I am asking is this:

While it may be true that the encoding implementation of Khmer script does not closely follow the cultural perceptions that Khmer people have of the script, might it be possible that this inconsistency could be masked from
users so that they are not aware of it?

This would be somewhat comparable to the implementation of Latin script not directly reflecting a relationship between upper and lower case pairs that does exist. It would not be intended to suggest that the encoded
implementation is how the script should be culturally perceived. It would be merely to facilitate the quickest path to see successful implementation of Khmer script in commercial and other software, something which might be
of more immediate benefit to users (particularly keeping in mind that various font implementations that could easily have hidden this inconsistency from users were in process of development at the time when these issues arose).

I realise in asking this question that there may be factors I am not considering, as I am neither a member of the Khmer community nor even thoroughly acquainted with the details of the script. It is for this reason that I do not assert an answer one way or another but rather present this to you as a question. I raise this in case it may present a
possibility for finding some solution to this concern.

Kind regards,
- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>

***********

Maurice,

You are very aware that I understand your approach and have even, with your help, implemented Khmer Unicode support. Please see my comments to your thoughts below:

A) Having one character used with each base character instead of "ligatures" doesn't automatically make it the correct manner in which to handle Khmer text.

B) There is room available in the BMP of Unicode for us to be able to do the right thing. We could get an Extended Khmer block added. This is not just about saving the number of characters encoded. It is about handling the Khmer language correctly.

C) This is not about keyboarding. It is just as easy to use your dead key approach to enter a subscript character representation as it is to enter the COENG + character with the same keyboard.

D) I would say that the ability to convert the Khmer to Latin is fascinating, but is not a factor that needs to be considered for correctly encoding Khmer into Unicode. It will be possible to take any correct encoding of a language and deduce rules for some type of morphological transformation to another language. The same rules will
apply to your transformation if the subscript letters are encoded.

Can you please present why the COENG model (based on virama) is so critical for implementing "correct" Unicode use of subscripts. If you remove constraints of the number of characters and any data entry issues, what is the compelling reason to encode Khmer in this manner? Frankly, I see no difference in the COENG model results than
representing each of the subscript forms as an individual character...except that encoding the subscript forms is more efficient and intuitive to use.

Personally, I believe that resolving the COENG issue will resolve over 80% of the problems that the Cambodian [delegation] has with the current Khmer Unicode implementation. It would be great if we can tackle this issue and bring it to some resolution.

Regards to all,

Paul

***********

On 11/12/2001 06:51:08 AM Paul Nelson wrote:

>C) This is not about keyboarding. It is just as easy to use your dead
>key approach to enter a subscript character representation as it is to
>enter the COENG + character with the same keyboard.
>
>D) I would say that the ability to convert the Khmer to Latin is
>fascinating, but is not a factor that needs to be considered for
>correctly encoding Khmer into Unicode. It will be possible to take any
>correct encoding of a language and deduce rules for some type of
>morphological transformation to another language. The same rules will
>apply to your transformation if the subscript letters are encoded.

I definitely agree strongly with both of these points.

- Peter