Khmer Unicode Mailing List 2001/11/20

> On 11/17/2001 12:07:35 PM "S Leng" wrote:
>
> >And I also wrote "everybody working in this field can see this
> >standard table" , it means for me that Cambodian who are programmers,
> >computers engineers, and so on , when they study computer science they
> >can see this standard.
>
> Without trying to argue for or against either opinion, I would like to
> understand why it is a problem for Cambodian computer engineers if it
> produces the necessary results.
>

Frankly speaking I'm very surprising to this question. I give you an indirect example. When 2 Cambodians get married. They want to have a khmer style ceremony with khmer dresses. But you said "You don't need khmer drees. Please use this chinese dress because it's easier to get it, nobody can make khmer dresses here. You can get married. You can get the same result independent of the dresses." I'm sure that they will not wear this chinese dress. If they are blind (very sorry for blind people, in reality I don't want to use this example, but I don't have easier example), and if you don't said to them that this dress is a chinese dress, they will wear(but if you inform them that it is a chinese dress they will not use even if they can not see). If they can see what they think. It is the similar case here. For Cambodians working on IT field they can see the table, they know also the new Virama model is introduced for the implementation in Khmer skripts what they will think (if there are no other solution, may be it is OK but here we have an other appropriate solution). I'm very regret there are no participation of Cambodian official body in the process when you applied the proposal for Khmer codes. By the way I don't know the members in this mailing list. To make the discussions more open, I think it's time to resume your response to our document N2406 and put it in WG2 documents to allowe officially other people know these discussions.

Svay Leng

**************

Asmus Freytag [asmusf@ix.netcom.com]

At 12:40 PM 11/20/01 +0900, S Leng wrote:
> By the way I don't know the members in this mailing list. To make the
> discussions more open, I think it's time to resume your response to our
> document N2406 and put it in WG2 documents to allowe officially other people
> know these discussions.
>
>Svay Leng

The discussions among the people on this list are only an exchange of ideas. In other words, they allow us to learn more about our respective points of views and exchange information. However, they are not official communications from Unicode, UTC or WG2, and I presume, likewise not from the Cambodian NB. In time, we will need to use what we learned from our exchange here to formulate an official response to your document N2406 and have it reviewed and approved by our organizations, in my case the Unicode Technical Committee and the US National Body.

A./
**************

Discussions on this mailing list concerning COENG seem to have quieted down. I presume this is because most arguments on either side of the issue have already been presented.

Because COENG has already entered the ISO10646/Unicode standards, the burden of proof lies with those who would deprecate it. Furthermore, this case is not a simple single character deprecation, it is a model deprecation.

Svay Leng, especially, has forcefully asserted that Khmers do not tend to think of COENG as having an independent existence. Rather, the perception is that subscripts are distinct characters from their base character.

Others (including myself) have argued that despite those very real perceptions, the use of COENG is a valid underlying model for the Khmer script (as with most other Indic scripts) and is more consistent with the data (second subscripts, robat, collation, names, phonetics, extensibility). End users could be 'shielded' from COENG if need be. But deprecating it from the Standards would introduce significant instability to the Standards themselves (a bad precedent which would adversely affect the stability of similar Indic scripts).

Unfortunately we cannot simply agree to disagree with regard to the Khmer ISO10646/Unicode Standard. The model is COENG or explicit subscripts, not both.

An overwhelming burden of proof justifying the deprecation of COENG has not been presented. The COENG model should, therefore, remain in the standards as it is. I trust the Cambodian delegation can help us move to a situation where software development consistent with Khmer Unicode can now proceed.

Furthermore there are other issues which need to be discussed...may I suggest sorting/collation be the next one?

Sincerely,

Maurice

************

On 11/19/2001 09:40:55 PM "S Leng" wrote:

>> Without trying to argue for or against either opinion, I would like to
>> understand why it is a problem for Cambodian computer engineers if it
>> produces the necessary results.
>>
>
>Frankly speaking I'm very surprising to this question. I give you an
>indirect example...

Well, my question was intended to find out what the concern was with respect to Cambodian computer engineers, and you appear to have answered that: cultural perceptions regarding how the script works, but not any technical limitations. Thus, you are saying that the usability requirement for engineers is pretty much the same as that for end users: it has to appear in terms of the way they think of the script today.

>For Cambodians working on IT field they
>can see the table, they know also the new Virama model is introduced for the
>implementation in Khmer skripts what they will think [?]

If we present it as something new being imposed on the script, I can understand they may question it; and if it doesn't provide some reasonable functionality or worse impedes functionality, I can understand them resenting it.

But speaking hypothetically (I don't know if these suggestions are potentially applicable): if it does provide some reasonable functionality without impeding, perhaps they might come to accept it? And if we present it as reflecting the historical roots of the script rather than something new and foreign, perhaps they might not have any objection to it?

> I'm very regret there are no participation of Cambodian official body in
>the process when you applied the proposal for Khmer codes.

I was not involved in that proposal, myself. If I had been, there are some things I would likely have tried to get done differently.

- Peter
***********

Hopefully the topic of Sorting/Collation will be more satisfying than the discussion of COENG:

(1) There are many facets to it
(2) Many decisions need to be made (and none are currently enshrined in a standard;-))
(3) It is adaptable
(4) It might resolve some of the character addition issues
(5) It moves into other facets of how Khmer can be used on computer (not just glorified typewriter output)

A rough outline:

(1) We need to know what Khmer characters have an affect on collation
+ Obviously consonants, independent vowels, and dependent vowels affect it
+ Three signs seem to affect the vowel sorting
+ What affect should other signs have on sorting? singly? in combination?
+ Should punctuation, numbers, abbreviations, etc, affect sorting?
+ What is the largest number of each type of character in a given cluster?
Consonant or Independent Vowel x 1
Register Shifter x 1 (secondary priority...so this does not affect primary sort order)
Subscript Consonant or Independent Vowel x 2 (are there any words with more than 2 subscripts?)
Vowel x 1 (but note change when used with one of three signs)
Other signs x ?
+ In what order should Khmer characters be stored? (In the Unicode standard there is a normative ordering: Consonant, Register shifter, Subscripts, Vowel, Signs (but nothing on ordering of signs).

(2) How many different dictionary sections should be set aside for collation?
+ Should some/all independent vowels be sorted separately or preprocessed to a base consonant and dependent vowel
+ Should there be separate sections for kinds of KHMER LETTER BA?

(3) Is historical Khmer ordering appropriate for indexes of the internet age?
+ Is there a standardised test to gauge competency in indexing/sorting/finding words
+ How common in the Khmer population is skill in putting words in alphabetical order (or finding a word in the dictionary)?
+ Should we just let the computer find words and skip the manual skill?
+ What percentage of current books contain an index in Khmer?
+ How should we handle inconsistency? Lookups?
+ Is the Chuon Nath dictionary THE standard?

(4) What is the Khmer sorting order?
+ Pali sorting seems to be different from Khmer sorting. What about Sanskrit? (are there any Sanskrit dictionaries in Khmer script available?)
+ Dependent vowels
+ Robat
+ Inherent vowels
+ Multiple decompositions of independent vowels
+ Lookup or rules
+ Primary versus secondary sort (cluster versus word)

Hopefully some expert Khmer linguists and scholar monks can be brought into this discussion. Is there a government body which oversees standardisation of the Khmer language? If at all possible, it would be preferred if specific examples could be quoted (if not graphically, then according to Khmer Unicode code points).

Sincerely,

Maurice

*************

Asmus Freytag [asmusf@ix.netcom.com]

All,

Having read the detail exchanges that have been made back and forth, I would like to take the occasion of the exchange below to try to come to a general view.

At 12:40 PM 11/20/01 +0900, S Leng wrote:
> > Without trying to argue for or against either opinion, I would like to
> > understand why it is a problem for Cambodian computer engineers if it
> > produces the necessary results.
> >
>
>Frankly speaking I'm very surprising to this question. I give you an
>indirect example. When 2 Cambodians get married...

I would argue that in the case of cultural rituals the modus of the ceremony and the dress form very much part of the core, or essence of the event, not just the 'outcome'. However, the same is not true for processing bits in the bowels of a modern computer. In that sense the analogy is seriously flawed.

It might be useful to consider which aspects of processing Khmer texts in the two models would allow an end user to differentiate between the two, in other words would be observable.

1) Storage & Transmission

There is a small increase in average length of stored data with the virama model. There are some circumstances where such an increase will be observable due to cache-size or transmission line limits, but these cases will present the minority and will be limited to what I call 'industrial-strength' text processing, where a few percent in throughput can be measured. For the typical end user, writing some text, viewing the web, etc., even doubling the storage size is rarely observable.

2) Data Input

We are already agreed that this can be decoupled from the encoding, so no observable differences exist.

3) Rendering

Rendering the Khmer script requires a deviation from the simple left-to-right, one-glyph-at-a-time approach that can be used for the bulk of Latin, Greek, Cyrillic, Han and several other scripts. The fonts for both approaches need to contain the same glyphs, and in both cases a rendering engine needs to deal with placement. Modern rendering engines and font technology support the selection of glyphs from character pairs, and this is not a bottleneck in terms of performance. Therefore, I am not aware of any observable differences.

4) Sorting, Searching and other processes

If I read the postings correctly, neither sorting nor any other process has been shown to lead to noticeably inferior results for either of the two encodings. All things being equal, the slight increase in length of the virama model can result in an observable performance impact (of the same slight degree) but only for those 'industrial-strength' applications for which these processes are truly time-critical bottlenecks.

5) Expressiveness

This one I cannot judge as easily. An argument has been made that the virama model is more open ended, in that it can express subscripts that might not initially be coded in the other model. The expressiveness of the virama model depends on the support built into the fonts, the expressiveness of the other model depends on the combinations that are encoded *and* built into the fonts. I suspect that, with some care on our part (and on the part of the font suppliers), no typical end-user will observe any differences.

6) Stability & time of arrival

There is a big difference in observable here. If we can accept the existing model and move forward on its base, implementations supporting Khmer can and will be available much sooner and much more widespread than if we continue this debate. If we end up 'deprecating the virama' we will forever enshrine an ambiguousness into the encoding that will continue to make certain kinds of processing much slower than the differences between either model alone. We have learned at our peril that there are many implementations of a script as soon as it's encoded -- even when they are not visible to the committees. Many of these implementations will forced to be supported into the future.

The future

The differences noted in items 1 and 4 will matter less and less in the future. They will not go away, but the pure amount of text(!) data needed to define an application as 'industrial-strength', will go up - in other words more and more applications won't reach the limits in bandwidth or cache sizes that make these small differences in the encoding observable.

The differences noted in item 6 will have partially disappeared. In ten years, a two or three year delay in wide-spread support for Khmer may be forgotten. If there are implementations of the virama model that have to be supported for backwards compatibility, they will no longer be unknown, but known, and everyone will have learned how to deal with them - but not without some ongoing cost.

Conclusion

The observable differences between the encodings appear to be minor, certainly they don't appear to be of a degree that one would consider either one as insufficient or gravely defective. The largest observable effects - especially in the near term - are the result of the uncertainty around the Khmer encoding and the threat of ongoing ambiguity. The best service we can do the users of the Khmer scripts (wherever they are located) is to come to a speedy resolution of this issue.

If I have left out an important aspect, or misunderstood a particular area of processing please let me know.

A./

************

Could you please let me know about your comments on all items mentioned in document N2406. It seems that you comment only on some items to justify the existing code set and forcing to go ahead.
Svay Leng
**************

Therefore I would like to say also that Khmer code set will be Cambodian standard , and the table itself is very important for Cambodian people not only the result of the encoding.
Svay Leng

***************

The only thing I have to say right now about Sorting is that we're supposed to be dealing with the Cambodian objections to the way Khmer has been encoded.
--
Michael Everson *** Everson Typography *** http://www.evertype.com