Khmer Unicode Mail List 2001/11/17

> I likewise agree completely that cultural perceptions regarding the
> alphabet and encoding issues are related but *different* matters.
>
> Accordingly, I think the current encoding in the Standard should be judged
> on the basis of whether it allows for implementations that provide
> culturally expected behaviours with regard to input, rendering, sorting,
> etc., and not on whether the encoding itself directly corresponds to those
> cultural expectations. The very fact that cultural expectations can and
> have changed should tell us that a direct correspondence between that and
> encoding is not the right solution.
>
>
>
> - Peter
>

I completely disagree. You said that end user will never see the encoding, I agreed with this point. But UNICODE is an standard in IT field that everybody, every countries have to follow. Everybody working in this field can see this standard table.

Svay Leng

*********

On 11/16/2001 07:45:34 PM "S Leng" wrote:

>I completely disagree. You said that end user will never see the encoding, I
>agreed with this point. But UNICODE is an standard in IT field that
>everybody, every countries have to follow. Everybody working in this field
>can see this standard table.

Could you please clarify this. Why does it matter that people can see what is in the Standard? As long as users are provided with behaviours that fit their cultural expectations, why does it matter how it is implemented internally? What user need is not met?

- Peter

************

I agree with Svay Leng that end users will not [necessarily] see the model used in Khmer Unicode. Furthermore, I agree with Svay Leng that everyone working on Khmer programming or standardisation will necessarily be confronted with the principles of COENG [the virama-like character in Khmer Unicode].

Where we might have to agree to disagree, however, is whether COENG is something that SHOULD be ignored;-) There is no disagreement that COENG is essentially hidden (although I think I hear it every time a Khmer classroom chants their spellings!).

Three relatively new points which illustrate how the virama model is appropriate for Khmer:

(1) The virama/COENG model applies a unifying principle to the three types of consonant (or independent vowel): base, first subscript, and second subscript: the same consonant occurs in each of these (for KA: U+1780; U+17D2 U+1780; U+17D2 U+1780). Each of these three types might have a different pronunciation...based on whether they occur singly, in pairs or in triplicate. The case of second subscripts has been little emphasized. The last one of these three in a cluster (and it would be the second subscript if all three consonant types are present in a cluster) always bears the vowel sound (with the exception of word finals where often there is no vowel sound in the cluster). Even though a second subscript (normally implying an inherent vowel sound if there is no explicit dependent vowel) has (i) a different sound and (ii) a different affect on collation (sorting) than a first subscript, we have not seen a Unicode proposal to explicitly encode second subscripts. Would this not weaken (in terms of consistency if nothing else) the arguments of those who distinguish base and subscript on those two supposed differentiators of sound and collation?

(2) A Unicode expert yesterday had a case of insomnia and wrote a fascinating paper on 'segmental collation' and the Unicode Collation Algorithm (which I've asked permission to make public). I was pleasantly surprised this morning to find that Khmer Unicode as presently defined appears to slot in with a default description of that algorithm (although independent vowels AND dependent vowels used in combination with NIKAHIT, REAHMUK, or YUUKALEAPINTU require some preprocessing). This means that COENG does not break collation (or even require a workaround). It is also very encouraging that the many different Indic scripts may start to come under the umbrella of universal collation standards!

(3) As mentioned earlier, COENG also works well with a more complete understanding of ROBAT (where the 'base' character looks like a diacritic and the 'subscript' looks like a base character). Look at the last three entries on the page arabic numbered page 51 of the Chuon Nath Dictionary to see something really astounding about ROBAT (which really is equivalent to RO COENG [U+179A U+17D2]). This illustrates that the appearance of a consonant is not as important as its roots or function. http://www.bauhahnm.clara.net/Khmer/Robat.jpg

Please appreciate that our early attempts to encourage Khmer standardisation were not plots to force Khmer into a non-Khmer mold (or to sidestep Khmer authorities). They represent the 'general' case which encourages a robust sythesis of the old and the new...and are largely being proved valid as we encounter unexpected issues relating to the script. There are deficiencies, however, and I trust these will be resolved quickly in the spirit of cooperation.

Incidentally, it would be good to know if any of the individuals explicitly copied in this email are already receiving copies via the Khmer mailing list (it is not helpful to send [duplicate] copies given the cost of internet connectivity in Cambodia).

Sincerely,

Maurice

> I completely disagree. You said that end user will never see the encoding, I
> agreed with this point. But UNICODE is an standard in IT field that
> everybody, every countries have to follow. Everybody working in this field
> can see this standard table.

> Svay Leng

*************

At 10:45 +0900 2001-11-17, S Leng wrote:

That's not the *end* user. It's the programmers and font designers who will see it. The end user is the secretary, the person writing his novel, the journalist.

Programmers and font designers benefit if the coeng or virama encoding model is used, because this is the same, familiar model which they use for other scripts which have the same intrinsic principles as Khmer.

The ordinary writer of Khmer neither knows nor cares what internal encoding is used. But the coeng encoding model has proved to work very well for all other Brahmic scripts, just as it does for Khmer. It has benefits for font design, for reuse of software designed for other similar scripts, for transparent transliteration of Sanskrit and Pali (remember, Khmer is not the only language written in the Khmer script), and for permitting a wide range of economic inputting methods.

Some inputting software might signal the coeng character as a little plus sign below a consonant, showing that the ceong has been entered, and that when the next consonant is entered, it will appear with its subscript shape. This might be beneficial for people learning to type. It is a method used by millions of users of other Brahmic scripts.

But other inputting software might hide coeng completely. Different users might have different preferences. All of those can be catered for without changing the encoding model which has been standardized for Khmer and other Brahmic scripts.

If we were to change the encoding model for one of these scripts, the door would be open for people to come and ask us to change them all. This would seriously undermine the stability of the entire International Standard.

It *is* possible to conceive of more than one way of encoding the Brahmic scripts. From a certain point of view, it could be said that any of them could be made to work, and that each of them has advantages and disadvantages. But the reality is that one way of encoding has already been selected. I, for one,am confident that it can represent Khmer, Pali, and Sanskrit texts accurately and unambiguously.
--
Michael Everson