Agreed, a sparse keyboard can be used to generate a rich encoding. The point is that basically one would be simulating the sparse encoding...so why bother with the 'rich encoding' in the first place (since its raison d'etre [interface issues with users] has largely evaporated).

As you mention, a 'rich keyboard' could be used to generate a sparse encoding...but anyone concerned with promoting efficiency would avoid that as a matter of principle (particularly in the case of Khmer where even a sparse encoding/keyboard requires an excessive number of characters compared to the number of keys available).

Sincerely,

Maurice

***********

> A. ...the harsh reality is that there are major parts of Cambodia that will never see fast networks or huge storage spaces.

> Data transmi[ss]ion and storage cannot be summarily dismissed as a cheap item.

None of us who care about Cambodia or the rest of the developing world would question the lamentable lack of affordable infrastructure and hardware. However, this discussion relates to the comparable costs of data transmission and storage versus the cost of data entry (not the affordability of computerisation in general). Data entry will continue to be the expensive component either keeping relatively stable or slowly growing with the costs of transmission and storage rapidly dropping.

(A.1) The realities of mass production also dictate that the developing world is never far behind in the capacities of hardware which it uses compared to the developed world. For example, it is practically impossible to buy a new 20 megabyte desktop hard drive today. Similarly it is difficult to find a 300 baud modem. On the other hand, 500 megabyte hard drives cost less than 20 megabyte hard drives used to cost and 56K modems cost less than
300 baud modems used to cost.

(A.2) Furthermore, transmission and storage benefit from compression schemes...but data entry is resistant to such improvements at least in the short term. In an effective compression scheme the two encoding models would occupy the same space.

(A.3) Cost per byte of transmission and storage are dropping at an astounding rate, while cost of data entry is slowly increasing. Certainly reuse is a consideration. I could not find statistics that detail that (which would be affected by compression, revised content, the plane of the encoding, etc). In any case the cost of supporting the 12% increase in size of COENG enabled encodings drops precipitously over time.

> B. One point of view says "return to the roots" and one point of view says "here is where we are today".

> One can claim that the only way to write Khmer is as an Indian advisor would have years ago...

(B.1) It might be put another way;-) Restrict the encoding to what is in common use today or ensure that it is open-ended enough to accommodate the whole range of the script's history.

(B.2) Having persisted through the years of steps which led to the present Khmer Unicode standardisation and now re-running much of the same turf, I have long been leery of having to engage the standardisation mechanism every time a new subscript form is discovered and proposed. Theoretically, it is a one year fast track clean process. In fact it is susceptible to political, personal and linguistic preferences. Much of the work of the standardisation process is done by unpaid volunteers. What do they get in return? Disparaging comments, threats to their lives and more work! I'm concerned that exhaustion will overtake many of these experienced folk and the process will either become much more rigid and costly or collapse in anarchy. As the saying goes, 'A bird in the hand is worth two in the bush!'

> How will these unencoded sounds be represented in Unicode?

Hopefully not in pairs;-)

> C. One point of view say "COENG solves all subscripts" and one point of view says "encode all subscripts".

> The Cambodian proposal encoded all known subscripts and lunar dates, and carefully left blank spaces for those COENG forms that were not currently used for modern usage and marked them as "RESERVED".

The recognition of the need for these various forms in the final output is appreciated.

> My supposition is that each point of view wants the best for the users of the Khmer script.

I'm sure this is the case...unfortunately the implications are not always so well understood. The push to expedite the process of getting Khmer standardised occasioned years later this challenge. In turn this challenge has stopped or slowed Khmer implementations and introduced instability into encoding standardisation. Cooperation would yield a much more palatable result.

> To those in favor of the COENG model I ask: Can you please present why the COENG model (based on virama) is so critical for implementing "correct" Unicode use of subscripts?

(C.1) At the moment I suppose the strongest argument is that it is the standard which (theoretically at least) cannot be deleted. The idea behind this is that standards can be implemented (sometimes at great cost) knowing that the implementer can guarantee a certain level of capability (say, of searching). The strength of a standard is not its accuracy...it is its dependability. This is illustrated by the embedding of standards within Unicode which violate the principles of Unicode...to promote accurate round trip conversions. The ramifications of dropping the COENG model in Khmer would, I fear, be much more widely felt than with Khmer script users. It would call into question the very foundations of ISO10646/Unicode. Could this not also put a cloud over other Indic scripts...effectively stopping development in this whole area which most certainly needs further improvement? The resolutions passed in Singapore confirm the commitment of those standards bodies to the permanency of their encodings in general (with the prevention of such wobbliness). The unanimous vote on the Khmer issues confirms this in particular.

Deprecation of COENG (U+17D2) should not be spoken of so glibly. The effective dropping of the virama/COENG model would not just drop the single character COENG (U+17D2), it would also drop the 34/35 implied subscript consonants, around 17 implied subscript independent vowels, and all the lunar dates. Although there are many compatibility characters in Unicode, they are there to respect other long established standards (part of the cost of getting Unicode accepted). ISO10646/Unicode is not about to knowingly countenance creating new compatibility characters.

Paul has already indicated that Microsoft cannot proceed with Khmer implementations while there is the present cloud over the Khmer Unicode encoding. Certainly there are other software developers (there is no registry of who does what!) who are similarly in a holding pattern (and maybe not just for Khmer!).

(C.2) If you remove constraints of the number of characters

This is not something so easily removed. There are limited available spaces in the base plane of Unicode. Although there are more than enough to accommodate the proposed Khmer characters, hot competition for those remaining spaces could easily leave an additional Khmer block out cold in the supplementary plane (in which case the encodings would significantly grow in size). Even if Khmer was able to grab an additional block in the base plane of Unicode, it would then put other languages at a disadvantage of having to share space in a supplementary plane. To be consistent, however, I must admit that compression schemes could drastically reduce the overall damage.

(C.3) [If you remove] any data entry issues,

If there were no data entry issues the complaint about COENG being non-intuitive would also be irrelevant. Data entry is a significant issue. Happily the many vowels in Khmer will some day make verbal data entry of Khmer easier than for most other languages.

I am not proposing a keyboard layout standard at the moment. On the other hand I would offer that Indic scripts do not tend to use IMEs (which are necessary for very large character sets).

(C.4) Is there any reason why encoding subscript forms and lunar numbers would not work?

Yes, in the case certain rare subscript independent vowels would not be available. These surprises do keep cropping up.

> If a proposed Extended Khmer range was considered it could not contain any characters already encoded. It could only contain subscript forms and any new characters that were not put in the Khmer block. Additionally, it would be necessary to have some type of mapping between the current Unicode encoding model and the subscripts encoded in the Extended Khmer range.

>Perhaps it is to[o] early to propose such a compromise. Oh well. It is an idea.

This is not a case amenable to compromise, unfortunately. It would introduce chaos into the use of Khmer searching (some users would use the COENG model and others the explicit subscript). Although theoretically a mapping between two equivalent forms would be possible (after all we have non-case sensitive searching in the Latin script), the speed required of searching means that it is typically conducted purely as a binary match level. The huge user population demand made case-insensitive searching a priority, it is doubtful that large software development projects would spend that kind of money for a minority language such as Khmer (possibly slowing the speed of searching of other scripts). The openendness of ISO10646/Unicode is there so that characters which are rarely encountered (and hence unnoticed in the initial encodings) can be added. It is not there to facilitate major shifts in how the encodings are handled. It is not there to break previous encoding models.

> There is an additional huge issue that must be considered. If the Khmer encoding is changed to include subscript forms, what rationale can be provided to prevent Indic languages from also demanding that all of their subscript forms be encoded? What makes Khmer more like Tibetan script than Sanskrit?

Yes, as pointed out above that could be catastrophic.

> P.S. - My opinion is still that keyboarding is an implementation issue that needs to be left out of the encoding mix. There are many ways to implement keyboards and other input devices. Those methods may change from device to device and evolve as technology evolves. This issue is about deriving the best Khmer encoding solution to solve this impass[e].

If keyboarding is not an issue (as I have mentioned earlier) there is no user interface issue and this challenge against COENG lacks merit.

It will be nice when we can conclude the discussion of COENG. It is unfortunately an area where compromise is not a real option. Hopefully the retracing of steps will help readers to appreciate the many issues involved and pave the way for more productive cooperation on other issues.

Possibly for off-line discussion: Paul (or others) could you provide me with some suggestions on how to provide feedback (which resolves when there is a complete entry) to the user when using dead-key data entry (but without corrupting the data store)?

Sincerely,

Maurice

 

*********

On 11/15/2001 12:05:05 AM Maurice Bauhahn wrote:

>> B. One point of view says "return to the roots" and one point of view
>> says "here is where we are today".
>
>> One can claim that the only way to write Khmer is as an Indian advisor
>> would have years ago...
>
>(B.1) It might be put another way;-) Restrict the encoding to what is in
>common use today or ensure that it is open-ended enough to accommodate the
>whole range of the script's history.

This may or may not be relevant, but we should make sure that it is possible to encode archaic Khmer documents. Of course, that alone does not predetermine one implementation over another.

>(B.2) Having persisted through the years of steps which led to the present
>Khmer Unicode standardisation and now re-running much of the same turf, I
>have long been leery of having to engage the standardisation mechanism every
>time a new subscript form is discovered and proposed.

I am inclined to agree, which is exactly why I was not happy with vowels being encoded as precombined combinations. But since they're there in that form, I'll have to live with that.

>I'm concerned
>that exhaustion will overtake many of these experienced folk and the process
>will either become much more rigid and costly or collapse in anarchy. As the
>saying goes, 'A bird in the hand is worth two in the bush!'

I quite agree. At the same time, if there are ways in which the Standard as it stands is inadequate, we should not keep anticipation of a difficult process from pursuing a solution. The question I see, then, is whether or not it is inadequate and, if so, in exactly what way.

>> To those in favor of the COENG model I ask:
>Can you please present why the COENG model (based on virama) is so
>critical for implementing "correct" Unicode use of subscripts?
>
>(C.1) At the moment I suppose the strongest argument is that it is the
>standard which (theoretically at least) cannot be deleted.

We have to recognise that this is indeed a strong consideration. Maurice is raising very valid issues regarding the stability of the Standard, and the likely refusal of the standardising bodies to completely re-do what was done. At the same time, deprecation of COENG and addition of new characters *without decomposition mappings* just might possibly be within the extent of what they'd consider, though I don't think it's at all obvious that they would. It is my impression from previous incidents that the mere fact that a national body is not entirely happy with it will not be enough. If there are problems that the standard as it stands is not implementable or is inadequate for some reason, then they will want to make changes to rectify that situation. But that would need to be demonstrated.

>Deprecation of COENG (U+17D2) should not be spoken of so glibly. The
>effective dropping of the virama/COENG model would not just drop the single
>character COENG (U+17D2), it would also drop the 34/35 implied subscript
>consonants, around 17 implied subscript independent vowels, and all the
>lunar dates.

That may be somewhat of an overstatement. In strict terms, it is deprecating one character in the Standard, not 52+ characters. It would only be done on the assumption that the others were included or supported in some other way. There are implications, though, in terms of what impact there may be on existing implementations. Now, we believe that there are yet no likely to be any existing implementations in software that have been put into use. (I wouldn't be surprised if there wasn't anybody else beside Paul and Maurice who had worked on implementations.) Even so, the standardising bodies will take very seriously the potential implications for implementations that may have been done whether we know of them or not. Maurice is quite right in pointing out that they have to maintain a trust regarding the stability of the Standard. If they didn't, that could put the Standard as serious risk of failure.

>Although there are many compatibility characters in Unicode

I don't think this is really relevant since we are not, that I know of, talking about possible addition of compatibility characters.

>Paul has already indicated that Microsoft cannot proceed with Khmer
>implementations while there is the present cloud over the Khmer Unicode
>encoding. Certainly there are other software developers (there is no
>registry of who does what!) who are similarly in a holding pattern (and
>maybe not just for Khmer!).

Yes. Resolving the issue and removing that cloud is probably more important than how it is implemented.

>> (C.2) If you remove constraints of the number of characters
>
>This is not something so easily removed....

I think we have to conclude at this point that we cannot assume this is an obstacle to a proposal for adding new characters.

>(C.3) [If you remove] any data entry issues,

I'm inclined to think that data entry issues do not present a strong argument one way or another.

>(C.4) Is there any reason why encoding subscript forms and lunar numbers
>would not work?
>
>Yes, in the case certain rare subscript independent vowels would not be
>available. These surprises do keep cropping up.

In general, I think that a productive mechanism for composition (that isn't completely ad hoc or overly complex) is preferable for exactly the reason that Maurice mentions. It is not a show-stopping requirement, though.

>This is not a case amenable to compromise, unfortunately. It would introduce
>chaos into the use of Khmer searching (some users would use the COENG model
>and others the explicit subscript).

If COENG were deprecated, that implementers should be able to assume that it is deprecated and ignore it. There are potential risks in doing that, though.

>Although theoretically a mapping between
>two equivalent forms would be possible (after all we have non-case sensitive
>searching in the Latin script), the speed required of searching means that
>it is typically conducted purely as a binary match level.

Maurice, I don't think you can argue on the one hand that hardware costs for storage and transmission are dropping rapidly but on the other hand raise a concern for this kind of issue. The processing speed/cost ratio is improving faster than the storage/cost and bandwidth/cost ratios, and major databases can already achieve adequate performance where Latin case issues are involved.

>The huge user
>population demand made case-insensitive searching a priority, it is doubtful
>that large software development projects would spend that kind of money for
>a minority language such as Khmer

Someone who needs to make their product work for Khmer markets will do this, I think. Also, consider that the COENG model requires some additional processing so that the COENG is ignored in collation, which is as likely to create performance issues as the other approach. I'm personally inclined toward staying with the COENG model (though I'm trying not to take sides in the discussion at this point), but I question whether this is really a valid argument in favour of it.

>The openendness of ISO10646/Unicode is there so that
>characters which are rarely encountered (and hence unnoticed in the initial
>encodings) can be added. It is not there to facilitate major shifts in how
>the encodings are handled. It is not there to break previous encoding
>models.

That's a valid issue, but it's part of your earlier point and not directly related to the concern for database performance.

 

>> There is an additional huge issue that must be considered. If the Khmer
>encoding is changed to include subscript forms, what rationale can be
>provided to prevent Indic languages from also demanding that all of
>their subscript forms be encoded? What makes Khmer more like Tibetan
>script than Sanskrit?
>
>Yes, as pointed out above that could be catastrophic.

I think this concern is secondary to the issues of stability in the standard in general, but those are obviously related, and both argue against any changes.

>Possibly for off-line discussion: Paul (or others) could you provide me with
>some suggestions on how to provide feedback (which resolves when there is a
>complete entry) to the user when using dead-key data entry (but without
>corrupting the data store)?

Dead-key data entry is a simple form of the general behaviour involved in input method editors: there is some intermediate state at which entry of a minimal unit is incomplete, and multiple keystrokes for that unit are required. Of course, dead keys are usually implemented without any immediate feedback, which is generally not a good user-interface practice. There is no reason why an IME implementation couldn't be used to provide that feedback via a composition window. This would not have to entail that all input is done via composition windows or that candidate windows are needed; it just means that that IME mechanism is used to provide some visual feedback reflecting the intermediate state.

- Peter

*********

Is there any information available as to what a culturally-appropriate keyboard layout / input method for Khmer would look like? Were there standard typewriter layouts and, if so, would they be more accessible to a large body of potential users because of familiarity?

- Peter

*********

Peter Constable = PC>
Maurice Bauhahn = MB>
Maurice: Unindented

Thank you, Peter, for your observations.

MB> I have long been leery of having to engage the standardisation
MB> mechanism every time a new subscript form is discovered and proposed.

PC> I am inclined to agree, which is exactly why I was not happy with vowels
PC> being encoded as precombined combinations. But since they're there in
that
PC> form, I'll have to live with that.

The existing encoding does not have precombined vowels (although the new counterproposal might be considered to propose two or three such depending on how you look at it). The reuse of standalone glyphs in combination may appear to be precombined vowels...but that is only a visual artefact (admittedly a troublesome one in keyboarding when we try to preserve the rule of only one vowel per cluster). For example, U+17C4 may appear to be in fact a precombined U+17C1 and U+17B6. All three of these have unique names, however(when spelling you pronounce only a single name). They sort as single entities at different places. They do not pronounce as diphthongs.

MB>Although there are many compatibility characters in Unicode

PC> I don't think this is really relevant since we are not, that I know of,
PC> talking about possible addition of compatibility characters.

I was considering an explicit subscript to be a compatibility character of COENG + BASE.

MB> This is not a case amenable to compromise, unfortunately. It would introduce
MB> chaos into the use of Khmer searching (some users would use the COENG model
MB> and others the explicit subscript).

PC> If COENG were deprecated, that implementers should be able to assume that
PC> it is deprecated and ignore it. There are potential risks in doing that,
PC> though.

They could not ignore it in existing text as such an action would render about 12% of the script (all the pre-encoded subscripts) invalid.

MB> Although theoretically a mapping between two equivalent forms would be possible
MB> (after all we have non-case sensitive searching in the Latin script), the speed
MB> required of searching means that it is typically conducted purely as a binary match
MB> level.

PC> Maurice, I don't think you can argue on the one hand that hardware costs
PC> for storage and transmission are dropping rapidly but on the other hand
PC> raise a concern for this kind of issue. The processing speed/cost ratio is
PC> improving faster than the storage/cost and bandwidth/cost ratios, and
PC> major databases can already achieve adequate performance where Latin case
PC> issues are involved.

You do have a good point here. Your point is valid for indexed searching. On the other hand software maintenance cost of adding such functionality for minority languages is prohibitive and the amount of textual information increases exponentially. These two factors tend to counterbalance the processing speed for full text searching. At work I use a US$100,000 package which indexes full Unicode text (and many propriety file types) with context sensitivity searching...but it will be a long time before such a tool is available for the average user (other than across the internet;-)).

Sincerely,

Maurice

***********

Dear colleagues,

I am glad to see a dialogue developing, and wish to thank Svay Leng in particular for his remarks. I assure him that it is our wish to see the most appropriate technology used to represent each of the world's languages. This has always been the wish of the UTC and WG2, and it informed our decisions in 1997 when we decided to go forward
with the virama model or Khmer, Myanmar, and Sinhala alike.

Regarding Paul Nelson's statement:

"If the Khmer encoding is changed to include subscript forms, what rationale can be provided to prevent Indic languages from also demanding that all of their subscript forms be encoded? What makes Khmer more like Tibetan script than Sanskrit?"

It must be understood WHY the Tibetan encoding differs from the other Indic encodings. This is because Tibetan has unusual stacking properties totally unlike anything that happens in Devanagari, Kannada, Myanmar, or Khmer. In Tibetan, vertical stacks of *any* length can be written. The mantras OM MANI PADME HUM and HAKSHAMALAWARAYAM are a fine examples of this. It is commonplace to write

OM HA
MA KSHA
NI MA
PA LA
DME WA
HUM RA
YAM

and it was after extensive discussion that consensus was achieved that the virama model simply wouldn't *work* for Tibetan.

The virama model works just as well for Khmer as it does for all the other Brahmic scripts.

It is possible to come to many technical solutions for representing a script. To reply to a point made by Svay Leng, it was once considered seriously whether we should encode only the Latin letters abcdefghijklmnopqrstuvwxyz and to use a COMBINING CAPITAL LETTER control character for the upper-case letters. Naturally, legacy data with ASCII made this unworkable. We already had another model that worked.

There are two points to this.

1) The end user will never see the underlying encoding. As a writer of Irish, I never know whether A ACUTE is represented by one Unicode character or two because of normalization. All I need is my input method to be what conforms to my expectations, and my fonts to make sure that what I get is what I want, and my system to have sorting and searching algorithms that do what I need them to do.

2) Khmer has *already* been encoded. Khmer can already be written using Unicode. I understand what Svay Leng has said, that, culturally, Khmers are *taught* about the alphabet in a way which suggests that KA and COENG KA are two different things. But that is a different question from the question of encoding.

Historically, we know that the Khmer script is Brahmic, and historically, we know that in fact KA and COENG KA are the *same* thing. COENG KA was simply written differently to save space and to show that there was an inherent vowel deleted -- and that long before the alphabet got to Cambodia -- it was back in King Ashoka's time! The Indian scripts use glyphs *differently* than Myanmar and Khmer do, but the essential process and structure of the writing systems is the *same*.

At the same time, we also know that historically the pronunciation of Khmer changed over the last 1500 years, and that because of this, while Sanskrit words like dharma are still spelt the same as they were in antiquity, they are now pronounced quite differently [toe] in this case. It is precisely these pronunciations which made Cambodians
learn to consider KA and COENG KA to be different -- because, as Svay Leng said, they often sound different.

But this does not mean that they *are* different. And they are not different historically. And they do not need to be distinguished in the encoding. In fact, it would be disadvantageous for historians and linguists and Buddhist scholars were the model for Khmer to be different. All the other Brahmic scripts except Thai and Lao (for industrial compatibility reasons) and Tibetan (because it is weird) use the same model to represent Sanksrit and Pali. Adding yet another method for Khmer will just make things even more complicated, and, importantly, it brings *no* benefit.

In Irish I have to live with spelling ambiguity, which we had to normalize so that <a-acute> and <a> + <combining-acute> are considered equivalent. For Khmer, we have only one way of spelling right now. Adding subscripts will break that model and introduce the possibility of multiple spellings for many Khmer words. This would be disastrous for spell-checking and searching operations.

At the top of column 2 of page 1270 of the Chuon Nath dictionary, the line reads:

sanggraaja ru sangghraaja (sang'graaca ru sang'-ghraaca)

Although it is clear that the characters in parentheses facilitate pronunciation the letters here are sa, nga, ga or gha, ra, -aa, ja, and ca -- they are simply displaying differently in different contexts. The COENG informs the rendering engines that the GA must display as a subscript. Unicode and ISO/IEC 10646 are based on characters, not glyphs, and it just isn't true to say that the subscript ga in sanggraaja and the ordinary ga in sang'graaja are essentially different. See the attached gif (also found at http://www.evertype.com/standards/km/sanggraaja.gif) where the equivalent letters are shown in colour, with comparisons to Devanagari and Myanmar. The intrinsic structure of these scripts is the same. This is why the standard encoding for them is also the same.

With regard to the argument that base characters and subscripts are different because they sort differently, this simply doesn't follow. If that were the case, first subscripts would be different from second subscripts because they affect sorting differently (the latter have a lower ranking affect). So sorting is based on other keys, and will work perfectly well with the virama model.

Best regards,
--
Michael Everson