Word Breaks - Borrowing from the Thai Experience

Martin Hosken has provided some useful help on this and almost all of the material below is from him (the mistakes are mine!). He indicates that the whole topic of word breaks is fraught with difficulties.

1. First one usually needs a pretty complete wordlist to word break against. There are numerous approaches (it would be very useful if NECTEC should publish a book on this!).

2. An algorithm needs to be selected...chosing between different speed and accuracy optimisations:

2.1 Longest match

Single pass approach picking the longest word in each case, back tracking as needed to try find a list of words that causes a complete word cut for a phrase.

Advantages: easy to implement. Fast.
Disadvantages: doesn't work very well. Finds long words early at expense of long words later.

2.2 Maximal match

In effect, find all possible ways of chopping the phrase and choose the one that results in the minimum number of words.

Advantages: works well especially if you have known material.
Disadvantages: tends to chew away at unknown material. Tricky to implement with any speed.

2.3 Statistical approaches

A whole slew of train and work algorithms which collect statistics from a training set and then use that to create rules to run on the main material.

Advantages: Can be used for Part Of Speech tagging. Fast. Don't need a dictionary
Disadvantages: Don't handle unknown material at all well.

3. Maximal match approach implementation

Martin wrote and provided software tools to handle the 'maximal match' approach. To use it you need to have Perl installed on your computer (freeware available from http://www.activestate.com/Products/ActivePerl/ [a commercial site] or http://www.perl.org/). The tools (Riwords.pm, thaisplit.pl, Wordbreak.pm) are aimed at Thai, but it should not be too difficult to change them to handle a Khmer wordlist (hopefully Maurice will have something to contribute in this area before long). You either need a training set of text (~10%) correctly word broken or else a pretty complete dictionary. The code Martin has written allows one to get at the unknown material and use that to improve your dictionary in an iterative approach. The approach it takes is to create a type of suffix tree. There aren't that many words. The Thaisplit stuff includes a compressed list of all the words from the Royal Institute word list and it's pretty small. There should be no problem keeping that in memory. (Note that *.pm and *.pl file types cannot be accessed from this Web site, so they have all been supplemented with an ending *.txt)