A Statistical-Machine-Translation Approach to Word Boundary Identification
A Statistical-Machine-Translation Approach to Word Boundary Identification

Word segmentation is a fundamental and essential step for processing a number of Asian languages such as Chinese, Japanese, and Thai. This paper presents a framework of Statistical Phrase-based Machine Translation for Thai word segmentation. The segmentation task can be recognized as a translation process from an unsegmented sentence to a segmented sentence. We formulate the problem by mapping individual characters (unsegmented text) to groups of characters (segmented text). The language and translation models which are constructed from the training data are applied in order to search for the best segmentation result. We also provide a simple post-processing system to correct segmentation errors of unknown words. The evaluation result shows the promising accuracy of average F-measure of 92.39%.

12/02/2009
12/02/2009
