InterBEST 2009

Who's Online

เรามี 1 บุคคลทั่วไป ออนไลน์


Paper Download

InterBEST 2009: Thai Word Segmentation Workshop
Proceedings of 2009 Eighth International Symposium on Natural Language Processing (SNLP2009), October 20-21, 2009 Bangkok Thailand.
Paper Topic

  • A Word and Character-Cluster Hybrid Model for  Thai Word Segmentation
  • Thai Word Segmentation using Character-level Information
  • TLex: Thai Lexeme Analyser Based on the Conditional Random Fields
  • A Statistical-Machine-Translation Approach to Word Boundary Identification: A Projective Analogy of Bilingual Translation
  • Thai Word Segmentation based-on GLR Parsing Technique and Word N-gram Model


DocumentsDate added

Order by : Name | Date | Hits [ Descendent ]

In this paper, we describe our system used in the InterBEST 2009 Thai Word Segmentation Shared Task. Our system is based on a word and character-cluster hybrid model which can effectively handle both known and unknown words. In addition, our model can be integrated with simple strategies for reducing annotation inconsistencies. Experimental results on in-domain and out-of-domain test data sets show the effectiveness of our system.

This paper describes a Thai word segmentation approach where Conditional Random Fields (CRFs) are utilized for classifying each character associated with the text string to be segmented into classes of characters categorized based on their positions in the underlying words. Characters used in the Thai writing system are attached with character functions proposed in this work. N-grams of these character functions are considered together with character N-grams within the feature templates of the CRF models in order for the models to locate characters likely to indicate word boundaries. The proposed methods yields the best F-measure score of 95.53% which is better than ones obtained based on word trigrams. It is also shown that character-level constraints make the result more robust to segmenting unseen words.


In this paper, we present our proposed solution to the InterBEST 2009 Thai Word Segmentation task. We applied the Conditional Random Fields (CRFs) to train word segmentation models from a given corpus. Using the CRFs, the word segmentation problem can be formulated as a sequential labeling task in which each character in the text string is predicted into one of two following classes: word-beginning and intra-word characters. One of the key factors which effect the performance of the word segmentation models is the design of appropriate feature sets. We proposed and evaluated three different feature sets: char (by using all possible characters as features), char-type (by categorizing all possible characters into 10 different types) and combined (by using both characters and character types as features). The evaluation results showed that the combined feature set yielded the best performance with the averaged F1 value over all genres equal to 93.90%. To further improve the results, we performed a post-processing step by merging named entities (NEs) in the segmented texts. We used the list of NEs which is compiled from the training corpus. The NE merging step helped increase the performance of the combined feature model to 94.27%.

Word segmentation is a fundamental and essential step for processing a number of Asian languages such as Chinese, Japanese, and Thai. This paper presents a framework of Statistical Phrase-based Machine Translation for Thai word segmentation. The segmentation task can be recognized as a translation process from an unsegmented sentence to a segmented sentence. We formulate the problem by mapping individual characters (unsegmented text) to groups of characters (segmented text). The language and translation models which are constructed from the training data are applied in order to search for the best segmentation result. We also provide a simple post-processing system to correct segmentation errors of unknown words. The evaluation result shows the promising accuracy of average F-measure of 92.39%.

Word segmentation is one of basic processes for the languages without explicit word boundary. Up to now, several works on Thai word segmentation have been proposed such as longest matching and maximum matching. We propose a Thai word segmentation technique based on GLR parsing and statistical language model. In this technique, an input Thai text is firstly segmented into a sequence of Thai Character Clusters (TCCs). Each TCC represents a group of inseparable Thai characters based on Thai writing system. The concept of TCC helps avoid choosing the segmentation points that violate the writing rules. Then, the most suitable segmentation candidate is chosen based on the word N-gram model with interpolation. Both of the candidate generation and selection processes are conducted through the two-phase GLR parsing technique. In the first phase, the production rules for TCC are applied to parse an input sequence of characters into a sequence of TCCs and becomes an input tokens for the parsing in the second phase. The second phase groups TCCs and forms a word. In the second phase, we construct the grammar rules that represent a word as a sequence of TCCs from words in the prepared training set. However, ambiguities in segmenting words affect the parsing result. We then need to apply the statistical language model to select the most appropriate segmentation. This statistical model is applied together with the GLR parsing, and the beam search technique is applied to select only the best k parsing paths. We evaluate the proposed technique. using the test data provided by InterBEST 2009. The experimental results show that the technique can obtain 87.04% f-measure when the beam size is set to 10.