InterBEST 2009

Who's Online

เรามี 1 บุคคลทั่วไป ออนไลน์


Details for Thai Word Segmentation based-on GLR Parsing Technique and Word N-gram Model
NameThai Word Segmentation based-on GLR Parsing Technique and Word N-gram Model

Word segmentation is one of basic processes for the languages without explicit word boundary. Up to now, several works on Thai word segmentation have been proposed such as longest matching and maximum matching. We propose a Thai word segmentation technique based on GLR parsing and statistical language model. In this technique, an input Thai text is firstly segmented into a sequence of Thai Character Clusters (TCCs). Each TCC represents a group of inseparable Thai characters based on Thai writing system. The concept of TCC helps avoid choosing the segmentation points that violate the writing rules. Then, the most suitable segmentation candidate is chosen based on the word N-gram model with interpolation. Both of the candidate generation and selection processes are conducted through the two-phase GLR parsing technique. In the first phase, the production rules for TCC are applied to parse an input sequence of characters into a sequence of TCCs and becomes an input tokens for the parsing in the second phase. The second phase groups TCCs and forms a word. In the second phase, we construct the grammar rules that represent a word as a sequence of TCCs from words in the prepared training set. However, ambiguities in segmenting words affect the parsing result. We then need to apply the statistical language model to select the most appropriate segmentation. This statistical model is applied together with the GLR parsing, and the beam search technique is applied to select only the best k parsing paths. We evaluate the proposed technique. using the test data provided by InterBEST 2009. The experimental results show that the technique can obtain 87.04% f-measure when the beam size is set to 10.

Filesize316.35 kB
Filetypepdf (Mime Type: application/pdf)
Created On: 12/02/2009 21:29
Hits122 Hits
Last updated on 12/02/2009 21:31
MD5 Checksum