InterBEST 2009

Who's Online

เรามี 1 บุคคลทั่วไป ออนไลน์


Details for TLex: Thai Lexeme Analyser Based on the Conditional Random Fields
NameTLex: Thai Lexeme Analyser Based on the Conditional Random Fields

In this paper, we present our proposed solution to the InterBEST 2009 Thai Word Segmentation task. We applied the Conditional Random Fields (CRFs) to train word segmentation models from a given corpus. Using the CRFs, the word segmentation problem can be formulated as a sequential labeling task in which each character in the text string is predicted into one of two following classes: word-beginning and intra-word characters. One of the key factors which effect the performance of the word segmentation models is the design of appropriate feature sets. We proposed and evaluated three different feature sets: char (by using all possible characters as features), char-type (by categorizing all possible characters into 10 different types) and combined (by using both characters and character types as features). The evaluation results showed that the combined feature set yielded the best performance with the averaged F1 value over all genres equal to 93.90%. To further improve the results, we performed a post-processing step by merging named entities (NEs) in the segmented texts. We used the list of NEs which is compiled from the training corpus. The NE merging step helped increase the performance of the combined feature model to 94.27%.

Filesize821.99 kB
Filetypepdf (Mime Type: application/pdf)
Created On: 12/02/2009 21:26
Hits196 Hits
Last updated on 12/02/2009 21:27
MD5 Checksum