BEST2010
InterBEST 2009
BEST2009
Who's Online
เรามี 1 บุคคลทั่วไป ออนไลน์Login
Property | Value |
Name | TLex: Thai Lexeme Analyser Based on the Conditional Random Fields |
Description | In this paper, we present our proposed solution to the InterBEST 2009 Thai Word Segmentation task. We applied the Conditional Random Fields (CRFs) to train word segmentation models from a given corpus. Using the CRFs, the word segmentation problem can be formulated as a sequential labeling task in which each character in the text string is predicted into one of two following classes: word-beginning and intra-word characters. One of the key factors which effect the performance of the word segmentation models is the design of appropriate feature sets. We proposed and evaluated three different feature sets: char (by using all possible characters as features), char-type (by categorizing all possible characters into 10 different types) and combined (by using both characters and character types as features). The evaluation results showed that the combined feature set yielded the best performance with the averaged F1 value over all genres equal to 93.90%. To further improve the results, we performed a post-processing step by merging named entities (NEs) in the segmented texts. We used the list of NEs which is compiled from the training corpus. The NE merging step helped increase the performance of the combined feature model to 94.27%. |
Filename | InterBEST_3.pdf |
Filesize | 821.99 kB |
Filetype | pdf (Mime Type: application/pdf) |
Creator | admin |
Created On: | 12/02/2009 21:26 |
Hits | 196 Hits |
Last updated on | 12/02/2009 21:27 |
MD5 Checksum |