Download Center

Main Menu

BEST2010

InterBEST 2009

BEST2009

Who's Online

เรามี 1 บุคคลทั่วไป ออนไลน์

Login

Details for TLex: Thai Lexeme Analyser Based on the Conditional Random Fields
Property	Value
Name	TLex: Thai Lexeme Analyser Based on the Conditional Random Fields
Description	In this paper, we present our proposed solution to the InterBEST 2009 Thai Word Segmentation task. We applied the Conditional Random Fields (CRFs) to train word segmentation models from a given corpus. Using the CRFs, the word segmentation problem can be formulated as a sequential labeling task in which each character in the text string is predicted into one of two following classes: word-beginning and intra-word characters. One of the key factors which effect the performance of the word segmentation models is the design of appropriate feature sets. We proposed and evaluated three different feature sets: char (by using all possible characters as features), char-type (by categorizing all possible characters into 10 different types) and combined (by using both characters and character types as features). The evaluation results showed that the combined feature set yielded the best performance with the averaged F1 value over all genres equal to 93.90%. To further improve the results, we performed a post-processing step by merging named entities (NEs) in the segmented texts. We used the list of NEs which is compiled from the training corpus. The NE merging step helped increase the performance of the combined feature model to 94.27%.
Filename	InterBEST_3.pdf
Filesize	821.99 kB
Filetype	pdf (Mime Type: application/pdf)
Creator	admin
Created On:	12/02/2009 21:26
Hits	196 Hits
Last updated on	12/02/2009 21:27
MD5 Checksum

Back