Downloads | InterBEST 2009

Main Menu

BEST2010

InterBEST 2009

BEST2009

Login

InterBEST 2009

Corpus and paper download

***"To download training/testing data, please login first."

CategoriesFiles

1 User Manual/Document

SEGMENTATION GUIDELINES FOR InterBEST 2009 THAI WORD SEGMENTATION

5 Paper Download

InterBEST 2009: Thai Word Segmentation Workshop
Proceedings of 2009 Eighth International Symposium on Natural Language Processing (SNLP2009), October 20-21, 2009 Bangkok Thailand.
Paper Topic

A Word and Character-Cluster Hybrid Model for Thai Word Segmentation
Thai Word Segmentation using Character-level Information
TLex: Thai Lexeme Analyser Based on the Conditional Random Fields
A Statistical-Machine-Translation Approach to Word Boundary Identification: A Projective Analogy of Bilingual Translation
Thai Word Segmentation based-on GLR Parsing Technique and Word N-gram Model

1 Corpus

Corpus formatting
Text files included in each data set are named according to their genres e.g. novel-000xx.txt, encyclopedia-000xx.txt, news-000xx.txt. All text files use UTF-8 encoding. Each text file is segmented into paragraphs as its original source by using a new-line character. There is neither a paragraph marker nor a sentence marker.

Word and Tag
In the training set, the text is word segmented by inserting a “|” symbol at the end of each word. Three special tags are also inserted to mark special word types:
... for a named entity e.g. person, organization, and place name
... for an abbreviation
... for a poem stream
Segmenting text within these three tags is beyond our current scope. This means that a string within each of these tags will always be counted as one word.

Download Center

Main Menu

BEST2010

InterBEST 2009

BEST2009

Who's Online

Login

CategoriesFiles