BEST2010
InterBEST 2009
BEST2009
Who's Online
เรามี 1 บุคคลทั่วไป ออนไลน์Login
Corpus formatting
Text files included in each data set are named according to their genres e.g. novel-000xx.txt, encyclopedia-000xx.txt, news-000xx.txt. All text files use UTF-8 encoding. Each text file is segmented into paragraphs as its original source by using a new-line character. There is neither a paragraph marker nor a sentence marker.
Word and Tag
In the training set, the text is word segmented by inserting a “|” symbol at the end of each word. Three special tags are also inserted to mark special word types:
... for a named entity e.g. person, organization, and place name
... for an abbreviation
... for a poem stream
Segmenting text within these three tags is beyond our current scope. This means that a string within each of these tags will always be counted as one word.
DocumentsDate added
BEST Corpus training set (Release 1) 5,036,228 words from 509 files : 96 news files, 108 encyclopedia files, 107 novel files and 198 article files.