edu.stanford.nlp.wordseg (Stanford JavaNLP API)

Class Summary
Class	Description
AffixDictionary	Affixation information.
ChineseDictionary	This class provides a main method that loads various dictionaries, and saves them in a serialized version, and runtime compiles them into a word list used as a feature in the segmenter.
ChineseSegmenterFeatureFactory<IN extends CoreLabel>	A Chinese segmenter Feature Factory for GALE project.
ChineseStringUtils
CorpusChar	Check tag of each character from 5 different corpora.
CorpusDictionary	Check if a bigram exists in bakeoff corpora.
Gale2007ChineseSegmenterFeatureFactory<IN extends CoreLabel>	A Chinese segmenter Feature Factory for the GALE project.
MaxMatchSegmenter	Lexicon-based segmenter.
NonDict2
Sighan2005DocumentReaderAndWriter	DocumentReader for Chinese segmentation task.

Enum Summary
Enum Description

MaxMatchSegmenter.MatchHeuristic

Enum Summary
Enum	Description
MaxMatchSegmenter.MatchHeuristic

Package edu.stanford.nlp.wordseg Description

A package for doing Chinese word segmentation.

This package makes use of the CRFClassifier class (a conditional random field sequence classifier) to do Chinese word segmentation.

On the Stanford NLP machines, usable properties files can be found at: /u/nlp/data/chinese-segmenter/Sighan2005/prop

Usage: For simplified Chinese:

java -mx200m edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict $CH_SEG/data -NormalizationTable $CH_SEG/data/norm.simp.utf8 -normTableEncoding UTF-8 -loadClassifier $CH_SEG/data/ctb.gz -testFile $file -inputEncoding $enc

Author:: Pi-Chuan Chang, Huihsin Tseng, Galen Andrew