Class | Description |
---|---|
AffixDictionary |
Affixation information.
|
ChineseDictionary |
This class provides a main method that loads various dictionaries, and
saves them in a serialized version, and runtime compiles them into a word list used as a feature in the segmenter.
|
ChineseSegmenterFeatureFactory<IN extends CoreLabel> |
A Chinese segmenter Feature Factory for GALE project.
|
ChineseStringUtils | |
CorpusChar |
Check tag of each character from 5 different corpora.
|
CorpusDictionary |
Check if a bigram exists in bakeoff corpora.
|
Gale2007ChineseSegmenterFeatureFactory<IN extends CoreLabel> |
A Chinese segmenter Feature Factory for the GALE project.
|
MaxMatchSegmenter |
Lexicon-based segmenter.
|
NonDict2 | |
Sighan2005DocumentReaderAndWriter |
DocumentReader for Chinese segmentation task.
|
Enum | Description |
---|---|
MaxMatchSegmenter.MatchHeuristic |
This package makes use of the CRFClassifier class (a conditional random field sequence classifier) to do Chinese word segmentation.
On the Stanford NLP machines, usable properties files can be found at:
/u/nlp/data/chinese-segmenter/Sighan2005/prop
Usage: For simplified Chinese:
java -mx200m edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict $CH_SEG/data -NormalizationTable $CH_SEG/data/norm.simp.utf8 -normTableEncoding UTF-8 -loadClassifier $CH_SEG/data/ctb.gz -testFile $file -inputEncoding $enc