Stanford Chinese Word Segmenter download

This software is a Java implementation of the CRF-based Chinese Word Segmenter described in:

Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. "A Conditional Random Field Word Segmenter." In Fourth SIGHAN Workshop on Chinese Language Processing. 2005.
The system requires Java 1.5+ to be installed. For the current setting in the scripts files, 1G of memory is required. But with smaller files, you can change the option java -mx1g to smaller numbers. Two models with two different segmentation standards are included -- Chinese Penn Treebank standard and Beijing University standard.

On May 21, 2008, we released a new version that makes use of lexicon features. With external lexicon features, the segmenter segments more consistently and also achieves higher F measure when we train and test on the bakeoff data. This version is close to the CRF-Lex segmenter described in:

Pi-Chuan Chang, Michel Galley and Chris Manning. "Optimizing Chinese Word Segmentation for Machine Translation Performance." in ACL Third Workshop on Statistical Machine Translation, 2008.
An example of how to train the segmenter is now also available.

The older version (2006-05-11) without using external lexicon features will still be available for download. But we do recommend using the latest version (2008-05-21). Another new feature of the latest release is that the segmenter can now output k-best segmentations.

The segmenter is licensed under the GNU GPL. (Note that this is the full GPL - which allows its use for research purposes or other free software projects but does not allow its incorporation into any type of commercial software, even in part or in translation; see GPL FAQ.) Source is included. The package includes components for command-line invocation and a Java API.

The download is a gzipped tar file mainly consisting of included model files. If you unpack the tar file, you should have everything needed. Simple scripts are included to invoke the segmenter. Please send any questions or feedback, or extensions and bugfixes to: java-nlp-support@lists.stanford.edu.

Download Stanford Chinese Segmenter version 2008-05-21 (with external lexicon features; able to output k-best segmentations)

Download Stanford Chinese Segmenter version 2006-05-11 (with no external lexicon features)