Stanford NER CRF FAQ

Questions

  1. How can I train my own NER model?
  2. That wasn't enough information. Can you really tell me how can I train my own NER model?
  3. How can I train an NER model using less memory?
  4. For our Web 5.0 system, can I set the Stanford NER system up to allow single-jar deployment rather than it having to load NER models from separate files?
  5. Why do I get compilation errors when I try to compile recent versions of the source code (such as the 2008-05-07 version)?

Questions with answers

  1. How can I train my own NER model?

    The documentation for training your own classifier is certainly somewhere between bad and non-existent. But nevertheless, you should look through the Javadoc for at least the classes CRFClassifier and NERFeatureFactory.

    Basically, the training data should be in tab-separated columns, and you define the meaning of those columns via a map. You define the data file, the map, and what features to generate via a properties file. There is considerable documentation of what features different properties generate in the Javadoc of NERFeatureFactory, though ultimately you have to go to the source code to answer some questions....

    Here's a sample NER properties file:

    trainFile = training-data.col
    serializeTo = ner-model.ser.gz
    map = word=0,answer=1
    
    useClassFeature=true
    useWord=true
    useNGrams=true
    noMidNGrams=true
    maxNGramLeng=6
    usePrev=true
    useNext=true
    useSequences=true
    usePrevSequences=true
    maxLeft=1
    useTypeSeqs=true
    useTypeSeqs2=true
    useTypeySequences=true
    wordShape=chris2useLC
    useDisjunctive=true
    
  2. That wasn't enough information. Can you really tell me how can I train my own NER model?

    Oh, okay. Here's an example. Suppose we want to build a NER system for Jane Austen novels. We might train it on chapter 1 of Emma. Download that file. You can convert it to one token per line with our tokenizer with the following command:

    java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer jane-austen-emma-ch1.txt > jane-austen-emma-ch1.tok

    We then need to make training data where we label the entities. There are various annotation tools available, or you could do this by hand in a text editor. One way is to default to making everything an other (for which the default label is "O" in our software) and then to hand-label the real entities in a text editor. The first step can be done with Perl using this command:

    perl -ne 'chomp; print "$_\tO\n"' jane-austen-emma-ch1.tok

    and if you don't want to do the second, you can skip to downloading our input file. We have marked only one entity type, PERS for person name, but you could easily add a second entity type such as LOC for location, to this data.

    You will then also want some test data to see how well the system is doing. You can download the text of chapter 2 of Emma and the gold standard annotated version of chapter 2.

    Stanford NER CRF allows all properties to be specified on the command line, but it is easier to use a properties file. Here is a simple properties file (pretty much like the one above!), but explanations for each line are in comments, specified by "#":

    #location of the training file
    trainFile = jane-austen-emma-ch1.tsv
    #location where you would like to save (serialize to) your
    #classifier; adding .gz at the end automatically gzips the file,
    #making it faster and smaller
    serializeTo = ner-model.ser.gz
    
    #structure of your training file; this tells the classifier
    #that the word is in column 0 and the correct answer is in
    #column 1
    map = word=0,answer=1
    
    #these are the features we'd like to train with
    #some are discussed below, the rest can be
    #understood by looking at NERFeatureFactory
    useClassFeature=true
    useWord=true
    useNGrams=true
    #no ngrams will be included that do not contain either the
    #beginning or end of the word
    noMidNGrams=true
    useDisjunctive=true
    maxNGramLeng=6
    usePrev=true
    useNext=true
    useSequences=true
    usePrevSequences=true
    maxLeft=1
    #the next 4 deal with word shape features
    useTypeSeqs=true
    useTypeSeqs2=true
    useTypeySequences=true
    wordShape=chris2useLC
    

    Once you make such a properties file, you can train a classifier with the command:

    java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop propertiesFile.txt

    (where propertiesFile.txt should be the location of the properties file you just created)

    An NER model will have been serialized to the location you specified once the program has completed. To check how well it works, you can run the test command:

    java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -testFile jane-austen-emma-ch2.tsv

    By looking at the output, you can see that the classifier finds most of the person named entities but not all due to small training data and limited features.

    So how do you apply this to make your own non-example NER model? You need 1) a training data source, 2) a properties file specifying the features you want to use, and (optional, but often nice) 3) a test file see how you're doing. For the training data source, you need each word to be on a separate line and annotated with the correct answer; all columns must be tab-separated. If you want to explicitly specify more features for the word, you can add these in the file in a new column and then put the appropriate structure of your file in the map line in the properties file. For example, if you added a third column to your data with a new feature, you might write "map= word=0, answer=1, mySpecialFeature=2". Right now, most arbitrarily named features (like mySpecialFeature) will not work without making modifications to the source code, but we are working on adding this feature. In the meantime, there are known names that do work, like tag, lemma, chunk, web. Once you've annotated your data, you make a properties file with the features you want. You can use the example properties file, and refer to the NERFeatureFactory for more possible features. Finally, you can test on your annotated test data as shown above or annotate more text using the -textFile command rather than -testFile.

  3. How can I train an NER model using less memory?

    Here are some tips on memory usage for CRFClassifier:

    1. Ultimately, if you have tons of features and lots of classes, you need to have lots of memory to train a CRFClassifier. We frequently train models that require 2 or more gigabytes of RAM and are used to typing java -mx2g.
    2. Decrease the memory of the limited-memory quasi-Newton optimizer (L-BFGS). The optimizer maintains a number of past guesses which are used to approximate the Hessian. Having more guesses makes the estimate more accurate, and optimization is faster, but the memory used by the system during optimization is linear in the number of guesses. This is specified by the parameter qnSize. The default is 25. Using 10 is perfectly adequate. If you're short of memory, things will still work with much smaller values, even just a value of 2.
    3. Use the flag saveFeatureIndexToDisk = true. The feature names aren't actually needed while the core model estimation (optimization) code is run. This option saves them to a file before the optimizer runs, enabling the memory they use to be freed, and then loads the feature index from disk after optimization is finished.
    4. Decrease the order of the CRF. We usually use just first order CRFs (maxLeft=1). While the code supports arbitrary order CRFs, building second, third, or fourth order CRFs will greatly increase memory usage and normally isn't necessary. Remember: maxLeft refers to the size of the class contexts that your features use (that is, it is one smaller than the clique size). A first order CRF can still look arbitrarily far to the left or right to get information about the observed data context.
    5. Decrease the number of features generated. To see all the features generated, you can set the property printFeatures to true. CRFClassifier will then write (potentially huge) files in the current directory listing the features generated for each token position. Options that generate huge numbers of features include useWordPairs and useNGrams when maxNGramLeng is a large number.
    6. Decrease the number of classes in your model. This may or may not be possible, depending on what your modeling requirements are.
    7. Use the flag useObservedSequencesOnly=true. This makes it so that you can only label adjacent words with label sequences that you have seen next to each other in the training data. For some kinds of data this actually gives better accuracy, for other kinds it is worse. But unless the label sequence patterns are dense, it will reduce your memory usage.
    8. If you're concerned about runtime memory usage, some of the above items still apply (number of features and classes, useObservedSequencesOnly, and order of the CRF), but in addition, you can use the flag featureDiffThresh, for example featureDiffThresh=0.05. In training, CRFClassifier will train one model, drop all the features with weight (absolute value) beneath the given threshold, and then train a second model. Training thus takes longer, but the resulting model is smaller and faster at runtime, and usually has very similar performance for a reasonable threshold such as 0.05.
  4. For our Web 5.0 system, can I set the Stanford NER system up to allow single-jar deployment rather than it having to load NER models from separate files?

    Yes! But you'll need to make your own custom jar file. If you insert into the jar file an NER model with name myModel and you put it inside the jar file under the /classifiers/ path as /classifiers/myModel, then you can load it when running from a jar file with a command like:

    java -mx500m -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadJarClassifier myModel -textFile sample.txt

    You might also be interested in looking at edu.stanford.nlp.ie.NERServer as an example of having the CRFClassifier run on a socket and wai for text to annotate and then returning the results. With a bit of work, we're sure you can adapt that example to work in a REST, SOAP, AJAX, or whatever system. If not, pay us a lot of money, and we'll work it out for you.

  5. Why do I get compilation errors when I try to compile recent versions of the source code (such as the 2008-05-07 version)?

    In recent versions of our NER code, we use the typesafe heterogeneous container pattern that Josh Bloch has talked about in various places such as this talk. It's neat but somewhat stresses the implementation of generic types in Java. The code is correct and should compile okay. It does compile okay in current versions of Sun javac v1.5 or v1.6 and with the current version of the Eclipse compiler. If it doesn't compile for you, you should upgrade your Java compiler or complain to the person who makes it.

Please send any other questions or feedback, or extensions and bugfixes to java-nlp-support@lists.stanford.edu.