Skip navigation links

Package edu.stanford.nlp.pipeline

Linguistic Annotation Pipeline

See: Description

Package edu.stanford.nlp.pipeline Description

Linguistic Annotation Pipeline

The point of this package is to enable people to quickly and painlessly get complete linguistic annotations of their text. It is designed to be highly flexible and extensible. I will first discuss the organization and functions of the classes, and then I will give some sample code and a run-down of the implemented Annotators.

Annotation

An Annotation is the data structure which holds the results of annotators. An Annotations is basically a map, from keys to bits of annotation, such as the parse, the part-of-speech tags, or named entity tags. Annotations are designed to operate at the sentence-level, however depending on the Annotators you use this may not be how you choose to use the package.

Annotators

The backbone of this package are the Annotators. Annotators are a lot like functions, except that they operate over Annotations instead of Objects. They do things like tokenize, parse, or NER tag sentences. In the javadocs of your Annotator you should specify what the Annotator is assuming already exists (for instance, the NERAnnotator assumes that the sentence has been tokenized) and where to find these annotations (in the example from the previous set of parentheses, it would be TextAnnotation.class). They should also specify what they add to the annotation, and where.

AnnotationPipeline

An AnnotationPipeline is where many Annotators are strung together to form a linguistic annotation pipeline. It is, itself, an Annotator. AnnotationPipelines usually also keep track of how much time they spend annotating and loading to assist users in finding where the time sinks are. However, the class AnnotationPipeline is not meant to be used as is. It serves as an example on how to build your own pipeline. If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP (described later in this document).

Sample Usage

Here is some sample code which illustrates the intended usage of the package:
 public void testPipeline(String text) throws Exception {
 // create pipeline
 AnnotationPipeline pipeline = new AnnotationPipeline();
 pipeline.addAnnotator(new TokenizerAnnotator(false, "en"));
 pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
 pipeline.addAnnotator(new POSTaggerAnnotator(false));
 pipeline.addAnnotator(new MorphaAnnotator(false));
 pipeline.addAnnotator(new NERCombinerAnnotator(false));
 pipeline.addAnnotator(new ParserAnnotator(false, -1));
 // create annotation with text
 Annotation document = new Annotation(text);
 // annotate text with pipeline
 pipeline.annotate(document);
 // demonstrate typical usage
 for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)) {
 // get the tree for the sentence
 Tree tree = sentence.get(TreeAnnotation.class);
 // get the tokens for the sentence and iterate over them
 for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
 // get token attributes
 String tokenText = token.get(TextAnnotation.class);
 String tokenPOS = token.get(PartOfSpeechAnnotation.class);
 String tokenLemma = token.get(LemmaAnnotation.class);
 String tokenNE = token.get(NamedEntityTagAnnotation.class);
 }
 }
 }
 

Existing Annotators

There already exist Annotators for many common tasks, all of which include default model locations, so they can just be used off the shelf. They are:

How Do I Use This?

You do not have to construct your pipeline from scratch! For the typical NL processors, use StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization, lemmatization, POS tagging, NER, parsing and coreference resolution. Read below for how to use this pipeline from the command line, or directly in your Java code.

Using StanfordCoreNLP from the Command Line

The command line for StanfordCoreNLP is:
 ./bin/stanfordcorenlp.sh
 
or
 java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR_CONFIGURATION_FILE ] -file YOUR_INPUT_FILE
 
where the following properties are defined: (if -props or annotators is not defined, default properties will be loaded via the classpath)
        "annotators" - comma separated list of annotators
                The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, dcoref, nfl
 
More information is available here: Stanford CoreNLP

The StanfordCoreNLP API

More information is available here: Stanford CoreNLP
Author:
Jenny Finkel, Mihai Surdeanu, Steven Bethard, David McClosky Last modified: May 7, 2012
Skip navigation links

Stanford NLP Group