|
|
About | Download | Usage | SUTime | Adding Annotators | Caseless Models | Extensions | Questions | Mailing lists | Online demo | FAQ | Release history
Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, and mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools on it with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.
Stanford CoreNLP integrates all our NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, and the coreference resolution system, and provides model files for analysis of English. The goal of this project is to enable people to quickly and painlessly get complete linguistic annotations of natural language texts. It is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled.
The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v2 or later). Source is included. Note that this is the full GPL, which allows many free uses, but not its use in distributed proprietary software. The download is 259 MB and requires Java 1.6+.
Or you can find it on
Maven Central.
Parsing a file and saving the output as XML
Before using Stanford CoreNLP, it is usual to create a configuration file (a Java Properties file). Minimally, this file should contain the "annotators" property, which contains a comma-separated list of Annotators to use. For example, the setting below enables: tokenization, sentence splitting (required by most Annotators), POS tagging, lemmatization, NER, syntactic parsing, and coreference resolution.
annotators = tokenize, ssplit, pos, lemma, ner, parse, dcoref
However, if you just want to specify one or two properties, you can instead place them on the command line.
To process one file using Stanford CoreNLP, use the following sort of command line (adjust the JAR file date extensions to your downloaded release):
java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props <YOUR CONFIGURATION FILE> ] -file <YOUR INPUT FILE>In particular, to process the included sample file
input.txt you can use this command in the distribution
directory:
java -cp stanford-corenlp-1.3.5.jar:stanford-corenlp-1.3.5-models.jar:xom.jar:joda-time.jar:jollyday.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file input.txtNotes:
-Xmx3g specifies the amount of RAM that Java will reserve.
On a 64-bit machine, Stanford CoreNLP typically requires 3GB to run
(and it may need even more, depending on the size of the document to
parse). It is not recommended to run this program on a 32 bit
machines, as 32 bit Java will not allocate even as much as 2G. This
is especially a problem on Windows machines.If you want to process a list of files use the following command line:
java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-models-YYYY-MM-DD.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props <YOUR CONFIGURATION FILE> ] -filelist <YOUR LIST OF FILES>
where the -filelist parameter points to a file whose content lists all files to be processed (one per line).
Note that the -props parameter is optional -- by default, it
will search for StanfordCoreNLP.properties in your classpath
and use the defaults included in the distribution.
By default, output files are written to the current directory.
You may specify an alternate output directory with the flag
-outputDirectory. Output filenames are the same as input
filenames but with -outputExtension added them (.xml
by default). It will overwrite (clobber) output files by default.
Pass -noClobber to avoid this behavior. Additionally, if you'd
rather it replace the extension with the -outputExtension, pass
the -replaceExtension flag. This will result in filenames like
test.xml instead of test.txt.xml (when given test.txt
as an input file).
For each input file, Stanford CoreNLP generates one XML file with all relevant annotation. For example, for the above configuration and a file containing the text below:
Stanford University is located in California. It is a great university.
Stanford CoreNLP generates the following output, with the following attributes.
Note that the XML output uses the CoreNLP-to-HTML.xsl stylesheet file, which can be downloaded from here. This stylesheet enables human-readable display of the above XML content. For example, the previous example should be displayed like this.
Stanford CoreNLP also has the ability to remove most XML from a document before processing it. (CDATA is not correctly handled.) For example, if run with the annotators
annotators = tokenize, cleanxml, ssplit, pos, lemma, ner, parse, dcoref
and given the text
<xml>Stanford University is located in California. It is a great university.</xml>Stanford CoreNLP generates the following output. Note that the only difference between this and the original output is the change in CharacterOffsets.
Using the Stanford CoreNLP API
The backbone of the CoreNLP package is formed by two classes: Annotation and Annotator. Annotations are the data structure which hold the results of annotations. Annotations are basically maps, from keys to bits of the annotation, such as the parse, the part-of-speech tags, or named entity tags. Annotators are a lot like functions, except that they operate over Annotations instead of Objects. They do things like tokenize, parse, or NER tag sentences. Annotators and Annotations are integrated by AnnotationPipelines, which create sequences of generic Annotators. Stanford CoreNLP inherits from the AnnotationPipeline class, and is customized with NLP Annotators.
The table below summarizes the Annotators currently supported and the Annotations that they generate.
| Property name | Annotator class name | Generated Annotation | Description |
| tokenize | PTBTokenizerAnnotator | TokensAnnotation (list of tokens), and CharacterOffsetBeginAnnotation, CharacterOffsetEndAnnotation, TextAnnotation (for each token) | Tokenizes the text. This component started as a PTB-style tokenizer, but was extended since then to handle noisy and web text. The tokenizer saves the character offsets of each token in the input text, as CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation. |
| cleanxml | CleanXmlAnnotator | XmlContextAnnotation | Remove xml tokens from the document |
| ssplit | WordToSentenceAnnotator | SentencesAnnotation | Splits a sequence of tokens into sentences. |
| pos | POSTaggerAnnotator | PartOfSpeechAnnotation | Labels tokens with their POS tag. For more details see this page. |
| lemma | MorphaAnnotator | LemmaAnnotation | Generates the word lemmas for all tokens in the corpus. |
| ner | NERClassifierCombiner | NamedEntityTagAnnotation and NormalizedNamedEntityTagAnnotation | Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC) and numerical entities (DATE, TIME, MONEY, NUMBER). Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora, such as ACE and MUC. Numerical entities are recognized using a rule-based system. Numerical entities that require normalization, e.g., dates, are normalized to NormalizedNamedEntityTagAnnotation. For more details on the CRF tagger see this page. |
| regexner | RegexNERAnnotator | NamedEntityTagAnnotation | Implements a rule-based NER using Java regular expressions. The goal of this Annotator is to provide a simple framework to incorporate NE labels that are not annotated in traditional NL corpora. For example, the default list of regular expressions that we distribute in the models file recognizes ideologies (IDEOLOGY), nationalities (NATIONALITY), religions (RELIGION), and titles (TITLE). |
| truecase | TrueCaseAnnotator | TrueCaseAnnotation and TrueCaseTextAnnotation | Recognizes the true case of tokens in text where this information was lost, e.g., all upper case text. This is implemented with a discriminative model implemented using a CRF sequence tagger. The true case label, e.g., INIT_UPPER is saved in TrueCaseAnnotation. The token text adjusted to match its true case is saved as TrueCaseTextAnnotation. |
| parse | ParserAnnotator | TreeAnnotation, BasicDependenciesAnnotation, CollapsedDependenciesAnnotation, CollapsedCCProcessedDependenciesAnnotation | Provides full syntactic analysis, using both the constituent and the dependency representations. The constituent-based output is saved in TreeAnnotation. We generate three dependency-based outputs, as follows: basic, uncollapsed dependencies, saved in BasicDependenciesAnnotation; collapsed dependencies saved in CollapsedDependenciesAnnotation; and collapsed dependencies with processed coordinations, in CollapsedCCProcessedDependenciesAnnotation. Most users of our parser will prefer the latter representation. For more details on the parser, please see this page. For more details about the dependencies, please refer to this page. |
| dcoref | DeterministicCorefAnnotator | CorefChainAnnotation | Implements both pronominal and nominal coreference resolution. The entire coreference graph (with head words of mentions as nodes) is saved in CorefChainAnnotation. For more details on the underlying coreference resolution algorithm, see this page. |
Depending on which annotators you use, please cite the corresponding papers on: POS tagging, NER, parsing, or coreference resolution.
To construct a Stanford CoreNLP object from a given set of properties, use StanfordCoreNLP(Properties props). This method creates the pipeline using the annotators given in the "annotators" property (see above for an example setting). The complete list of accepted annotator names is listed in the first column of the table at the top of this page. To parse an arbitrary text, use the annotate(Annotation document) method.
The code below shows how to create and use a Stanford CoreNLP object:
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
// this is the parse tree of the current sentence
Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence
SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);
}
// This is the coreference link graph
// Each chain stores a set of mentions that link to each other,
// along with a method for getting the most representative mention
// Both sentence and token offsets start at 1!
Map<Integer, CorefChain> graph =
document.get(CorefChainAnnotation.class);
Annotator options
While all Annotators have a default behavior that is likely to be sufficient for the majority of users, most Annotators take additional options that can be passed as Java properties in the configuration file. We list below the configuration options for all Annotators:
general options:
PTBTokenizer
for example, things like "americanize=false" or
"strictTreebank3=true,untokenizable=allKeep".
StanfordCoreNLP includes SUTime, Stanford's temporal expression recognizer. SUTime is transparently called from the "ner" annotator, so no configuration is necessary. Furthermore, the "cleanxml" annotator now extracts the reference date for a given XML document, so relative dates, e.g., "yesterday", are transparently normalized with no configuration necessary.
SUTime supports the same annotations as before, i.e., NamedEntityTagAnnotation is set with the label of the numeric entity (DATE, TIME, DURATION, MONEY, PERCENT, or NUMBER) and NormalizedNamedEntityTagAnnotation is set to the value of the normalized temporal expression. Note that NormalizedNamedEntityTagAnnotation now follows the TIMEX3 standard, rather than Stanford's internal representation, e.g., "2010-01-01" for the string "January 1, 2010", rather than "20100101".
Also, SUTime now sets the TimexAnnotation key to an edu.stanford.nlp.time.Timex object, which contains the complete list of TIMEX3 fields for the corresponding expressions, such as "val", "alt_val", "type", "tid". This might be useful to developers interested in recovering complete TIMEX3 expressions.
Reference dates are by default extracted from the "datetime" and
"date" tags in an xml document. To set a different set of tags to
use, use the clean.datetags property. When using the API, reference
dates can be added to an Annotation via
edu.stanford.nlp.ling.CoreAnnotations.DocDateAnnotation,
although note that when processing an xml document, the cleanxml
annotator will overwrite the DocDateAnnotation if
"datetime" or "date" are specified in the document.
TokensRegex
StanfordCoreNLP includes TokensRegex, a framework for defining regular expressions over text and tokens, and mapping matched text to semantic objects.
Javadoc
More information is available in the javadoc: Stanford Core NLP Javadoc.
StanfordCoreNLP also has the capacity to add a new annotator by reflection without altering the code in StanfordCoreNLP.java. To create a new annotator, extend the class edu.stanford.nlp.pipeline.Annotator and define a constructor with the signature (String, Properties). Then, add the property customAnnotatorClass.FOO=BAR to the properties used to create the pipeline. If FOO is then added to the list of annotators, the class BAR will be created, with the name used to create it and the properties file passed in.
It is possible to run StanfordCoreNLP with tagger, parser, and NER
models that ignore capitalization. In order to do this, download the
caseless
models package. Be sure to include the path to the case
insensitive models jar in the -cp classpath flag as well.
Then, set properties which point to these models as follows:
-pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger
-parse.model edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz
-ner.model edu/stanford/nlp/models/ner/english.all.3class.caseless.distsim.crf.ser.gz
edu/stanford/nlp/models/ner/english.muc.7class.caseless.distsim.crf.ser.gz
edu/stanford/nlp/models/ner/english.conll.4class.caseless.distsim.crf.ser.gz
cleartk-stanford-corenlp is a UIMA wrapper for Stanford CoreNLP built by Steven Bethard in the context of the ClearTK toolkit.
dkpro-core-gpl is a collection of NLP components, principally Stanford CoreNLP, wrapped as UIMA components, based on work at the Ubiquitous Knowledge Processing Lab (UKP) at the Technische Universität Darmstadt. It is part of the DKPro project. See also the DKPro Core wiki.
Python wrapper including JSON-RPC server by Dustin Smith, or an up-to-date fork by Hiroyoshi Komatsu and Johannes Castner.
A stopword removal annotator by John Conwell
Perl wrapper by Kalle Räisänen.
Ruby bindings by Louis Mullie.
Wrapper for each of Stanford's Chinese tools by Mingli Yuan.
Client server package written for Apache Thrift by Diane Napolitano
Scala API for CoreNLP by Mihai Surdeanu, one of the original developers of the CoreNLP package
Questions, feedback, and bug reports/fixes can be sent to our mailing lists.
We have 3 mailing lists for the Stanford Coreference Rersolution System, all of which are shared
with other JavaNLP tools (with the exclusion of the parser). Each address is
at @lists.stanford.edu:
java-nlp-user This is the best list to post to in order
to ask questions, make announcements, or for discussion among JavaNLP
users. You have to subscribe to be able to use it.
Join the list via this webpage or by emailing
java-nlp-user-join@lists.stanford.edu. (Leave the
subject and message body empty.) You can also
look at
the list archives.
java-nlp-announce This list will be used only to announce
new versions of Stanford JavaNLP tools. So it will be very low volume (expect 1-3
messages a year). Join the list via this webpage or by emailing
java-nlp-announce-join@lists.stanford.edu. (Leave the
subject and message body empty.)
java-nlp-support This list goes only to the software
maintainers. It's a good address for licensing questions, etc. For
general use and support questions, you're better off joining and using
java-nlp-user.
You cannot join java-nlp-support, but you can mail questions to
java-nlp-support@lists.stanford.edu.
| Version 1.3.5 | 2013-04-04 | Bugs fixed, speed improvements, coref improvements, Chinese support compatible caseless models compatible chinese models |
| Version 1.3.4 | 2012-11-12 | Upgrades to sutime, dependency extraction code and English 3-class NER model compatible caseless models |
| Version 1.3.3 | 2012-07-09 | Minor bug fixes; compatible caseless models |
| Version 1.3.2 | 2012-05-22 | Upgrades to sutime, include tokenregex annotator; compatible caseless models |
| Version 1.3.1 | 2012-04-09 | Fixed thread safety bugs, caseless models available: compatible caseless models |
| Version 1.3.0 | 2012-01-08 | Fix a crashing bug, fix excessive warnings, threadsafe |
| Version 1.2.0 | 2011-09-14 | Added SUTime time phrase recognizer to NER, bug fixes, reduced library dependencies |
| Version 1.1.0 | 2011-06-19 | Greatly improved coref results |
| Version 1.0.4 | 2011-05-15 | DCoref uses less memory, already tokenized input possible |
| Version 1.0.3 | 2011-04-17 | Add the ability to specify an arbitrary annotator |
| Version 1.0.2 | 2010-11-11 | Remove wn.jar for license reasons |
| Version 1.0.1 | 2010-11-10 | Add the ability to remove XML |
| Version 1.0 | 2010-11-01 | Initial release |
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · machines · Wiki · Calendar |
Site design by Bill MacCartney |