About | Citing | Questions | Download | Included Tools | Extensions | Release history | Sample output | Online | FAQ
A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online.
This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k-best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel.
The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser.
As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese.
The parser provides Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage.
The current version of the parser requires Java 6 (JDK1.6) or later. (You can also download an old version of the parser, version 1.4, which runs under JDK 1.4, or version 2.0 which runs under JDK 1.5, but those distributions are no longer supported.) The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model).
The parser is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API. The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding.
The download is a 54 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.
As of version 3.4 in 2014, the parser includes the code necessary to run a shift reduce parser, a much faster constituent parser with competitive accuracy. Models for this parser are linked below.
The main technical ideas behind how these parsers work appear in these papers. Feel free to cite one or more of the following papers depending on what you are using. Since the parser is regularly updated, we appreciate it if papers with numerical results reflecting parser performance mention the version of the parser being used.
For the new Compositional Vector Grammar parser (starting at version 3.2):
Richard Socher, John Bauer, Christopher D. Manning and Andrew Y. Ng. 2013. Parsing With Compositional Vector Grammars. Proceedings of ACL 2013
For the PCFG parser:
Dan Klein and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423-430.
For the factored parser:
Dan Klein and Christopher D. Manning. 2003. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), Cambridge, MA: MIT Press, pp. 3-10.
For the (English) Stanford Dependencies representation:
Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.
For the German parser:
Anna Rafferty and Christopher D. Manning. 2008. Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines. In ACL Workshop on Parsing German.
For the Chinese Parser:
Roger Levy and Christopher D. Manning. 2003. Is it harder to parse Chinese, or the Chinese Treebank? ACL 2003, pp. 439-446.
For the Chinese Stanford Dependencies:
Pi-Chuan Chang, Huihsin Tseng, Dan Jurafsky, and Christopher D. Manning. 2009. Discriminative Reordering with Chinese Grammatical Relations Features. In Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation.
For the Arabic parser:
Spence Green and Christopher D. Manning. 2010. Better Arabic Parsing: Baselines, Evaluations, and Analysis. In COLING 2010.
For the French parser:
Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning. 2010. Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French.. In EMNLP 2011.
lexparserpackage documentation and
LexicalizedParserclass documentation. (Point your web browser at the
index.htmlfile in the included
javadocdirectory and navigate to those items.)
.NET / F# / C#
|Version 3.4.1||2014-08-27||Add Spanish models||shift reduce parser models|
|Version 3.4||2014-06-16||Shift-reduce parser, dependency improvements, French parser uses CC tagset||shift reduce parser models|
|Version 3.3.1||2014-01-04||English dependency "infmod" and "partmod" combined into "vmod", other minor dependency improvements|
|Version 3.3.0||2013-11-12||English dependency "attr" removed, other dependency improvements, imperative training data added|
|Version 3.2.0||2013-06-20||New CVG based English model with higher accuracy|
|Version 2.0.5||2013-04-05||Dependency improvements, -nthreads option, ctb7 model|
|Version 2.0.4||2012-11-12||Improved dependency code extraction efficiency, other dependency changes|
|Version 2.0.3||2012-07-09||Minor bug fixes|
|Version 2.0.2||2012-05-22||Some models now support training with extra tagged, non-tree data|
|Version 2.0.1||2012-03-09||Caseless English model included, bugfix for enforced tags|
|Version 1.6.9||2011-09-14||Improved recognition of imperatives, dependencies now explicitely include a root, parser knows osprey is a noun|
|Version 1.6.8||2011-06-19||New French model, improved foreign language models, bug fixes|
|Version 1.6.7||2011-05-18||Minor bug fixes.|
|Version 1.6.6||2011-04-20||Internal code and API changes (ArrayLists rather than Sentence; use of CoreLabel objects) to match tagger and CoreNLP.|
|Version 1.6.5||2010-11-30||Further improvements to English Stanford Dependencies and other minor changes|
|Version 1.6.4||2010-08-20||More minor bug fixes and improvements to English Stanford Dependencies and question parsing|
|Version 1.6.3||2010-07-09||Improvements to English Stanford Dependencies and question parsing, minor bug fixes|
|Version 1.6.2||2010-02-26||Improvements to Arabic parser models, and to English and Chinese Stanford Dependencies|
|Version 1.6.1||2008-10-26||Slightly improved Arabic and German parsing, and Stanford Dependencies|
|Version 1.6||2007-08-19||Added Arabic, k-best PCCFG parsing; improved English grammatical relations|
|Version 1.5.1||2006-06-11||Improved English and Chinese grammatical relations; fixed UTF-8 handling|
|Version 1.5||2005-07-21||Added grammatical relations output; fixed bugs introduced in 1.4|
|Version 1.4||2004-03-24||Made PCFG faster again (by FSA minimization); added German support|
|Version 1.3||2003-09-06||Made parser over twice as fast; added tokenization options|
|Version 1.2||2003-07-20||Halved PCFG memory usage; added support for Chinese|
|Version 1.1||2003-03-25||Improved parsing speed; included GUI, improved PCFG grammar|
|Version 1.0||2002-12-05||Initial release|
The parser can read various forms of plain text input and can output various analysis formats, including part-of-speech tagged text, phrase structure trees, and a grammatical relations (typed dependency) format. For example, consider the text:
The strongest rain ever recorded in India shut down the financial hub of Mumbai, snapped communication lines, closed airports and forced thousands of people to sleep in their offices or walk home during the night, officials said today.
The following output shows part-of-speech tagged text, then a context-free phrase structure grammar representation, and finally a typed dependency representation. All of these are different views of the output of the parser.
The/DT strongest/JJS rain/NN ever/RB recorded/VBN in/IN India/NNP shut/VBD down/RP the/DT financial/JJ hub/NN of/IN Mumbai/NNP ,/, snapped/VBD communication/NN lines/NNS ,/, closed/VBD airports/NNS and/CC forced/VBD thousands/NNS of/IN people/NNS to/TO sleep/VB in/IN their/PRP$ offices/NNS or/CC walk/VB home/NN during/IN the/DT night/NN ,/, officials/NNS said/VBD today/NN ./. (ROOT (S (S (NP (NP (DT The) (JJS strongest) (NN rain)) (VP (ADVP (RB ever)) (VBN recorded) (PP (IN in) (NP (NNP India))))) (VP (VP (VBD shut) (PRT (RP down)) (NP (NP (DT the) (JJ financial) (NN hub)) (PP (IN of) (NP (NNP Mumbai))))) (, ,) (VP (VBD snapped) (NP (NN communication) (NNS lines))) (, ,) (VP (VBD closed) (NP (NNS airports))) (CC and) (VP (VBD forced) (NP (NP (NNS thousands)) (PP (IN of) (NP (NNS people)))) (S (VP (TO to) (VP (VP (VB sleep) (PP (IN in) (NP (PRP$ their) (NNS offices)))) (CC or) (VP (VB walk) (NP (NN home)) (PP (IN during) (NP (DT the) (NN night)))))))))) (, ,) (NP (NNS officials)) (VP (VBD said) (NP-TMP (NN today))) (. .))) det(rain-3, The-1) amod(rain-3, strongest-2) nsubj(shut-8, rain-3) nsubj(snapped-16, rain-3) nsubj(closed-20, rain-3) nsubj(forced-23, rain-3) advmod(recorded-5, ever-4) partmod(rain-3, recorded-5) prep_in(recorded-5, India-7) ccomp(said-40, shut-8) prt(shut-8, down-9) det(hub-12, the-10) amod(hub-12, financial-11) dobj(shut-8, hub-12) prep_of(hub-12, Mumbai-14) conj_and(shut-8, snapped-16) ccomp(said-40, snapped-16) nn(lines-18, communication-17) dobj(snapped-16, lines-18) conj_and(shut-8, closed-20) ccomp(said-40, closed-20) dobj(closed-20, airports-21) conj_and(shut-8, forced-23) ccomp(said-40, forced-23) dobj(forced-23, thousands-24) prep_of(thousands-24, people-26) aux(sleep-28, to-27) xcomp(forced-23, sleep-28) poss(offices-31, their-30) prep_in(sleep-28, offices-31) xcomp(forced-23, walk-33) conj_or(sleep-28, walk-33) dobj(walk-33, home-34) det(night-37, the-36) prep_during(walk-33, night-37) nsubj(said-40, officials-39) root(ROOT-0, said-40) tmod(said-40, today-41)
This output was generated with the command:
java -mx200m edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz mumbai.txt
Site design by Bill MacCartney