A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads. The figure below shows a dependency parse of a short sentence. The arrow from the word moving to the word faster indicates that faster modifies moving, and the label advmod assigned to the arrow describes the exact nature of the dependency.
We have built a super-fast transition-based parser which produces typed dependency parses of natural language sentences. The parser is powered by a neural network which accepts word embedding inputs, as described in the paper:
Danqi Chen and Christopher Manning. 2014. A Fast and Accurate Dependency Parser Using Neural Networks. In Proceedings of EMNLP 2014.
This parser supports English (with Universal Dependencies, Stanford Dependencies and CoNLL Dependencies) and Chinese (with CoNLL Dependencies). Future versions of the software will support other languages.
For a quick introduction to the standard approach to transition-based dependency parsing, see Joakim Nivre's EACL 2014 tutorial.
This parser builds a parse by performing a linear-time scan over the words of a sentence. At every step it maintains a partial parse, a stack of words which are currently being processed, and a buffer of words yet to be processed.
The parser continues to apply transitions to its state until its buffer is empty and the dependency graph is completed.
The initial state is to have all of the words in order on the buffer, with a single dummy ROOT node on the stack. The following transitions can be applied:
With just these three types of transitions, a parser can generate any projective dependency parse. Note that for a typed dependency parser, with each transition we must also specify the type of the relationship between the head and dependent being described.
The parser decides among transitions at each state using a neural network classifier. Distributed representations (dense, continuous vector representations) of the parser's current state are provided as inputs to this classifier, which then chooses among the possible transitions to make next. These representations describe various features of the current stack and buffer contents in the parser state.
The classifier which powers the parser is trained using an oracle. This oracle takes each sentence in the training data and produces many training examples indicating which transition should be taken at each state to reach the correct final parse. The neural network is trained on these examples using adaptive gradient descent (AdaGrad) with hidden unit dropout.
edu/stanford/nlp/models/parser/nndep/english_UD.gz (default, English, Universal Dependencies)
edu/stanford/nlp/models/parser/nndep/english_SD.gz (English, Stanford Dependencies)
edu/stanford/nlp/models/parser/nndep/PTB_CoNLL_params.txt.gz (English, CoNLL Dependencies)
edu/stanford/nlp/models/parser/nndep/CTB_CoNLL_params.txt.gz (Chinese, CoNLL Dependencies)
Note that these models were trained with an earlier Matlab version of the code,
and your results training with the Java code may be slightly worse.
This parser is integrated into Stanford CoreNLP as a new annotator.
If you want to use the transition-based parser from the command line, invoke StanfordCoreNLP with the depparse annotator. This annotator has dependencies on the tokenize, ssplit, and pos annotators. An example invocation follows (assuming CoreNLP is on your classpath):
java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,depparse -file <INPUT_FILE>
It is also possible to access the parser directly in the Stanford Parser or Stanford CoreNLP packages. With direct access to the parser, you can train new models, evaluate models with test treebanks, or parse raw sentences. Note that this package currently still reads and writes CoNLL-X files, notCoNLL-U files.
The main program to use is the class
edu.stanford.nlp.parser.nndep.DependencyParser
. The Javadoc for
this class' main method describes all possible options
in details. Some usage examples follow:
java edu.stanford.nlp.parser.nndep.DependencyParser -model modelOutputFile.txt.gz -textFile rawTextToParse -outFile dependenciesOutputFile.txt
java edu.stanford.nlp.parser.nndep.DependencyParser -model modelOutputFile.txt.gz -textFile - -outFile -
java edu.stanford.nlp.parser.nndep.DependencyParser -model edu/stanford/nlp/models/parser/nndep/english_UD.gz -testFile test.conllx -outFile test-output.conllx
The options for specifying files to the parser at training and test time are:
Option | Required for training | Required for testing / parsing | Description |
---|---|---|---|
‑devFile | Optional | No | Path to a development-set treebank in CoNLL-X format. If provided, the dev set performance is monitored during training. |
‑embedFile | Optional (recommended!) | No | A word embedding file, containing distributed representations of English words. Each line of the provided file should contain a single word followed by the elements of the corresponding word embedding (space-delimited). It is not absolutely necessary that all words in the treebank be covered by this embedding file, though the parser's performance will generally improve if you are able to provide better embeddings for more words. |
‑model | Yes | Yes | Path to a model file. If the path ends in .gz, the model will be read as a Gzipped model file. During training, we write to this path; at test time we read a pre-trained model from this path. |
‑textFile | No | Yes (or testFile) | Path to a plaintext file containing sentences to be parsed. |
‑testFile | No | Yes (or textFile) | Path to a test-set treebank in CoNLL-X format for final evaluation of the parser. |
‑trainFile | Yes | No | Path to a training treebank in CoNLL-X format. |
It's also possible to use this parser directly in your own Java code. There is an DependencyParserDemo example class in the package edu.stanford.nlp.parser.nndep.demo, included in the source of the Stanford Parser and the source of CoreNLP.
You can train a new dependency parser using your own data in the CoNLL-X data format. (Many dependency treebanks are provided in this format by default; even if not, conversion is often trivial.)
To train a new English model, you need the following pieces of data:
This word embedding file is only used for training. The parser will build its own improved embeddings and save them as part of the learned model.
To start training with the data described above, run this command with the parser on your classpath:
java edu.stanford.nlp.parser.nndep.DependencyParser -trainFile <trainPath> -devFile <devPath> -embedFile <wordEmbeddingFile> -embeddingSize <wordEmbeddingDimensionality> -model nndep.model.txt.gz
On the Stanford NLP machines, training data is available in /u/nlp/data/depparser/nn/data:
java edu.stanford.nlp.parser.nndep.DependencyParser \
-trainFile /u/nlp/data/depparser/nn/data/dependency_treebanks/PTB_Stanford/train.conll \
-devFile /u/nlp/data/depparser/nn/data/dependency_treebanks/PTB_Stanford/dev.conll \
-embedFile /u/nlp/data/depparser/nn/data/embeddings/en-cw.txt -embeddingSize 50 \
-model nndep.model.txt.gz
To train the parser for languages other than English, you need the data as described in the previous section, along with a TreebankLanguagePack describing the particularities of your treebank and the language it contains. (The Stanford Parser package may already contain a TLP for your language of choice: check the package edu.stanford.nlp.trees.international.)
Note that at test time, a language appropriate tagger will also be necessary.
For example, here is a command used to train a Chinese model. The only difference from the English case (apart from the fact that we changed datasets) is that we also provide a different TreebankLanguagePack class with the -tlp option.
java edu.stanford.nlp.parser.nndep.DependencyParser -tlp edu.stanford.nlp.trees.international.pennchinese.ChineseTreebankLanguagePack -trainFile chinese/train.conll -devFile chinese/dev.conll -embedFile chinese/embeddings.txt -embeddingSize 50 -model nndep.chinese.model.txt.gz
The only complicated part here is the TreebankLanguagePack
, which is a Java class you need to provide. It's not hard to write. It's only used for a couple of things: A default character encoding, a list of punctuation POS tags and sentence final punctuation words, and to specify a tokenizer (which you might also need to write). Some of these, like the tokenizer, are only needed for running the parser on raw text, and you can train and test on CoNLL files without one. Getting started, if your language uses the Latin alphabet, you can probably get away with using the default English TreebankLanguagePack
, PennTreebankLanguagePack
.
Option | Default | Description |
---|---|---|
-adaAlpha | 0.01 | Global learning rate for AdaGrad training. |
-adaEps | 1e-6 | Epsilon value added to the denominator of AdaGrad update expression for numerical stability. |
-batchSize | 10000 | Size of mini-batch used for training. |
-dropProb | 0.5 | Dropout probability. For each training example we randomly choose some amount of units to disable in the neural network classifier. This parameter controls the proportion of units "dropped out." |
-embeddingSize | 50 | Dimensionality of word embeddings provided. |
-evalPerIter | 100 | Run full UAS (unlabeled attachment score) evaluation on the development set every time we finish this number of iterations. |
-hiddenSize | 200 | Dimensionality of hidden layer in neural network classifier. |
-initRange | 0.01 | Bounds of range within which weight matrix elements should be initialized. Each element is drawn from a uniform distribution over the range [-initRange, initRange]. |
-maxIter | 20000 | Number of training iterations to complete before stopping and saving the final model. |
-numPreComputed | 100000 | The parser pre-computes hidden-layer unit activations for particular inputs words at both training and testing time in order to speed up feedforward computation in the neural network. This parameter determines how many words for which we should compute hidden-layer activations. |
-regParameter | 1e-8 | Regularization parameter for training. |
-trainingThreads | 1 | Number of threads to use during training. Note that depending on training batch size, it may be unwise to simply choose the maximum amount of threads for your machine. On our 16-core test machines: a batch size of 10,000 runs fastest with around 6 threads; a batch size of 100,000 runs best with around 10 threads. |
-wordCutOff | 1 | The parser can optionally ignore rare words by simply choosing an arbitrary "unknown" feature representation for words that appear with frequency less than n in the corpus. This n is controlled by the wordCutOff parameter. |
Option | Default | Description |
---|---|---|
-escaper | N/A | Only applicable for testing with -textFile. If provided, use this word-escaper when parsing raw sentences. Should be a fully-qualified class name like edu.stanford.nlp.trees.international.arabic.ATBEscaper. |
-numPreComputed | 100000 | The parser pre-computes hidden-layer unit activations for particular inputs words at both training and testing time in order to speed up feedforward computation in the neural network. This parameter determines how many words for which we should compute hidden-layer activations. |
-sentenceDelimiter | N/A | Only applicable for testing with -textFile. If provided, assume that the given textFile has already been sentence-split, and that sentences are separated by this delimiter. |
-tagger.model | edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger | Only applicable for testing with -textFile. Path to a part-of-speech tagger to use to pre-tag the raw sentences before parsing. |
The table below describes this parser's performance on the Penn Treebank, converted to dependencies using Stanford Dependencies. The part-of-speech tags used as input for training and testing were generated by the Stanford POS Tagger (using the bidirectional5words model).
Data | Metric | |
---|---|---|
Development (1700 sentences) | UAS | 92.0 |
LAS | 89.7 | |
Test (2416 sentences) | UAS | 91.7 |
LAS | 89.5 |