|
|
-outputFormat and -outputFormatOptions
options?null or an untyped
dependency parse?
Please send any other questions or feedback, or extensions and bugfixes to
parser-support@lists.stanford.edu.
On Unix, try using GNU tar, if you're not already. (If you're using
Linux, you're almost certainly using GNU tar.) For some reason we don't
understand, it doesn't seem to unpack with classic Unix tar. Make sure you
specify the -z option if you are not gunzipping it in
advance: tar -xzf filename.
On Windows, it unpacks fine with most common tools, such as WinZip or 7-Zip. The latter is open source. (As of Sep 2007, WinRAR doesn't work: it apparently does not handle tar files correctly.)
On the Mac, just double-click it to unpack. The default unarchiver
(BOMArchiveHelper) works fine. To make it easier to run the parser from
the GUI by double-clicking, you should rename
lexparser-gui.csh to lexparser-gui.command.
If it won't unpack, you normally have either a corrupted download (try downloading it again) or there is some configuration error on your system, which we can't help with. The download should be 60,783,166 bytes with an MD5 checksum of 6a4929a2d4e93697ea9d688ec63e3d6a (for version 1.6).
There is considerable Javadoc documentation included in the
javadoc/
directory of the distribution. You should start by looking at the
javadoc for the parser.lexparser package and the LexicalizedParser class.
(The documentation appearing on the nlp.stanford.edu
website refers to code under development and is not necessarily consistent
with the released version of the parser.) If you're interested in the
theory and algorithms behind how the parser works, look at the research
papers listed.
For part-of-speech tags and phrasal categories, this depends on the language and treebank on which the parser was trained (and was decided by the treebank producers not us). The parser can be used for English, Chinese, Arabic, or German (among other languages). For part of speech and phrasal categories, here are relevant links:
Please read the documentation for each of these corpora to learn about their tagsets and phrasal categories. You can often also find additional documentation resources by doing web searches.
The typed dependency (grammatical relations) output available for English and Chinese was defined by us. For English, there is an introduction in the paper:
Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure Parses. In LREC 2006.There is also a Stanford Dependencies manual (for English).
Further information (definitions and examples of nearly all the
grammatical relations) appear in the included Javadoc documentation.
Look at the EnglishGrammaticalRelations and
ChineseGrammaticalRelations classes. (To do this,
with a web browser Open File on the index.html file in the javadoc
folder of the parser distribution, and then click on the given class
names in the bottom-left scroll list.) At some point we may
produce better user-level documentation of these relations, but this is
what is available currently.
A corpus of English biomedical texts, with hand-corrected annotations in a slight variant of the Stanford typed dependency format is available from The BioInfer project.
Yes, you can train a parser. You will need a collection of
syntactically annotated data such as the Penn Treebank
to train the parser. If they are not in the same format as currently
supported Treebanks, you may need to write classes to read in the trees,
etc. Read the Javadocs for the
main method of the LexicalizedParser class, particularly the
-train option to find out about the command options for
training parsers. The supplied file
makeSerialized.csh shows exactly what options we used to
train the parsers that are included in the distribution. If you
want to train the parser on a new language and/or treebank format,
you can (and people have done so), but you need to spend a while learning about
the code, especially if you wish to develop language-specific features.
Start by trying to train a plain PCFG on the data, and then
look at the TreebankLangParserParams class for how to do
language-specific processing.
Use the -sentences option. If you want to give the
parser one sentence per line, include the option -sentences
newline in your invocation of LexicalizedParser.
From the commandline, if you give the option -tokenized, then the parser will
assume white-space separated tokens, and use your tokenization as is.
Of course, parsing will suffer unless your tokenization accurately
matches the tokenization of the underlying treebank, for instance Penn
Treebank tokenization. A common occurrence is that your text is already
correctly tokenized but does not escape characters the way the Penn
Treebank does (turning parentheses into -LRB- and
-RRB-, and putting a backslash in front of forward slashes
and asterisks - presumably a holdover from Lisp). In this case, you can
use the -tokenized option but also add the flag:
-escaper edu.stanford.nlp.process.PTBEscapingProcessor
If calling the parser within your own program, the main
parse methods
take a List of words which should already be correctly tokenized and
escaped before calling the parser. You don't
need to and cannot give the -tokenized option. If you have
untokenized text, it needs to tokenized before parsing. You may
use the parse method that takes a String argument to have
this done for you or you
may be able to use of classes in the
process package, such as DocumentPreprocessor
and PTBTokenizer for tokenization, much as the main method of the parser
does. Or you may want to use your own tokenizer.
Yes, you can. However, you will need to provide correctly tokenized input if you want to provide POS-annotated input. (That is, the input must be tokenized and normalized exactly as the material in the treebank underlying the grammar is.)
Read the Javadocs for the main method of the LexicalizedParser
class. The relevant options are -sentences (see above),
-tokenized, and -tagSeparator. If, for
example, you want to denote a POS tag by the suffix /POS
on a word, you would include the options -tokenized
-tagSeparator / in your invocation of
LexicalizedParser. You could then give the parser input such as
The/DT quick/JJ brown/JJ fox/NN jumped/VBD over/IN the/DT lazy/JJ dog/NN ./.
Partially-tagged input (only indicating the POS of some words) is also OK.
If you wish to work with POS-tagged text programmatically, then
things are different. You pass to the parse method a
List (Sentence). If the items in this list implement
HasTag, such as being of type TaggedWord, then
the parser will use the tags that you provide. You can use the
DocumentPreprocessor class, as the main method
does, to produce these lists, or you could use
WhitespaceTokenizer followed by
WordToTaggedWordProcessor, or you can do this with
code that you write.
Not yet, but in the future, very possibly.
Yes, for the PCFG parser (only). With a PCFG parser, you can give
the option -printPCFGkBest n and it will print the
n highest-scoring parses for a sentence. They can be printed
either as phrase structure trees or as typed dependencies in the usual
way via the -outputFormat option, and each receives a score
(log probability). The k best parses are extracted efficiently
by using the algorithm of Huang and Chiang (2005).
This may be because the parser chose an incorrect structure for your sentence, or because the phrase structure annotation conventions used for training the parser don't match your expectations. To make sure you understand the annotation conventions, please read the bracketing guidelines for the parser model that you're using, which are referenced above. Or it may be because the parser made a mistake. While our goal is to improve the parser when we can, we can't fix individual examples. The parser is just choosing the highest probability analysis according to its grammar.
This parser is in the space of modern statistical parsers whose goal is to give the most likely sentence analysis to a list of words. It does not attempt to determine grammaticality, though it will normally prefer a "grammatical" parse for a sentence if one exists. This is appropriate in many circumstances, such as when wanting to interpret user input, or dealing with conversational speech, web pages, non-native speakers, etc.
For other applications, such as grammar checking, this is less appropriate. One could attempt to assess grammaticality by looking at the probabilities that the parser returns for sentences, but it is difficult to normalize this number to give a useful "grammaticality" score, since the probability strongly depends on other factors like the length of the sentence, the rarity of the words in the sentence, and whether word dependencies in the sentence being tested were seen in the training data or not.
The parser uses considerable amounts of memory. If you see a
java.lang.OutOfMemoryError, you either need to give the
parser more memory or to take steps to reduce the memory needed. (You
give java more memory at the command line by using the -mx
flag, for example -mx500m.)
Memory usage by the parser depends on a number of factors:
-maxLength and to skip long
sentences.Below are some statistics for 32-bit operation with the supplied englishPCFG and englishFactoredGrammars. We have parsed sentences as long as 234 words, but you need lots of RAM and patience.
| Length | PCFG | Factored |
|---|---|---|
| 20 | 50 MB | 250 MB |
| 50 | 125 MB | 600 MB |
| 100 | 350 MB | 2100 MB |
If you see the error:
Exception in thread "main" java.lang.UnsupportedClassVersionError:
edu/stanford/nlp/parser/lexparser/LexicalizedParser (Unsupported
major.minor version 49.0)
This means that you don't have JDK 1.5 installed. You should upgrade at java.sun.com.
You can use the -outputFormat wordsAndTags option.
Note: if you want to tag a lot of text, it'd be much faster to use a
dedicated POS tagger (such as
ours or
someone else's),
since this option has the parser parse the sentences
and just not print the other information. There isn't a separate included
tagger; the parser does POS tagging as part of parsing.
Yes, you can. You can use the main method of
EnglishGrammaticalStructure (for English, or the
corresponding class for Chinese). You can give it options like
-treeFile to read in trees, and, say,
-collapsed to output
typedDependenciesCollapsed. For example, this command
(with appropriate paths) will convert a Penn Treebank file to uncollapsed
typed dependencies:
java -cp stanford-parser-2007-08-19/stanford-parser.jar
edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile
wsj/02/wsj_0201.mrg -basic
Also, here is a sample Java class that you can download that converts from an input file of trees to typed dependencies.
Fine print: There is one subtlety. The conversion code
generally expects Penn Treebank style trees which have been stripped of
functional tags and empty elements. This generally corresponds to the
output of the Stanford, Charniak or Collins/Bikel parsers. The
exception is that it gets value from the -TMP annotation on bare
temporal NPs in order to recognize them as having temporal function
(tmod). (It also allows a -ADV annotation on
NPs.) Without the temporal annotation, some simple temporals like
today will still be recognized, but a bare temporal like
last week in I left last week will be tagged as an object
(dobj). With the Stanford parser, you can get marking of
temporal NPs in the tree output by giving the option
-retainTmpSubcategories, either on the command line or by
passing it to the setOptionFlags(String[]) method of the parser.
This is an element of the dependency analysis we adopted. It's not uncontroversial, and it could have been done differently, but we'll try to explain briefly why we did things the way we did. The general philosophy of the grammatical relations design is that main predicates should be heads and auxiliaries should not. So, for the sentence Jill is singing, you will see nsubj(singing, Jill). We feel that this is more useful for most semantic interpretation applications, because it directly connects the main predicate with its arguments, while the auxiliary is rendered as modifying the verb (aux(singing, is)). Most people seem to agree.
What then when the main predicate is an adjective or a noun? That is, sentences like Jill is busy or Jill is a teacher. We continue to regard the adjective or noun as the predicate of which the subject is the argument, rather than changing and now regarding the copular verb is as the head and busy/teacher as a complement. That is, we produce nsubj(busy, Jill) and nsubj(teacher, Jill). This frequently seems to confuse people, because the main predicate of the clause is now not a verb. But we believe that this is the best thing to do for several reasons:
Yes, you can. Various tokenizers are included. The one used for English is called PTBTokenizer. It is a hand-written rule-based (FSM) tokenizer, but is quite accurate over newswire-style text. Because it is rule-based it is quite fast (about 100,000 tokens per second on an Intel box in 2007). You can use it as follows:
java edu.stanford.nlp.process.PTBTokenizer inputFile > outputFile
There are several options, including one for batch-processing lots of
files; see the Javadoc documentation of the main method of PTBTokenizer.
There's not much in the way of secret sauce (partly by the design of
the parsers as guaranteed to find model optimal solutions).
If you're not using englishPCFG.ser.gz for English, then
you should be - it's much faster than the Factored parser. If you can
exclude extremely long sentences (especially ones over 60 words or so),
then that helps since they take disproportionately long times to parse.
If POS-tagging sentences prior to parsing is an option, that speeds
things up (less possibilities to search).
The parser doesn't support multithreading (don't expect correct results if you try it!). The main tool remaining is to run multiple parsers in separate JVMs in parallel. This can be on multiple machines, but you can usefully run multiple parsing processes on one machine if you have dual CPU/dual core machines and enough memory. We've parsed at a rate of about 1,000,000 sentences a day by distributing the work over 6 dual processor machines.
Sure!! These instructions concentrate on parsing from the command line, since you need to use that to be able to set most options. But you can also use the parser on Chinese from within the GUI.
The parser is supplied with 5 Chinese grammars (and, with access to suitable training data, you could train other versions). All of these are trained on data from the Penn Chinese Treebank, and you should consult their site for details of the syntactic representation of Chinese which they use. They are:
| PCFG | Factored | Factored, segmenting | |
| Xinhua (mainland, newswire) | xinhuaPCFG.ser.gz |
xinhuaFactored.ser.gz |
xinhuaFactoredSegmenting.ser.gz |
| Mixed Chinese | chinesePCFG.ser.gz |
chineseFactored.ser.gz |
The PCFG parsers are smaller and faster. But the Factored parser is
significantly better for Chinese, and we would generally recommend its
use. The xinhua grammars are trained solely on Xinhua
newspaper text from mainland China. We would recommend their use for
parsing material from mainland China. The chinese
grammars also include some training material from Hong Kong SAR and
Taiwan. We'd recommend their use if parsing material from these areas
or a mixture of text types.
Four of the parsers assume input that has already been word
segmented, while the fifth does word segmentation internal to the
parser. This is discussed further below. The parser also comes with 3
Chinese example sentences, in files whose names all begin with
chinese.
Character encoding: The first thing to get straight is the
character encoding of the text you wish to parse. By default, our
Chinese parser uses GB18030 (the native character encoding of the Penn
Chinese Treebank and the national encoding of China) for input and output.
However, it is
very easy to parse text in another character encoding: you simply give
the flag -encoding encoding to the parser, where
encoding is a
character set encoding name recognized within Java, such as:
UTF-8, Big5-HKSCS, or GB18030.
This changes the input and output encoding. If you want to display the
output in a command window, you separately also need
to work out what character set your computer supports for display. If
that is different to the encoding of the file, you will need to convert
the encoding for display. If any of this encoding stuff is wrong, then
you are likely to see gibberish.
Here are example commands for parsing two of
the test files, one in UTF-8 and one in GB18030. The (Linux) computer
that this is being run on is set up to work with UTF-8 (and this webpage
is also in UTF-8), so for the case of GB18030, the output is piped
through the Unix iconv utility for display.
$ java -server -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -encoding utf-8 /u/nlp/data/lexparser/chineseFactored.ser.gz chinese-onesent-utf8.txt
Loading parser from serialized file /u/nlp/data/lexparser/chineseFactored.ser.gz ... done [20.7 sec].
Parsing file: chinese-onesent-utf8.txt with 2 sentences.
Parsing [sent. 1 len. 8]: 俄国 希望 伊朗 没有 制造 核武器 计划 。
(ROOT
(IP
(NP (NR 俄国))
(VP (VV 希望)
(IP
(NP (NR 伊朗))
(VP (VE 没有)
(NP (NN 制造) (NN 核武器) (NN 计划)))))
(PU 。)))
Parsing [sent. 2 len. 6]: 他 在 学校 里 学习 。
(ROOT
(IP
(NP (PN 他))
(VP
(PP (P 在)
(LCP
(NP (NN 学校))
(LC 里)))
(VP (VV 学习)))
(PU 。)))
Parsed file: chinese-onesent-utf8.txt [2 sentences].
Parsed 14 words in 2 sentences (6.55 wds/sec; 0.94 sents/sec).
$ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser chineseFactored.ser.gz chinese-onesent |& iconv -f gb18030 -t utf-8
Loading parser from serialized file chineseFactored.ser.gz ... done [13.3 sec].
Parsing file: chinese-onesent with 1 sentences.
Parsing [sent. 1 len. 10]: 他 和 我 在 学校 里 常 打 桌球 。
(ROOT
(IP
(NP (PN 他)
(CC 和)
(PN 我))
(VP
(PP (P 在)
(LCP
(NP (NN 学校))
(LC 里)))
(ADVP (AD 常))
(VP (VV 打)
(NP (NN 桌球))))
(PU 。)))
Parsed file: chinese-onesent [1 sentences].
Parsed 10 words in 1 sentences (10.78 wds/sec; 1.08 sents/sec).
Normalization: As well as the character set, there are also issues of
"normalization" for characters: for instance, basic Latin letters can
appear in either their "regular ASCII" forms or as "full width" forms,
equivalent in size to Chinese characters. Character normalization is
something we may revisit in the future, but at present, the parser was
trained on text which mainly has fullwidth Latin letters and punctuation
and does no normalization, and so you will get far better results if you
also represent such characters as fullwidth letters. The parser
does provide an escaper that will do this mapping for you on input. You
can invoke it with the -escaper flag, by using a command
like the following (which also shows output being sent to a file):
$ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -escaper edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper -sentences newline chineseFactored.ser.gz chinese-onesent > chinese-onesent.stp
Word segmentation: Chinese is not normally written with spaces
between words. But the examples shown above were all parsing
text that had already been segmented into words according to the
conventions of the Penn Chinese Treebank. For best results, we
recommend that you first segment input text with a high quality word
segmentation system which provides word segmentation according to Penn
Chinese Treebank conventions (note that there are many different
conventions for
Chinese word segmentation...). You can find out much more information
about CTB word segmentation from the
First,
Second, or
Third
International Chinese Word Segmentation Bakeoff.
In particular, you can now download a version of our CRF-based word
segmenter (similar to
the system we used in the Second Sighan Bakeoff)
from our software page.
However, for
convenience, we also provide an ability for the parser to do word
segmentation. Essentially, it misuses the parser as a first-order HMM
Chinese word segmentation system. This gives a reasonable, but not
excellent, Chinese word segmentation system. (It's performance
isn't as good as the Stanford CRF word segmenter mentioned above.)
To use it,
you use the -segmentMarkov option or a grammar trained with
this option. For example:
$ iconv -f gb18030 -t utf8 < chinese-onesent-unseg.txt
他在学校学习。
$ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser xinhuaFactoredSegmenting.ser.gz chinese-onesent-unseg.txt | & iconv -f gb18030 -t utf-8
Loading parser from serialized file xinhuaFactoredSegmenting.ser.gz ... done [6.8 sec].
Parsing file: chinese-onesent-unseg.txt with 1 sentences.
Parsing [sent. 1 len. 5]: 他 在 学校 学习 。
Trying recovery parse...
Sentence couldn't be parsed by grammar.... falling back to PCFG parse.
(ROOT
(IP
(NP (PN 他))
(VP
(PP (P 在)
(NP (NN 学校)))
(VP (VV 学习)))
(PU 。)))
Parsed file: chinese-onesent-unseg.txt [1 sentences].
Parsed 5 words in 1 sentences (6.08 wds/sec; 1.22 sents/sec).
1 sentences were parsed by fallback to PCFG.
Grammatical relations: The Chinese parser also supports grammatical relations (typed dependencies) output. For instance:
$ java -mx500m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependencies xinhuaFactored.ser.gz chinese-onesent | & iconv -f gb18030 -t utf-8 Loading parser from serialized file xinhuaFactored.ser.gz ... done [4.9 sec]. Parsing file: chinese-onesent with 1 sentences. Parsing [sent. 1 len. 10]: 他 和 我 在 学校 里 常 打 桌球 。 conj(我-3, 他-1) cc(我-3, 和-2) nsubj(打-8, 我-3) prep(打-8, 在-4) lobj(里-6, 学校-5) plmod(在-4, 里-6) advmod(打-8, 常-7) dobj(打-8, 桌球-9) Parsed file: chinese-onesent [1 sentences]. Parsed 10 words in 1 sentences (7.10 wds/sec; 0.71 sents/sec).
Sure! See the Stanford Arabic Parser IAQ.
There are many kinds of 'vanilla', but, providing your treebank is in
Penn Treebank format, then, yes, this is easy to do. You can train and
test the parser as follows, assuming that your training trees
are in train.txt and your test trees are in
test.txt:
java -mx1g edu.stanford.nlp.parser.lexparser.LexicalizedParser
-PCFG -vMarkov 1 -uwm 0 -headFinder
edu.stanford.nlp.trees.LeftHeadFinder -train train.txt
-test test.txt > output.txt
Going through the options, we ask for just the PCFG model
(-PCFG), for just conditioning context-free rules based on their
left-hand side (parent) (-vMarkov 0), whereas the default also
conditions on grandparents (-vMarkov 1), to use no
language-specific heuristics for unknown word processing
(-uwm 0),
and to always just choose the left-most category on a rule RHS as the
head (-headFinder edu.stanford.nlp.trees.LeftHeadFinder).
When using a plain PCFG (i.e., no markovization of
rules), the headFinder does not affect results, but unless
you use this head finder, you will see errors about the parser not
finding head categories (if your categories differ from those of the
Penn Treebank). This HeadFinder will give consistent left-branching
binarization.
At present, we don't have any documentation beyond what you get in the
download and what's on this page. If you would like to help by producing
better documentation, feel free to write to
parser-support@lists.stanford.edu.
Some parser command-line options are documented. See the
parser.lexparser package documentation, the
LexicalizedParser.main method documentation, the
TreePrint class, and the documentation of variables in the
Train, Test, and Options classes,
and appropriate language-particular
TreebankLangParserParams. For the rest, you need to look
at the source code. The public API is somewhat documented in the
LexicalizedParser class JavaDoc. See especially the sample
invocation in the parser.lexparser package documentation.
The included file makeSerialized.csh effectively documents
how the included grammars were made.
The included file ParserDemo.java gives an example of
how to call the parser programmatically, including getting
Tree and typedDependencies output. It is
reproduced below:
import java.util.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
class ParserDemo {
public static void main(String[] args) {
LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
String[] sent = { "This", "is", "an", "easy", "sentence", "." };
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
parse.pennPrint();
System.out.println();
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println(tdl);
System.out.println();
TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
tp.printTree(parse);
}
}
-outputFormat and -outputFormatOptions
options?
You can give the options -outputFormat typedDependencies or
-outputFormat typedDependenciesCollapsed to get typed
dependencies (or grammatical relations) output (for English and Chinese
only, currently).
You can print out lexicalized trees (head words and tags at each phrasal
node with the -outputFormatOptions lexicalize option.
You can see all the other options by looking in the Javadoc of the
TreePrint class.
Yes, you use a filename of a single dash/minus character: -. E.g.,
java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
englishPCFG.ser.gz -
For interactive use, you may find it convenient to turn off the stderr output. For example, in bash you could use the command:
java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
englishPCFG.ser.gz - 2> /dev/null
null or an untyped
dependency parse?
This answer is specific to English. It mostly applies to other
languages although some components are missing in some languages.
The file englishPCFG.ser.gz comprises just an unlexicalized PCFG
grammar. It is basically the parser described in the ACL 2003 Accurate
Unlexicalized Parsing paper. The typed dependencies are produced in a
postprocessing step after parsing by matching patterns on CFG trees.
This process is described in the several papers on the topic by
Marie-Catherine de Marneffe. Confusingly, the current code to generate
Stanford Dependencies requires a phrase structure (CFG) parse.
It doesn't require or use a dependency parse. The file
englishFactored.ser.gz contains two grammars and leads the
system to run three parsers. It first runs a (simpler) PCFG
parser and then an untyped dependency parser, and then runs a third
parser which finds the parse with the best joint score across the two
other parsers via a product model. This is described in the NIPS
Fast Exact Inference paper. You can get Stanford Dependencies from the
output of this parser, since it generates a phrase structure parse.
At the API level, with the factored parser, if you ask for
getBestDependencyParse(), then you will get the best untyped dependency
parse. If you call that method with englishPCFG.ser.gz, it
will return null, as there is no dependency parse. For
either, you need to use the separate GrammaticalStructure classes to get
the typed Stanford Dependencies representation. In general, with
appropriate grammars loaded, you can parse with and ask for output of the PCFG,
(untyped) dependency, or factored parsers. For English, although the
grammars and parsing methods differ, the average
quality of englishPCFG.ser.gz and
englishFactored.ser.gz is similar, and so many people opt
for the faster englishPCFG.ser.gz, though
englishFactored.ser.gz sometimes does better because it
does include lexicalization. For other languages, the factored models
are considerably better than the PCFG models, and are what people
generally use. (Since these parsers were written, direct typed
dependency parsers have been increasingly explored; in the future, we
might produce a parser that directly parses to Stanford Dependencies,
but that isn't what we have now.)
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · ScalaNLP · machines · Wiki |
Site design by Bill MacCartney |