QuestionBank corpus improvements done at Stanford

About

A traditional problem with newswire-trained statistical parsers and part-of-speech taggers is that they are not very good at parsing things like questions and imperatives which are rare in newswire. This problem was partially addressed by John Judge in 2006 by his annotation of 4000 questions:

John Judge, Aoife Cahill and Josef van Genabith. 2006. QuestionBank: Creating a Corpus of Parse Annotated Questions. Proceedings of ACL/CoLing, Sydney, Australia.

which he released here:

Original QuestionBank

However, the quality of the original annotations is only medium. In 2011, Chris Manning spent some time improving the parses in QuestionBank, and the results of that work appear here. I'm releasing this corrected version as a script that maps the original to the corrected version.

However, I'm sure there are still more errors to be fixed in these trees. Feel free to pass along any that you find!

To use the script, you need: Perl (v5), bash shell, Java (1.5+), and Stanford Tregex/Tsurgeon (http://nlp.stanford.edu/software/tregex.shtml). You then need to edit autofix.sh to provide flags to Java so it can find Tregex/Tsurgeon on the classpath (unless it is already there). That is, make JAVAFLAGS be:

JAVAFLAGS=-cp /path/to/my/stanford-tregex.jar

Then you need the original version of QuestionBank from the site linked above. Then you should be able to run the code as a filter like this:

./autofix.sh < 4000qs-originalVersion.txt > 4000qs-fixedVersion-1.0.txt

Note one fine point: Just testing a parser against the improved treebank (or even training with some of the material and testing on the rest) will not necessarily give you higher parse accuracy numbers. This is because the original release tended to preserve errors made by statistical parsers trained on WSJ newswire rather than correcting them, and the WSJ material dominates the parser training material, typically leading to errors such as parsers dispreferring using WH- categories even when they are correct. Hence, such a result does not affect the claim that the trees in this release are much nearer correct according to Penn Treebank 3 annotation standards.

The scripts in this download are in the public domain. That is not necessarily true of the source data nor the tools that the scripts use (Tregex is GPL). Use at your own risk. But the data should give you more correct question parses, according to the Penn Treebank version 3 conventions. (That is, this is "old Treebank" annotation, not the "new Treebank" annotation used in recent projects like OntoNotes, with NML, and HYPH.)

There's also an option in the scripts where you can sample the 4000 questions down to 3924, by excluding the ones that appear in the test set used by Laura Rimell et al. in:

Laura Rimell, Stephen Clark and Mark Steedman. 2009. Unbounded Dependency Recovery for Parser Evaluation. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-09), pp. 813-821, Singapore.

Look at the $what variable which is an integer interpreted as a set of flags detailed at the top of autofix.pl. It can be passed in as the sole argument to autofix.sh.

Train/dev/test split

Since the first 2000 questions in QuestionBank are TREC questions, while the second 2000 come from Dan Roth's group, mainly from ISI, mainly from answers.com, it seems best to sample training and test data from both halves. I'd suggest these splits:

Train: 1-1000, 2001-3000
Dev: 1001-1500, 3001-3500
Test: 1501-2000, 3501-4000

Downloads

QuestionBank-Stanford-1.0.zip (2011-07-01)
QuestionBank-Stanford-1.0.1.zip (2013-11-15; makes about 50 further fixes, nearly all to POS tags)