public final class TreebankPreprocessor
extends java.lang.Object
A simple framework for preparing various kinds of treebank data. The original goal was to prepare the Penn Arabic Treebank (PATB) trees for parsing. This pipeline arose from the need to prepare various data sets in a uniform manner for the execution of experiments that require multiple tools. The design objectives are:
These objectives are realized through three features:
ConfigParser
-- reads the plain text configuration file and creates configuration parameter objects for each data set
Dataset
interface -- Generic interface for loading, processing, and writing datasets
Mapper
interface -- Generic interface for applying transformations to strings (usually words and POS tags)
The process for preparing arbitrary data set X is as follows:
ConfigParser
as necessary
Dataset
interface for the new data set (or use one of the existing classes)
Mapper
classes as needed
TreebankPreprocessor
using the plain text file as the argument
Modifier and Type | Field and Description |
---|---|
static java.util.Map<java.lang.String,java.lang.Integer> |
optionArgDefs |
Modifier and Type | Method and Description |
---|---|
static void |
main(java.lang.String[] args)
Execute with no arguments for usage.
|