TrainOptions (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.TrainOptions

All Implemented Interfaces:

java.io.Serializable

Direct Known Subclasses:

ShiftReduceTrainOptions
```
public class TrainOptions
extends java.lang.Object
implements java.io.Serializable
```
Non-language-specific options for training a grammar from a treebank. These options are not used at parsing time.

Author:

Dan Klein, Christopher Manning

See Also:

Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class TrainOptions.TransformMatrixType

Nested Classes
Modifier and Type	Class and Description
`static class`	`TrainOptions.TransformMatrixType`

Field Summary

Fields
Modifier and Type	Field and Description
`boolean`	`basicCategoryTagsInDependencyGrammar` Where to use the basic or split tags in the dependency grammar
`int`	`batchSize`
`boolean`	`cheatPCFG` Add all test set trees to training data for PCFG.
`boolean`	`collinsPunc` Promote/delete punctuation like Collins.
`int`	`compactGrammar` How to compact grammars as FSMs.
`int`	`debugOutputFrequency` If larger than 0, the parser may choose to output debug information every X seconds, X iterations, or some other similar metric
`static int`	`DEFAULT_BATCH_SIZE` When training using batches of trees, such as in the DVParser, how many trees to use in one batch
`static double`	`DEFAULT_DELTA_MARGIN`
`static int`	`DEFAULT_K_BEST` When training the DV parsing method, how many of the top K trees to analyze from the underlying parser
`static double`	`DEFAULT_LEARNING_RATE`
`static int`	`DEFAULT_QN_ITERATIONS_PER_BATCH` When training the DV parsing method, how many iterations to loop for one batch of trees
`static double`	`DEFAULT_REGCOST` regularization constant
`static double`	`DEFAULT_SCALING_FOR_INIT`
`static int`	`DEFAULT_STALLED_ITERATION_LIMIT`
`static int`	`DEFAULT_TRAINING_ITERATIONS` When training a parsing method where the training has a (max) number of iterations, how many iterations to loop
`static java.lang.String`	`DEFAULT_UNK_WORD`
`java.util.Set<java.lang.String>`	`deleteSplitters`
`double`	`deltaMargin` How much to penalize the wrong trees for how different they are from the gold tree when training
`int`	`dvKBest`
`boolean`	`dvSimplifiedModel` Make the dv model as simple as possible
`double`	`fractionBeforeUnseenCounting` Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.
`boolean`	`gPA` This variable controls doing 2 levels of parent annotation.
`int`	`HSEL_CUT`
`boolean`	`hSelSplit`
`double`	`learningRate` How fast to learn (can mean different things for different algorithms)
`boolean`	`leftRec` Left edge is right-recursive (X << X) Bad.
`boolean`	`leftToRight`
`boolean`	`lowercaseWordVectors` Whether or not to lowercase word vectors
`boolean`	`markFinalStates` Whether or not to mark final states in binarized grammar.
`boolean`	`markovFactor` Whether to do "horizontal Markovization" (as in ACL 2003 paper).
`int`	`markovOrder`
`boolean`	`markStrahler` Horton-Strahler number/dimension (Maximilian Schlund)
`int`	`markUnary` Mark all unary nodes specially.
`boolean`	`markUnaryTags` Mark POS tags which are the sole member of their phrasal constituent.
`int`	`maxTrainTimeSeconds`
`boolean`	`noRebinarization` When binarizing trees, don't binarize trees with two children.
`boolean`	`noTagSplit`
`int`	`openClassTypesThreshold` A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag.
`boolean`	`PA` This variable controls doing parent annotation of phrasal nodes.
`boolean`	`postGPA`
`boolean`	`postPA`
`java.util.Set`	`postSplitters`
`boolean`	`postSplitWithBaseCategory` Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.
`boolean`	`predictSplits` Use the method reported by Berkeley for splitting and recombining states.
`TreeTransformer`	`preTransformer` A transformer to use on the training data before any other processing step.
`java.io.PrintWriter`	`printAnnotatedPW`
`boolean`	`printAnnotatedRuleCounts`
`boolean`	`printAnnotatedStateCounts`
`java.io.PrintWriter`	`printBinarizedPW`
`boolean`	`printStates`
`int`	`printTreeTransformations` Just for debugging: check that your tree transforms work correctly.
`int`	`qnEstimates` When training the DV parsing method, how many estimates to keep for the qn approximation.
`int`	`qnIterationsPerBatch`
`double`	`qnTolerance` When training the DV parsing method, the tolerance to use if we want to stop qn early
`long`	`randomSeed`
`double`	`regCost`
`boolean`	`rightRec` Right edge is right-recursive (X << X) Bad.
`double`	`ruleDiscount` Discounts the count of BinaryRule's (only, apparently) in training data.
`boolean`	`ruleSmoothing` Enables linear rule smoothing during grammar extraction but before grammar compaction.
`double`	`ruleSmoothingAlpha`
`double`	`scalingForInit` How much to scale certain parameters when initializing models.
`boolean`	`selectivePostSplit`
`double`	`selectivePostSplitCutOff`
`boolean`	`selectiveSplit` Only split the "common high KL divergence" parent categories....
`double`	`selectiveSplitCutOff`
`boolean`	`simpleBinarizedLabels` When binarizing trees, don't annotate the labels with anything
`boolean`	`sisterAnnotate` Selective Sister annotation.
`java.util.Set<java.lang.String>`	`sisterSplitters`
`boolean`	`smoothing` TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser.
`int`	`splitCount` If we are predicting splits, we loop this many times
`boolean`	`splitPrePreT` Mark all pre-preterminals (also does splitBaseNP: don't need both)
`double`	`splitRecombineRate` If we are predicting splits, we recombine states at this rate every loop
`java.util.Set<java.lang.String>`	`splitters` Set the splitter strings.
`int`	`stalledIterationLimit` How many iterations to allow training to stall before taking the best model, if training in an iterative manner
`java.lang.String`	`taggedFiles` A set of files to use as extra information in the lexicon.
`boolean`	`tagPA` Parent annotation on tags.
`boolean`	`tagSelectivePostSplit`
`double`	`tagSelectivePostSplitCutOff`
`boolean`	`tagSelectiveSplit` Do parent annotation on tags selectively.
`double`	`tagSelectiveSplitCutOff`
`int`	`trainingIterations`
`int`	`trainingThreads` If the training algorithm allows for parallelization, how many threads to use
`int`	`trainLengthLimit`
`java.lang.String`	`trainTreeFile`
`boolean`	`trainWordVectors` Do we want a model that uses word vectors (such as the DVParser) to train those word vectors when training the model? Note: models prior to 2014-02-13 may have incorrect values in this field, as it was originally a compile time constant
`TrainOptions.TransformMatrixType`	`transformMatrixType`
`boolean`	`unknownCapsVector` Whether or not to build an unknown word vector for words with caps in them
`boolean`	`unknownChineseNumberVector` Whether or not to build an unknown word vector to match Chinese numbers
`boolean`	`unknownChinesePercentVector` Whether or not to build an unknown word vector to match Chinese percentages
`boolean`	`unknownChineseYearVector` Whether or not to build an unknown word vector to match Chinese years
`boolean`	`unknownDashedWordVectors` Whether or not to handle unknown dashed words by taking the last part
`boolean`	`unknownNumberVector` Whether or not to build an unknown word vector specifically for numbers
`java.lang.String`	`unkWord` Some models will use external data sources which contain information about unknown words.
`boolean`	`useContextWords` Specifically for the DVModel, uses words on either side of a context when combining constituents.

Constructor Summary

Constructors
Constructor and Description

TrainOptions()

Constructors
Constructor and Description
`TrainOptions()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`compactGrammar()`
`void`	`display()`
`boolean`	`outsideFactor()` If true, declare early -- leave this on except maybe with markov on.
`static void`	`printTrainTree(java.io.PrintWriter pw, java.lang.String message, Tree t)`
`java.lang.String`	`toString()`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - trainTreeFile
```
public java.lang.String trainTreeFile
```
  - trainLengthLimit
```
public int trainLengthLimit
```
  - cheatPCFG
```
public boolean cheatPCFG
```
    Add all test set trees to training data for PCFG. (Currently only supported in FactoredParser main.)
  - markovFactor
```
public boolean markovFactor
```
    Whether to do "horizontal Markovization" (as in ACL 2003 paper). False means regular PCFG expansions.
  - markovOrder
```
public int markovOrder
```
  - hSelSplit
```
public boolean hSelSplit
```
  - HSEL_CUT
```
public int HSEL_CUT
```
  - markFinalStates
```
public boolean markFinalStates
```
    Whether or not to mark final states in binarized grammar. This must be off to get most value out of grammar compaction.
  - openClassTypesThreshold
```
public int openClassTypesThreshold
```
    A POS tag has to have been attributed to more than this number of word types before it is regarded as an open-class tag. Unknown words will only possibly be tagged as open-class tags (unless flexiTag is on). If flexiTag is on, unknown words will be able to be tagged any POS for which the unseenMap has nonzero count (that is, the tag was seen for a new word after unseen signature counting was started).
  - fractionBeforeUnseenCounting
```
public double fractionBeforeUnseenCounting
```
    Start to aggregate signature-tag pairs only for words unseen in the first this fraction of the data.
  - PA
```
public boolean PA
```
    This variable controls doing parent annotation of phrasal nodes. Good.
  - gPA
```
public boolean gPA
```
    This variable controls doing 2 levels of parent annotation. Bad.
  - postPA
```
public boolean postPA
```
  - postGPA
```
public boolean postGPA
```
  - selectiveSplit
```
public boolean selectiveSplit
```
    Only split the "common high KL divergence" parent categories.... Good.
  - selectiveSplitCutOff
```
public double selectiveSplitCutOff
```
  - selectivePostSplit
```
public boolean selectivePostSplit
```
  - selectivePostSplitCutOff
```
public double selectivePostSplitCutOff
```
  - postSplitWithBaseCategory
```
public boolean postSplitWithBaseCategory
```
    Whether, in post-splitting of categories, nodes are annotated with the (grand)parent's base category or with its complete subcategorized category.
  - sisterAnnotate
```
public boolean sisterAnnotate
```
    Selective Sister annotation.
  - sisterSplitters
```
public java.util.Set<java.lang.String> sisterSplitters
```
  - markUnary
```
public int markUnary
```
    Mark all unary nodes specially. Good for just PCFG. Bad for factored. markUnary affects phrasal nodes. A value of 0 means to do nothing; a value of 1 means to mark the parent (higher) node of a unary rewrite. A value of 2 means to mark the child (lower) node of a unary rewrie. Values of 1 and 2 only apply if the child (lower) node is phrasal. (A value of 1 is better than 2 in combos.) A value of 1 corresponds to the old boolean -unary flag.
  - markUnaryTags
```
public boolean markUnaryTags
```
    Mark POS tags which are the sole member of their phrasal constituent. This is like markUnary=2, applied to POS tags.
  - splitPrePreT
```
public boolean splitPrePreT
```
    Mark all pre-preterminals (also does splitBaseNP: don't need both)
  - tagPA
```
public boolean tagPA
```
    Parent annotation on tags. Good (for PCFG?)
  - tagSelectiveSplit
```
public boolean tagSelectiveSplit
```
    Do parent annotation on tags selectively. Neutral, but less splits.
  - tagSelectiveSplitCutOff
```
public double tagSelectiveSplitCutOff
```
  - tagSelectivePostSplit
```
public boolean tagSelectivePostSplit
```
  - tagSelectivePostSplitCutOff
```
public double tagSelectivePostSplitCutOff
```
  - rightRec
```
public boolean rightRec
```
    Right edge is right-recursive (X << X) Bad. (NP only is good)
  - leftRec
```
public boolean leftRec
```
    Left edge is right-recursive (X << X) Bad.
  - collinsPunc
```
public boolean collinsPunc
```
    Promote/delete punctuation like Collins. Bad (!)
  - splitters
```
public java.util.Set<java.lang.String> splitters
```
    Set the splitter strings. These are a set of parent and/or grandparent annotated categories which should be split off.
  - postSplitters
```
public java.util.Set postSplitters
```
  - deleteSplitters
```
public java.util.Set<java.lang.String> deleteSplitters
```
  - printTreeTransformations
```
public int printTreeTransformations
```
    Just for debugging: check that your tree transforms work correctly. This will print the transformations of the first printTreeTransformations trees.
  - printAnnotatedPW
```
public java.io.PrintWriter printAnnotatedPW
```
  - printBinarizedPW
```
public java.io.PrintWriter printBinarizedPW
```
  - printStates
```
public boolean printStates
```
  - compactGrammar
```
public int compactGrammar
```
    How to compact grammars as FSMs. 0 = no compaction [uses makeSyntheticLabel1], 1 = no compaction but use label names that wrap from right to left in binarization [uses makeSyntheticLabel2], 2 = wrapping labels and materialize unary at top rewriting passive to active, 3 = ExactGrammarCompactor, 4 = LossyGrammarCompactor, 5 = CategoryMergingGrammarCompactor. (May 2007 CDM note: options 4 and 5 don't seem to be functioning sensibly. 0, 1, and 3 seem to be the 'good' options. 2 is only useful as input to 3. There seems to be no reason not to use 0, despite the default.)
  - leftToRight
```
public boolean leftToRight
```
  - noTagSplit
```
public boolean noTagSplit
```
  - ruleSmoothing
```
public boolean ruleSmoothing
```
    Enables linear rule smoothing during grammar extraction but before grammar compaction. The alpha term is the same as that described in Petrov et al. (2006), and has range [0,1].
  - ruleSmoothingAlpha
```
public double ruleSmoothingAlpha
```
  - smoothing
```
public boolean smoothing
```
    TODO wsg2011: This is the old grammar smoothing parameter that no longer does anything in the parser. It should be removed.
  - ruleDiscount
```
public double ruleDiscount
```
    Discounts the count of BinaryRule's (only, apparently) in training data.
  - printAnnotatedRuleCounts
```
public boolean printAnnotatedRuleCounts
```
  - printAnnotatedStateCounts
```
public boolean printAnnotatedStateCounts
```
  - basicCategoryTagsInDependencyGrammar
```
public boolean basicCategoryTagsInDependencyGrammar
```
    Where to use the basic or split tags in the dependency grammar
  - preTransformer
```
public TreeTransformer preTransformer
```
    A transformer to use on the training data before any other processing step. This is specified by using the -preTransformer flag when training the parser. A comma separated list of classes will be turned into a CompositeTransformer. This can be used to strip subcategories, to run a tsurgeon pattern, or any number of other useful operations.
  - taggedFiles
```
public java.lang.String taggedFiles
```
    A set of files to use as extra information in the lexicon. This can provide tagged words which are not part of trees
  - predictSplits
```
public boolean predictSplits
```
    Use the method reported by Berkeley for splitting and recombining states. This is an experimental and still in development reimplementation of that work.
  - splitCount
```
public int splitCount
```
    If we are predicting splits, we loop this many times
  - splitRecombineRate
```
public double splitRecombineRate
```
    If we are predicting splits, we recombine states at this rate every loop
  - simpleBinarizedLabels
```
public boolean simpleBinarizedLabels
```
    When binarizing trees, don't annotate the labels with anything
  - noRebinarization
```
public boolean noRebinarization
```
    When binarizing trees, don't binarize trees with two children. Only applies when using inside markov binarization for now.
  - trainingThreads
```
public int trainingThreads
```
    If the training algorithm allows for parallelization, how many threads to use
  - DEFAULT_K_BEST
```
public static final int DEFAULT_K_BEST
```
    When training the DV parsing method, how many of the top K trees to analyze from the underlying parser
    
    See Also:
    
    Constant Field Values
  - dvKBest
```
public int dvKBest
```
  - DEFAULT_TRAINING_ITERATIONS
```
public static final int DEFAULT_TRAINING_ITERATIONS
```
    When training a parsing method where the training has a (max) number of iterations, how many iterations to loop
    
    See Also:
    
    Constant Field Values
  - trainingIterations
```
public int trainingIterations
```
  - DEFAULT_BATCH_SIZE
```
public static final int DEFAULT_BATCH_SIZE
```
    When training using batches of trees, such as in the DVParser, how many trees to use in one batch
    
    See Also:
    
    Constant Field Values
  - batchSize
```
public int batchSize
```
  - DEFAULT_REGCOST
```
public static final double DEFAULT_REGCOST
```
    regularization constant
    
    See Also:
    
    Constant Field Values
  - regCost
```
public double regCost
```
  - DEFAULT_QN_ITERATIONS_PER_BATCH
```
public static final int DEFAULT_QN_ITERATIONS_PER_BATCH
```
    When training the DV parsing method, how many iterations to loop for one batch of trees
    
    See Also:
    
    Constant Field Values
  - qnIterationsPerBatch
```
public int qnIterationsPerBatch
```
  - qnEstimates
```
public int qnEstimates
```
    When training the DV parsing method, how many estimates to keep for the qn approximation.
  - qnTolerance
```
public double qnTolerance
```
    When training the DV parsing method, the tolerance to use if we want to stop qn early
  - debugOutputFrequency
```
public int debugOutputFrequency
```
    If larger than 0, the parser may choose to output debug information every X seconds, X iterations, or some other similar metric
  - randomSeed
```
public long randomSeed
```
  - DEFAULT_LEARNING_RATE
```
public static final double DEFAULT_LEARNING_RATE
```
    See Also:
    
    Constant Field Values
  - learningRate
```
public double learningRate
```
    How fast to learn (can mean different things for different algorithms)
  - DEFAULT_DELTA_MARGIN
```
public static final double DEFAULT_DELTA_MARGIN
```
    See Also:
    
    Constant Field Values
  - deltaMargin
```
public double deltaMargin
```
    How much to penalize the wrong trees for how different they are from the gold tree when training
  - unknownNumberVector
```
public boolean unknownNumberVector
```
    Whether or not to build an unknown word vector specifically for numbers
  - unknownDashedWordVectors
```
public boolean unknownDashedWordVectors
```
    Whether or not to handle unknown dashed words by taking the last part
  - unknownCapsVector
```
public boolean unknownCapsVector
```
    Whether or not to build an unknown word vector for words with caps in them
  - dvSimplifiedModel
```
public boolean dvSimplifiedModel
```
    Make the dv model as simple as possible
  - unknownChineseYearVector
```
public boolean unknownChineseYearVector
```
    Whether or not to build an unknown word vector to match Chinese years
  - unknownChineseNumberVector
```
public boolean unknownChineseNumberVector
```
    Whether or not to build an unknown word vector to match Chinese numbers
  - unknownChinesePercentVector
```
public boolean unknownChinesePercentVector
```
    Whether or not to build an unknown word vector to match Chinese percentages
  - DEFAULT_SCALING_FOR_INIT
```
public static final double DEFAULT_SCALING_FOR_INIT
```
    See Also:
    
    Constant Field Values
  - scalingForInit
```
public double scalingForInit
```
    How much to scale certain parameters when initializing models. For example, the DVParser uses this to rescale its initial matrices.
  - maxTrainTimeSeconds
```
public int maxTrainTimeSeconds
```
  - DEFAULT_UNK_WORD
```
public static final java.lang.String DEFAULT_UNK_WORD
```
    See Also:
    
    Constant Field Values
  - unkWord
```
public java.lang.String unkWord
```
    Some models will use external data sources which contain information about unknown words. This variable is a way to provide the name of the unknown word in the external data source.
  - lowercaseWordVectors
```
public boolean lowercaseWordVectors
```
    Whether or not to lowercase word vectors
  - transformMatrixType
```
public TrainOptions.TransformMatrixType transformMatrixType
```
  - useContextWords
```
public boolean useContextWords
```
    Specifically for the DVModel, uses words on either side of a context when combining constituents. Gives perhaps a microscopic improvement in performance but causes a large slowdown.
  - trainWordVectors
```
public boolean trainWordVectors
```
    Do we want a model that uses word vectors (such as the DVParser) to train those word vectors when training the model?
    Note: models prior to 2014-02-13 may have incorrect values in this field, as it was originally a compile time constant
  - DEFAULT_STALLED_ITERATION_LIMIT
```
public static final int DEFAULT_STALLED_ITERATION_LIMIT
```
    See Also:
    
    Constant Field Values
  - stalledIterationLimit
```
public int stalledIterationLimit
```
    How many iterations to allow training to stall before taking the best model, if training in an iterative manner
  - markStrahler
```
public boolean markStrahler
```
    Horton-Strahler number/dimension (Maximilian Schlund)
- Constructor Detail
  - TrainOptions
```
public TrainOptions()
```
- Method Detail
  - outsideFactor
```
public boolean outsideFactor()
```
    If true, declare early -- leave this on except maybe with markov on.
    
    Returns:
    
    Whether to do outside factorization in binarization of the grammar
  - compactGrammar
```
public int compactGrammar()
```
  - display
```
public void display()
```
  - toString
```
public java.lang.String toString()
```
    Overrides:
    
    toString in class java.lang.Object
  - printTrainTree
```
public static void printTrainTree(java.io.PrintWriter pw,
                                  java.lang.String message,
                                  Tree t)
```

Class TrainOptions

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

trainTreeFile

trainLengthLimit

cheatPCFG

markovFactor

markovOrder

hSelSplit

HSEL_CUT

markFinalStates

openClassTypesThreshold

fractionBeforeUnseenCounting

PA

gPA

postPA

postGPA

selectiveSplit

selectiveSplitCutOff

selectivePostSplit

selectivePostSplitCutOff

postSplitWithBaseCategory

sisterAnnotate

sisterSplitters

markUnary

markUnaryTags

splitPrePreT

tagPA

tagSelectiveSplit

tagSelectiveSplitCutOff

tagSelectivePostSplit

tagSelectivePostSplitCutOff

rightRec

leftRec

collinsPunc

splitters

postSplitters

deleteSplitters

printTreeTransformations

printAnnotatedPW

printBinarizedPW

printStates

compactGrammar

leftToRight

noTagSplit

ruleSmoothing

ruleSmoothingAlpha

smoothing

ruleDiscount

printAnnotatedRuleCounts

printAnnotatedStateCounts

basicCategoryTagsInDependencyGrammar

preTransformer

taggedFiles

predictSplits

splitCount

splitRecombineRate

simpleBinarizedLabels

noRebinarization

trainingThreads

DEFAULT_K_BEST

dvKBest

DEFAULT_TRAINING_ITERATIONS

trainingIterations

DEFAULT_BATCH_SIZE

batchSize

DEFAULT_REGCOST

regCost

DEFAULT_QN_ITERATIONS_PER_BATCH

qnIterationsPerBatch

qnEstimates

qnTolerance

debugOutputFrequency

randomSeed

DEFAULT_LEARNING_RATE

learningRate