MaxentTagger (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.tagger.common.Tagger
- - edu.stanford.nlp.tagger.maxent.MaxentTagger

All Implemented Interfaces:: ListProcessor<java.util.List<? extends HasWord>,java.util.List<TaggedWord>>, java.io.Serializable, java.util.function.Function<java.util.List<? extends HasWord>,java.util.List<TaggedWord>>

public class MaxentTagger
extends Tagger
implements ListProcessor<java.util.List<? extends HasWord>,java.util.List<TaggedWord>>, java.io.Serializable

The main class for users to run, train, and test the part of speech tagger. You can tag things through the Java API or from the command line. The two English taggers included in this distribution are:

A bi-directional dependency network tagger in edu/stanford/nlp/models/pos-tagger/english-bidirectional-distsim.tagger. Its accuracy was 97.32% on Penn Treebank WSJ secs. 22-24.
A model using only left second-order sequence information and similar but less unknown words and lexical features as the previous model in edu/stanford/nlp/models/pos-tagger/english-left3words-distsim.tagger This tagger runs a lot faster, and is recommended for general use. Its accuracy was 96.92% on Penn Treebank WSJ secs. 22-24.

Using the Java API

A MaxentTagger can be made with a constructor taking as argument the location of parameter files for a trained tagger:: MaxentTagger tagger = new MaxentTagger("models/left3words-wsj-0-18.tagger");
A default path is provided for the location of the tagger on the Stanford NLP machines:: MaxentTagger tagger = new MaxentTagger(DEFAULT_NLP_GROUP_MODEL_PATH);
If you set the NLP_DATA_HOME environment variable, DEFAULT_NLP_GROUP_MODEL_PATH will instead point to the directory given in NLP_DATA_HOME.
To tag a List of HasWord and get a List of TaggedWord, you can use one of:: List<TaggedWord> taggedSentence = tagger.tagSentence(List<? extends HasWord> sentence); List<TaggedWord> taggedSentence = tagger.apply(List<? extends HasWord> sentence)
To tag a list of sentences and get back a list of tagged sentences:: List taggedList = tagger.process(List sentences)
To tag a String of text and to get back a String with tagged words:: String taggedString = tagger.tagString("Here's a tagged string.")
To tag a string of correctly tokenized, whitespace-separated words and get a string of tagged words back:: String taggedString = tagger.tagTokenizedString("Here 's a tagged string .")

The tagString method uses the default tokenizer (PTBTokenizer). If you wish to control tokenization, you may wish to call tokenizeText(Reader, TokenizerFactory) and then to call process() on the result.

Using the command line

Tagging, testing, and training can all also be done via the command line.

Training from the command line

To train a model from the command line, first generate a property file:

java edu.stanford.nlp.tagger.maxent.MaxentTagger -genprops

This gets you a default properties file with descriptions of each parameter you can set in your trained model. You can modify the properties file, or use the default options. To train, run:

java -mx1g edu.stanford.nlp.tagger.maxent.MaxentTagger -props myPropertiesFile.props

with the appropriate properties file specified. Any argument you give in the properties file can also be specified on the command line. You must have specified a model using -model, either in the properties file or on the command line, as well as a file containing tagged words using -trainFile. Useful flags for controlling the amount of output are -verbose, which prints extra debugging information, and -verboseResults, which prints full information about intermediate results. -verbose defaults to false and -verboseResults defaults to true.

Tagging and Testing from the command line

Usage: For tagging (plain text):

java edu.stanford.nlp.tagger.maxent.MaxentTagger -model modelFile -textFile textfile

For testing (evaluating against tagged text):

java edu.stanford.nlp.tagger.maxent.MaxentTagger -model modelFile -testFile testfile

You can use the same properties file as for training if you pass it in with the "-props" argument. The most important arguments for tagging (besides "model" and "file") are "tokenize" and "tokenizerFactory". See below for more details. Note that the tagger assumes input has not yet been tokenized and by default tokenizes it using a default English tokenizer. If your input has already been tokenized, use the flag "-tokenize false". Parameters can be defined using a Properties file (specified on the command-line with -prop propFile), or directly on the command line (by preceding their name with a minus sign ("-") to turn them into a flag. The following properties are recognized:

MaxentTagger command line params
Property Name	Type	Default Value	Relevant Phase(s)	Description
model	String	N/A	All	Path and filename where you would like to save the model (training) or where the model should be loaded from (testing, tagging).
trainFile	String	N/A	Train	Path to the file holding the training data; specifying this option puts the tagger in training mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. There are three formats possible. The first is a text file of tagged data. Each line is considered a separate sentence. In each sentence, words are separated by whitespace. Each word must have a tag, which is separated from the token using the specified `tagSeparator`. This format, called TEXT, is the default format. The second format is a file of Penn Treebank formatted (i.e., s-expression) tree files. Trees are loaded one at a time and the tagged words in a tree are used as a training sentence. To specify this format, preface the filename with "`format=TREES,`". The final possible format is TSV files (tab-separated columns). To specify a TSV file, set `trainFile` to "`format=TSV,wordColumn=x,tagColumn=y,filename`". Column numbers are indexed from 0, and sentences are separated with blank lines. The default wordColumn is 0 and default tagColumn is 1. If comments=true, then comment lines will be skipped (a common thing to appear in conllu files) A file can be in a different character set encoding than the tagger's default encoding by prefacing the filename with `"encoding=ENC,"`. You can specify the tagSeparator character in a TEXT file by prefacing the filename with "tagSeparator=c,". Tree files can be fed through TreeTransformers and TreeNormalizers. To specify a transformer, preface the filename with "treeTransformer=CLASSNAME,". To specify a normalizer, preface the filename with "treeNormalizer=CLASSNAME,". You can also filter trees using a `Filter<Tree>`, which can be specified with "treeFilter=CLASSNAME,". A specific range of trees to be used can be specified with treeRange=X-Y. Multiple parts of the range can be separated by : as opposed to the normal separator of ,. For example, one could use the argument "-treeRange=25-50:75-100". You can specify a TreeReaderFactory by prefacing the filename with "trf=CLASSNAME,". Note: If it includes a TreeNormalizer, you want to specify it as the treeNormalizer as well. Multiple files can be specified by making a semicolon separated list of files. Each file can have its own format specifiers as above. You will note that none of , ; or = can be in filenames.
testFile	String	N/A	Test	Path to the file holding the test data; specifying this option puts the tagger in testing mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. The same format as trainFile applies, but only one file can be specified.
textFile	String	N/A	Tag	Path to the file holding the text to tag; specifying this option puts the tagger in tagging mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified. No file reading options may be specified for textFile
dump	String	N/A	Dump	Path to the file holding the model to dump; specifying this option puts the tagger in dumping mode. Only one of 'trainFile','testFile','textFile', and 'dump' may be specified.
genprops	boolean	N/A	N/A	Use this option to output a default properties file, containing information about each of the possible configuration options.
tagSeparator	char	/	All	Separator character that separates word and part of speech tags, such as out/IN or out_IN. For training and testing, this is the separator used in the train/test files. For tagging, this is the character that will be inserted between words and tags in the output.
encoding	String	UTF-8	All	Encoding of the read files (training, testing) and the output text files.
tokenize	boolean	true	Tag,Test	Whether or not the file needs to be tokenized. If this is false, the tagger assumes that white space separates words if and only if they should be tagged as separate tokens, and that the input is strictly one sentence per line.
tokenizerFactory	String	edu.stanford.nlp. process.PTBTokenizer	Tag,Test	Fully qualified class name of the tokenizer to use. edu.stanford.nlp.process.PTBTokenizer does basic English tokenization.
tokenizerOptions	String		Tag,Test	Known options for the particular tokenizer used. A comma-separated list. For PTBTokenizer, options of interest include `americanize=false` and `quotes=ascii` (for German). Note that any choice of tokenizer options that conflicts with the tokenization used in the tagger training data will likely degrade tagger performance.
sentenceDelimiter	String	null	Tag,Test	A marker used to separate a text into sentences. If not set (equal to `null`), sentence breaking is done by content (looking for periods, etc.) Otherwise, it will break on this String, except that if the String is "newline", it breaks on the String "\\n".
arch	String	generic	Train	Architecture of the model, as a comma-separated list of options, some with a parenthesized integer argument written k here: this determines what features are used to build your model. See `ExtractorFrames` and `ExtractorFramesRare` for more information.
wordFunction	String	(none)	Train	A function to apply to the text before training or testing. Must inherit from `java.util.function.Function<String, String>`. Can be blank.
lang	String	english	Train	Language from which the part of speech tags are drawn. This option determines which tags are considered closed-class (only fixed set of words can be tagged with a closed-class tag, such as prepositions). Defined languages are 'english' (Penn tag set), 'polish' (very rudimentary), 'french', 'chinese', 'arabic', 'german', and 'medline'.
openClassTags	String	N/A	Train	Space separated list of tags that should be considered open-class. All tags encountered that are not in this list are considered closed-class. E.g. format: "NN VB"
closedClassTags	String	N/A	Train	Space separated list of tags that should be considered closed-class. All tags encountered that are not in this list are considered open-class.
learnClosedClassTags	boolean	false	Train	If true, induce which tags are closed-class by counting as closed-class tags all those tags which have fewer unique word tokens than closedClassTagThreshold.
closedClassTagThreshold	int	int	Train	Number of unique word tokens that a tag may have and still be considered closed-class; relevant only if learnClosedClassTags is true.
sgml	boolean	false	Tag, Test	Very basic tagging of the contents of all sgml fields; for more complex mark-up, consider using the xmlInput option.
xmlInput	String		Tag, Test	Give a space separated list of tags in an XML file whose content you would like tagged. Any internal tags that appear in the content of fields you would like tagged will be discarded; the rest of the XML will be preserved and the original text of specified fields will be replaced with the tagged text.
outputFile	String	""	Tag	Path to write output to. If blank, stdout is used.
outputFormat	String	""	Tag	Output format. One of: slashTags (default), xml (or inlineXML as a synonym), or tsv
outputFormatOptions	String	""	Tag	Output format options. Currently used: lemmatize, verbose, keepEmptySentences
tagInside	String	""	Tag	Tags inside elements that match the regular expression given in the String.
search	String	cg	Train	Specify the search method to be used in the optimization method for training. Options are 'cg' (conjugate gradient), 'iis' (improved iterative scaling), or 'qn' (quasi-newton).
sigmaSquared	double	0.5	Train	Sigma-squared smoothing/regularization parameter to be used for conjugate gradient search. Default usually works reasonably well.
iterations	int	100	Train	Number of iterations to be used for improved iterative scaling.
rareWordThresh	int	5	Train	Words that appear fewer than this number of times during training are considered rare words and use extra rare word features.
minFeatureThreshold	int	5	Train	Features whose history appears fewer than this number of times are discarded.
curWordMinFeatureThreshold	int	2	Train	Words that occur more than this number of times will generate features with all of the tags they've been seen with.
rareWordMinFeatureThresh	int	10	Train	Features of rare words whose histories occur fewer than this number of times are discarded.
veryCommonWordThresh	int	250	Train	Words that occur more than this number of times form an equivalence class by themselves. Ignored unless you are using ambiguity classes.
debug	boolean	boolean	All	Whether to write debugging information (words, top words, unknown words, confusion matrix). Useful for error analysis.
debugPrefix	String	N/A	All	File (path) prefix for where to write out the debugging information (relevant only if debug=true).
nthreads	int	1	Test,Text	Number of threads to use when processing text.
minWordsLockTags	int	1	All	If a word occurs at least this many times in the training data, the tagger will only propose known tags as candidates.

Author:: Kristina Toutanova, Miler Lee, Joseph Smarr, Anna Rafferty, Michel Galley, Christopher Manning, John Bauer
See Also:: Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`BASE_TAGGER_HOME` The directory from which to get taggers when using DEFAULT_NLP_GROUP_MODEL_PATH.
`static java.lang.String`	`DEFAULT_DISTRIBUTION_PATH`
`static java.lang.String`	`DEFAULT_JAR_PATH`
`static java.lang.String`	`DEFAULT_NLP_GROUP_MODEL_PATH`
`static java.lang.String`	`TAGGER_HOME`

Fields inherited from class edu.stanford.nlp.tagger.common.Tagger
EOS_TAG, EOS_WORD

Constructor Summary

Constructors
Constructor and Description
`MaxentTagger()`
`MaxentTagger(java.io.InputStream modelStream)` Constructor for a tagger, loading a model stored in a particular file, classpath resource, or URL.
`MaxentTagger(java.io.InputStream modelStream, java.util.Properties config, boolean printLoading)` Initializer that loads the tagger.
`MaxentTagger(java.lang.String modelFile)` Constructor for a tagger, loading a model stored in a particular file, classpath resource, or URL.
`MaxentTagger(java.lang.String modelFile, java.util.Properties config)` Constructor for a tagger using a model stored in a particular file, with options taken from the supplied TaggerConfig.
`MaxentTagger(java.lang.String modelFile, java.util.Properties config, boolean printLoading)` Initializer that loads the tagger.
`MaxentTagger(TaggerConfig config)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`int`	`addTag(java.lang.String tag)` Will return the index of a tag, adding it if it doesn't already exist
`java.util.List<TaggedWord>`	`apply(java.util.List<? extends HasWord> in)` Expects a sentence and returns a tagged sentence.
`protected TokenizerFactory<? extends HasWord>`	`chooseTokenizerFactory()` Figures out what tokenizer factory might be described by the config.
`protected static TokenizerFactory<? extends HasWord>`	`chooseTokenizerFactory(boolean tokenize, java.lang.String tokenizerFactory, java.lang.String tokenizerOptions, boolean invertible)`
`java.lang.String`	`getTag(int index)`
`int`	`getTagIndex(java.lang.String tag)` Will return the index of a tag if known, -1 if not already known
`static void`	`lemmatize(java.util.List<CoreLabel> sentence, Morphology morpha)` Adds lemmas to the given list of CoreLabels, using the given Morphology object.
`static void`	`main(java.lang.String[] args)` Command-line tagger interface.
`int`	`numTags()`
`void`	`outputTaggedSentence(java.util.List<? extends HasWord> sentence, boolean outputLemmas, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle, boolean outputVerbosity, int numSentences, java.lang.String separator, java.io.Writer writer)`
`java.util.List<java.util.List<TaggedWord>>`	`process(java.util.List<? extends java.util.List<? extends HasWord>> sentences)` Tags the Words in each Sentence in the given List with their grammatical part-of-speech.
`protected void`	`readModelAndInit(java.util.Properties config, java.io.DataInputStream rf, boolean printLoading)` This reads the complete tagger from a single model file, and inits the tagger using a combination of the properties passed in and parameters from the file.
`protected void`	`readModelAndInit(java.util.Properties config, java.io.InputStream modelStream, boolean printLoading)` This reads the complete tagger from a single model provided as an InputStream, and initializes the tagger using a combination of the properties passed in and parameters from the file.
`protected void`	`readModelAndInit(java.util.Properties config, java.lang.String modelFileOrUrl, boolean printLoading)` This reads the complete tagger from a single model stored in a file, at a URL, or as a resource in a jar file, and initializes the tagger using a combination of the properties passed in and parameters from the file.
`void`	`runTagger(java.io.BufferedReader reader, java.io.BufferedWriter writer, java.lang.String tagInside, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)` This method runs the tagger on the provided reader and writer.
`<X extends HasWord> void`	`runTagger(java.lang.Iterable<java.util.List<X>> document, java.io.BufferedWriter writer, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)`
`void`	`runTaggerSGML(java.io.BufferedReader reader, java.io.BufferedWriter writer, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)`
`void`	`runTaggerStdin(java.io.BufferedReader reader, java.io.BufferedWriter writer, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)`
`protected void`	`saveModel(java.io.DataOutputStream file)`
`protected void`	`saveModel(java.lang.String filename)`
`void`	`tagAndOutputSentence(java.util.List<? extends HasWord> sentence, boolean outputLemmas, Morphology morpha, PlainTextDocumentReaderAndWriter.OutputStyle outputStyle, boolean outputVerbosity, int numSentences, java.lang.String separator, java.io.Writer writer)`
`void`	`tagCoreLabels(java.util.List<CoreLabel> sentence)` Takes a sentence composed of CoreLabels and add the tags to the CoreLabels, modifying the input sentence.
`void`	`tagCoreLabels(java.util.List<CoreLabel> sentence, boolean reuseTags)` Takes a sentence composed of CoreLabels and add the tags to the CoreLabels, modifying the input sentence.
`java.util.List<? extends HasWord>`	`tagCoreLabelsOrHasWords(java.util.List<? extends HasWord> sentence, Morphology morpha, boolean outputLemmas)`
`void`	`tagFromXML(java.io.InputStream input, java.io.Writer writer, java.lang.String... xmlTags)` Uses an XML transformer to turn an input stream into a bunch of output.
`void`	`tagFromXML(java.io.Reader input, java.io.Writer writer, java.lang.String... xmlTags)`
`java.util.List<TaggedWord>`	`tagSentence(java.util.List<? extends HasWord> sentence)` Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech.
`java.util.List<TaggedWord>`	`tagSentence(java.util.List<? extends HasWord> sentence, boolean reuseTags)` Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech.
`java.util.Set<java.lang.String>`	`tagSet()`
`java.lang.String`	`tagString(java.lang.String toTag)` Tags the input string and returns the tagged version.
`java.lang.String`	`tagTokenizedString(java.lang.String toTag)` Tags the tokenized input string and returns the tagged version.
`static java.util.List<java.util.List<HasWord>>`	`tokenizeText(java.io.Reader r)` Reads data from r, tokenizes it with the default (Penn Treebank) tokenizer, and returns a List of Sentence objects, which can then be fed into tagSentence.
`static java.util.List<java.util.List<HasWord>>`	`tokenizeText(java.io.Reader r, TokenizerFactory<? extends HasWord> tokenizerFactory)` Reads data from r, tokenizes it with the given tokenizer, and returns a List of Lists of (extends) HasWord objects, which can then be fed into tagSentence.

Methods inherited from class edu.stanford.nlp.tagger.common.Tagger
loadModel

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.function.Function
andThen, compose, identity

- Field Detail
  - BASE_TAGGER_HOME
```
public static final java.lang.String BASE_TAGGER_HOME
```
    The directory from which to get taggers when using DEFAULT_NLP_GROUP_MODEL_PATH. Normally set to the location of the latest left3words tagger on the NLP machines, but can be changed by setting the environment variable NLP_DATA_HOME.
    
    See Also:
    
    Constant Field Values
  - TAGGER_HOME
```
public static final java.lang.String TAGGER_HOME
```
  - DEFAULT_NLP_GROUP_MODEL_PATH
```
public static final java.lang.String DEFAULT_NLP_GROUP_MODEL_PATH
```
  - DEFAULT_JAR_PATH
```
public static final java.lang.String DEFAULT_JAR_PATH
```
    See Also:
    
    Constant Field Values
  - DEFAULT_DISTRIBUTION_PATH
```
public static final java.lang.String DEFAULT_DISTRIBUTION_PATH
```
    See Also:
    
    Constant Field Values
- Constructor Detail
  - MaxentTagger
```
public MaxentTagger()
```
  - MaxentTagger
```
public MaxentTagger(TaggerConfig config)
```
  - MaxentTagger
```
public MaxentTagger(java.lang.String modelFile)
```
    Constructor for a tagger, loading a model stored in a particular file, classpath resource, or URL. The tagger data is loaded when the constructor is called (this can be slow). This constructor first constructs a TaggerConfig object, which loads the tagger options from the modelFile.
    
    Parameters:
    
    modelFile - Filename, classpath resource, or URL for the trained model
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - MaxentTagger
```
public MaxentTagger(java.io.InputStream modelStream)
```
    Constructor for a tagger, loading a model stored in a particular file, classpath resource, or URL. The tagger data is loaded when the constructor is called (this can be slow). This constructor first constructs a TaggerConfig object, which loads the tagger options from the modelFile.
    
    Parameters:
    
    modelStream - The InputStream from which to read the model
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - MaxentTagger
```
public MaxentTagger(java.lang.String modelFile,
                    java.util.Properties config)
```
    Constructor for a tagger using a model stored in a particular file, with options taken from the supplied TaggerConfig. The tagger data is loaded when the constructor is called (this can be slow). This version assumes that the tagger options in the modelFile have already been loaded into the TaggerConfig (if that is desired).
    
    Parameters:
    
    modelFile - Filename, classpath resource, or URL for the trained model
    
    config - The configuration for the tagger
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - MaxentTagger
```
public MaxentTagger(java.lang.String modelFile,
                    java.util.Properties config,
                    boolean printLoading)
```
    Initializer that loads the tagger.
    
    Parameters:
    
    modelFile - Where to initialize the tagger from. Most commonly, this is the filename of the trained model, for example, /u/nlp/data/pos-tagger/wsj3t0-18-left3words/left3words-wsj-0-18.tagger. However, if it starts with "https?://" it will be interpreted as a URL. One can also load models directly from the classpath, as in loading from edu/stanford/nlp/models/pos-tagger/bidirectional-distsim-wsj-0-18.tagger.
    
    config - TaggerConfig based on command-line arguments
    
    printLoading - Whether to print a message saying what model file is being loaded and how long it took when finished.
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - MaxentTagger
```
public MaxentTagger(java.io.InputStream modelStream,
                    java.util.Properties config,
                    boolean printLoading)
```
    Initializer that loads the tagger.
    
    Parameters:
    
    modelStream - An InputStream for reading the model file
    
    config - TaggerConfig based on command-line arguments
    
    printLoading - Whether to print a message saying what model file is being loaded and how long it took when finished.
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
- Method Detail
  - addTag
```
public int addTag(java.lang.String tag)
```
    Will return the index of a tag, adding it if it doesn't already exist
  - getTagIndex
```
public int getTagIndex(java.lang.String tag)
```
    Will return the index of a tag if known, -1 if not already known
  - numTags
```
public int numTags()
```
  - getTag
```
public java.lang.String getTag(int index)
```
  - tagSet
```
public java.util.Set<java.lang.String> tagSet()
```
  - chooseTokenizerFactory
```
protected TokenizerFactory<? extends HasWord> chooseTokenizerFactory()
```
    Figures out what tokenizer factory might be described by the config. If it's described by name in the config, uses reflection to get the factory (which may cause an exception, of course...)
  - chooseTokenizerFactory
```
protected static TokenizerFactory<? extends HasWord> chooseTokenizerFactory(boolean tokenize,
                                                                            java.lang.String tokenizerFactory,
                                                                            java.lang.String tokenizerOptions,
                                                                            boolean invertible)
```
  - saveModel
```
protected void saveModel(java.lang.String filename)
```
  - saveModel
```
protected void saveModel(java.io.DataOutputStream file)
                  throws java.io.IOException
```
    Throws:
    
    java.io.IOException
  - readModelAndInit
```
protected void readModelAndInit(java.util.Properties config,
                                java.lang.String modelFileOrUrl,
                                boolean printLoading)
```
    This reads the complete tagger from a single model stored in a file, at a URL, or as a resource in a jar file, and initializes the tagger using a combination of the properties passed in and parameters from the file.
    Note for the future: This assumes that the TaggerConfig in the file has already been read and used. This work is done inside the constructor of TaggerConfig. It might be better to refactor things so that is all done inside this method, but for the moment it seemed better to leave working code alone [cdm 2008].
    
    Parameters:
    
    config - The tagger config
    
    modelFileOrUrl - The name of the model file. This routine opens and closes it.
    
    printLoading - Whether to print a message saying what model file is being loaded and how long it took when finished.
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - readModelAndInit
```
protected void readModelAndInit(java.util.Properties config,
                                java.io.InputStream modelStream,
                                boolean printLoading)
```
    This reads the complete tagger from a single model provided as an InputStream, and initializes the tagger using a combination of the properties passed in and parameters from the file.
    Note for the future: This assumes that the TaggerConfig in the file has already been read and used. This work is done inside the constructor of TaggerConfig. It might be better to refactor things so that is all done inside this method, but for the moment it seemed better to leave working code alone [cdm 2008].
    
    Parameters:
    
    config - The tagger config
    
    modelStream - The model provided as an InputStream
    
    printLoading - Whether to print a message saying what model file is being loaded and how long it took when finished.
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - readModelAndInit
```
protected void readModelAndInit(java.util.Properties config,
                                java.io.DataInputStream rf,
                                boolean printLoading)
```
    This reads the complete tagger from a single model file, and inits the tagger using a combination of the properties passed in and parameters from the file.
    Note for the future: This assumes that the TaggerConfig in the file has already been read and used. It might be better to refactor things so that is all done inside this method, but for the moment it seemed better to leave working code alone [cdm 2008].
    
    Parameters:
    
    config - The tagger config
    
    rf - DataInputStream to read from. It's the caller's job to open and close this stream.
    
    printLoading - Whether to print a message saying what model file is being loaded and how long it took when finished.
    
    Throws:
    
    RuntimeIOException - if I/O errors or serialization errors
  - tagTokenizedString
```
public java.lang.String tagTokenizedString(java.lang.String toTag)
```
    Tags the tokenized input string and returns the tagged version. This method requires the input to already be tokenized. The tagger wants input that is whitespace separated tokens, tokenized according to the conventions of the training data. (For instance, for the Penn Treebank, punctuation marks and possessive "'s" should be separated from words.)
    
    Parameters:
    
    toTag - The untagged input String
    
    Returns:
    
    The same string with tags inserted in the form word/tag
  - tagString
```
public java.lang.String tagString(java.lang.String toTag)
```
    Tags the input string and returns the tagged version. This method tokenizes the input into words in perhaps multiple sentences and then tags those sentences. The default (PTB English) tokenizer is used.
    
    Parameters:
    
    toTag - The untagged input String
    
    Returns:
    
    A String of sentences with tags inserted in the form word/tag
  - apply
```
public java.util.List<TaggedWord> apply(java.util.List<? extends HasWord> in)
```
    Expects a sentence and returns a tagged sentence.
    
    Specified by:
    
    apply in interface java.util.function.Function<java.util.List<? extends HasWord>,java.util.List<TaggedWord>>
    
    Specified by:
    
    apply in class Tagger
    
    Parameters:
    
    in - This needs to be a sentence (List of words)
    
    Returns:
    
    A sentence of TaggedWord
  - process
```
public java.util.List<java.util.List<TaggedWord>> process(java.util.List<? extends java.util.List<? extends HasWord>> sentences)
```
    Tags the Words in each Sentence in the given List with their grammatical part-of-speech. The returned List contains Sentences consisting of TaggedWords.
    NOTE: The input document must contain sentences as its elements, not words. To turn a Document of words into a Document of sentences, run it through WordToSentenceProcessor.
    
    Specified by:
    
    process in interface ListProcessor<java.util.List<? extends HasWord>,java.util.List<TaggedWord>>
    
    Parameters:
    
    sentences - A List of Sentence
    
    Returns:
    
    A List of Sentence of TaggedWord
  - tagSentence
```
public java.util.List<TaggedWord> tagSentence(java.util.List<? extends HasWord> sentence)
```
    Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. Convenience method when you only want to tag a single List instead of a Document of sentences.
    
    Parameters:
    
    sentence - sentence to tag
    
    Returns:
    
    tagged sentence
  - tagSentence
```
public java.util.List<TaggedWord> tagSentence(java.util.List<? extends HasWord> sentence,
                                              boolean reuseTags)
```
    Returns a new Sentence that is a copy of the given sentence with all the words tagged with their part-of-speech. Convenience method when you only want to tag a single List instead of a List of Lists. If you supply tagSentence with a List of HasTag, and set reuseTags to true, the tagger will reuse the supplied tags.
    
    Parameters:
    
    sentence - sentence to tag
    
    reuseTags - whether or not to reuse the given tag
    
    Returns:
    
    tagged sentence
  - tagCoreLabels
```
public void tagCoreLabels(java.util.List<CoreLabel> sentence)
```
    Takes a sentence composed of CoreLabels and add the tags to the CoreLabels, modifying the input sentence.
  - tagCoreLabels
```
public void tagCoreLabels(java.util.List<CoreLabel> sentence,
                          boolean reuseTags)
```
    Takes a sentence composed of CoreLabels and add the tags to the CoreLabels, modifying the input sentence. If reuseTags is set to true, any tags supplied with the CoreLabels are taken as correct.
  - lemmatize
```
public static void lemmatize(java.util.List<CoreLabel> sentence,
                             Morphology morpha)
```
    Adds lemmas to the given list of CoreLabels, using the given Morphology object. The input list must already have tags set.
  - tokenizeText
```
public static java.util.List<java.util.List<HasWord>> tokenizeText(java.io.Reader r)
```
    Reads data from r, tokenizes it with the default (Penn Treebank) tokenizer, and returns a List of Sentence objects, which can then be fed into tagSentence.
    
    Parameters:
    
    r - Reader where untokenized text is read
    
    Returns:
    
    List of tokenized sentences
  - tokenizeText
```
public static java.util.List<java.util.List<HasWord>> tokenizeText(java.io.Reader r,
                                                                   TokenizerFactory<? extends HasWord> tokenizerFactory)
```
    Reads data from r, tokenizes it with the given tokenizer, and returns a List of Lists of (extends) HasWord objects, which can then be fed into tagSentence.
    
    Parameters:
    
    r - Reader where untokenized text is read
    
    tokenizerFactory - Tokenizer. This can be null in which case the default English tokenizer (PTBTokenizerFactory) is used.
    
    Returns:
    
    List of tokenized sentences
  - tagFromXML
```
public void tagFromXML(java.io.InputStream input,
                       java.io.Writer writer,
                       java.lang.String... xmlTags)
```
    Uses an XML transformer to turn an input stream into a bunch of output. Tags all of the text between xmlTags. The difference between using this and using runTagger in XML mode is that this preserves the XML structure outside of the list of elements to tag, whereas the runTagger method throws away all of the surrounding structure and returns tagged plain text.
  - tagFromXML
```
public void tagFromXML(java.io.Reader input,
                       java.io.Writer writer,
                       java.lang.String... xmlTags)
```
  - runTaggerStdin
```
public void runTaggerStdin(java.io.BufferedReader reader,
                           java.io.BufferedWriter writer,
                           PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)
                    throws java.io.IOException
```
    Throws:
    
    java.io.IOException
  - runTaggerSGML
```
public void runTaggerSGML(java.io.BufferedReader reader,
                          java.io.BufferedWriter writer,
                          PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)
                   throws java.io.IOException
```
    Throws:
    
    java.io.IOException
  - runTagger
```
public <X extends HasWord> void runTagger(java.lang.Iterable<java.util.List<X>> document,
                                          java.io.BufferedWriter writer,
                                          PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)
                                   throws java.io.IOException
```
    Throws:
    
    java.io.IOException
  - runTagger
```
public void runTagger(java.io.BufferedReader reader,
                      java.io.BufferedWriter writer,
                      java.lang.String tagInside,
                      PlainTextDocumentReaderAndWriter.OutputStyle outputStyle)
               throws java.io.IOException
```
    This method runs the tagger on the provided reader and writer. It takes input from the given reader, applies the tagger to it one sentence at a time (determined using documentPreprocessor), and writes the output to the given writer. The document is broken into sentences using the sentence processor determined in the tagger's TaggerConfig. tagInside makes the tagger run in XML mode.... If set to non-empty, instead of processing the document as one large text blob, it considers each region in between the given tag to be a separate text blob.
    
    Throws:
    
    java.io.IOException
  - tagCoreLabelsOrHasWords
```
public java.util.List<? extends HasWord> tagCoreLabelsOrHasWords(java.util.List<? extends HasWord> sentence,
                                                                 Morphology morpha,
                                                                 boolean outputLemmas)
```
  - tagAndOutputSentence
```
public void tagAndOutputSentence(java.util.List<? extends HasWord> sentence,
                                 boolean outputLemmas,
                                 Morphology morpha,
                                 PlainTextDocumentReaderAndWriter.OutputStyle outputStyle,
                                 boolean outputVerbosity,
                                 int numSentences,
                                 java.lang.String separator,
                                 java.io.Writer writer)
```
  - outputTaggedSentence
```
public void outputTaggedSentence(java.util.List<? extends HasWord> sentence,
                                 boolean outputLemmas,
                                 PlainTextDocumentReaderAndWriter.OutputStyle outputStyle,
                                 boolean outputVerbosity,
                                 int numSentences,
                                 java.lang.String separator,
                                 java.io.Writer writer)
```
  - main
```
public static void main(java.lang.String[] args)
                 throws java.lang.Exception
```
    Command-line tagger interface. Can be used to train or test taggers, or to tag text, taking input from stdin or a file. See class documentation for usage.
    
    Parameters:
    
    args - Command-line arguments
    
    Throws:
    
    java.io.IOException - If any file problems
    
    java.lang.Exception

Class MaxentTagger

Using the Java API

Using the command line

Training from the command line

Tagging and Testing from the command line

Field Summary

Fields inherited from class edu.stanford.nlp.tagger.common.Tagger

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.tagger.common.Tagger

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.function.Function

Field Detail

BASE_TAGGER_HOME

TAGGER_HOME

DEFAULT_NLP_GROUP_MODEL_PATH

DEFAULT_JAR_PATH

DEFAULT_DISTRIBUTION_PATH

Constructor Detail

MaxentTagger

MaxentTagger

MaxentTagger

MaxentTagger

MaxentTagger

MaxentTagger

MaxentTagger

Method Detail

addTag

getTagIndex

numTags

getTag

tagSet

chooseTokenizerFactory

chooseTokenizerFactory

saveModel

saveModel

readModelAndInit

readModelAndInit

readModelAndInit

tagTokenizedString

tagString

apply

process

tagSentence

tagSentence

tagCoreLabels

tagCoreLabels

lemmatize

tokenizeText

tokenizeText

tagFromXML

tagFromXML

runTaggerStdin

runTaggerSGML

runTagger

runTagger

tagCoreLabelsOrHasWords

tagAndOutputSentence

outputTaggedSentence

main