public class DocumentPreprocessor extends java.lang.Object implements java.lang.Iterable<java.util.List<HasWord>>
Tokenization: The default tokenizer is PTBTokenizer
. If null is passed
to setTokenizerFactory
, then whitespace tokenization is assumed.
Adding a new document type requires two steps:
NOTES: This document preprocessor is principally used in the Stanford Parser (and also a bit in the POS tagger). It is not used by CoreNLP. This implementation should not use external libraries since it is used in the parser.
Modifier and Type | Class and Description |
---|---|
static class |
DocumentPreprocessor.DocType |
Constructor and Description |
---|
DocumentPreprocessor(java.io.Reader input)
Constructs a preprocessor from an existing input stream.
|
DocumentPreprocessor(java.io.Reader input,
DocumentPreprocessor.DocType t) |
DocumentPreprocessor(java.lang.String docPath) |
DocumentPreprocessor(java.lang.String docPath,
DocumentPreprocessor.DocType t) |
DocumentPreprocessor(java.lang.String docPath,
DocumentPreprocessor.DocType t,
java.lang.String encoding)
Constructs a preprocessor from a file at a path, which can be either
a filesystem location, a classpath entry, or a URL.
|
Modifier and Type | Method and Description |
---|---|
java.util.Iterator<java.util.List<HasWord>> |
iterator()
Returns sentences until the document is exhausted.
|
static void |
main(java.lang.String[] args)
A simple, deterministic sentence-splitter.
|
void |
setElementDelimiter(java.lang.String s)
Only read text from inside these XML elements if in XML mode.
|
void |
setEscaper(java.util.function.Function<java.util.List<HasWord>,java.util.List<HasWord>> e)
Set an escaper.
|
void |
setKeepEmptySentences(boolean keepEmptySentences)
Set whether or not the tokenizer keeps empty sentences in
whitespace mode.
|
void |
setSentenceDelimiter(java.lang.String s)
Make the processor assume that the document is already delimited
by the supplied parameter.
|
void |
setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters.
|
void |
setTagDelimiter(java.lang.String s)
Split tags from tokens.
|
void |
setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a
Tokenizer . |
public DocumentPreprocessor(java.io.Reader input)
input
- An existing readerpublic DocumentPreprocessor(java.io.Reader input, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(java.lang.String docPath)
public DocumentPreprocessor(java.lang.String docPath, DocumentPreprocessor.DocType t)
public DocumentPreprocessor(java.lang.String docPath, DocumentPreprocessor.DocType t, java.lang.String encoding)
docPath
- The pathencoding
- The character encoding used by Readerspublic void setKeepEmptySentences(boolean keepEmptySentences)
public void setSentenceFinalPuncWords(java.lang.String[] sentenceFinalPuncWords)
For newline tokenization, use the argument {"\n"}.
sentenceFinalPuncWords
- An array of words that count as sentence final punctuation.public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.
public void setEscaper(java.util.function.Function<java.util.List<HasWord>,java.util.List<HasWord>> e)
e
- The escaperpublic void setSentenceDelimiter(java.lang.String s)
s
- The sentence delimiterpublic void setTagDelimiter(java.lang.String s)
Note that for strings that contain two or more instances of the tag delimiter, the last instance is treated as the split point.
The tag delimiter should not contain any characters that must be escaped in a Java regex.
s
- POS tag delimiterpublic void setElementDelimiter(java.lang.String s)
public java.util.Iterator<java.util.List<HasWord>> iterator()
iterator
in interface java.lang.Iterable<java.util.List<HasWord>>
List<HasWord>
, in practice you get a List of CoreLabel,
and you can cast down to that. (Someday we might manage to fix the generic typing....)public static void main(java.lang.String[] args) throws java.io.IOException
args
- Command-line argumentsjava.io.IOException