edu.stanford.nlp.process
Class DocumentPreprocessor

java.lang.Object
  extended by edu.stanford.nlp.process.DocumentPreprocessor
All Implemented Interfaces:
Iterable<List<HasWord>>

public class DocumentPreprocessor
extends Object
implements Iterable<List<HasWord>>

Produces a list of sentences from either a plain text or XML document.

Tokenization: The default tokenizer is PTBTokenizer. If null is passed to setTokenizerFactory, then whitespace tokenization is assumed.

Adding a new document type requires two steps:

  1. Add a new DocType.
  2. Create an iterator for the new DocType and modify the iterator() function to return the new iterator.

NOTE: This implementation should not use external libraries since it is used in the parser.

Author:
Spence Green

Nested Class Summary
static class DocumentPreprocessor.DocType
           
 
Constructor Summary
DocumentPreprocessor(Reader input)
          Constructs a preprocessor from an existing input stream.
DocumentPreprocessor(Reader input, DocumentPreprocessor.DocType t)
           
DocumentPreprocessor(String docPath)
          Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL.
DocumentPreprocessor(String docPath, DocumentPreprocessor.DocType t)
           
 
Method Summary
 Iterator<List<HasWord>> iterator()
          Returns sentences until the document is exhausted.
static void main(String[] args)
          This provides a simple test method for DocumentPreprocessor.
 void setElementDelimiter(String s)
          Only read text from between these XML tokens if in XML mode.
 void setEncoding(String encoding)
          Set the character encoding.
 void setEscaper(Function<List<HasWord>,List<HasWord>> e)
          Set an escaper.
 void setSentenceDelimiter(String s)
          Make the processor assume that the document is already delimited by the supplied parameter.
 void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
          Sets the end-of-sentence delimiters.
 void setTagDelimiter(String s)
          Split tags from tokens.
 void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
          Sets the factory from which to produce a Tokenizer.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentPreprocessor

public DocumentPreprocessor(Reader input)
Constructs a preprocessor from an existing input stream.

Parameters:
input - An existing reader

DocumentPreprocessor

public DocumentPreprocessor(Reader input,
                            DocumentPreprocessor.DocType t)

DocumentPreprocessor

public DocumentPreprocessor(String docPath)
Constructs a preprocessor from a file at a path, which can be either a filesystem location or a URL.

Parameters:
docPath -

DocumentPreprocessor

public DocumentPreprocessor(String docPath,
                            DocumentPreprocessor.DocType t)
Method Detail

setEncoding

public void setEncoding(String encoding)
                 throws IllegalCharsetNameException
Set the character encoding.

Parameters:
encoding - The character encoding used by Readers
Throws:
IllegalCharsetNameException - If the JVM does not support the named character set.

setSentenceFinalPuncWords

public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
Sets the end-of-sentence delimiters.

For newline tokenization, use the argument {"\n"}.

Parameters:
sentenceFinalPuncWords -

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory<? extends HasWord> newTokenizerFactory)
Sets the factory from which to produce a Tokenizer. The default is PTBTokenizer.

NOTE: If a null argument is used, then the document is assumed to be tokenized and DocumentPreprocessor performs no tokenization.


setEscaper

public void setEscaper(Function<List<HasWord>,List<HasWord>> e)
Set an escaper.

Parameters:
e - The escaper

setSentenceDelimiter

public void setSentenceDelimiter(String s)
Make the processor assume that the document is already delimited by the supplied parameter.

Parameters:
s - The sentence delimiter

setTagDelimiter

public void setTagDelimiter(String s)
Split tags from tokens. The tag will be placed in the TagAnnotation of the returned label.

Note that for strings that contain two or more instances of the tag delimiter, the last instance is treated as the split point.

The tag delimiter should not contain any characters that must be escaped in a Java regex.

Parameters:
s - POS tag delimiter

setElementDelimiter

public void setElementDelimiter(String s)
Only read text from between these XML tokens if in XML mode. Otherwise, will read from all tokens.


iterator

public Iterator<List<HasWord>> iterator()
Returns sentences until the document is exhausted. Calls close() if the end of the document is reached. Otherwise, the user is required to close the stream.

Specified by:
iterator in interface Iterable<List<HasWord>>

main

public static void main(String[] args)
This provides a simple test method for DocumentPreprocessor.
Usage: java DocumentPreprocessor -file filename [-xml tag] [-suppressEscaping] [-noTokenization]

A filename is required. The code doesn't run as a filter currently.

tag is the element name of the XML from which to extract text. It can be a regular expression which is called on the element with the matches() method, such as 'TITLE|P'.

Parameters:
args - Command-line arguments


Stanford NLP Group