WordToSentenceProcessor (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.process.WordToSentenceProcessor<IN>

Type Parameters:

IN - The type of the tokens in the sentences

All Implemented Interfaces:

ListProcessor<IN,java.util.List<IN>>
```
public class WordToSentenceProcessor<IN>
extends java.lang.Object
implements ListProcessor<IN,java.util.List<IN>>
```
Transforms a List of words into a List of Lists of words (that is, a List of sentences), by grouping the words. The word stream is assumed to already be adequately tokenized, and this class just divides the List into sentences, perhaps discarding some separator tokens as it goes.
The main behavior is to look for sentence ending tokens like "." or "?!?", and to split after them and any following sentence closers like ")". Overlaid on this is an overall choice of state: The WordToSentenceProcessor can be a non-splitter, which always returns one sentence. Otherwise, the WordToSentenceProcessor will also split based on paragraphs using one of these three states: (1) Ignore line breaks in splitting sentences, (2) Treat each line as a separate paragraph, or (3) Treat two consecutive line breaks as marking the end of a paragraph. The details of sentence breaking within paragraphs is controlled based on the following three variables:
- sentenceBoundaryTokens are tokens that are left in a sentence, but are to be regarded as ending a sentence. A canonical example is a period. If two of these follow each other, the second will be a sentence consisting of only the sentenceBoundaryToken.
- sentenceBoundaryFollowers are tokens that are left in a sentence, and which can follow a sentenceBoundaryToken while still belonging to the previous sentence. They cannot begin a sentence (except at the beginning of a document). A canonical example is a close parenthesis ')'. The default (English) set is in DEFAULT_BOUNDARY_FOLLOWERS_REGEX.
- sentenceBoundaryToDiscard are tokens which separate sentences and which should be thrown away. In web documents, a typical example would be a '<p>' tag. If two of these follow each other, they are coalesced: no empty Sentence is output. The end-of-file is not represented in this Set, but the code behaves as if it were a member.
- regionElementRegex A regular expression for element names containing a sentence region. Only tokens in such elements will be included in sentences. The start and end tags themselves are not included in the sentence.
Instances of this class are now immutable. ☺
Author:

Joseph Smarr (jsmarr@stanford.edu), Christopher Manning, Teg Grenager (grenager@stanford.edu), Sarah Spikes (sdspikes@cs.stanford.edu) (Templatization)

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class WordToSentenceProcessor.NewlineIsSentenceBreak

Nested Classes
Modifier and Type	Class and Description
`static class`	`WordToSentenceProcessor.NewlineIsSentenceBreak`

Field Summary

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`DEFAULT_BOUNDARY_FOLLOWERS_REGEX` Pe = Close_Punctuation (close brackets), Pf = Final_Punctuation (close quotes); add straight quotes, PTB escaped right brackets (-RRB-, etc.), greater than as close angle bracket, and those forms in full width range.
`static java.lang.String`	`DEFAULT_BOUNDARY_REGEX` Default pattern for sentence ending punctuation.
`static java.util.Set<java.lang.String>`	`DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD`

Constructor Summary

Constructors
Constructor and Description
`WordToSentenceProcessor()` Create a `WordToSentenceProcessor` using a sensible default list of tokens for sentence ending for English/Latin writing systems.
`WordToSentenceProcessor(boolean isOneSentence)` Create a `WordToSentenceProcessor` which never breaks the input into multiple sentences.
`WordToSentenceProcessor(java.util.Set<java.lang.String> boundaryToDiscard)` Set the set of Strings that will mark the end of a sentence, and which will be discarded after doing so.
WordToSentenceProcessor(java.lang.String boundaryTokenRegex, java.lang.String boundaryFollowersRegex, java.util.Set<java.lang.String> boundariesToDiscard, java.util.Set<java.lang.String> xmlBreakElementsToDiscard, java.lang.String regionElementRegex, WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak, SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern, java.util.Set<java.lang.String> tokenRegexesToDiscard, boolean isOneSentence, boolean allowEmptySentences) Configure all parameters for converting a list of tokens into sentences.
`WordToSentenceProcessor(java.lang.String boundaryTokenRegex, java.lang.String boundaryFollowersRegex, java.util.Set<java.lang.String> boundaryToDiscard, java.util.Set<java.lang.String> xmlBreakElementsToDiscard, WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak, SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern, java.util.Set<java.lang.String> tokenRegexesToDiscard)` Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens.
`WordToSentenceProcessor(java.lang.String boundaryTokenRegex, WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak, boolean isOneSentence)` Create a basic `WordToSentenceProcessor` specifying just a few top-level options.
`WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)` Create a `WordToSentenceProcessor` using a sensible default list of tokens for sentence ending for English/Latin writing systems.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`java.util.List<java.util.List<IN>>`	`process(java.util.List<? extends IN> words)` Returns a List of Lists where each element is built from a run of Words in the input Document.
`<L,F> Document<L,F,java.util.List<IN>>`	`processDocument(Document<L,F,IN> in)`
`static WordToSentenceProcessor.NewlineIsSentenceBreak`	`stringToNewlineIsSentenceBreak(java.lang.String name)`

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - DEFAULT_BOUNDARY_REGEX
```
public static final java.lang.String DEFAULT_BOUNDARY_REGEX
```
    Default pattern for sentence ending punctuation. Now Chinese-friendly as well as English.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_BOUNDARY_FOLLOWERS_REGEX
```
public static final java.lang.String DEFAULT_BOUNDARY_FOLLOWERS_REGEX
```
    Pe = Close_Punctuation (close brackets), Pf = Final_Punctuation (close quotes); add straight quotes, PTB escaped right brackets (-RRB-, etc.), greater than as close angle bracket, and those forms in full width range.
    
    See Also:
    
    Constant Field Values
  - DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
```
public static final java.util.Set<java.lang.String> DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
```
- Constructor Detail
  - WordToSentenceProcessor
```
public WordToSentenceProcessor()
```
    Create a WordToSentenceProcessor using a sensible default list of tokens for sentence ending for English/Latin writing systems. The default set is: {".","?","!"} and any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!. A sequence of two or more consecutive line breaks is taken as a paragraph break which also splits sentences. This is the usual constructor for sentence breaking reasonable text, which uses hard-line breaking, so two blank lines indicate a paragraph break. People commonly use this constructor.
  - WordToSentenceProcessor
```
public WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)
```
    Create a WordToSentenceProcessor using a sensible default list of tokens for sentence ending for English/Latin writing systems. The default set is: {".","?","!"} and any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!. You can specify the treatment of newlines as sentence breaks as one of ignored, every newline is a sentence break, or only two or more consecutive newlines are a sentence break.
    
    Parameters:
    
    newlineIsSentenceBreak - Strategy for treating newlines as paragraph breaks.
  - WordToSentenceProcessor
```
public WordToSentenceProcessor(boolean isOneSentence)
```
    Create a WordToSentenceProcessor which never breaks the input into multiple sentences. If the argument is true, the input stream is always output as one sentence. (If it is false, this is equivalent to the no argument constructor, so why use this?)
    
    Parameters:
    
    isOneSentence - Marker argument: true means to treat input as one sentence
  - WordToSentenceProcessor
```
public WordToSentenceProcessor(java.util.Set<java.lang.String> boundaryToDiscard)
```
    Set the set of Strings that will mark the end of a sentence, and which will be discarded after doing so. This constructor is used for, and usually only for, doing one-sentence-per-line sentence splitting. Since in such cases, you generally want to strictly preserve the set of lines in the input, it preserves empty lines as empty sentences in the output.
    
    Parameters:
    
    boundaryToDiscard - A Set of String that will be matched with .equals() and will mark an end of sentence and be discarded.
  - WordToSentenceProcessor
```
public WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                               WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                               boolean isOneSentence)
```
    Create a basic WordToSentenceProcessor specifying just a few top-level options.
    
    Parameters:
    
    boundaryTokenRegex - The set of boundary tokens
    
    newlineIsSentenceBreak - Strategy for treating newlines as sentence breaks
    
    isOneSentence - Whether to treat whole text as one sentence (if true, the other two parameters are ignored).
  - WordToSentenceProcessor
```
public WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                               java.lang.String boundaryFollowersRegex,
                               java.util.Set<java.lang.String> boundaryToDiscard,
                               java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                               WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                               SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
                               java.util.Set<java.lang.String> tokenRegexesToDiscard)
```
    Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens. Also can set sentence boundary to discard tokens and xmlBreakElementsToDiscard and set the treatment of newlines (boundaryToDiscard) as sentence ends. This one is convenient in allowing any of the first 3 arguments to be null, and then the usual defaults are substituted for it. The allowed set of boundary followers is the regex: "[\\p{Pe}\\p{Pf}'\"]|''|-R[CRS]B-". The default set of discarded separator tokens includes the newline tokens used by WhitespaceLexer and PTBLexer.
    
    Parameters:
    
    boundaryTokenRegex - The regex of boundary tokens. If null, use default.
    
    boundaryFollowersRegex - The regex of boundary following tokens. If null, use default. These are tokens which should normally be added on to the current sentence even after something normally sentence ending has been seen. For example, typically a close parenthesis or close quotes goes with the current sentence, even after a period or question mark have been seen.
    
    boundaryToDiscard - The set of regex for sentence boundary tokens that should be discarded. If null, use default.
    
    xmlBreakElementsToDiscard - xml element names like "p", which will be recognized, treated as sentence ends, and discarded. If null, use none.
    
    newlineIsSentenceBreak - Strategy for counting line ends (boundaryToDiscard) as sentence ends.
  - WordToSentenceProcessor
```
public WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
                               java.lang.String boundaryFollowersRegex,
                               java.util.Set<java.lang.String> boundariesToDiscard,
                               java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
                               java.lang.String regionElementRegex,
                               WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
                               SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
                               java.util.Set<java.lang.String> tokenRegexesToDiscard,
                               boolean isOneSentence,
                               boolean allowEmptySentences)
```
    Configure all parameters for converting a list of tokens into sentences. The whole enchilada.
    
    Parameters:
    
    boundaryTokenRegex - Tokens that match this regex will end a sentence, but are retained at the end of the sentence. Substantive value must be supplied.
    
    boundaryFollowersRegex - This is a Set of String that are matched with .equals() which are allowed to be tacked onto the end of a sentence after a sentence boundary token, for example ")". Substantive value must be supplied.
    
    boundariesToDiscard - This is normally used for newline tokens if they are included in the tokenization. They may end the sentence (depending on the setting of newlineIsSentenceBreak), but at any rate are deleted from sentences in the output. Substantive value must be supplied.
    
    xmlBreakElementsToDiscard - These are elements like "p" or "sent", which will be wrapped into regex for approximate XML matching. They will be deleted in the output, and will always trigger a sentence boundary. May be null; means discard none.
    
    regionElementRegex - XML element name regex to delimit regions processed. Tokens outside one of these elements are discarded. May be null; means to not filter by regions
    
    newlineIsSentenceBreak - How to treat newlines. Must have substantive value.
    
    sentenceBoundaryMultiTokenPattern - A TokensRegex multi-token pattern for finding boundaries. May be null; means that there are no such patterns.
    
    tokenRegexesToDiscard - Regex for tokens to discard. May be null; means that no tokens are discarded in this way.
    
    isOneSentence - Whether to treat whole of input as one sentence regardless. Must have substantive value. Overrides anything else.
    
    allowEmptySentences - Whether to allow empty sentences to be output Must have substantive value. Often suppressed, but don't want that in things like strict one-sentence-per-line mode.
- Method Detail
  - stringToNewlineIsSentenceBreak
```
public static WordToSentenceProcessor.NewlineIsSentenceBreak stringToNewlineIsSentenceBreak(java.lang.String name)
```
  - process
```
public java.util.List<java.util.List<IN>> process(java.util.List<? extends IN> words)
```
    Returns a List of Lists where each element is built from a run of Words in the input Document. Specifically, reads through each word in the input document and breaks off a sentence after finding a valid sentence boundary token or end of file. Note that for this to work, the words in the input document must have been tokenized with a tokenizer that makes sentence boundary tokens their own tokens (e.g., PTBTokenizer).
    
    Specified by:
    
    process in interface ListProcessor<IN,java.util.List<IN>>
    
    Parameters:
    
    words - A list of already tokenized words (must implement HasWord or be a String).
    
    Returns:
    
    A list of sentences.
    
    See Also:
    
    WordToSentenceProcessor(String, String, Set, Set, String, NewlineIsSentenceBreak, SequencePattern, Set, boolean, boolean)
  - processDocument
```
public <L,F> Document<L,F,java.util.List<IN>> processDocument(Document<L,F,IN> in)
```

Class WordToSentenceProcessor<IN>

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

DEFAULT_BOUNDARY_REGEX

DEFAULT_BOUNDARY_FOLLOWERS_REGEX

DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD

Constructor Detail

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

Method Detail

stringToNewlineIsSentenceBreak

process

processDocument