IN
- The type of the tokens in the sentencespublic class WordToSentenceProcessor<IN> extends java.lang.Object implements ListProcessor<IN,java.util.List<IN>>
The main behavior is to look for sentence ending tokens like "." or "?!?", and to split after them and any following sentence closers like ")". Overlaid on this is an overall choice of state: The WordToSentenceProcessor can be a non-splitter, which always returns one sentence. Otherwise, the WordToSentenceProcessor will also split based on paragraphs using one of these three states: (1) Ignore line breaks in splitting sentences, (2) Treat each line as a separate paragraph, or (3) Treat two consecutive line breaks as marking the end of a paragraph. The details of sentence breaking within paragraphs is controlled based on the following three variables:
<p>
' tag. If two of these follow each other, they are
coalesced: no empty Sentence is output. The end-of-file is not
represented in this Set, but the code behaves as if it were a member.
Modifier and Type | Class and Description |
---|---|
static class |
WordToSentenceProcessor.NewlineIsSentenceBreak |
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
DEFAULT_BOUNDARY_FOLLOWERS_REGEX
Pe = Close_Punctuation (close brackets), Pf = Final_Punctuation (close quotes);
add straight quotes, PTB escaped right brackets (-RRB-, etc.), greater than as close angle bracket,
and those forms in full width range.
|
static java.lang.String |
DEFAULT_BOUNDARY_REGEX
Default pattern for sentence ending punctuation.
|
static java.util.Set<java.lang.String> |
DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD |
Constructor and Description |
---|
WordToSentenceProcessor()
Create a
WordToSentenceProcessor using a sensible default
list of tokens for sentence ending for English/Latin writing systems. |
WordToSentenceProcessor(boolean isOneSentence)
Create a
WordToSentenceProcessor which never breaks the input
into multiple sentences. |
WordToSentenceProcessor(java.util.Set<java.lang.String> boundaryToDiscard)
Set the set of Strings that will mark the end of a sentence,
and which will be discarded after doing so.
|
WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
java.lang.String boundaryFollowersRegex,
java.util.Set<java.lang.String> boundariesToDiscard,
java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
java.lang.String regionElementRegex,
WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
java.util.Set<java.lang.String> tokenRegexesToDiscard,
boolean isOneSentence,
boolean allowEmptySentences)
Configure all parameters for converting a list of tokens into sentences.
|
WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
java.lang.String boundaryFollowersRegex,
java.util.Set<java.lang.String> boundaryToDiscard,
java.util.Set<java.lang.String> xmlBreakElementsToDiscard,
WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern,
java.util.Set<java.lang.String> tokenRegexesToDiscard)
Flexibly set the set of acceptable sentence boundary tokens, but with
a default set of allowed boundary following tokens.
|
WordToSentenceProcessor(java.lang.String boundaryTokenRegex,
WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak,
boolean isOneSentence)
Create a basic
WordToSentenceProcessor specifying just a few top-level options. |
WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)
Create a
WordToSentenceProcessor using a sensible default
list of tokens for sentence ending for English/Latin writing systems. |
Modifier and Type | Method and Description |
---|---|
java.util.List<java.util.List<IN>> |
process(java.util.List<? extends IN> words)
Returns a List of Lists where each element is built from a run
of Words in the input Document.
|
<L,F> Document<L,F,java.util.List<IN>> |
processDocument(Document<L,F,IN> in) |
static WordToSentenceProcessor.NewlineIsSentenceBreak |
stringToNewlineIsSentenceBreak(java.lang.String name) |
public static final java.lang.String DEFAULT_BOUNDARY_REGEX
public static final java.lang.String DEFAULT_BOUNDARY_FOLLOWERS_REGEX
public static final java.util.Set<java.lang.String> DEFAULT_SENTENCE_BOUNDARIES_TO_DISCARD
public WordToSentenceProcessor()
WordToSentenceProcessor
using a sensible default
list of tokens for sentence ending for English/Latin writing systems.
The default set is: {".","?","!"} and
any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!.
A sequence of two or more consecutive line breaks is taken as a paragraph break
which also splits sentences. This is the usual constructor for sentence
breaking reasonable text, which uses hard-line breaking, so two
blank lines indicate a paragraph break.
People commonly use this constructor.public WordToSentenceProcessor(WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak)
WordToSentenceProcessor
using a sensible default
list of tokens for sentence ending for English/Latin writing systems.
The default set is: {".","?","!"} and
any combination of ! or ?, as in !!!?!?!?!!!?!!?!!!.
You can specify the treatment of newlines as sentence breaks as one
of ignored, every newline is a sentence break, or only two or more
consecutive newlines are a sentence break.newlineIsSentenceBreak
- Strategy for treating newlines as
paragraph breaks.public WordToSentenceProcessor(boolean isOneSentence)
WordToSentenceProcessor
which never breaks the input
into multiple sentences. If the argument is true, the input stream
is always output as one sentence. (If it is false, this is
equivalent to the no argument constructor, so why use this?)isOneSentence
- Marker argument: true means to treat input
as one sentencepublic WordToSentenceProcessor(java.util.Set<java.lang.String> boundaryToDiscard)
boundaryToDiscard
- A Set of String that will be matched
with .equals() and will mark an
end of sentence and be discarded.public WordToSentenceProcessor(java.lang.String boundaryTokenRegex, WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak, boolean isOneSentence)
WordToSentenceProcessor
specifying just a few top-level options.boundaryTokenRegex
- The set of boundary tokensnewlineIsSentenceBreak
- Strategy for treating newlines as sentence breaksisOneSentence
- Whether to treat whole text as one sentence
(if true, the other two parameters are ignored).public WordToSentenceProcessor(java.lang.String boundaryTokenRegex, java.lang.String boundaryFollowersRegex, java.util.Set<java.lang.String> boundaryToDiscard, java.util.Set<java.lang.String> xmlBreakElementsToDiscard, WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak, SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern, java.util.Set<java.lang.String> tokenRegexesToDiscard)
boundaryTokenRegex
- The regex of boundary tokens. If null, use default.boundaryFollowersRegex
- The regex of boundary following tokens. If null, use default.
These are tokens which should normally be added on to the current sentence
even after something normally sentence ending has been seen. For example,
typically a close parenthesis or close quotes goes with the current sentence,
even after a period or question mark have been seen.boundaryToDiscard
- The set of regex for sentence boundary tokens that should be discarded.
If null, use default.xmlBreakElementsToDiscard
- xml element names like "p", which will be recognized,
treated as sentence ends, and discarded.
If null, use none.newlineIsSentenceBreak
- Strategy for counting line ends (boundaryToDiscard) as sentence ends.public WordToSentenceProcessor(java.lang.String boundaryTokenRegex, java.lang.String boundaryFollowersRegex, java.util.Set<java.lang.String> boundariesToDiscard, java.util.Set<java.lang.String> xmlBreakElementsToDiscard, java.lang.String regionElementRegex, WordToSentenceProcessor.NewlineIsSentenceBreak newlineIsSentenceBreak, SequencePattern<? super IN> sentenceBoundaryMultiTokenPattern, java.util.Set<java.lang.String> tokenRegexesToDiscard, boolean isOneSentence, boolean allowEmptySentences)
boundaryTokenRegex
- Tokens that match this regex will end a
sentence, but are retained at the end of
the sentence. Substantive value must be supplied.boundaryFollowersRegex
- This is a Set of String that are matched with
.equals() which are allowed to be tacked onto
the end of a sentence after a sentence boundary
token, for example ")". Substantive value must be supplied.boundariesToDiscard
- This is normally used for newline tokens if
they are included in the tokenization. They
may end the sentence (depending on the setting
of newlineIsSentenceBreak), but at any rate
are deleted from sentences in the output.
Substantive value must be supplied.xmlBreakElementsToDiscard
- These are elements like "p" or "sent",
which will be wrapped into regex for
approximate XML matching. They will be
deleted in the output, and will always
trigger a sentence boundary.
May be null; means discard none.regionElementRegex
- XML element name regex to delimit regions processed.
Tokens outside one of these elements are discarded.
May be null; means to not filter by regionsnewlineIsSentenceBreak
- How to treat newlines. Must have substantive value.sentenceBoundaryMultiTokenPattern
- A TokensRegex multi-token pattern for finding boundaries.
May be null; means that there are no such patterns.tokenRegexesToDiscard
- Regex for tokens to discard.
May be null; means that no tokens are discarded in this way.isOneSentence
- Whether to treat whole of input as one sentence regardless.
Must have substantive value. Overrides anything else.allowEmptySentences
- Whether to allow empty sentences to be output
Must have substantive value. Often suppressed, but don't want that in things like
strict one-sentence-per-line mode.public static WordToSentenceProcessor.NewlineIsSentenceBreak stringToNewlineIsSentenceBreak(java.lang.String name)
public java.util.List<java.util.List<IN>> process(java.util.List<? extends IN> words)
PTBTokenizer
).process
in interface ListProcessor<IN,java.util.List<IN>>
words
- A list of already tokenized words (must implement HasWord or be a String).WordToSentenceProcessor(String, String, Set, Set, String, NewlineIsSentenceBreak, SequencePattern, Set, boolean, boolean)