L
- The type of the labelsF
- The type of the featurespublic class StripTagsProcessor<L,F> extends AbstractListProcessor<Word,Word,L,F>
Processor
whose process
method deletes all
SGML/XML/HTML tags (tokens starting with <
and ending
with >
. Optionally, newlines can be inserted after the
end of block-level tags to roughly simulate where continuous text was
broken up (this helps finding sentence boundaries for example).Modifier and Type | Field and Description |
---|---|
static java.util.Set<java.lang.String> |
blockTags
Block-level HTML tags that are rendered with surrounding line breaks.
|
Constructor and Description |
---|
StripTagsProcessor()
Constructs a new StripTagsProcessor that doesn't mark line breaks.
|
StripTagsProcessor(boolean markLineBreaks)
Constructs a new StripTagProcessor that marks line breaks as specified.
|
Modifier and Type | Method and Description |
---|---|
boolean |
getMarkLineBreaks()
Returns whether the output of the processor will contain newline words
("\n") at the end of block-level tags.
|
static void |
main(java.lang.String[] args)
For internal debugging purposes only.
|
java.util.List<Word> |
process(java.util.List<? extends Word> in)
Returns a new Document with the same meta-data as in,
and the same words except tags are stripped.
|
void |
setMarkLineBreaks(boolean markLineBreaks)
Sets whether the output of the processor will contain newline words
("\n") at the end of block-level tags.
|
processDocument, processLists
public static final java.util.Set<java.lang.String> blockTags
public StripTagsProcessor()
public StripTagsProcessor(boolean markLineBreaks)
public boolean getMarkLineBreaks()
public void setMarkLineBreaks(boolean markLineBreaks)
public java.util.List<Word> process(java.util.List<? extends Word> in)
public static void main(java.lang.String[] args)