public class SpanishTokenizer<T extends HasWord> extends AbstractTokenizer<T>
The tokenizer tokenizes according to the modified AnCora corpus tokenization standards, so the rules are a little different from PTB.
A single instance of a Spanish Tokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a SpanishTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
Modifier and Type | Class and Description |
---|---|
static class |
SpanishTokenizer.SpanishTokenizerFactory<T extends HasWord>
A factory for Spanish tokenizer instances.
|
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
ANCORA_OPTIONS |
static java.lang.String |
DEFAULT_OPTIONS |
NEWLINE_TOKEN, nextToken
Constructor and Description |
---|
SpanishTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties,
boolean splitCompounds,
boolean splitVerbs,
boolean splitContractions)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
static TokenizerFactory<CoreLabel> |
ancoraFactory()
Returns a tokenizer with Ancora tokenization.
|
static TokenizerFactory<CoreLabel> |
coreLabelFactory()
a factory that vends CoreLabel tokens with default tokenization.
|
static TokenizerFactory<CoreLabel> |
factory() |
static <T extends HasWord> |
factory(LexedTokenFactory<T> factory) |
static <T extends HasWord> |
factory(LexedTokenFactory<T> factory,
java.lang.String options)
recommended factory method
|
protected T |
getNext()
Internally fetches the next token.
|
static void |
main(java.lang.String[] args)
A fast, rule-based tokenizer for Spanish based on AnCora.
|
hasNext, next, peek, remove, tokenize
public static final java.lang.String ANCORA_OPTIONS
public static final java.lang.String DEFAULT_OPTIONS
public SpanishTokenizer(java.io.Reader r, LexedTokenFactory<T> tf, java.util.Properties lexerProperties, boolean splitCompounds, boolean splitVerbs, boolean splitContractions)
r
- tf
- lexerProperties
- splitCompounds
- protected T getNext()
AbstractTokenizer
getNext
in class AbstractTokenizer<T extends HasWord>
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory, java.lang.String options)
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory)
public static TokenizerFactory<CoreLabel> ancoraFactory()
public static TokenizerFactory<CoreLabel> coreLabelFactory()
public static TokenizerFactory<CoreLabel> factory()
public static void main(java.lang.String[] args)
Currently, this tokenizer does not do line splitting. It assumes that the input file is delimited by the system line separator. The output will be equivalently delimited.
args
- Command-line arguments