SpanishTokenizer (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.process.AbstractTokenizer<T>
- - edu.stanford.nlp.international.spanish.process.SpanishTokenizer<T>

All Implemented Interfaces:

Tokenizer<T>, java.util.Iterator<T>
```
public class SpanishTokenizer<T extends HasWord>
extends AbstractTokenizer<T>
```
Tokenizer for raw Spanish text. This tokenization scheme is a derivative of PTB tokenization, but with extra rules for Spanish contractions and assimilations. It is based heavily on the FrenchTokenizer.
The tokenizer tokenizes according to the modified AnCora corpus tokenization standards, so the rules are a little different from PTB.
A single instance of a Spanish Tokenizer is not thread safe, as it uses a non-threadsafe JFlex object to do the processing. Multiple instances can be created safely, though. A single instance of a SpanishTokenizerFactory is also not thread safe, as it keeps its options in a local variable.

Author:

Ishita Prasad

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class SpanishTokenizer.SpanishTokenizerFactory<T extends HasWord>
A factory for Spanish tokenizer instances.

Nested Classes
Modifier and Type	Class and Description
`static class`	`SpanishTokenizer.SpanishTokenizerFactory<T extends HasWord>` A factory for Spanish tokenizer instances.

Field Summary

Fields
Modifier and Type Field and Description

static java.lang.String ANCORA_OPTIONS

static java.lang.String DEFAULT_OPTIONS
- Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
  NEWLINE_TOKEN, nextToken

Fields
Modifier and Type	Field and Description
`static java.lang.String`	`ANCORA_OPTIONS`
`static java.lang.String`	`DEFAULT_OPTIONS`

Constructor Summary

Constructors
Constructor and Description
`SpanishTokenizer(java.io.Reader r, LexedTokenFactory<T> tf, java.util.Properties lexerProperties, boolean splitCompounds, boolean splitVerbs, boolean splitContractions)` Constructor.

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static TokenizerFactory<CoreLabel>`	`ancoraFactory()` Returns a tokenizer with Ancora tokenization.
`static TokenizerFactory<CoreLabel>`	`coreLabelFactory()` a factory that vends CoreLabel tokens with default tokenization.
`static TokenizerFactory<CoreLabel>`	`factory()`
`static <T extends HasWord> TokenizerFactory<T>`	`factory(LexedTokenFactory<T> factory)`
`static <T extends HasWord> TokenizerFactory<T>`	`factory(LexedTokenFactory<T> factory, java.lang.String options)` recommended factory method
`protected T`	`getNext()` Internally fetches the next token.
`static void`	`main(java.lang.String[] args)` A fast, rule-based tokenizer for Spanish based on AnCora.

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.Iterator
forEachRemaining

Field Detail

ANCORA_OPTIONS

public static final java.lang.String ANCORA_OPTIONS

See Also:: Constant Field Values

DEFAULT_OPTIONS

public static final java.lang.String DEFAULT_OPTIONS

See Also:: Constant Field Values

Constructor Detail

SpanishTokenizer

public SpanishTokenizer(java.io.Reader r,
                        LexedTokenFactory<T> tf,
                        java.util.Properties lexerProperties,
                        boolean splitCompounds,
                        boolean splitVerbs,
                        boolean splitContractions)

Constructor.

Parameters:: r -; tf -; lexerProperties -; splitCompounds -

Method Detail
- getNext
```
protected T getNext()
```
  Description copied from class: AbstractTokenizer
  
  Internally fetches the next token.
  
  Specified by:
  
  getNext in class AbstractTokenizer<T extends HasWord>
  
  Returns:
  
  the next token in the token stream, or null if none exists.
- factory
```
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory,
                                                              java.lang.String options)
```
  recommended factory method
- factory
```
public static <T extends HasWord> TokenizerFactory<T> factory(LexedTokenFactory<T> factory)
```
- ancoraFactory
```
public static TokenizerFactory<CoreLabel> ancoraFactory()
```
  Returns a tokenizer with Ancora tokenization.
- coreLabelFactory
```
public static TokenizerFactory<CoreLabel> coreLabelFactory()
```
  a factory that vends CoreLabel tokens with default tokenization.
- factory
```
public static TokenizerFactory<CoreLabel> factory()
```
- main
```
public static void main(java.lang.String[] args)
```
  A fast, rule-based tokenizer for Spanish based on AnCora. Performs punctuation splitting and light tokenization by default.
  Currently, this tokenizer does not do line splitting. It assumes that the input file is delimited by the system line separator. The output will be equivalently delimited.
  
  Parameters:
  
  args - Command-line arguments

Class SpanishTokenizer<T extends HasWord>

Nested Class Summary

Field Summary

Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.Iterator

Field Detail

ANCORA_OPTIONS

DEFAULT_OPTIONS

Constructor Detail

SpanishTokenizer

Method Detail

getNext

factory

factory

ancoraFactory

coreLabelFactory

factory

main