public class ArabicTokenizer<T extends HasWord> extends AbstractTokenizer<T>
A single instance of an Arabic Tokenizer is not thread safe, as it uses a non-threadsafe jflex object to do the processing. Multiple instances can be created safely, though. A single instance of a ArabicTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
TODO(spenceg): Merge in rules from ibm tokenizer (v5). TODO(spenceg): Add XML escaping TODO(spenceg): When running from the command line, the tokenizer does not produce the correct number of newline-delimited lines for the ATB data sets.
Modifier and Type | Class and Description |
---|---|
static class |
ArabicTokenizer.ArabicTokenizerFactory<T extends HasWord> |
NEWLINE_TOKEN, nextToken
Constructor and Description |
---|
ArabicTokenizer(java.io.Reader r,
LexedTokenFactory<T> tf,
java.util.Properties lexerProperties) |
Modifier and Type | Method and Description |
---|---|
static TokenizerFactory<CoreLabel> |
atbFactory() |
static TokenizerFactory<CoreLabel> |
factory() |
protected T |
getNext()
Internally fetches the next token.
|
static void |
main(java.lang.String[] args)
A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
|
static ArabicTokenizer<CoreLabel> |
newArabicTokenizer(java.io.Reader r,
java.util.Properties lexerProperties) |
hasNext, next, peek, remove, tokenize
public ArabicTokenizer(java.io.Reader r, LexedTokenFactory<T> tf, java.util.Properties lexerProperties)
public static ArabicTokenizer<CoreLabel> newArabicTokenizer(java.io.Reader r, java.util.Properties lexerProperties)
protected T getNext()
AbstractTokenizer
getNext
in class AbstractTokenizer<T extends HasWord>
public static TokenizerFactory<CoreLabel> factory()
public static TokenizerFactory<CoreLabel> atbFactory()
public static void main(java.lang.String[] args)
Currently, this tokenizer does not do line splitting. It normalizes non-printing line separators across platforms and prints the system default line splitter to the output.
The following normalization options are provided:
useUTF8Ellipsis
: Replaces sequences of three or more full stops with …normArDigits
: Convert Arabic digits to ASCII equivalentsnormArPunc
: Convert Arabic punctuation to ASCII equivalentsnormAlif
: Change all alif forms to bare alifnormYa
: Map ya to alif maqsuraremoveDiacritics
: Strip all diacriticsremoveTatweel
: Strip tatweel elongation characterremoveQuranChars
: Remove diacritics that appear in the QuranremoveProMarker
: Remove the ATB null pronoun markerremoveSegMarker
: Remove the ATB clitic segmentation markerremoveMorphMarker
: Remove the ATB morpheme boundary markersremoveLengthening
: Replace all sequences of three or more identical (non-period) characters with one copyatbEscaping
: Replace left/right parentheses with ATB escape charactersargs
-