ArabicTokenizer (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.process.AbstractTokenizer<T>
- - edu.stanford.nlp.international.arabic.process.ArabicTokenizer<T>

All Implemented Interfaces:

Tokenizer<T>, java.util.Iterator<T>
```
public class ArabicTokenizer<T extends HasWord>
extends AbstractTokenizer<T>
```
Tokenizer for UTF-8 Arabic. Buckwalter encoding is not supported.
A single instance of an Arabic Tokenizer is not thread safe, as it uses a non-threadsafe jflex object to do the processing. Multiple instances can be created safely, though. A single instance of a ArabicTokenizerFactory is also not thread safe, as it keeps its options in a local variable.
TODO(spenceg): Merge in rules from ibm tokenizer (v5). TODO(spenceg): Add XML escaping TODO(spenceg): When running from the command line, the tokenizer does not produce the correct number of newline-delimited lines for the ATB data sets.

Author:

Spence Green

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

static class ArabicTokenizer.ArabicTokenizerFactory<T extends HasWord>

Nested Classes
Modifier and Type	Class and Description
`static class`	`ArabicTokenizer.ArabicTokenizerFactory<T extends HasWord>`

Field Summary
- Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
  NEWLINE_TOKEN, nextToken

Constructor Summary

Constructors
Constructor and Description

ArabicTokenizer(java.io.Reader r, LexedTokenFactory<T> tf, java.util.Properties lexerProperties)

Constructors
Constructor and Description
`ArabicTokenizer(java.io.Reader r, LexedTokenFactory<T> tf, java.util.Properties lexerProperties)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`static TokenizerFactory<CoreLabel>`	`atbFactory()`
`static TokenizerFactory<CoreLabel>`	`factory()`
`protected T`	`getNext()` Internally fetches the next token.
`static void`	`main(java.lang.String[] args)` A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding).
`static ArabicTokenizer<CoreLabel>`	`newArabicTokenizer(java.io.Reader r, java.util.Properties lexerProperties)`

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface java.util.Iterator
forEachRemaining

- Constructor Detail
  - ArabicTokenizer
```
public ArabicTokenizer(java.io.Reader r,
                       LexedTokenFactory<T> tf,
                       java.util.Properties lexerProperties)
```
- Method Detail
  - newArabicTokenizer
```
public static ArabicTokenizer<CoreLabel> newArabicTokenizer(java.io.Reader r,
                                                            java.util.Properties lexerProperties)
```
  - getNext
```
protected T getNext()
```
    Description copied from class: AbstractTokenizer
    
    Internally fetches the next token.
    
    Specified by:
    
    getNext in class AbstractTokenizer<T extends HasWord>
    
    Returns:
    
    the next token in the token stream, or null if none exists.
  - factory
```
public static TokenizerFactory<CoreLabel> factory()
```
  - atbFactory
```
public static TokenizerFactory<CoreLabel> atbFactory()
```
  - main
```
public static void main(java.lang.String[] args)
```
    A fast, rule-based tokenizer for Modern Standard Arabic (UTF-8 encoding). Performs punctuation splitting and light tokenization by default. Orthographic normalization options are available, and can be enabled with command line options.
    Currently, this tokenizer does not do line splitting. It normalizes non-printing line separators across platforms and prints the system default line splitter to the output.
    The following normalization options are provided:
    - useUTF8Ellipsis : Replaces sequences of three or more full stops with …
    - normArDigits : Convert Arabic digits to ASCII equivalents
    - normArPunc : Convert Arabic punctuation to ASCII equivalents
    - normAlif : Change all alif forms to bare alif
    - normYa : Map ya to alif maqsura
    - removeDiacritics : Strip all diacritics
    - removeTatweel : Strip tatweel elongation character
    - removeQuranChars : Remove diacritics that appear in the Quran
    - removeProMarker : Remove the ATB null pronoun marker
    - removeSegMarker : Remove the ATB clitic segmentation marker
    - removeMorphMarker : Remove the ATB morpheme boundary markers
    - removeLengthening : Replace all sequences of three or more identical (non-period) characters with one copy
    - atbEscaping : Replace left/right parentheses with ATB escape characters
    Parameters:
    
    args -

Class ArabicTokenizer<T extends HasWord>

Nested Class Summary

Field Summary

Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer

Constructor Summary

Method Summary

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer

Methods inherited from class java.lang.Object

Methods inherited from interface java.util.Iterator

Constructor Detail

ArabicTokenizer

Method Detail

newArabicTokenizer

getNext

factory

atbFactory

main