ChineseCharacterBasedLexicon (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.ChineseCharacterBasedLexicon

All Implemented Interfaces:

Lexicon, java.io.Serializable
```
public class ChineseCharacterBasedLexicon
extends java.lang.Object
implements Lexicon
```
Author:

Galen Andrew

See Also:

Serialized Form

Field Summary
- Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
  BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD

Constructor Summary

Constructors
Constructor and Description
`ChineseCharacterBasedLexicon(ChineseTreebankParserParams params, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`finishTraining()` Done collecting statistics for the lexicon.
`Distribution<java.lang.String>`	`getPOSDistribution()`
`UnknownWordModel`	`getUnknownWordModel()`
`void`	`incrementTreesRead(double weight)` If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens.
`void`	`initializeTraining(double numTrees)` Start training this lexicon on the expected number of trees.
`static boolean`	`isForeign(java.lang.String s)`
`boolean`	`isKnown(int word)` Checks whether a word is in the lexicon.
`boolean`	`isKnown(java.lang.String word)` Checks whether a word is in the lexicon.
`int`	`numRules()` Returns the number of rules (tag rewrites as word) in the Lexicon.
`void`	`readData(java.io.BufferedReader in)` Read the lexicon from the BufferedReader in the format written by writeData.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(int word, int loc, java.lang.String featureSpec)` Get an iterator over all rules (pairs of (word, POS)) for this word.
`java.util.Iterator<IntTaggedWord>`	`ruleIteratorByWord(java.lang.String word, int loc, java.lang.String featureSpec)` Same thing, but with a string that needs to be translated by the lexicon's word index
`java.lang.String`	`sampleFrom()` Samples over words regardless of POS: first samples POS, then samples word according to that POS
`java.lang.String`	`sampleFrom(java.lang.String tag)` Samples from the distribution over words with this POS according to the lexicon.
`float`	`score(IntTaggedWord iTW, int loc, java.lang.String word, java.lang.String featureSpec)` Get the score of this word with this tag (as an IntTaggedWord) at this loc.
`void`	`setUnknownWordModel(UnknownWordModel uwm)`
`java.util.Set<java.lang.String>`	`tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)` Return the Set of tags used by this tagger (available after training the tagger).
`void`	`train(java.util.Collection<Tree> trees)` Train this lexicon on the given set of trees.
`void`	`train(java.util.Collection<Tree> trees, java.util.Collection<Tree> rawTrees)`
`void`	`train(java.util.Collection<Tree> trees, double weight)` Train this lexicon on the given set of trees.
`void`	`train(java.util.List<TaggedWord> sentence, double weight)` Not all subclasses support this particular method.
`void`	`train(TaggedWord tw, int loc, double weight)` Not all subclasses support this particular method.
`void`	`train(Tree tree, double weight)` TODO: make this method do something with the weight
`void`	`trainUnannotated(java.util.List<TaggedWord> sentence, double weight)` Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwise annotated tree.
`void`	`writeData(java.io.Writer w)` Write the lexicon in human-readable format to the Writer.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - ChineseCharacterBasedLexicon
```
public ChineseCharacterBasedLexicon(ChineseTreebankParserParams params,
                                    Index<java.lang.String> wordIndex,
                                    Index<java.lang.String> tagIndex)
```
- Method Detail
  - initializeTraining
```
public void initializeTraining(double numTrees)
```
    Description copied from interface: Lexicon
    
    Start training this lexicon on the expected number of trees. (Some UnknownWordModels use the number of trees to know when to start counting statistics.)
    
    Specified by:
    
    initializeTraining in interface Lexicon
  - train
```
public void train(java.util.Collection<Tree> trees)
```
    Train this lexicon on the given set of trees.
    
    Specified by:
    
    train in interface Lexicon
    
    Parameters:
    
    trees - Trees to train on
  - train
```
public void train(java.util.Collection<Tree> trees,
                  double weight)
```
    Train this lexicon on the given set of trees.
    
    Specified by:
    
    train in interface Lexicon
  - train
```
public void train(Tree tree,
                  double weight)
```
    TODO: make this method do something with the weight
    
    Specified by:
    
    train in interface Lexicon
  - trainUnannotated
```
public void trainUnannotated(java.util.List<TaggedWord> sentence,
                             double weight)
```
    Description copied from interface: Lexicon
    
    Sometimes we might have a sentence of tagged words which we would like to add to the lexicon, but they weren't part of a binarized, markovized, or otherwise annotated tree.
    
    Specified by:
    
    trainUnannotated in interface Lexicon
  - incrementTreesRead
```
public void incrementTreesRead(double weight)
```
    Description copied from interface: Lexicon
    
    If training on a per-word basis instead of on a per-tree basis, we will want to increment the tree count as this happens.
    
    Specified by:
    
    incrementTreesRead in interface Lexicon
  - train
```
public void train(TaggedWord tw,
                  int loc,
                  double weight)
```
    Description copied from interface: Lexicon
    
    Not all subclasses support this particular method. Those that don't will barf...
    
    Specified by:
    
    train in interface Lexicon
  - train
```
public void train(java.util.List<TaggedWord> sentence,
                  double weight)
```
    Description copied from interface: Lexicon
    
    Not all subclasses support this particular method. Those that don't will barf...
    
    Specified by:
    
    train in interface Lexicon
  - finishTraining
```
public void finishTraining()
```
    Description copied from interface: Lexicon
    
    Done collecting statistics for the lexicon.
    
    Specified by:
    
    finishTraining in interface Lexicon
  - getPOSDistribution
```
public Distribution<java.lang.String> getPOSDistribution()
```
  - isForeign
```
public static boolean isForeign(java.lang.String s)
```
  - score
```
public float score(IntTaggedWord iTW,
                   int loc,
                   java.lang.String word,
                   java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)
    
    Specified by:
    
    score in interface Lexicon
    
    Parameters:
    
    iTW - An IntTaggedWord pairing a word and POS tag
    
    loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial.
    
    word - The word itself; useful so we don't have to look it up in an index
    
    featureSpec - TODO
    
    Returns:
    
    A score, usually, log P(word|tag)
  - sampleFrom
```
public java.lang.String sampleFrom(java.lang.String tag)
```
    Samples from the distribution over words with this POS according to the lexicon.
    
    Parameters:
    
    tag - the POS of the word to sample
    
    Returns:
    
    a sampled word
  - sampleFrom
```
public java.lang.String sampleFrom()
```
    Samples over words regardless of POS: first samples POS, then samples word according to that POS
    
    Returns:
    
    a sampled word
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(int word,
                                                            int loc,
                                                            java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Get an iterator over all rules (pairs of (word, POS)) for this word.
    
    Specified by:
    
    ruleIteratorByWord in interface Lexicon
    
    Parameters:
    
    word - The word, represented as an integer in Index
    
    loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
    
    featureSpec - Additional word features like morphosyntactic information.
    
    Returns:
    
    An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)
  - ruleIteratorByWord
```
public java.util.Iterator<IntTaggedWord> ruleIteratorByWord(java.lang.String word,
                                                            int loc,
                                                            java.lang.String featureSpec)
```
    Description copied from interface: Lexicon
    
    Same thing, but with a string that needs to be translated by the lexicon's word index
    
    Specified by:
    
    ruleIteratorByWord in interface Lexicon
  - numRules
```
public int numRules()
```
    Returns the number of rules (tag rewrites as word) in the Lexicon. This method isn't yet implemented in this class. It currently just returns 0, which may or may not be helpful.
    
    Specified by:
    
    numRules in interface Lexicon
    
    Returns:
    
    The number of rules (tag rewrites as word) in the Lexicon.
  - readData
```
public void readData(java.io.BufferedReader in)
              throws java.io.IOException
```
    Description copied from interface: Lexicon
    
    Read the lexicon from the BufferedReader in the format written by writeData. (An optional operation.)
    
    Specified by:
    
    readData in interface Lexicon
    
    Parameters:
    
    in - The BufferedReader to read from
    
    Throws:
    
    java.io.IOException - If any I/O problem
  - writeData
```
public void writeData(java.io.Writer w)
               throws java.io.IOException
```
    Description copied from interface: Lexicon
    
    Write the lexicon in human-readable format to the Writer. (An optional operation.)
    
    Specified by:
    
    writeData in interface Lexicon
    
    Parameters:
    
    w - The writer to output to
    
    Throws:
    
    java.io.IOException - If any I/O problem
  - isKnown
```
public boolean isKnown(int word)
```
    Description copied from interface: Lexicon
    
    Checks whether a word is in the lexicon.
    
    Specified by:
    
    isKnown in interface Lexicon
    
    Parameters:
    
    word - The word as an int
    
    Returns:
    
    Whether the word is in the lexicon
  - isKnown
```
public boolean isKnown(java.lang.String word)
```
    Description copied from interface: Lexicon
    
    Checks whether a word is in the lexicon.
    
    Specified by:
    
    isKnown in interface Lexicon
    
    Parameters:
    
    word - The word as a String
    
    Returns:
    
    Whether the word is in the lexicon
  - tagSet
```
public java.util.Set<java.lang.String> tagSet(java.util.function.Function<java.lang.String,java.lang.String> basicCategoryFunction)
```
    Return the Set of tags used by this tagger (available after training the tagger).
    
    Specified by:
    
    tagSet in interface Lexicon
    
    Returns:
    
    The Set of tags used by this tagger
  - getUnknownWordModel
```
public UnknownWordModel getUnknownWordModel()
```
    Specified by:
    
    getUnknownWordModel in interface Lexicon
  - setUnknownWordModel
```
public void setUnknownWordModel(UnknownWordModel uwm)
```
    Specified by:
    
    setUnknownWordModel in interface Lexicon
  - train
```
public void train(java.util.Collection<Tree> trees,
                  java.util.Collection<Tree> rawTrees)
```
    Specified by:
    
    train in interface Lexicon

Class ChineseCharacterBasedLexicon

Field Summary

Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

ChineseCharacterBasedLexicon

Method Detail

initializeTraining

train

train

train

trainUnannotated

incrementTreesRead

train

train

finishTraining

getPOSDistribution

isForeign

score

sampleFrom

sampleFrom

ruleIteratorByWord

ruleIteratorByWord

numRules

readData

writeData

isKnown

isKnown

tagSet

getUnknownWordModel

setUnknownWordModel

train