BaseUnknownWordModel (Stanford JavaNLP API)

java.lang.Object
- edu.stanford.nlp.parser.lexparser.BaseUnknownWordModel

All Implemented Interfaces:

UnknownWordModel, java.io.Serializable

Direct Known Subclasses:

ArabicUnknownWordModel, ChineseUnknownWordModel, EnglishUnknownWordModel, FrenchUnknownWordModel, GermanUnknownWordModel, SpanishUnknownWordModel
```
public class BaseUnknownWordModel
extends java.lang.Object
implements UnknownWordModel
```
An unknown word model for a generic language. This was originally designed for German, changing only to remove German-specific numeric features. Models unknown words based on their prefix and suffixes, as well as capital letters.

Author:

Roger Levy, Greg Donaker (corrections and modeling improvements), Christopher Manning (generalized and improved what Greg did), Anna Rafferty

See Also:

Serialized Form

Field Summary

Fields
Modifier and Type	Field and Description
`protected static IntTaggedWord`	`NULL_ITW`
`protected static short`	`nullTag`
`protected static int`	`nullWord`
`protected java.util.Map<Label,ClassicCounter<java.lang.String>>`	`tagHash` This maps from a tag (as a label) to a Counter from word signatures to their P(sig\|tag), as estimated in the model.
`protected Index<java.lang.String>`	`tagIndex`
`protected TrainOptions`	`trainOptions`
`protected static java.lang.String`	`unknown`
`protected int`	`unknownLevel` What type of equivalence classing is done in getSignature
`protected ClassicCounter<IntTaggedWord>`	`unSeenCounter` Has counts for taggings in terms of unseen signatures.
`protected boolean`	`useFirst`
`protected boolean`	`useGT`
`protected static boolean`	`VERBOSE`
`protected Index<java.lang.String>`	`wordIndex`

Constructor Summary

Constructors
Constructor and Description
`BaseUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)` This constructor creates an UWM with empty data structures.
`BaseUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter, java.util.Map<Label,ClassicCounter<java.lang.String>> tagHash, java.util.Map<java.lang.String,java.lang.Float> unknownGT, java.util.Set<java.lang.String> seenEnd)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`addTagging(boolean seen, IntTaggedWord itw, double count)` Adds the tagging with count to the data structures in this Lexicon.
`Lexicon`	`getLexicon()` Get the lexicon associated with this unknown word model; usually not used, but might be useful to tell you if a related word is known or unknown, for example.
`java.lang.String`	`getSignature(java.lang.String word, int loc)` Signature for a specific word; loc parameter is ignored.
`int`	`getSignatureIndex(int wordIndex, int sentencePosition, java.lang.String word)` Returns an unknown word signature as an integer index rather than as a String.
`int`	`getUnknownLevel()` Get the level of equivalence classing for the model.
`float`	`score(IntTaggedWord itw, int loc, double c_Tseen, double total, double smooth, java.lang.String word)` Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
`float`	`score(IntTaggedWord itw, java.lang.String word)`
`protected float`	`scoreGT(java.lang.String tag)`
`double`	`scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, java.lang.String word)` Calculate P(Tag\|Signature) with Bayesian smoothing via just P(Tag\|Unknown)
`Counter<IntTaggedWord>`	`unSeenCounter()` Returns a Counter from IntTaggedWord to how often they have been seen.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - VERBOSE
```
protected static final boolean VERBOSE
```
    See Also:
    
    Constant Field Values
  - useFirst
```
protected boolean useFirst
```
  - useGT
```
protected boolean useGT
```
  - unknownLevel
```
protected final int unknownLevel
```
    What type of equivalence classing is done in getSignature
  - unknown
```
protected static final java.lang.String unknown
```
    See Also:
    
    Constant Field Values
  - nullWord
```
protected static final int nullWord
```
    See Also:
    
    Constant Field Values
  - nullTag
```
protected static final short nullTag
```
    See Also:
    
    Constant Field Values
  - NULL_ITW
```
protected static final IntTaggedWord NULL_ITW
```
  - trainOptions
```
protected final TrainOptions trainOptions
```
  - wordIndex
```
protected final Index<java.lang.String> wordIndex
```
  - tagIndex
```
protected final Index<java.lang.String> tagIndex
```
  - unSeenCounter
```
protected final ClassicCounter<IntTaggedWord> unSeenCounter
```
    Has counts for taggings in terms of unseen signatures. The IntTagWords are for (tag,sig), (tag,null), (null,sig), (null,null). (None for basic UNK if there are signatures.)
  - tagHash
```
protected final java.util.Map<Label,ClassicCounter<java.lang.String>> tagHash
```
    This maps from a tag (as a label) to a Counter from word signatures to their P(sig|tag), as estimated in the model. For Chinese, the word signature is just the first character or its unicode type for things that aren't Chinese characters.
- Constructor Detail
  - BaseUnknownWordModel
```
public BaseUnknownWordModel(Options op,
                            Lexicon lex,
                            Index<java.lang.String> wordIndex,
                            Index<java.lang.String> tagIndex,
                            ClassicCounter<IntTaggedWord> unSeenCounter,
                            java.util.Map<Label,ClassicCounter<java.lang.String>> tagHash,
                            java.util.Map<java.lang.String,java.lang.Float> unknownGT,
                            java.util.Set<java.lang.String> seenEnd)
```
  - BaseUnknownWordModel
```
public BaseUnknownWordModel(Options op,
                            Lexicon lex,
                            Index<java.lang.String> wordIndex,
                            Index<java.lang.String> tagIndex)
```
    This constructor creates an UWM with empty data structures. Only use if loading in the data separately, such as by reading in text lines containing the data.
- Method Detail
  - score
```
public float score(IntTaggedWord itw,
                   int loc,
                   double c_Tseen,
                   double total,
                   double smooth,
                   java.lang.String word)
```
    Currently we don't consider loc or the other parameters in determining score in the default implementation; only English uses them.
    
    Specified by:
    
    score in interface UnknownWordModel
    
    Parameters:
    
    itw - An IntTaggedWord pairing a word and POS tag
    
    loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial. Now, a negative value
    
    c_Tseen - Total count of this tag (on seen words) in training
    
    total - Total count of word tokens in training
    
    smooth - Weighting on prior P(T|U) in estimate
    
    word - The word itself; useful so we don't look it up in the index
    
    Returns:
    
    A double valued score, usually - log P(word|tag)
  - score
```
public float score(IntTaggedWord itw,
                   java.lang.String word)
```
  - scoreProbTagGivenWordSignature
```
public double scoreProbTagGivenWordSignature(IntTaggedWord iTW,
                                             int loc,
                                             double smooth,
                                             java.lang.String word)
```
    Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)
    
    Specified by:
    
    scoreProbTagGivenWordSignature in interface UnknownWordModel
  - scoreGT
```
protected float scoreGT(java.lang.String tag)
```
  - getSignature
```
public java.lang.String getSignature(java.lang.String word,
                                     int loc)
```
    Signature for a specific word; loc parameter is ignored.
    
    Specified by:
    
    getSignature in interface UnknownWordModel
    
    Parameters:
    
    word - The word
    
    loc - Its sentence position
    
    Returns:
    
    A "signature" (which represents an equivalence class of Strings), e.g., a suffix of the string
  - getSignatureIndex
```
public int getSignatureIndex(int wordIndex,
                             int sentencePosition,
                             java.lang.String word)
```
    Description copied from interface: UnknownWordModel
    
    Returns an unknown word signature as an integer index rather than as a String.
    
    Specified by:
    
    getSignatureIndex in interface UnknownWordModel
  - getLexicon
```
public Lexicon getLexicon()
```
    Get the lexicon associated with this unknown word model; usually not used, but might be useful to tell you if a related word is known or unknown, for example.
    
    Specified by:
    
    getLexicon in interface UnknownWordModel
    
    Returns:
    
    The lexicon used by this unknown word model
  - getUnknownLevel
```
public int getUnknownLevel()
```
    Description copied from interface: UnknownWordModel
    
    Get the level of equivalence classing for the model. One unknown word model may allow different options to be set; for example, several models of unknown words for a given language could be included in one class. The unknown level can be queried with this method.
    
    Specified by:
    
    getUnknownLevel in interface UnknownWordModel
    
    Returns:
    
    The current level of unknown word equivalence classing
  - addTagging
```
public void addTagging(boolean seen,
                       IntTaggedWord itw,
                       double count)
```
    Adds the tagging with count to the data structures in this Lexicon.
    
    Specified by:
    
    addTagging in interface UnknownWordModel
    
    Parameters:
    
    seen - Whether tagging is seen
    
    itw - The tagging
    
    count - Its weight
  - unSeenCounter
```
public Counter<IntTaggedWord> unSeenCounter()
```
    Description copied from interface: UnknownWordModel
    
    Returns a Counter from IntTaggedWord to how often they have been seen.
    
    Specified by:
    
    unSeenCounter in interface UnknownWordModel

Class BaseUnknownWordModel

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

VERBOSE

useFirst

useGT

unknownLevel

unknown

nullWord

nullTag

NULL_ITW

trainOptions

wordIndex

tagIndex

unSeenCounter

tagHash

Constructor Detail

BaseUnknownWordModel

BaseUnknownWordModel

Method Detail

score

score

scoreProbTagGivenWordSignature

scoreGT

getSignature

getSignatureIndex

getLexicon

getUnknownLevel

addTagging

unSeenCounter