public class BaseUnknownWordModel extends java.lang.Object implements UnknownWordModel
Modifier and Type | Field and Description |
---|---|
protected static IntTaggedWord |
NULL_ITW |
protected static short |
nullTag |
protected static int |
nullWord |
protected java.util.Map<Label,ClassicCounter<java.lang.String>> |
tagHash
This maps from a tag (as a label) to a Counter from word signatures to
their P(sig|tag), as estimated in the model.
|
protected Index<java.lang.String> |
tagIndex |
protected TrainOptions |
trainOptions |
protected static java.lang.String |
unknown |
protected int |
unknownLevel
What type of equivalence classing is done in getSignature
|
protected ClassicCounter<IntTaggedWord> |
unSeenCounter
Has counts for taggings in terms of unseen signatures.
|
protected boolean |
useFirst |
protected boolean |
useGT |
protected static boolean |
VERBOSE |
protected Index<java.lang.String> |
wordIndex |
Constructor and Description |
---|
BaseUnknownWordModel(Options op,
Lexicon lex,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex)
This constructor creates an UWM with empty data structures.
|
BaseUnknownWordModel(Options op,
Lexicon lex,
Index<java.lang.String> wordIndex,
Index<java.lang.String> tagIndex,
ClassicCounter<IntTaggedWord> unSeenCounter,
java.util.Map<Label,ClassicCounter<java.lang.String>> tagHash,
java.util.Map<java.lang.String,java.lang.Float> unknownGT,
java.util.Set<java.lang.String> seenEnd) |
Modifier and Type | Method and Description |
---|---|
void |
addTagging(boolean seen,
IntTaggedWord itw,
double count)
Adds the tagging with count to the data structures in this Lexicon.
|
Lexicon |
getLexicon()
Get the lexicon associated with this unknown word model; usually not used, but
might be useful to tell you if a related word is known or unknown, for example.
|
java.lang.String |
getSignature(java.lang.String word,
int loc)
Signature for a specific word; loc parameter is ignored.
|
int |
getSignatureIndex(int wordIndex,
int sentencePosition,
java.lang.String word)
Returns an unknown word signature as an integer index rather than as a String.
|
int |
getUnknownLevel()
Get the level of equivalence classing for the model.
|
float |
score(IntTaggedWord itw,
int loc,
double c_Tseen,
double total,
double smooth,
java.lang.String word)
Currently we don't consider loc or the other parameters in determining
score in the default implementation; only English uses them.
|
float |
score(IntTaggedWord itw,
java.lang.String word) |
protected float |
scoreGT(java.lang.String tag) |
double |
scoreProbTagGivenWordSignature(IntTaggedWord iTW,
int loc,
double smooth,
java.lang.String word)
Calculate P(Tag|Signature) with Bayesian smoothing via just P(Tag|Unknown)
|
Counter<IntTaggedWord> |
unSeenCounter()
Returns a Counter from IntTaggedWord to how often they have been seen.
|
protected static final boolean VERBOSE
protected boolean useFirst
protected boolean useGT
protected final int unknownLevel
protected static final java.lang.String unknown
protected static final int nullWord
protected static final short nullTag
protected static final IntTaggedWord NULL_ITW
protected final TrainOptions trainOptions
protected final Index<java.lang.String> wordIndex
protected final Index<java.lang.String> tagIndex
protected final ClassicCounter<IntTaggedWord> unSeenCounter
protected final java.util.Map<Label,ClassicCounter<java.lang.String>> tagHash
public BaseUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex, ClassicCounter<IntTaggedWord> unSeenCounter, java.util.Map<Label,ClassicCounter<java.lang.String>> tagHash, java.util.Map<java.lang.String,java.lang.Float> unknownGT, java.util.Set<java.lang.String> seenEnd)
public BaseUnknownWordModel(Options op, Lexicon lex, Index<java.lang.String> wordIndex, Index<java.lang.String> tagIndex)
public float score(IntTaggedWord itw, int loc, double c_Tseen, double total, double smooth, java.lang.String word)
score
in interface UnknownWordModel
itw
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their
probability distribution when sentence initial. Now,
a negative value c_Tseen
- Total count of this tag (on seen words) in trainingtotal
- Total count of word tokens in trainingsmooth
- Weighting on prior P(T|U) in estimateword
- The word itself; useful so we don't look it up in the indexpublic float score(IntTaggedWord itw, java.lang.String word)
public double scoreProbTagGivenWordSignature(IntTaggedWord iTW, int loc, double smooth, java.lang.String word)
scoreProbTagGivenWordSignature
in interface UnknownWordModel
protected float scoreGT(java.lang.String tag)
public java.lang.String getSignature(java.lang.String word, int loc)
getSignature
in interface UnknownWordModel
word
- The wordloc
- Its sentence positionpublic int getSignatureIndex(int wordIndex, int sentencePosition, java.lang.String word)
UnknownWordModel
getSignatureIndex
in interface UnknownWordModel
public Lexicon getLexicon()
getLexicon
in interface UnknownWordModel
public int getUnknownLevel()
UnknownWordModel
getUnknownLevel
in interface UnknownWordModel
public void addTagging(boolean seen, IntTaggedWord itw, double count)
addTagging
in interface UnknownWordModel
seen
- Whether tagging is seenitw
- The taggingcount
- Its weightpublic Counter<IntTaggedWord> unSeenCounter()
UnknownWordModel
unSeenCounter
in interface UnknownWordModel