This class contains feature extractors for the MaxentTagger that are only
applied to rare (low frequency/unknown) words.
The following options are supported:
Options for rare extractor frames
Name | Args | Effect |
wordshapes | left, right |
Word shape features, e.g., transform Foo5 into Xxx#
(not exactly like that, but that general idea).
Creates individual features for each word left ... right.
If just one argument wordshapes(-2) is given, then end is taken as 0.
If left is not less than or equal to right, no features are made.
Fairly English-specific. |
unicodeshapes | left, right |
Same thing, but works for unicode characters generally. |
unicodeshapeconjunction | left, right |
Instead of individual word shape features, combines several
word shapes into one feature. |
suffix | length, position |
Features for suffixes of the word position. One feature for
each suffix of length 1 ... length. |
prefix | length, position |
Features for prefixes of the word position. One feature for
each prefix of length 1 ... length. |
prefixsuffix | length |
Features for concatenated prefix and suffix. One feature for
each of length 1 ... length. |
capitalizationsuffix | length |
Current word only. Combines character suffixes up to size length with a
binary value for whether the word contains any capital letters. |
distsim | filename, left, right |
Individual features for each position left ... right.
Compares that word with the dictionary in filename. |
distsimconjunction | filename, left, right |
A concatenation of distsim features from left ... right. |
nonalphanumeric | - |
A single boolean: whether or not the word is non-alphanumeric. |
numeric | - |
A single boolean: whether or not the word is entirely numeric. |
Also available are the macros "naacl2003unknowns",
"lnaacl2003unknowns", and "naacl2003conjunctions".
naacl2003unknowns and lnaacl2003unknowns include suffix extractors
and extractors for specific word shape features, such as containing
or not containing a digit.
The macro "frenchunknowns" is a macro for five extractors specific
to French, which test the end of the word to see if it matches
common suffixes for various POS classes and plural words. Adding
this experiment did not improve accuracy over the regular
naacl2003unknowns extractor macro, though.