The classify package provides facilities for training classifiers. In this package, data points are viewed as single instances, not sequences. The most commonly used classifier is the log-linear classifier with binary features. More classifiers, such as SVM and Naive Bayes, are also available in this package.
The Classifier
contract only guarantees routines for getting a classification for an example, and the scores assigned to each class for that example. Note that training is dependent upon the individual classifier.
Classifiers operate over Datum
objects. A Datum
is a list of descriptive features and a class label; features and labels can be any object, but usually String
s are used. Datum objects are grouped using Dataset
objects. Some classifiers use Dataset objects as a way of grouping inputs.
Following is a set of examples outlining how to create, train, and use each of the different classifier types.
To build a classifier, one first creates a GeneralDataset
, which is a list to Datum
objects. A Datum
is a list of descriptive features, along with a label; features and labels can be any object, though we usually use strings.
GeneralDataset dataSet=new Dataset(); while (more datums to make) { ... make featureList: e.g., ["PrevWord=at","CurrentTag=NNP","isUpperCase"] ... make label: e.g., ["PLACE"]; Datum d = new BasicDatum(featureList, label); dataSet.add(d); }
There are some useful methods in GeneralDataset
such as:
dataSet.applyFeatureCountThreshold(int cutoff); dataSet.summaryStatistics(); // dumps the number of features and datums
Next, one makes a LinearClassifierFactory
and calls its trainClassifier(GeneralDataset dataSet)
method:
LinearClassifierFactory lcFactory = new LinearClassifierFactory(); LinearClassifier c = lcFactory.trainClassifier(dataSet);
LinearClassifierFactory
has options for different optimizers (default: QNminimizer), the converge threshold for minimization, etc. Check the class description for detailed information.
A classifier, once built, can be used to classify new Datum
instances:
Object label = c.classOf(mysteryDatum);If you want scores instead, you can ask:
Counter scores = c.scoresOf(mysteryDatum);
The scores which are returned by the log-linear classifiers are the feature-weight dot products, not the normalized probabilities.
There are some other useful methods like justificationOf(Datum d)
, and
logProbabilityOf(Datum d)
, also various methods for visualizing the
weights and the most highly weighted features.
This concludes the log-linear classifiers with binary features.
We can also train log-linear classifiers with real-valued features. In this case,
RVFDatum
should be used.
Real Valued Classifiers (RVF) operate over RVFDatum
objects. A RVFDatum is composed of a set of feature and real-value pairs. RVFDatums are grouped using a RVFDataset
.
To assemble an RVFDatum
by using a Counter
and assigning an Object
label to it.
Counter features = new Counter(); features.incrementCount("FEATURE_A", 1.2); features.incrementCount("FEATURE_B", 2.3); features.incrementCount("FEATURE_C", 0.5); RVFDatum rvfDatum = new RVFDatum(features, "DATUM_LABEL");
RVFDataset
objects are representations of RVFDatum
objects that efficiently store
the data with which to train the classifier. This type of dataset only accepts RVFDatum
objects via its add
method (other Datum
objects that are not instances of RVFDatum
will be ignored), and is equivalent to a Dataset
if all RVFDatum
objects have only features with value 1.0. Since it is a subclass of GeneralDataset
,
the methods shown above as applied to the GeneralDataset
can also be applied to the RVFDataset
.
(TODO) An example for LinearType2Classifier.
(TODO) Saving Classifier out to file (from LearningExperiment
)
private static void saveClassifierToFile(LinearClassifier classifier, String serializePath) { System.err.print("Serializing classifier to " + serializePath + "..."); try { ObjectOutputStream oos; if (serializePath.endsWith(".gz")) { oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(serializePath)))); } else { oos = new ObjectOutputStream(new BufferedOutputStream(new FileOutputStream(serializePath))); } oos.writeObject(classifier); oos.close(); System.err.println("done."); } catch (Exception e) { e.printStackTrace(); throw new RuntimeException("Serialization failed: "+e.getMessage()); } }@author Dan Klein