edu.stanford.nlp.classify
Class Dataset

java.lang.Object
  extended by edu.stanford.nlp.classify.GeneralDataset
      extended by edu.stanford.nlp.classify.Dataset
Direct Known Subclasses:
WeightedDataset

public class Dataset
extends GeneralDataset

An interfacing class for ClassifierFactory that incrementally builds a more memory-efficent representation of a List of Datum objects for the purposes of training a Classifier with a ClassifierFactory.

Author:
Roger Levy (rog@stanford.edu), Anna Rafferty (various refactoring with GeneralDataset/RVFDataset)

Field Summary
 
Fields inherited from class edu.stanford.nlp.classify.GeneralDataset
data, featureIndex, labelIndex, labels, size
 
Constructor Summary
Dataset()
           
Dataset(Index labelIndex, int[] labels, Index featureIndex, int[][] data)
          Constructor that fully specifies a Dataset.
Dataset(Index labelIndex, int[] labels, Index featureIndex, int[][] data, int size)
          Constructor that fully specifies a Dataset.
Dataset(int numDatums)
           
Dataset(int numDatums, Index featureIndex, Index labelIndex)
           
 
Method Summary
 void add(Collection features, Object label)
           
 void add(Datum d)
           
protected  void addFeatures(Collection features)
           
protected  void addLabel(Object label)
           
 void applyFeatureCountThreshold(List<Pair<Pattern,Integer>> thresholds)
          Applies feature count thresholds to the Dataset.
 void changeFeatureIndex(Index newFeatureIndex)
           
 void changeLabelIndex(Index newLabelIndex)
           
protected  void ensureSize()
           
 Datum getDatum(int index)
           
 Counter getFeatureCounter()
          Get Number of datums a given feature appears in.
 double[] getInformationGains()
           
 Dataset getRandomSubDataset(double p, int seed)
           
 RVFDatum getRVFDatum(int index)
           
 double[][] getValuesArray()
           
protected  void initialize(int numDatums)
          This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data
static void main(String[] args)
           
 void printFullFeatureMatrix(PrintWriter pw)
          prints the full feature matrix in tab-delimited form.
 void printSparseFeatureMatrix()
          prints the sparse feature matrix using printSparseFeatureMatrix() to System.out.
 void printSparseFeatureMatrix(PrintWriter pw)
          prints a sparse feature matrix representation of the Dataset.
static void printSVMLightFormat(PrintWriter pw, Counter<Integer> c, int classNo)
          Need to sort the counter by feature keys and dump it
static Dataset readSVMLightFormat(String filename)
          Constructs a Dataset by reading in a file in SVM light format.
static Dataset readSVMLightFormat(String filename, Index featureIndex, Index labelIndex)
          Constructs a Dataset by reading in a file in SVM light format.
static Dataset readSVMLightFormat(String filename, Index featureIndex, Index labelIndex, List<String> lines)
          Constructs a Dataset by reading in a file in SVM light format.
static Dataset readSVMLightFormat(String filename, List<String> lines)
          Constructs a Dataset by reading in a file in SVM light format.
 void selectFeaturesBinaryInformationGain(int numFeatures)
           
 Pair<GeneralDataset,GeneralDataset> split(double percentDev)
           
 Pair<GeneralDataset,GeneralDataset> split(int start, int end)
           
 void summaryStatistics()
          Prints some summary statistics to stderr for the Dataset.
static Datum svmLightLineToDatum(String l)
           
 String toString()
           
 String toSummaryStatistics()
           
 String toSummaryString()
           
 
Methods inherited from class edu.stanford.nlp.classify.GeneralDataset
addAll, applyFeatureCountThreshold, clear, clear, featureIndex, getDataArray, getFeatureCounts, getLabelsArray, labelIndex, labelIterator, numClasses, numFeatures, numFeatureTokens, numFeatureTypes, printSVMLightFormat, printSVMLightFormat, size, trimData, trimLabels, trimToSize, trimToSize, trimToSize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

Dataset

public Dataset()

Dataset

public Dataset(int numDatums,
               Index featureIndex,
               Index labelIndex)

Dataset

public Dataset(int numDatums)

Dataset

public Dataset(Index labelIndex,
               int[] labels,
               Index featureIndex,
               int[][] data)
Constructor that fully specifies a Dataset. Needed this for MulticlassDataset.


Dataset

public Dataset(Index labelIndex,
               int[] labels,
               Index featureIndex,
               int[][] data,
               int size)
Constructor that fully specifies a Dataset. Needed this for MulticlassDataset.

Method Detail

split

public Pair<GeneralDataset,GeneralDataset> split(double percentDev)
Specified by:
split in class GeneralDataset

split

public Pair<GeneralDataset,GeneralDataset> split(int start,
                                                 int end)
Specified by:
split in class GeneralDataset

getRandomSubDataset

public Dataset getRandomSubDataset(double p,
                                   int seed)

getValuesArray

public double[][] getValuesArray()
Specified by:
getValuesArray in class GeneralDataset

readSVMLightFormat

public static Dataset readSVMLightFormat(String filename)
Constructs a Dataset by reading in a file in SVM light format.


readSVMLightFormat

public static Dataset readSVMLightFormat(String filename,
                                         List<String> lines)
Constructs a Dataset by reading in a file in SVM light format. The lines parameter is filled with the lines of the file for further processing (if lines is null, it is assumed no line information is desired)


readSVMLightFormat

public static Dataset readSVMLightFormat(String filename,
                                         Index featureIndex,
                                         Index labelIndex)
Constructs a Dataset by reading in a file in SVM light format. the created dataset has the same feature and label index as given


readSVMLightFormat

public static Dataset readSVMLightFormat(String filename,
                                         Index featureIndex,
                                         Index labelIndex,
                                         List<String> lines)
Constructs a Dataset by reading in a file in SVM light format. the created dataset has the same feature and label index as given


svmLightLineToDatum

public static Datum svmLightLineToDatum(String l)

getFeatureCounter

public Counter getFeatureCounter()
Get Number of datums a given feature appears in.


add

public void add(Datum d)
Specified by:
add in class GeneralDataset

add

public void add(Collection features,
                Object label)

ensureSize

protected void ensureSize()

addLabel

protected void addLabel(Object label)

addFeatures

protected void addFeatures(Collection features)

initialize

protected void initialize(int numDatums)
Description copied from class: GeneralDataset
This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data

Specified by:
initialize in class GeneralDataset
Parameters:
numDatums - initial capacity of dataset

getDatum

public Datum getDatum(int index)
Parameters:
index -
Returns:
the index-ed datum

getRVFDatum

public RVFDatum getRVFDatum(int index)
Specified by:
getRVFDatum in class GeneralDataset
Parameters:
index -
Returns:
the index-ed datum

summaryStatistics

public void summaryStatistics()
Prints some summary statistics to stderr for the Dataset.

Specified by:
summaryStatistics in class GeneralDataset

toSummaryStatistics

public String toSummaryStatistics()

applyFeatureCountThreshold

public void applyFeatureCountThreshold(List<Pair<Pattern,Integer>> thresholds)
Applies feature count thresholds to the Dataset. Only features that match pattern_i and occur at least threshold_i times (for some i) are kept.

Parameters:
thresholds - a list of pattern, threshold pairs

printFullFeatureMatrix

public void printFullFeatureMatrix(PrintWriter pw)
prints the full feature matrix in tab-delimited form. These can be BIG matrices, so be careful!


printSparseFeatureMatrix

public void printSparseFeatureMatrix()
prints the sparse feature matrix using printSparseFeatureMatrix() to System.out.


printSparseFeatureMatrix

public void printSparseFeatureMatrix(PrintWriter pw)
prints a sparse feature matrix representation of the Dataset. Prints the actual Object.toString() representations of features.


main

public static void main(String[] args)

changeLabelIndex

public void changeLabelIndex(Index newLabelIndex)

changeFeatureIndex

public void changeFeatureIndex(Index newFeatureIndex)

selectFeaturesBinaryInformationGain

public void selectFeaturesBinaryInformationGain(int numFeatures)

getInformationGains

public double[] getInformationGains()

toString

public String toString()
Overrides:
toString in class Object

toSummaryString

public String toSummaryString()

printSVMLightFormat

public static void printSVMLightFormat(PrintWriter pw,
                                       Counter<Integer> c,
                                       int classNo)
Need to sort the counter by feature keys and dump it

Parameters:
pw -
c -
classNo -


Stanford NLP Group