edu.stanford.nlp.classify
Class GeneralDataset

java.lang.Object
  extended by edu.stanford.nlp.classify.GeneralDataset
Direct Known Subclasses:
Dataset, RVFDataset

public abstract class GeneralDataset
extends Object

The purpose of this interface is to unify Dataset and RVFDataset.

Author:
Kristina Toutanova (kristina@cs.stanford.edu), Anna Rafferty (various refactoring with subclasses)

Field Summary
protected  int[][] data
           
 Index<Object> featureIndex
           
 Index<Object> labelIndex
           
protected  int[] labels
           
protected  int size
           
 
Constructor Summary
GeneralDataset()
           
 
Method Summary
abstract  void add(Datum d)
           
 void addAll(Collection<Datum> data)
          Adds all Datums in the given collection of data to this dataset
 void applyFeatureCountThreshold(int k)
          Applies a feature count threshold to the Dataset.
 void clear()
          Resets the Dataset so that it is empty and ready to collect data.
 void clear(int numDatums)
          Resets the Dataset so that it is empty and ready to collect data.
 Index featureIndex()
           
 int[][] getDataArray()
           
protected  float[] getFeatureCounts()
          Get the total count (over all data instances) of each feature
 int[] getLabelsArray()
           
abstract  RVFDatum getRVFDatum(int index)
           
abstract  double[][] getValuesArray()
           
protected abstract  void initialize(int numDatums)
          This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data
 Index labelIndex()
           
 Iterator labelIterator()
          Returns an iterator over the class labels of the Dataset
 int numClasses()
           
 int numFeatures()
           
 int numFeatureTokens()
          returns the number of feature tokens in the Dataset.
 int numFeatureTypes()
          returns the number of distinct feature types in the Dataset.
 void printSVMLightFormat()
          Dumps the Dataset as a training/test file for SVMLight.
 void printSVMLightFormat(PrintWriter pw)
          Print SVM Light Format file.
 int size()
          Returns the number of examples (Datums) in the Dataset.
abstract  Pair<GeneralDataset,GeneralDataset> split(double p)
           
abstract  Pair<GeneralDataset,GeneralDataset> split(int start, int end)
           
abstract  void summaryStatistics()
          Print some statistics summarizing the dataset
protected  void trimData()
           
protected  void trimLabels()
           
protected  double[][] trimToSize(double[][] i)
           
protected  int[] trimToSize(int[] i)
           
protected  int[][] trimToSize(int[][] i)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

labelIndex

public Index<Object> labelIndex

featureIndex

public Index<Object> featureIndex

labels

protected int[] labels

data

protected int[][] data

size

protected int size
Constructor Detail

GeneralDataset

public GeneralDataset()
Method Detail

labelIndex

public Index labelIndex()

featureIndex

public Index featureIndex()

numFeatures

public int numFeatures()

numClasses

public int numClasses()

getLabelsArray

public int[] getLabelsArray()

getDataArray

public int[][] getDataArray()

getValuesArray

public abstract double[][] getValuesArray()

clear

public void clear()
Resets the Dataset so that it is empty and ready to collect data.


clear

public void clear(int numDatums)
Resets the Dataset so that it is empty and ready to collect data.

Parameters:
numDatums - initial capacity of dataset

initialize

protected abstract void initialize(int numDatums)
This method takes care of resetting values of the dataset such that it is empty with an initial capacity of numDatums Should be accessed only by appropriate methods within the class, such as clear(), which take care of other parts of the emptying of data

Parameters:
numDatums - initial capacity of dataset

getRVFDatum

public abstract RVFDatum getRVFDatum(int index)

add

public abstract void add(Datum d)

getFeatureCounts

protected float[] getFeatureCounts()
Get the total count (over all data instances) of each feature

Returns:
an array containing the counts (indexed by index)

applyFeatureCountThreshold

public void applyFeatureCountThreshold(int k)
Applies a feature count threshold to the Dataset. All features that occur fewer than k times are expunged.


numFeatureTokens

public int numFeatureTokens()
returns the number of feature tokens in the Dataset.


numFeatureTypes

public int numFeatureTypes()
returns the number of distinct feature types in the Dataset.


addAll

public void addAll(Collection<Datum> data)
Adds all Datums in the given collection of data to this dataset

Parameters:
data - collection of datums you would like to add to the dataset

split

public abstract Pair<GeneralDataset,GeneralDataset> split(int start,
                                                          int end)

split

public abstract Pair<GeneralDataset,GeneralDataset> split(double p)

size

public int size()
Returns the number of examples (Datums) in the Dataset.


trimData

protected void trimData()

trimLabels

protected void trimLabels()

trimToSize

protected int[] trimToSize(int[] i)

trimToSize

protected int[][] trimToSize(int[][] i)

trimToSize

protected double[][] trimToSize(double[][] i)

summaryStatistics

public abstract void summaryStatistics()
Print some statistics summarizing the dataset


labelIterator

public Iterator labelIterator()
Returns an iterator over the class labels of the Dataset


printSVMLightFormat

public void printSVMLightFormat()
Dumps the Dataset as a training/test file for SVMLight.
class [fno:val]+ The features must occur in consecutive order.


printSVMLightFormat

public void printSVMLightFormat(PrintWriter pw)
Print SVM Light Format file. If the Dataset has more than 2 classes, then it prints using the label index (+1) (for svm_struct). If it is 2 classes, then the labelIndex.get(0) is mapped to +1 and labelIndex.get(1) is mapped to -1 (for svm_light).



Stanford NLP Group