Class DatasetSplitter

java.lang.Object
org.apache.lucene.classification.utils.DatasetSplitter

public class DatasetSplitter extends Object
Utility class for creating training / test / cross validation indexes from the original index.
  • Constructor Details

    • DatasetSplitter

      public DatasetSplitter(double testRatio, double crossValidationRatio)
      Create a DatasetSplitter by giving test and cross validation IDXs sizes
      Parameters:
      testRatio - the ratio of the original index to be used for the test IDX as a double between 0.0 and 1.0
      crossValidationRatio - the ratio of the original index to be used for the c.v. IDX as a double between 0.0 and 1.0
  • Method Details

    • split

      public void split(IndexReader originalIndex, Directory trainingIndex, Directory testIndex, Directory crossValidationIndex, Analyzer analyzer, boolean termVectors, String classFieldName, String... fieldNames) throws IOException
      Split a given index into 3 indexes for training, test and cross validation tasks respectively
      Parameters:
      originalIndex - an LeafReader on the source index
      trainingIndex - a Directory used to write the training index
      testIndex - a Directory used to write the test index
      crossValidationIndex - a Directory used to write the cross validation index
      analyzer - Analyzer used to create the new docs
      termVectors - true if term vectors should be kept
      classFieldName - name of the field used as the label for classification; this must be indexed with sorted doc values
      fieldNames - names of fields that need to be put in the new indexes or null if all should be used
      Throws:
      IOException - if any writing operation fails on any of the indexes