java.lang.Object

org.apache.lucene.classification.SimpleNaiveBayesClassifier

All Implemented Interfaces:: Classifier<BytesRef>

Direct Known Subclasses:: CachingNaiveBayesClassifier, SimpleNaiveBayesDocumentClassifier

public class SimpleNaiveBayesClassifier extends Object implements Classifier<BytesRef>

A simplistic Lucene based NaiveBayes classifier, see


 http://en.wikipedia.org/wiki/Naive_Bayes_classifier

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

protected final Analyzer

analyzer

Analyzer to be used for tokenizing unseen input text

protected final String

classFieldName

name of the field to be used as a class / category output

protected final IndexReader

indexReader

IndexReader used to access the Classifier's index

protected final IndexSearcher

indexSearcher

IndexSearcher to run searches on the index for retrieving frequencies

protected final Query

query

Query used to eventually filter the document set to be used to classify

protected final String[]

textFieldNames

names of the fields to be used as input text
Constructor Summary

Constructors

Constructor

Description

SimpleNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, String classFieldName, String... textFieldNames)

Creates a new NaiveBayes classifier.
Method Summary

Modifier and Type

Method

Description

ClassificationResult<BytesRef>

assignClass(String inputDocument)

Assign a class (with score) to the given text String

protected List<ClassificationResult<BytesRef>>

assignClassNormalizedList(String inputDocument)

Calculate probabilities for all classes for a given input text

protected int

countDocsWithClass()

count the number of documents in the index having at least a value for the 'class' field

List<ClassificationResult<BytesRef>>

getClasses(String text)

Get all the classes (sorted by score, descending) assigned to the given text String.

List<ClassificationResult<BytesRef>>

getClasses(String text, int max)

Get the first max classes (sorted by score, descending) assigned to the given text String.

protected ArrayList<ClassificationResult<BytesRef>>

normClassificationResults(List<ClassificationResult<BytesRef>> assignedClasses)

Normalize the classification results based on the max score available

protected String[]

tokenize(String text)

tokenize a String on this classifier's text fields and analyzer

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- indexReader
  
  protected final IndexReader indexReader
  
  IndexReader used to access the Classifier's index
- textFieldNames
  
  protected final String[] textFieldNames
  
  names of the fields to be used as input text
- classFieldName
  
  protected final String classFieldName
  
  name of the field to be used as a class / category output
- analyzer
  
  protected final Analyzer analyzer
  
  Analyzer to be used for tokenizing unseen input text
- indexSearcher
  
  protected final IndexSearcher indexSearcher
  
  IndexSearcher to run searches on the index for retrieving frequencies
- query
  
  protected final Query query
  
  Query used to eventually filter the document set to be used to classify
Constructor Details
- SimpleNaiveBayesClassifier
  
  public SimpleNaiveBayesClassifier(IndexReader indexReader, Analyzer analyzer, Query query, String classFieldName, String... textFieldNames)
  
  Creates a new NaiveBayes classifier.
  
  Parameters:
  
  indexReader - the reader on the index to be used for classification
  
  analyzer - an Analyzer used to analyze unseen text
  
  query - a Query to eventually filter the docs used for training the classifier, or null if all the indexed docs should be used
  
  classFieldName - the name of the field used as the output for the classifier NOTE: must not be havely analyzed as the returned class will be a token indexed for this field
  
  textFieldNames - the name of the fields used as the inputs for the classifier, NO boosting supported per field
Method Details
- assignClass
  
  public ClassificationResult<BytesRef> assignClass(String inputDocument) throws IOException
  
  Description copied from interface: Classifier
  
  Assign a class (with score) to the given text String
  
  Specified by:
  
  assignClass in interface Classifier<BytesRef>
  
  Parameters:
  
  inputDocument - a String containing text to be classified
  
  Returns:
  
  a ClassificationResult holding assigned class of type T and score
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- getClasses
  
  public List<ClassificationResult<BytesRef>> getClasses(String text) throws IOException
  
  Description copied from interface: Classifier
  
  Get all the classes (sorted by score, descending) assigned to the given text String.
  
  Specified by:
  
  getClasses in interface Classifier<BytesRef>
  
  Parameters:
  
  text - a String containing text to be classified
  
  Returns:
  
  the whole list of ClassificationResult, the classes and scores. Returns null if the classifier can't make lists.
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- getClasses
  
  public List<ClassificationResult<BytesRef>> getClasses(String text, int max) throws IOException
  
  Description copied from interface: Classifier
  
  Get the first max classes (sorted by score, descending) assigned to the given text String.
  
  Specified by:
  
  getClasses in interface Classifier<BytesRef>
  
  Parameters:
  
  text - a String containing text to be classified
  
  max - the number of return list elements
  
  Returns:
  
  the whole list of ClassificationResult, the classes and scores. Cut for "max" number of elements. Returns null if the classifier can't make lists.
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- assignClassNormalizedList
  
  protected List<ClassificationResult<BytesRef>> assignClassNormalizedList(String inputDocument) throws IOException
  
  Calculate probabilities for all classes for a given input text
  
  Parameters:
  
  inputDocument - the input text as a String
  
  Returns:
  
  a List of ClassificationResult, one for each existing class
  
  Throws:
  
  IOException - if assigning probabilities fails
- countDocsWithClass
  
  protected int countDocsWithClass() throws IOException
  
  count the number of documents in the index having at least a value for the 'class' field
  
  Returns:
  
  the no. of documents having a value for the 'class' field
  
  Throws:
  
  IOException - if accessing to term vectors or search fails
- tokenize
  
  protected String[] tokenize(String text) throws IOException
  
  tokenize a String on this classifier's text fields and analyzer
  
  Parameters:
  
  text - the String representing an input text (to be classified)
  
  Returns:
  
  a String array of the resulting tokens
  
  Throws:
  
  IOException - if tokenization fails
- normClassificationResults
  
  protected ArrayList<ClassificationResult<BytesRef>> normClassificationResults(List<ClassificationResult<BytesRef>> assignedClasses)
  
  Normalize the classification results based on the max score available
  
  Parameters:
  
  assignedClasses - the list of assigned classes
  
  Returns:
  
  the normalized results

Class SimpleNaiveBayesClassifier

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

indexReader

textFieldNames

classFieldName

analyzer

indexSearcher

query

Constructor Details

SimpleNaiveBayesClassifier

Method Details

assignClass

getClasses

getClasses

assignClassNormalizedList

countDocsWithClass

tokenize

normClassificationResults