java.lang.Object

org.apache.lucene.classification.SimpleNaiveBayesClassifier

org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier

All Implemented Interfaces:: Classifier<BytesRef>, DocumentClassifier<BytesRef>

public class SimpleNaiveBayesDocumentClassifier extends SimpleNaiveBayesClassifier implements DocumentClassifier<BytesRef>

A simplistic Lucene based NaiveBayes classifier, see


 http://en.wikipedia.org/wiki/Naive_Bayes_classifier

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary

Fields

Modifier and Type

Field

Description

protected final Map<String,Analyzer>

field2analyzer

Analyzer to be used for tokenizing document fields

Fields inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier
analyzer, classFieldName, indexReader, indexSearcher, query, textFieldNames
Constructor Summary

Constructors

Constructor

Description

SimpleNaiveBayesDocumentClassifier(IndexReader indexReader, Query query, String classFieldName, Map<String,Analyzer> field2analyzer, String... textFieldNames)

Creates a new NaiveBayes classifier.
Method Summary

Modifier and Type

Method

Description

ClassificationResult<BytesRef>

assignClass(Document document)

Assign a class (with score) to the given Document

List<ClassificationResult<BytesRef>>

getClasses(Document document)

Get all the classes (sorted by score, descending) assigned to the given Document.

List<ClassificationResult<BytesRef>>

getClasses(Document document, int max)

Get the first max classes (sorted by score, descending) assigned to the given text String.

protected String[]

getTokenArray(TokenStream tokenizedText)

Returns a token array from the TokenStream in input

Methods inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier
assignClass, assignClassNormalizedList, countDocsWithClass, getClasses, getClasses, normClassificationResults, tokenize

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- field2analyzer
  
  protected final Map<String,Analyzer> field2analyzer
  
  Analyzer to be used for tokenizing document fields
Constructor Details
- SimpleNaiveBayesDocumentClassifier
  
  public SimpleNaiveBayesDocumentClassifier(IndexReader indexReader, Query query, String classFieldName, Map<String,Analyzer> field2analyzer, String... textFieldNames)
  
  Creates a new NaiveBayes classifier.
  
  Parameters:
  
  indexReader - the reader on the index to be used for classification
  
  query - a Query to eventually filter the docs used for training the classifier, or null if all the indexed docs should be used
  
  classFieldName - the name of the field used as the output for the classifier NOTE: must not be heavely analyzed as the returned class will be a token indexed for this field
  
  textFieldNames - the name of the fields used as the inputs for the classifier, they can contain boosting indication e.g. title^10
Method Details
- assignClass
  
  public ClassificationResult<BytesRef> assignClass(Document document) throws IOException
  
  Description copied from interface: DocumentClassifier
  
  Assign a class (with score) to the given Document
  
  Specified by:
  
  assignClass in interface DocumentClassifier<BytesRef>
  
  Parameters:
  
  document - a Document to be classified. Fields are considered features for the classification.
  
  Returns:
  
  a ClassificationResult holding assigned class of type T and score
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- getClasses
  
  public List<ClassificationResult<BytesRef>> getClasses(Document document) throws IOException
  
  Description copied from interface: DocumentClassifier
  
  Get all the classes (sorted by score, descending) assigned to the given Document.
  
  Specified by:
  
  getClasses in interface DocumentClassifier<BytesRef>
  
  Parameters:
  
  document - a Document to be classified. Fields are considered features for the classification.
  
  Returns:
  
  the whole list of ClassificationResult, the classes and scores. Returns null if the classifier can't make lists.
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- getClasses
  
  public List<ClassificationResult<BytesRef>> getClasses(Document document, int max) throws IOException
  
  Description copied from interface: DocumentClassifier
  
  Get the first max classes (sorted by score, descending) assigned to the given text String.
  
  Specified by:
  
  getClasses in interface DocumentClassifier<BytesRef>
  
  Parameters:
  
  document - a Document to be classified. Fields are considered features for the classification.
  
  max - the number of return list elements
  
  Returns:
  
  the whole list of ClassificationResult, the classes and scores. Cut for "max" number of elements. Returns null if the classifier can't make lists.
  
  Throws:
  
  IOException - If there is a low-level I/O error.
- getTokenArray
  
  protected String[] getTokenArray(TokenStream tokenizedText) throws IOException
  
  Returns a token array from the TokenStream in input
  
  Parameters:
  
  tokenizedText - the tokenized content of a field
  
  Returns:
  
  a String array of the resulting tokens
  
  Throws:
  
  IOException - If tokenization fails because there is a low-level I/O error

Class SimpleNaiveBayesDocumentClassifier

Field Summary

Fields inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier

Constructor Summary

Method Summary

Methods inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier

Methods inherited from class java.lang.Object

Field Details

field2analyzer

Constructor Details

SimpleNaiveBayesDocumentClassifier

Method Details

assignClass

getClasses

getClasses

getTokenArray