Skip navigation links

Package org.apache.lucene.index

Code to maintain and access indices.

See: Description

Package org.apache.lucene.index Description

Code to maintain and access indices.

Table Of Contents

  1. Postings APIs
  2. Index Statistics

Postings APIs

Fields

Fields is the initial entry point into the postings APIs, this can be obtained in several ways:

 // access indexed fields for an index segment
 Fields fields = reader.fields();
 // access term vector fields for a specified document
 Fields fields = reader.getTermVectors(docid);
 
Fields implements Java's Iterable interface, so it's easy to enumerate the list of fields:
 // enumerate list of fields
 for (String field : fields) {
   // access the terms for this field
   Terms terms = fields.terms(field);
 }
 

Terms

Terms represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.

 // metadata about the field
 System.out.println("positions? " + terms.hasPositions());
 System.out.println("offsets? " + terms.hasOffsets());
 System.out.println("payloads? " + terms.hasPayloads());
 // iterate through terms
 TermsEnum termsEnum = terms.iterator(null);
 BytesRef term = null;
 while ((term = termsEnum.next()) != null) {
   doSomethingWith(termsEnum.term());
 }
 
TermsEnum provides an iterator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.
 // seek to a specific term
 boolean found = termsEnum.seekExact(new BytesRef("foobar"));
 if (found) {
   // get the document frequency
   System.out.println(termsEnum.docFreq());
   // enumerate through documents
   PostingsEnum docs = termsEnum.postings(null, null);
   // enumerate through documents and positions
   PostingsEnum docsAndPositions = termsEnum.postings(null, null, PostingsEnum.FLAG_POSITIONS);
 }
 

Documents

PostingsEnum is an extension of DocIdSetIteratorthat iterates over the list of documents for a term, along with the term frequency within that document.

 int docid;
 while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
   System.out.println(docid);
   System.out.println(docsEnum.freq());
  }
 

Positions

PostingsEnum also allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload). The information available is controlled by flags passed to TermsEnum#postings

 int docid;
 PostingsEnum postings = termsEnum.postings(null, null, PostingsEnum.FLAG_PAYLOADS | PostingsEnum.FLAG_OFFSETS);
 while ((docid = postings.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
   System.out.println(docid);
   int freq = postings.freq();
   for (int i = 0; i < freq; i++) {
      System.out.println(postings.nextPosition());
      System.out.println(postings.startOffset());
      System.out.println(postings.endOffset());
      System.out.println(postings.getPayload());
   }
 }
 

Index Statistics

Term statistics

Field statistics

Segment statistics

Document statistics

Document statistics are available during the indexing process for an indexed field: typically a Similarity implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its Similarity.computeNorm(org.apache.lucene.index.FieldInvertState) method.

Additional user-supplied statistics can be added to the document as DocValues fields and accessed via LeafReader.getNumericDocValues(java.lang.String).

Skip navigation links

Copyright © 2000-2017 Apache Software Foundation. All Rights Reserved.