Package org.apache.lucene.index
Table Of Contents
Index APIs
IndexWriter
IndexWriter
is used to create an index, and to add, update and
delete documents. The IndexWriter class is thread safe, and enforces a single instance per index.
Creating an IndexWriter creates a new index or opens an existing index for writing, in a Directory
, depending on the configuration in IndexWriterConfig
. A Directory is an abstraction that typically
represents a local file-system directory (see various implementations of FSDirectory
), but it may also stand for some other storage, such as RAM.
IndexReader
IndexReader
is used to read data from the index, and supports
searching. Many thread-safe readers may be DirectoryReader.open(org.apache.lucene.store.Directory)
concurrently with a single (or no) writer. Each reader maintains a consistent "point in time"
view of an index and must be explicitly refreshed (see DirectoryReader.openIfChanged(org.apache.lucene.index.DirectoryReader)
) in order to incorporate writes that may
occur after it is opened.
Segments and docids
Lucene's index is composed of segments, each of which contains a subset of all the documents in the index, and is a complete searchable index in itself, over that subset. As documents are written to the index, new segments are created and flushed to directory storage. Segments are immutable; updates and deletions may only create new segments and do not modify existing ones. Over time, the writer merges groups of smaller segments into single larger ones in order to maintain an index that is efficient to search, and to reclaim dead space left behind by deleted (and updated) documents.
Each document is identified by a 32-bit number, its "docid," and is composed of a collection
of Field values of diverse types (postings, stored fields, doc values, and points). Docids come
in two flavors: global and per-segment. A document's global docid is just the sum of its
per-segment docid and that segment's base docid offset. External, high-level APIs only handle
global docids, but internal APIs that reference a LeafReader
,
which is a reader for a single segment, deal in per-segment docids.
Docids are assigned sequentially within each segment (starting at 0). Thus the number of documents in a segment is the same as its maximum docid; some may be deleted, but their docids are retained until the segment is merged. When segments merge, their documents are assigned new sequential docids. Accordingly, docid values must always be treated as internal implementation, not exposed as part of an application, nor stored or referenced outside of Lucene's internal APIs.
Field Types
Lucene supports a variety of different document field data structures. Lucene's core, the
inverted index, is comprised of "postings." The postings, with their term dictionary, can be
thought of as a map that provides efficient lookup given a Term
(roughly, a word or token), to (the ordered list of) Document
s
containing that Term. Codecs may additionally record impacts
alongside postings in order to be able to
skip over low-scoring documents at search time. Postings do not provide any way of retrieving
terms given a document, short of scanning the entire index.
Stored fields are essentially the opposite of postings, providing efficient retrieval of field
values given a docid. All stored field values for a document are stored together in a block.
Different types of stored field provide high-level datatypes such as strings and numbers on top
of the underlying bytes. Stored field values are usually retrieved by the searcher using an
implementation of StoredFieldVisitor
.
DocValues
fields are what are sometimes referred to as
columnar, or column-stride fields, by analogy to relational database terminology, in which
documents are considered as rows, and fields, columns. DocValues fields store values per-field: a
value for every document is held in a single data structure, providing for rapid, sequential
lookup of a field-value given a docid. These fields are used for efficient value-based sorting,
and for faceting, but they are not useful for filtering.
PointValues
represent numeric values using a kd-tree data
structure. Efficient 1- and higher dimensional implementations make these the choice for numeric
range and interval queries, and geo-spatial queries.
Postings APIs
Fields
Fields
is the initial entry point into the postings APIs, this
can be obtained in several ways:
// access indexed fields for an index segment Fields fields = reader.fields(); // access term vector fields for a specified document TermVectors vectors = reader.termVectors(); Fields fields = vectors.get(docid);Fields implements Java's Iterable interface, so it's easy to enumerate the list of fields:
// enumerate list of fields for (String field : fields) { // access the terms for this field Terms terms = fields.terms(field); }
Terms
Terms
represents the collection of terms within a field,
exposes some metadata and statistics, and an API for enumeration.
// metadata about the field System.out.println("positions? " + terms.hasPositions()); System.out.println("offsets? " + terms.hasOffsets()); System.out.println("payloads? " + terms.hasPayloads()); // iterate through terms TermsEnum termsEnum = terms.iterator(null); BytesRef term = null; while ((term = termsEnum.next()) != null) { doSomethingWith(termsEnum.term()); }
TermsEnum
provides an iterator over the list of terms within a
field, some statistics about the term, and methods to access the term's
documents and positions.
// seek to a specific term boolean found = termsEnum.seekExact(new BytesRef("foobar")); if (found) { // get the document frequency System.out.println(termsEnum.docFreq()); // enumerate through documents PostingsEnum docs = termsEnum.postings(null, null); // enumerate through documents and positions PostingsEnum docsAndPositions = termsEnum.postings(null, null, PostingsEnum.FLAG_POSITIONS); }
Documents
PostingsEnum
is an extension of DocIdSetIterator
that iterates over the list of documents for a term,
along with the term frequency within that document.
int docid; while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); System.out.println(docsEnum.freq()); }
Positions
PostingsEnum also allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload). The information available is controlled by flags passed to TermsEnum#postings
int docid; PostingsEnum postings = termsEnum.postings(null, null, PostingsEnum.FLAG_PAYLOADS | PostingsEnum.FLAG_OFFSETS); while ((docid = postings.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); int freq = postings.freq(); for (int i = 0; i < freq; i++) { System.out.println(postings.nextPosition()); System.out.println(postings.startOffset()); System.out.println(postings.endOffset()); System.out.println(postings.getPayload()); } }
Index Statistics
Term statistics
TermsEnum.docFreq()
: Returns the number of documents that contain at least one occurrence of the term. This statistic is always available for an indexed term. Note that it will also count deleted documents, when segments are merged the statistic is updated as those deleted documents are merged away.TermsEnum.totalTermFreq()
: Returns the number of occurrences of this term across all documents. Like docFreq(), it will also count occurrences that appear in deleted documents.
Field statistics
Terms.size()
: Returns the number of unique terms in the field. This statistic may be unavailable (returns-1
) for some Terms implementations such asMultiTerms
, where it cannot be efficiently computed. Note that this count also includes terms that appear only in deleted documents: when segments are merged such terms are also merged away and the statistic is then updated.Terms.getDocCount()
: Returns the number of documents that contain at least one occurrence of any term for this field. This can be thought of as a Field-level docFreq(). Like docFreq() it will also count deleted documents.Terms.getSumDocFreq()
: Returns the number of postings (term-document mappings in the inverted index) for the field. This can be thought of as the sum ofTermsEnum.docFreq()
across all terms in the field, and like docFreq() it will also count postings that appear in deleted documents.Terms.getSumTotalTermFreq()
: Returns the number of tokens for the field. This can be thought of as the sum ofTermsEnum.totalTermFreq()
across all terms in the field, and like totalTermFreq() it will also count occurrences that appear in deleted documents.
Segment statistics
IndexReader.maxDoc()
: Returns the number of documents (including deleted documents) in the index.IndexReader.numDocs()
: Returns the number of live documents (excluding deleted documents) in the index.IndexReader.numDeletedDocs()
: Returns the number of deleted documents in the index.Fields.size()
: Returns the number of indexed fields.
Document statistics
Document statistics are available during the indexing process for an indexed field: typically
a Similarity
implementation will store some of
these values (possibly in a lossy way), into the normalization value for the document in its
Similarity.computeNorm(org.apache.lucene.index.FieldInvertState)
method.
FieldInvertState.getLength()
: Returns the number of tokens for this field in the document. Note that this is just the number of times thatTokenStream.incrementToken()
returned true, and is unrelated to the values inPositionIncrementAttribute
.FieldInvertState.getNumOverlap()
: Returns the number of tokens for this field in the document that had a position increment of zero. This can be used to compute a document length that discounts artificial tokens such as synonyms.FieldInvertState.getPosition()
: Returns the accumulated position value for this field in the document: computed from the values ofPositionIncrementAttribute
and includingAnalyzer.getPositionIncrementGap(java.lang.String)
s across multivalued fields.FieldInvertState.getOffset()
: Returns the total character offset value for this field in the document: computed from the values ofOffsetAttribute
returned byTokenStream.end()
, and includingAnalyzer.getOffsetGap(java.lang.String)
s across multivalued fields.FieldInvertState.getUniqueTermCount()
: Returns the number of unique terms encountered for this field in the document.FieldInvertState.getMaxTermFrequency()
: Returns the maximum frequency across all unique terms encountered for this field in the document.
Additional user-supplied statistics can be added to the document as DocValues fields and
accessed via LeafReader.getNumericDocValues(java.lang.String)
.
-
ClassDescriptionA FilteredTermsEnum that enumerates terms based upon what is accepted by a DFA.BaseCompositeReader<R extends IndexReader>Base class for implementing
CompositeReader
s based on an array of sub-readers.A base TermsEnum that adds default implementations forBaseTermsEnum.attributes()
BaseTermsEnum.termState()
BaseTermsEnum.seekExact(BytesRef)
BaseTermsEnum.seekExact(BytesRef, TermState)
In some cases, the default implementation may be slow and consume huge memory, so subclass SHOULD have its own implementation if possible.A per-document numeric value.This class provides access to per-document floating point vector values indexed asKnnByteVectorField
.Basic tool and API to check the health of an index and write a new segments file that removes reference to problematic segments.The marker RuntimeException used by CheckIndex APIs when index integrity failure is detected.Class with static variables with information about CheckIndex's -level parameter.Run-time configuration options for CheckIndex commands.Returned fromCheckIndex.checkIndex()
detailing the health and status of the index.Status from testing DocValuesStatus from testing field infos.Status from testing field norms.Status from testing index sortStatus from testing livedocsStatus from testing PointValuesHolds the status of each segment in the index.Status from testing soft deletesStatus from testing stored fields.Status from testing term index.Status from testing stored fields.Status from testing vector valuesWalks the entire N-dimensional points space, verifying that all points fall within the last cell's boundaries.LeafReader implemented by codec APIs.Instances of this reader type can only be used to get stored fields from the underlying LeafReaders, but it is not possible to directly retrieve postings.IndexReaderContext
forCompositeReader
instance.AMergeScheduler
that runs each merge using a separate thread.This exception is thrown when Lucene detects an inconsistency in the index.DirectoryReader is an implementation ofCompositeReader
that can read indexes in aDirectory
.DocIDMerger<T extends DocIDMerger.Sub>Utility class to help merging documents from sub-readers according to either simple concatenated (unsorted) order, or by a specified index-time sort, skipping deleted documents and remapping non-deleted documents.Represents one sub-reader being mergedAccumulator for documents that have a value for a field.This class contains utility methods and constants for DocValuesOptions for skip indexes on doc values.Skipper forDocValues
.DocValues types.Abstract base class implementing aDocValuesProducer
that has no doc values.TheExitableDirectoryReader
wraps a real indexDirectoryReader
and allows for aQueryTimeout
implementation object to be checked periodically to see if the thread should exit or not.Wrapper class for another FilterAtomicReader.Wrapper class for a SubReaderWrapper that is used by the ExitableDirectoryReader.Wrapper class for another Terms implementation that is used by ExitableFields.Wrapper class for TermsEnum that is used by ExitableTerms for implementing an exitable enumeration of terms.Exception that is thrown to prematurely terminate a term enumeration.Access to the Field Info file that describes document fields and whether or not they are indexed.Collection ofFieldInfo
s (accessible by number or by name).This class tracks the number and position / offset parameters of terms being added to the index.Provides aTerms
index for fields that have it, and lists which fields do.Delegates all methods to a wrappedBinaryDocValues
.AFilterCodecReader
contains another CodecReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.A FilterDirectoryReader wraps another DirectoryReader, allowing implementations to transform or extend it.A DelegatingCacheHelper is a CacheHelper specialization for implementing long-lived caching behaviour for FilterDirectoryReader subclasses.Factory class passed to FilterDirectoryReader constructor that allows subclasses to wrap the filtered DirectoryReader's subreaders.Abstract class for enumerating a subset of all terms.Return value, if term should be accepted or the iteration shouldEND
.AFilterLeafReader
contains another LeafReader, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.Base class for filteringFields
implementations.Base class for filteringPostingsEnum
implementations.Base class for filteringTerms
implementations.Base class for filteringTermsEnum
implementations.A wrapper forMergePolicy
instances.Delegates all methods to a wrappedNumericDocValues
.Delegates all methods to a wrappedSortedDocValues
.Delegates all methods to a wrappedSortedNumericDocValues
.Delegates all methods to a wrappedSortedSetDocValues
.This class provides access to per-document floating point vector values indexed asKnnFloatVectorField
.Per-document scoring factors.Information about upcoming impacts, ie.Extension ofPostingsEnum
which also provides information about upcoming impacts.Source ofImpacts
.Represents a single field for indexing.Describes the properties of a field.Expert: represents a single commit into an index as seen by theIndexDeletionPolicy
orIndexReader
.Expert: policy for deletion of staleindex commits
.This class contains useful constants representing filenames and extensions used by lucene, as well as convenience methods for querying whether a file name matches an extension (matchesExtension
), as well as generating file names from a segment name, generation and extension (fileNameFromGeneration
,segmentFileName
).This exception is thrown when Lucene detects an index that is newer than this Lucene version.This exception is thrown when Lucene detects an index that is too old for this Lucene versionSignals that no index was found in the Directory.Controls how much information is stored in the postings lists.IndexReader is an abstract class, providing an interface for accessing a point-in-time view of an index.A utility class that gives hooks in order to help build a cache based on the data that is contained in this index.A cache key identifying a resource that is being cached on.A listener that is called when a resource gets closed.A struct like class that represents a hierarchical relationship betweenIndexReader
instances.Handles how documents should be sorted in an index, both within a segment and between segments.Used for sorting documents across segmentsA comparator of doc IDs, used for sorting documents within a segmentSorts documents based on double values from a NumericDocValues instanceSorts documents based on float values from a NumericDocValues instanceSorts documents based on integer values from a NumericDocValues instanceSorts documents based on long values from a NumericDocValues instanceProvide a NumericDocValues instance for a LeafReaderProvide a SortedDocValues instance for a LeafReaderSorts documents based on terms from a SortedDocValues instanceThis is an easy-to-use tool that upgrades all segments of an index from previous Lucene versions to the current segment file format.AnIndexWriter
creates and maintains an index.DocStats for this indexIfDirectoryReader.open(IndexWriter)
has been called (ie, this writer is in near real-time mode), then after a merge completes, this class can be invoked to warm the reader on the newly merged segment, before the merge commits.Holds all the configuration that is used to create anIndexWriter
.Specifies the open mode forIndexWriter
.A callback event listener for recording key events happened inside IndexWriterThisIndexDeletionPolicy
implementation that keeps only the most recent commit and immediately removes all prior commits after a new commit is done.This class abstracts addressing of document vector values indexed asKnnFloatVectorField
orKnnByteVectorField
.A DocIdSetIterator that also provides an index() method tracking a distinct ordinal for a vector associated with each doc.Provides read-only metadata about a leaf.LeafReader
is an abstract class, providing an interface for accessing an index.IndexReaderContext
forLeafReader
instances.Holds all the configuration used byIndexWriter
with few setters for settings that can be changed on anIndexWriter
instance "live".This is aLogMergePolicy
that measures size of a segment as the total byte size of the segment's files.This is aLogMergePolicy
that measures size of a segment as the number of documents (not taking deletions into account).This class implements aMergePolicy
that tries to merge segments into levels of exponentially increasing size, where each level has fewer segments than the value of the merge factor.AFields
implementation that merges multiple Fields into one, and maps around deleted documents.Expert: a MergePolicy determines the sequence of primitive merge operations.Thrown when a merge was explicitly aborted becauseIndexWriter.abortMerges()
was called.This interface represents the current context of the merge selection process.Exception thrown if there are any problems while executing a merge.A MergeSpecification instance provides the information necessary to perform multiple merges.OneMerge provides the information necessary to perform an individual primitive merge operation, resulting in a single new segment.Progress and state for an executing merge.Reason for pausing the merge thread.This is theRateLimiter
thatIndexWriter
assigns to each running merge, to giveMergeScheduler
s ionice like control.Expert:IndexWriter
uses an instance implementing this interface to execute the merges selected by aMergePolicy
.Provides access to new merges and executes the actual mergeHolds common state used during segment merging.A map of doc IDs.MergeTrigger is passed toMergePolicy.findMerges(MergeTrigger, SegmentInfos, MergePolicy.MergeContext)
to indicate the event that triggered the merge.Concatenates multiple Bits together, on every lookup.A wrapper for CompositeIndexReader providing access to DocValues.Implements SortedDocValues over n subs, using an OrdinalMapImplements MultiSortedSetDocValues over n subs, using an OrdinalMapProvides a singleFields
term index view over anIndexReader
.Utility methods for working with aIndexReader
as if it were aLeafReader
.ExposesPostingsEnum
, merged fromPostingsEnum
API of sub-segments.Holds aPostingsEnum
along with the correspondingReaderSlice
.ACompositeReader
which reads multiple indexes, appending their content.Exposes flex API, merged from flex API of sub-segments.AnIndexDeletionPolicy
which keeps all index commits around, never deleting them.AMergePolicy
which never returns merges to execute.AMergeScheduler
which never executes any merges.A per-document numeric value.A wrapping merge policy that wraps theMergePolicy.OneMerge
objects returned by the wrapped merge policy.Maps per-segment ordinals to/from global ordinal space, using a compact packed-ints representation.An ordinal basedTermState
AnCompositeReader
which reads multiple, parallel indexes.AnLeafReader
which reads multiple, parallel indexes.ASnapshotDeletionPolicy
which adds a persistence layer so that snapshots can be maintained across the life of an application.Access to indexed numeric values.We recurse thePointValues.PointTree
, using a provided instance of this to guide the recursion.Basic operations to read the KD-tree.Used byPointValues.intersect(org.apache.lucene.index.PointValues.IntersectVisitor)
to check how each recursive cell corresponds to the query.Iterates through the postings.Prefix codes term instances (prefixes are shared).Builds a PrefixCodedTerms: call add repeatedly, then finish.An iterator over the list of terms stored in aPrefixCodedTerms
.Query timeout abstraction that controls whether a query should continue or be stopped.An implementation ofQueryTimeout
that can be used by theExitableDirectoryReader
class to time out and exit out when a query takes a long time to rewrite.Utility class to safely shareDirectoryReader
instances across multiple threads, while periodically reopening.Subreader slice from a parent composite reader.Common util methods for dealing withIndexReader
s andIndexReaderContext
s.Embeds a [read-only] SegmentInfo and adds per-commit fields.Information about a segment such as its name, directory, and files related to the segment.A collection of segmentInfo objects with methods for operating on those segments in relation to the file system.Utility class for executing code that needs to do something with the current segments file.IndexReader implementation over a single segment.Holder class for common parameters used during read.Holder class for common parameters used during write.AMergeScheduler
that simply does each merge sequentially, using the current thread.A very simple merged segment warmer that just ensures data structures are initialized.Subclass of FilteredTermsEnum for enumerating a single term.Wraps arbitrary readers for merging.ImpactsEnum
that doesn't index impacts but implements the API in a legal way.AnIndexDeletionPolicy
that wraps any otherIndexDeletionPolicy
and adds the ability to hold and later release snapshots of an index.This reader filters out documents that have a doc values value in the given field and treat these documents as soft deleted.ThisMergePolicy
allows to carry over soft deleted documents across merges.A per-document byte[] with presorted values.A list of per-document numeric values, sorted according toLong.compare(long, long)
.A multi-valued version ofSortedDocValues
.Sorts documents of a given index by returning a permutation on the document IDs.A permutation of doc IDs.Reads/Writes a named SortField from a segment info file, used to record index sortsAnCodecReader
which supports sorting documents by a givenSort
.Iterator over KnnVectorValues accepting a mapping to differently-sorted docs.Default implementation ofDirectoryReader
.API for reading stored fields.Expert: provides a low-level means of accessing the stored field values in an index.Enumeration of possible return values forStoredFieldVisitor.needsField(org.apache.lucene.index.FieldInfo)
.A Term represents a word from text.Access to the terms in a specific field.Iterator to seek (TermsEnum.seekCeil(BytesRef)
,TermsEnum.seekExact(BytesRef)
) or step through (BytesRefIterator.next()
terms to obtain frequency information (TermsEnum.docFreq()
),PostingsEnum
orPostingsEnum
for the current term (TermsEnum.postings(org.apache.lucene.index.PostingsEnum)
.Represents returned result fromTermsEnum.seekCeil(org.apache.lucene.util.BytesRef)
.Encapsulates all required internal state to position the associatedTermsEnum
without re-seeking.API for reading term vectors.Merges segments of approximately equal size, subject to an allowed number of segments per tier.Holds score and explanation for a single candidate merge.An interface for implementations that support 2-phase commit.A utility for executing 2-phase commit on several objects.Thrown byTwoPhaseCommitTool.execute(TwoPhaseCommit...)
when an object fails to commit().Thrown byTwoPhaseCommitTool.execute(TwoPhaseCommit...)
when an object fails to prepareCommit().ThisMergePolicy
is used for upgrading all existing segments of an index when callingIndexWriter.forceMerge(int)
.The numeric datatype of the vector values.Vector similarity function; used in search to return top K most similar vectors to a target vector.