Apache Lucene Migration Guide

Four-dimensional enumerations

Flexible indexing changed the low level fields/terms/docs/positions enumeration APIs. Here are the major changes:

LUCENE-2380: FieldCache.getStrings/Index –> FieldCache.getDocTerms/Index

LUCENE-2600: IndexReaders are now read-only

Instead of IndexReader.isDeleted, do this:

  import org.apache.lucene.util.Bits;
  import org.apache.lucene.index.MultiFields;

  Bits liveDocs = MultiFields.getLiveDocs(indexReader);
  if (liveDocs != null && !liveDocs.get(docID)) {
    // document is deleted...
  }

LUCENE-2858, LUCENE-3733: IndexReader –> AtomicReader/CompositeReader/DirectoryReader refactoring

The abstract class IndexReader has been refactored to expose only essential methods to access stored fields during display of search results. It is no longer possible to retrieve terms or postings data from the underlying index, not even deletions are visible anymore. You can still pass IndexReader as constructor parameter to IndexSearcher and execute your searches; Lucene will automatically delegate procedures like query rewriting and document collection atomic subreaders.

If you want to dive deeper into the index and want to write own queries, take a closer look at the new abstract sub-classes AtomicReader and CompositeReader:

AtomicReader instances are now the only source of Terms, Postings, DocValues and FieldCache. Queries are forced to execute on a Atomic reader on a per-segment basis and FieldCaches are keyed by AtomicReaders.

Its counterpart CompositeReader exposes a utility method to retrieve its composites. But watch out, composites are not necessarily atomic. Next to the added type-safety we also removed the notion of index-commits and version numbers from the abstract IndexReader, the associations with IndexWriter were pulled into a specialized DirectoryReader. To open Directory-based indexes use DirectoryReader.open(), the corresponding method in IndexReader is now deprecated for easier migration. Only DirectoryReader supports commits, versions, and reopening with openIfChanged(). Terms, postings, docvalues, and norms can from now on only be retrieved using AtomicReader; DirectoryReader and MultiReader extend CompositeReader, only offering stored fields and access to the sub-readers (which may be composite or atomic).

If you have more advanced code dealing with custom Filters, you might have noticed another new class hierarchy in Lucene (see LUCENE-2831): IndexReaderContext with corresponding Atomic-/CompositeReaderContext.

The move towards per-segment search Lucene 2.9 exposed lots of custom Queries and Filters that couldn't handle it. For example, some Filter implementations expected the IndexReader passed in is identical to the IndexReader passed to IndexSearcher with all its advantages like absolute document IDs etc. Obviously this "paradigm-shift" broke lots of applications and especially those that utilized cross-segment data structures (like Apache Solr).

In Lucene 4.0, we introduce IndexReaderContexts "searcher-private" reader hierarchy. During Query or Filter execution Lucene no longer passes raw readers down Queries, Filters or Collectors; instead components are provided an AtomicReaderContext (essentially a hierarchy leaf) holding relative properties like the document-basis in relation to the top-level reader. This allows Queries & Filter to build up logic based on document IDs, albeit the per-segment orientation.

There are still valid use-cases where top-level readers ie. "atomic views" on the index are desirable. Let say you want to iterate all terms of a complete index for auto-completion or faceting, Lucene provides utility wrappers like SlowCompositeReaderWrapper (LUCENE-2597) emulating an AtomicReader. Note: using "atomicity emulators" can cause serious slowdowns due to the need to merge terms, postings, DocValues, and FieldCache, use them with care!

LUCENE-4306: getSequentialSubReaders(), ReaderUtil.Gather

The method IndexReader#getSequentialSubReaders() was moved to CompositeReader (see LUCENE-2858, LUCENE-3733) and made protected. It is solely used by CompositeReader itself to build its reader tree. To get all atomic leaves of a reader, use IndexReader#leaves(), which also provides the doc base of each leave. Readers that are already atomic return itself as leaf with doc base 0. To emulate Lucene 3.x getSequentialSubReaders(), use getContext().children().

LUCENE-2413,LUCENE-3396: Analyzer package changes

Lucene's core and contrib analyzers, along with Solr's analyzers, were consolidated into lucene/analysis. During the refactoring some package names have changed, and ReusableAnalyzerBase was renamed to Analyzer:

LUCENE-2514: Collators

The option to use a Collator's order (instead of binary order) for sorting and range queries has been moved to lucene/queries. The Collated TermRangeQuery/Filter has been moved to SlowCollatedTermRangeQuery/Filter, and the collated sorting has been moved to SlowCollatedStringComparator.

Note: this functionality isn't very scalable and if you are using it, consider indexing collation keys with the collation support in the analysis module instead.

To perform collated range queries, use a suitable collating analyzer: CollationKeyAnalyzer or ICUCollationKeyAnalyzer, and set qp.setAnalyzeRangeTerms(true).

TermRangeQuery and TermRangeFilter now work purely on bytes. Both have helper factory methods (newStringRange) similar to the NumericRange API, to easily perform range queries on Strings.

LUCENE-2883: ValueSource changes

Lucene's o.a.l.search.function ValueSource based functionality, was consolidated into lucene/queries along with Solr's similar functionality. The following classes were moved:

The following lists the replacement classes for those removed:

DocValues are now named FunctionValues, to not confuse with Lucene's per-document values.

LUCENE-2392: Enable flexible scoring

The existing "Similarity" api is now TFIDFSimilarity, if you were extending Similarity before, you should likely extend this instead.

Weight.normalize no longer takes a norm value that incorporates the top-level boost from outer queries such as BooleanQuery, instead it takes 2 parameters, the outer boost (topLevelBoost) and the norm. Weight.sumOfSquaredWeights has been renamed to Weight.getValueForNormalization().

The scorePayload method now takes a BytesRef. It is never null.

LUCENE-3283: Query parsers moved to separate module

Lucene's core o.a.l.queryParser QueryParsers have been consolidated into lucene/queryparser, where other QueryParsers from the codebase will also be placed. The following classes were moved:

LUCENE-2308, LUCENE-3453: Separate IndexableFieldType from Field instances

With this change, the indexing details (indexed, tokenized, norms, indexOptions, stored, etc.) are moved into a separate FieldType instance (rather than being stored directly on the Field).

This means you can create the FieldType instance once, up front, for a given field, and then re-use that instance whenever you instantiate the Field.

Certain field types are pre-defined since they are common cases:

If your usage fits one of those common cases you can simply instantiate the above class. If you need to store the value, you can add a separate StoredField to the document, or you can use TYPE_STORED for the field:

Field f = new Field("field", "value", StringField.TYPE_STORED);

Alternatively, if an existing type is close to what you want but you need to make a few changes, you can copy that type and make changes:

FieldType bodyType = new FieldType(TextField.TYPE_STORED);
bodyType.setStoreTermVectors(true);

You can of course also create your own FieldType from scratch:

FieldType t = new FieldType();
t.setIndexed(true);
t.setStored(true);
t.setOmitNorms(true);
t.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
t.freeze();

FieldType has a freeze() method to prevent further changes.

There is also a deprecated transition API, providing the same Index, Store, TermVector enums from 3.x, and Field constructors taking these enums.

When migrating from the 3.x API, if you did this before:

new Field("field", "value", Field.Store.NO, Field.Indexed.NOT_ANALYZED_NO_NORMS)

you can now do this:

new StringField("field", "value")

(though note that StringField indexes DOCS_ONLY).

If instead the value was stored:

new Field("field", "value", Field.Store.YES, Field.Indexed.NOT_ANALYZED_NO_NORMS)

you can now do this:

new Field("field", "value", StringField.TYPE_STORED)

If you didn't omit norms:

new Field("field", "value", Field.Store.YES, Field.Indexed.NOT_ANALYZED)

you can now do this:

FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
new Field("field", "value", ft)

If you did this before (value can be String or Reader):

new Field("field", value, Field.Store.NO, Field.Indexed.ANALYZED)

you can now do this:

new TextField("field", value)

If instead the value was stored:

new Field("field", value, Field.Store.YES, Field.Indexed.ANALYZED)

you can now do this:

new Field("field", value, TextField.TYPE_STORED)

If in addition you omit norms:

new Field("field", value, Field.Store.YES, Field.Indexed.ANALYZED_NO_NORMS)

you can now do this:

FieldType ft = new FieldType(TextField.TYPE_STORED);
ft.setOmitNorms(true);
new Field("field", value, ft)

If you did this before (bytes is a byte[]):

new Field("field", bytes)

you can now do this:

new StoredField("field", bytes)

If you previously used Document.setBoost, you must now pre-multiply the document boost into each Field.setBoost. If you have a multi-valued field, you should do this only for the first Field instance (ie, subsequent Field instance sharing the same field name should only include their per-field boost and not the document level boost) as the boost for multi-valued field instances are multiplied together by Lucene.

Other changes