Apache Lucene Migration Guide

Changed SPI lookups for codecs and analysis changed (LUCENE-7873)

Due to serious problems with context class loaders in several frameworks (OSGI, Java 9 Jigsaw), the lookup of Codecs, PostingsFormats, DocValuesFormats and all analysis factories was changed to only inspect the current classloader that defined the interface class (lucene-core.jar). Normal applications should not encounter any issues with that change, because the application classloader (unnamed module in Java 9) can load all SPIs from all JARs from classpath.

For any code that relies on the old behaviour (e.g., certain web applications or components in application servers) one can manually instruct the Lucene SPI implementation to also inspect the context classloader. To do this, add this code to the early startup phase of your application before any Apache Lucene component is used:

ClassLoader cl = Thread.currentThread().getContextClassLoader();
// Codecs:
PostingsFormat.reloadPostingsFormats(cl);
DocValuesFormat.reloadDocValuesFormats(cl);
Codec.reloadCodecs(cl);
// Analysis:
CharFilterFactory.reloadCharFilters(cl);
TokenFilterFactory.reloadTokenFilters(cl);
TokenizerFactory.reloadTokenizers(cl);

This code will reload all service providers from the given class loader (in our case the context class loader). Of course, instead of specifying the context class loader, it is receommended to use the application's main class loader or the module class loader.

If you are migrating your project to Java 9 Jigsaw module system, keep in mind that Lucene currently does not yet support module-info.java declarations of service provider impls (provides statement). It is therefore recommended to keep all of Lucene in one Uber-Module and not try to split Lucene into several modules. As soon as Lucene will migrate to Java 9 as minimum requirement, we will work on improving that.

For OSGI, the same applies. You have to create a bundle with all of Lucene for SPI to work correctly.

CustomAnalyzer resources (LUCENE-7883)##

Lucene no longer uses the context class loader when resolving resources in CustomAnalyzer or ClassPathResourceLoader. Resources are only resolved against Lucene's class loader by default. Please use another builder method to change to a custom classloader.

Query.hashCode and Query.equals are now abstract methods (LUCENE-7277)

Any custom query subclasses should redeclare equivalence relationship according to the subclass's details. See code patterns used in existing core Lucene query classes for details.

CompressionTools removed (LUCENE-7322)

Per-field compression has been superseded by codec-level compression, which has the benefit of being able to compress several fields, or even documents at once, yielding better compression ratios. In case you would still like to compress on top of the codec, you can do it on the application side by using the utility classes from the java.util.zip package.

Explanation.toHtml() removed (LUCENE-7360)

Clients wishing to render Explanations as HTML should implement their own utilities for this.

Similarity.coord and BooleanQuery.disableCoord removed (LUCENE-7369)

Coordination factors were a workaround for the fact that the ClassicSimilarity does not have strong enough term frequency saturation. This causes disjunctions to get better scores on documents that have many occurrences of a few query terms than on documents that match most clauses, which is most of time undesirable. The new BM25Similarity does not suffer from this problem since it has better saturation for the contribution of the term frequency so the coord factors have been removed from scores. Things now work as if coords were always disabled when constructing boolean queries.

Weight.getValueForNormalization() and Weight.normalize() removed (LUCENE-7368)

Query normalization's goal was to make scores comparable across queries, which was only implemented by the ClassicSimilarity. Since ClassicSimilarity is not the default similarity anymore, this functionality has been removed. Boosts are now propagated through Query#createWeight.

AnalyzingQueryParser removed (LUCENE-7355)

The functionality of AnalyzingQueryParser has been folded into the classic QueryParser, which now passes terms through Analyzer#normalize when generating queries.

CommonQueryParserConfiguration.setLowerCaseExpandedTerms removed (LUCENE-7355)

This option has been removed as expanded terms are now normalized through Analyzer#normalize.

Cache key and close listener refactoring (LUCENE-7410)

The way to access cache keys and add close listeners has been refactored in order to be less trappy. You should now use IndexReader.getReaderCacheHelper() to have manage caches that take deleted docs and doc values updates into account, and LeafReader.getCoreCacheHelper() to manage per-segment caches that do not take deleted docs and doc values updates into account.

Index-time boosts removal (LUCENE-6819)

Index-time boosts are not supported anymore. As a replacement, index-time scoring factors should be indexed in a doc value field and combined with the score at query time using FunctionScoreQuery for instance.

Grouping collector refactoring (LUCENE-7701)

Groups are now defined by GroupSelector classes, making it easier to define new types of groups. Rather than having term or function specific collection classes, FirstPassGroupingCollector, AllGroupsCollector and AllGroupHeadsCollector are now concrete classes taking a GroupSelector.

SecondPassGroupingCollector is no longer specifically aimed at collecting TopDocs for each group, but instead takes a GroupReducer that will perform any type of reduction on the top groups collected on a first-pass. To reproduce the old behaviour of SecondPassGroupingCollector, you should instead use TopGroupsCollector.

Removed legacy numerics (LUCENE-7850)

Support for legacy numerics has been removed since legacy numerics had been deprecated since Lucene 6.0. Points should be used instead, see org.apache.lucene.index.PointValues for an introduction.

TopDocs.totalHits is now a long (LUCENE-7872)

TopDocs.totalHits is now a long so that TopDocs instances can be used to represent top hits that have more than 2B matches. This is necessary for the case that multiple TopDocs instances are merged together with TopDocs#merge as they might have more than 2B matches in total. However TopDocs instances returned by IndexSearcher will still have a total number of hits which is less than 2B since Lucene indexes are still bound to at most 2B documents, so it can safely be casted to an int in that case.

PrefixAwareTokenFilter and PrefixAndSuffixAwareTokenFilter removed

(LUCENE-7877)

Instead use ConcatentingTokenStream, which will allow for the use of custom attributes.

FieldValueQuery is renamed to DocValuesFieldExistsQuery (LUCENE-7899)

This query matches only documents that have a value for the specified doc values field.

RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated

This RAM-based directory implementation is an old piece of code that uses inefficient thread synchronization primitives and can be confused as "faster" than the NIO-based MMapDirectory. It is deprecated and scheduled for removal in future versions of Lucene. (LUCENE-8467, LUCENE-8438)