Lucene Change Log
For more information on past and future Lucene versions, please see:
http://s.apache.org/luceneversions
- API Changes (9)
- GITHUB#13806: Add TermInSetQuery#getBytesRefIterator to be able to iterate over query terms.
(Christoph Büscher)
- GITHUB#13469: Expose FlatVectorsFormat as a first-class format; can be configured using a custom Codec.
(Michael Sokolov)
- GITHUB#13612: Hunspell: add Suggester#proceedPastRep to avoid losing relevant suggestions.
(Peter Gromov)
- GITHUB#13603: Introduced `IndexSearcher#searchLeaf(LeafReaderContext, Weight, Collector)` protected method to
facilitate customizing per-leaf behavior of search without requiring to override
`search(LeafReaderContext[], Weight, Collector)` which requires overriding the entire loop across the leaves
(Luca Cavanna)
- GITHUB#13559: Add BitSet#nextSetBit(int, int) to get the index of the first set bit in range.
(Egor Potemkin)
- GITHUB#13568: Add DoubleValuesSource#toSortableLongDoubleValuesSource and
MultiDoubleValuesSource#toSortableMultiLongValuesSource methods.
(Shradha Shankar)
- GITHUB#13568, GITHUB#13750: Add DrillSideways#search method that supports any CollectorManagers for drill-sideways dimensions
or drill-down.
(Egor Potemkin)
- GITHUB#13737: Deprecate the FacetsCollector#search utility methods and add new corresponding method to
FacetsCollectorManager that accept a FacetsCollectorManager as last argument in place of a Collector.
(Luca Cavanna)
- GITHUB#13794: Deprecate BulkScorer#score(LeafCollector collector, Bits acceptDocs) in favour of
BulkScorer#score(LeafCollector collector, Bits acceptDocs, int min, int max). The method will be removed in the next
major version. Replace usages with the latter, providing 0 as min and DocIdSetIterator.NO_MORE_DOCS as max in case
the entire segment should be scored. Subclasses that override the method should instead override its replacement.
(Luca Cavanna)
- New Features (5)
- GITHUB#13430: Allow configuring the search concurrency via
TieredMergePolicy#setTargetSearchConcurrency. This in-turn instructs the
merge policy to try to have at least this number of segments on the highest
tier.
(Adrien Grand, Carlos Delgado)
- GITHUB#13517: Allow configuring the search concurrency on LogDocMergePolicy
and LogByteSizeMergePolicy via a new #setTargetConcurrency setter.
(Adrien Grand)
- GITHUB#13568: Add sandbox facets module to compute facets while collecting.
(Egor Potemkin, Shradha Shankar)
- GITHUB#13678: Add support JDK 23 to the Panama Vectorization Provider.
(Chris Hegarty)
- GITHUB#13689: Add a new faceting feature, dynamic range facets, which automatically picks a balanced set of numeric
ranges based on the distribution of values that occur across all hits. For use cases that have a highly variable
numeric doc values field, such as "price" in an e-commerce application, this facet method is powerful as it allows the
presented ranges to adapt depending on what hits the query actually matches. This is in contrast to existing range
faceting that requires the application to provide the specific fixed ranges up front.
(Yuting Gan, Greg Miller,
Stefan Vodita)
- Improvements (10)
- GITHUB#13475: Re-enable intra-merge parallelism except for terms, norms, and doc values.
Related to GITHUB#13478.
(Ben Trent)
- GITHUB#13548: Refactor and javadoc update for KNN vector writer classes.
(Patrick Zhai)
- GITHUB#13562: Add Intervals.regexp and Intervals.range methods to produce IntervalsSource
for regexp and range queries.
(Mayya Sharipova)
- GITHUB#13625: Remove BitSet#nextSetBit code duplication.
(Greg Miller)
- GITHUB#13285: Early terminate graph searches of AbstractVectorSimilarityQuery to follow timeout set from
IndexSearcher#setTimeout(QueryTimeout).
(Kaival Parikh)
- GITHUB#13633: Add ability to read/write knn vector values to a MemoryIndex.
(Ben Trent)
- GITHUB#12627: patch HNSW graphs to improve reachability of all nodes from entry points
- GITHUB#13201: Better cost estimation on MultiTermQuery over few terms.
(Michael Froh)
- GITHUB#13735: Migrate monitor package usage of deprecated IndexSearcher#search(Query, Collector)
to IndexSearcher#search(Query, CollectorManager).
(Greg Miller)
- GITHUB#13746: Introduce ProfilerCollectorManager to parallelize search when using ProfilerCollector.
(Luca Cavanna)
- Optimizations (18)
- GITHUB#13439: Avoid unnecessary memory allocation in PackedLongValues#Iterator.
(Zhang Chao)
- GITHUB##13425: Rewrite SortedNumericDocValuesRangeQuery to MatchNoDocsQuery when the upper bound is smaller than the
lower bound.
(Ioana Tagirta)
- GITHUB#13322: Implement Weight#count for vector values in the FieldExistsQuery.
(Pan Guixin)
- GITHUB#13454: MultiTermQuery returns null ScoreSupplier in cases where
no query terms are present in the index segment
(Mayya Sharipova)
- GITHUB#13431: Replace TreeMap and use compiled Patterns in Japanese UserDictionary.
(Bruno Roustant)
- GITHUB#12941: Don't preserve auxiliary buffer contents in LSBRadixSorter if it grows.
(Stefan Vodita)
- GITHUB#13175: Stop double-checking priority queue inserts in some FacetCount classes.
(Jakub Slowinski)
- GITHUB#13538: Slightly reduce heap usage for HNSW and scalar quantized vector writers.
(Ben Trent)
- GITHUB#12100: WordBreakSpellChecker.suggestWordBreaks now does a breadth first search, allowing it to return
better matches with fewer evaluations
(hossman)
- GITHUB#13582: Stop requiring MaxScoreBulkScorer's outer window from having at
least INNER_WINDOW_SIZE docs.
(Adrien Grand)
- GITHUB#13570, GITHUB#13574, GITHUB#13535: Avoid performance degradation with closing shared Arenas.
Closing many individual index files can potentially lead to a degradation in execution performance.
Index files are mmapped one-to-one with the JDK's foreign shared Arena. The JVM deoptimizes the top
few frames of all threads when closing a shared Arena (see JDK-8335480). We mitigate this situation
when running with JDK 21 and greater, by 1) using a confined Arena where appropriate, and 2) grouping
files from the same segment to a single shared Arena.
A system property has been added that allows to control the total maximum number of mmapped files
that may be associated with a single shared Arena. For example, to set the max number of permits to
256, pass the following on the command line
-
Dorg.apache.lucene.store.MMapDirectory.sharedArenaMaxPermits=256. Setting a value of 1 associates
a single file to a single shared arena.
(Chris Hegarty, Michael Gibney, Uwe Schindler)
- GITHUB#13585: Lucene912PostingsFormat, the new default postings format, now
only has 2 levels of skip data, which are inlined into postings instead of
being stored at the end of postings lists. This translates into better
performance for queries that need skipping such as conjunctions.
(Adrien Grand)
- GITHUB#13581: OnHeapHnswGraph no longer allocates a lock for every graph node
(Mike Sokolov)
- GITHUB#13636, GITHUB#13658: Optimizations to the decoding logic of blocks of
postings.
(Adrien Grand, Uwe Schindler, Greg Miller)
- GITHUB##13644: Improve NumericComparator competitive iterator logic by comparing the missing value with the top
value even after the hit queue is full
(Pan Guixin)
- GITHUB#13587: Use Max WAND optimizations with ToParentBlockJoinQuery when using ScoreMode.Max
(Mike Pellegrini)
- GITHUB#13742: Reorder checks in LRUQueryCache#count
(Shubham Chaudhary)
- GITHUB#13697: Add a bulk scorer to ToParentBlockJoinQuery, which delegates to the bulk scorer of the child query.
This should speed up query evaluation when the child query has a specialized bulk scorer, such as disjunctive queries.
(Mike Pellegrini)
- Changes in runtime behavior (1)
- GITHUB#13472: When an executor is provided to the IndexSearcher constructor, the searcher now executes tasks on the
thread that invoked a search as well as its configured executor. Users should reduce the executor's thread-count by 1
to retain the previous level of parallelism. Moreover, it is now possible to start searches from the same executor
that is configured in the IndexSearcher without risk of deadlocking. A separate executor for starting searches is no
longer required.
(Armin Braun)
- Bug Fixes (12)
- GITHUB#13498: Avoid performance regression by constructing lazily the PointTree in NumericComparator,
(Ignacio Vera)
- GITHUB#13384: Fix highlighter to use longer passages instead of shorter individual terms.
(Zack Kendall)
- GITHUB#13463: Address bug in MultiLeafKnnCollector causing #minCompetitiveSimilarity to stay artificially low in
some corner cases.
(Greg Miller)
- GITHUB#13553: Correct RamUsageEstimate for scalar quantized knn vector formats so that raw vectors are correctly
accounted for.
(Ben Trent)
- GITHUB#13615: Correct scalar quantization when used in conjunction with COSINE similarity. Vectors are normalized
before quantization to ensure the cosine similarity is correctly calculated.
(Ben Trent)
- GITHUB#13627: Fix race condition on flush for DWPT seqNo generation.
(Ben Trent, Ao Li)
- GITHUB#13646: Fix rare test bug in TestLongValueFacetCounts that was introduced in 9.6.
(Greg Miller)
- GITHUB#13691: Fix incorrect exponent value in explain of SigmoidFunction.
(Owais Kazi)
- GITHUB#13703: Fix bug in LatLonPoint queries where narrow polygons close to latitude 90 don't
match any points due to an Integer overflow.
(Ignacio Vera)
- GITHUB#13641: Unify how KnnFormats handle missing fields and correctly handle missing vector fields when
merging segments.
(Ben Trent)
- GITHUB#13519: 8 bit scalar vector quantization is no longer
supported: it was buggy starting in 9.11 (GITHUB#13197). 4 and 7
bit quantization are still supported. Existing (9.x) Lucene indices
that previously used 8 bit quantization can still be read/searched
but the results from `KNN*VectorQuery` are silently buggy. Further
8 bit quantized vector indexing into such (9.11) indices is not
permitted, so your path forward if you wish to continue using the
same 9.11 index is to index additional vectors into the same field
with either 4 or 7 bit quantization (or no quantization), and ensure
all older (9.11 written) segments are rewritten either via
`IndexWriter.forceMerge` or
`IndexWriter.addIndexes(CodecReader...)`, or reindexing entirely.
- GITHUB#13799: Disable intra-merge parallelism for all structures but kNN vectors.
(Ben Trent)
- Build (1)
- GITHUB#13695, GITHUB#13696: Fix Gradle build sometimes gives spurious "unreferenced license file" warnings.
(Uwe Schindler)
- Other (1)
- GITHUB#13720: Add float comparison based on unit of least precision and use it to stop test failures caused by float
summation not being associative in IEEE 754.
(Alex Herbert, Stefan Vodita)
- Bug Fixes (5)
- GITHUB#13498: Avoid performance regression by constructing lazily the PointTree in NumericComparator.
(Ignacio Vera)
- GITHUB#13501, GITHUB#13478: Remove intra-merge parallelism for everything except HNSW graph merges.
(Ben Trent)
- GITHUB#13498, GITHUB#13340: Allow adding a parent field to an index with no fields
(Michael Sokolov)
- GITHUB#12431: Fix IndexOutOfBoundsException thrown in DefaultPassageFormatter
by unordered matches.
(Stephane Campinas)
- GITHUB#13493: StringValueFacetCounts stops throwing NPE when faceting over an empty match-set.
(Grebennikov Roman,
Stefan Vodita)
- API Changes (2)
- GITHUB#13145: Deprecate ByteBufferIndexInput as it will be removed in Lucene 10.0.
(Uwe Schindler)
- GITHUB#13422: an explicit dependency on the HPPC library is removed in favor of an internal repackaged copy in
oal.internal.hppc. If you relied on HPPC as a transitive dependency, you'll have to add it to your project explicitly.
The HPPC classes now bundled in Lucene core are internal and will have restricted access in future releases, please do
not use them.
(Bruno Roustant, Dawid Weiss, Uwe Schindler, Chris Hegarty)
- New Features (9)
- GITHUB#13125: Recursive graph bisection is now supported on indexes that have blocks, as long as
they configure a parent field via `IndexWriterConfig#setParentField`.
(Adrien Grand)
- GITHUB#12915: Add new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter
and JapaneseKatakanaUppercaseFilter.
(Dai Sugimori)
- GITHUB#13196, GITHUB#13222: Add support for posix_madvise to MMapDirectory: If running on
Linux/macOS and Java 21 or later, MMapDirectory uses IOContext to pass suitable MADV flags to
kernel of operating system. In particular, merging now passes POSIX_MADV_SEQUENTIAL to the readers
that are being merged, and searching passes POSIX_MADV_RANDOM to vector data files - including
quantized vector data files, HNSW graphs, stored fields data files and term vectors data files.
This may improve paging logic especially when working with large indexes under memory pressure.
(Uwe Schindler, Chris Hegarty, Robert Muir, Adrien Grand)
- GITHUB#13197: Expand support for new scalar bit levels for HNSW vectors. This includes 4-bit vectors and an option
to compress them to gain a 50% reduction in memory usage.
(Ben Trent)
- GITHUB#13268: Add ability for UnifiedHighlighter to highlight a field based on combined matches from multiple fields.
(Mayya Sharipova, Jim Ferenczi)
- GITHUB#13288: Make HNSW and Flat storage vector formats easier to extend with new FlatVectorScorer interface. Add
new Hnsw format for binary quantized vectors.
(Ben Trent)
- GITHUB#13181: Add new VectorScorer interface to vector value iterators. This allows for vector codecs to supply
simpler and more optimized vector scoring when iterating vector values directly.
(Ben Trent)
- GITHUB#13414: Counts are always available in the result when using taxonomy facets.
(Stefan Vodita)
- GITHUB#13445: Add new option when calculating scalar quantiles. The new option of setting `confidenceInterval` to
`0` will now dynamically determine the quantiles through a grid search over multiple quantiles calculated
by multiple intervals.
(Ben Trent)
- Improvements (14)
- GITHUB#13092: `static final Map` constants have been made immutable
(Dmitry Cherniachenko)
- GITHUB#13041: TokenizedPhraseQueryNode code cleanup
(Dmitry Cherniachenko)
- GITHUB#13087: Changed `static final Set` constants to be immutable. Among others it affected
ScandinavianNormalizer.ALL_FOLDINGS set with public access.
(Dmitry Cherniachenko)
- GITHUB#13155: Hunspell: allow ignoring exceptions on duplicate ICONV/OCONV mappings
(Peter Gromov)
- GITHUB#13156: Hunspell: don't proceed with other suggestions if we found good REP ones
(Peter Gromov)
- GITHUB#13066: Support getMaxScore of DisjunctionSumScorer for non top level scoring clause
(Shintaro Murakami)
- GITHUB#13124: MergeScheduler can now provide an executor for intra-merge parallelism. The first
implementation is the ConcurrentMergeScheduler and the Lucene99HnswVectorsFormat will use it if no other
executor is provided.
(Ben Trent)
- GITHUB#13239: Upgrade icu4j to version 74.2.
(Robert Muir)
- GITHUB#13202: Early terminate graph and exact searches of AbstractKnnVectorQuery to follow timeout set from
IndexSearcher#setTimeout(QueryTimeout).
(Kaival Parikh)
- GITHUB#12966: Move most of the responsibility from TaxonomyFacets implementations to TaxonomyFacets itself.
This reduces code duplication and enables future development.
(Stefan Vodita)
- GITHUB#13362: Add sub query explanations to DisjunctionMaxQuery, if the overall query didn't match.
(Tim Grein)
- GITHUB#13385: Add Intervals.noIntervals() method to produce an empty IntervalsSource.
(Aniketh Jain, Uwe Schindler, Alan Woodward))
- GITHUB#13276: UnifiedHighlighter: new 'passageSortComparator' option to allow sorting other than offset order.
(Seunghan Jung)
- GITHUB#13429: Hunspell: speed up "compress"; minimize the number of the generated entries; don't even consider "forbidden" entries anymore
(Peter Gromov)
- Optimizations (24)
- GITHUB#13306: Use RWLock to access LRUQueryCache to reduce contention.
(Boice Huang)
- GITHUB#13252: Replace handwritten loops compare with Arrays.compareUnsigned in SegmentTermsEnum.
(zhouhui)
- GITHUB#12996: Reduce ArrayUtil#grow in decompress.
(Zhang Chao)
- GITHUB#13115: Short circuit queued flush check when flush on update is disabled
(Prabhat Sharma)
- GITHUB#13085: Remove unnecessary toString() / substring() calls to save some String allocations
(Dmitry Cherniachenko)
- GITHUB#13121: Speedup multi-segment HNSW graph search for diversifying child kNN queries. Builds on GITHUB#12962.
(Ben Trent)
- GITHUB#13184: Make the HitQueue size more appropriate for KNN exact search
(Pan Guixin)
- GITHUB#13199: Speed up dynamic pruning by breaking point estimation when threshold get exceeded.
(Guo Feng)
- GITHUB#13203: Speed up writeGroupVInts
(Zhang Chao)
- GITHUB#13224: Use singleton for all-zeros DirectMonotonicReader.Meta
(Armin Braun)
- GITHUB#13232 : Introduce singleton for PackedInts.NullReader of size 256
(Armin Braun)
- GITHUB#11888: Binary search the BlockTree terms dictionary entries when all suffixes have the same length
in a leaf block, speeding up cases like primary key lookup on an id field when all ids are the same length.
(zhouhui)
- GITHUB#13149: Made PointRangeQuery faster, for some segment sizes, by reducing the amount of virtual calls to
IntersectVisitor::visit(int).
(Anton Hägerstrand)
- GITHUB#12966: FloatTaxonomyFacets can now collect values into a sparse structure, like IntTaxonomyFacets already
could.
(Stefan Vodita)
- GITHUB#13284: Per-field doc values and knn vectors readers now use a HashMap internally instead of
a TreeMap.
(Adrien Grand)
- GITHUB#13321: Improve compressed int4 quantized vector search by utilizing SIMD inline with the decompression
process.
(Ben Trent)
- GITHUB#12408: Lazy initialization improvements for Facets implementations when there are segments with no hits
to count.
(Greg Miller)
- GITHUB#13327: Reduce memory usage of field maps in FieldInfos and BlockTree TermsReader.
(Bruno Roustant, David Smiley)
- GITHUB#13339: Add a MemorySegment Vector scorer - for scoring without copying on-heap
(Chris Hegarty)
- GITHUB#13368: Replace Map<Integer, Object> by primitive IntObjectHashMap.
(Bruno Roustant)
- GITHUB#13392: Replace Map<Long, Object> by primitive LongObjectHashMap.
(Bruno Roustant)
- GITHUB#13400: Replace Set<Integer> by IntHashSet and Set<Long> by LongHashSet.
(Bruno Roustant)
- GITHUB#13406: Replace List<Integer> by IntArrayList and List<Long> by LongArrayList.
(Bruno Roustant)
- GITHUB#13420: Replace Map<Character> by CharObjectHashMap and Set<Character> by CharHashSet.
(Bruno Roustant)
- Bug Fixes (16)
- GITHUB#13105: Fix ByteKnnVectorFieldSource & FloatKnnVectorFieldSource to work correctly when a segment does not contain
any docs with vectors
(hossman)
- GITHUB#13017: Fix DV update files referenced by merge will be deleted by concurrent flush.
(Jialiang Guo)
- GITHUB#13145: Detect MemorySegmentIndexInput correctly in NRTSuggester.
(Uwe Schindler)
- GITHUB#13154: Hunspell GeneratingSuggester: ensure there are never more than 100 roots to process
(Peter Gromov)
- GITHUB#13162: Fix NPE when LeafReader return null VectorValues
(Pan Guixin)
- GITHUB#13169: Fix potential race condition in DocumentsWriter & DocumentsWriterDeleteQueue
(Ben Trent)
- GITHUB#13204: Fix equals/hashCode of IOContext.
(Uwe Schindler, Robert Muir)
- GITHUB#13206: Subtract deleted file size from the cache size of NRTCachingDirectory.
(Jean-François Boeuf)
- GITHUB#12966: Aggregation facets no longer assume that aggregation values are positive.
(Stefan Vodita)
- GITHUB#13356: Ensure negative scores are not returned from scalar quantization scorer.
(Ben Trent)
- GITHUB#13366: Disallow NaN and Inf values in scalar quantization and better handle extreme cases.
(Ben Trent)
- GITHUB#13369: Fix NRT opening failure when soft deletes are enabled and the document fails to index before a point
field is written
(Ben Trent)
- GITHUB#13378: Fix points writing with no values
(Chris Hegarty)
- GITHUB#13374: Fix bug in SQ when just a single vector present in a segment
(Chris Hegarty)
- GITHUB#13376: Fix integer overflow exception in postings encoding as group-varint.
(Zhang Chao, Guo Feng)
- GITHUB#13421: Fixes TestOrdinalMap.testRamBytesUsed for multiple default PackedInts.NullReader instances.
(Amir Raza)
- Build (1)
- Upgrade forbiddenapis to version 3.7 and ASM for APIJAR extraction to 9.7.
(Uwe Schindler)
- Other (3)
- GITHUB#13068: Replace numerous `brToString(BytesRef)` copies with a `ToStringUtils` method
(Dmitry Cherniachenko)
- GITHUB#13077: Add public getter for SynonymQuery#field
(Andrey Bozhko)
- GITHUB#13393: Add support for reloading the SPI for KnnVectorsFormat class
(Navneet Verma)
- API Changes (4)
- GITHUB#12243: Mark TermInSetQuery ctors with varargs terms as @Deprecated. SortedSetDocValuesField#newSlowSetQuery,
SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery now take a collection of terms as a param.
(Jakub Slowinski)
- GITHUB#11041: Deprecate IndexSearch#search(Query, Collector) in favor of
IndexSearcher#search(Query, CollectorManager) for TopFieldCollectorManager
and TopScoreDocCollectorManager.
(Zach Chen, Adrien Grand, Michael McCandless, Greg Miller, Luca Cavanna)
- GITHUB#12854: Mark DrillSideways#createDrillDownFacetsCollector as @Deprecated.
(Greg Miller)
- GITHUB#12624, GITHUB#12831: Allow FSTCompiler to stream to any DataOutput while building, and
make compile() only return the FSTMetadata. For on-heap (default) use case, please use
FST.fromFSTReader(fstMetadata, fstCompiler.getFSTReader()) to create the FST.
(Anh Dung Bui)
- New Features (4)
- GITHUB#12679: Add support for similarity-based vector searches using [Byte|Float]VectorSimilarityQuery. Uses a new
VectorSimilarityCollector to find all vectors scoring above a `resultSimilarity` while traversing the HNSW graph till
better-scoring nodes are available, or the best candidate is below a score of `traversalSimilarity` in the lowest
level.
(Aditya Prakash, Kaival Parikh)
- GITHUB#12829: For indices newly created as of 9.10.0 onwards, IndexWriter preserves document blocks indexed via
IndexWriter#addDocuments or IndexWriter#updateDocuments also when index sorting is configured. Document blocks are
maintained alongside their parent documents during sort and merge. IndexWriterConfig accepts a parent field that is used
to maintain block orders if index sorting is used. Note, this is fully optional in Lucene 9.x while will be mandatory for
indices that use document blocks together with index sorting as of 10.0.0.
(Simon Willnauer)
- GITHUB#12336: Index additional data per facet label in the taxonomy.
(Shai Erera, Egor Potemkin, Mike McCandless,
Stefan Vodita)
- GITHUB#12706: Add support for the final release of Java foreign memory API in Java 22 (and later).
Lucene's MMapDirectory will now mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) starting
from Java 19. Indexes closed while queries are running can no longer crash the JVM.
Support for vectorized implementations of VectorUtil based on jdk.incubator.vector APIs was added
for exactly Java 22. Therefore, applications started with command line parameter
"java --add-modules jdk.incubator.vector" will automatically use the new vectorized implementations
if running on a supported platform (Java 20/21/22 on x86 CPUs with AVX2 or later or ARM NEON CPUs).
This is an opt-in feature and requires explicit Java command line flag! When enabled, Lucene logs
a notice using java.util.logging. Please test thoroughly and report bugs/slowness to Lucene's mailing
list.
(Uwe Schindler, Chris Hegarty)
- Improvements (7)
- GITHUB#12870: Tighten synchronized loop in DirectoryTaxonomyReader#getOrdinal.
(Stefan Vodita)
- GITHUB#12812: Avoid overflows and false negatives in int slice buffer filled-with-zeros assertion.
(Stefan Vodita)
- GITHUB#12910: Refactor around NeighborArray to make it more self-contained.
(Patrick Zhai)
- GITHUB#12999: Use Automaton for SurroundQuery prefix/pattern matching
(Michael Gibney)
- GITHUB#13043: Support getMaxScore of ConjunctionScorer for non top level scoring clause.
(Shintaro Murakami)
- GITHUB#13055: Make DEFAULT_STOP_TAGS in KoreanPartOfSpeechStopFilter immutable
(Dmitry Cherniachenko)
- GITHUB#888: Use native byte order varhandles to spare CPU's byte swapping.
Tests are running with random byte order to ensure that the order does not affect correctness
of code. Native order was enabled for LZ4 compression.
(Uwe Schindler)
- Optimizations (11)
- LUCENE-10366: Override readVInt() and readVLong() in ByteBufferDataInput to help Hotspot inline method.
(Guo Feng)
- GITHUB#12839: Introduce method to grow arrays up to a given upper limit and use it to reduce overallocation for
DirectoryTaxonomyReader#getBulkOrdinals.
(Stefan Vodita)
- GITHUB#12841: Move group-varint encoding/decoding logic to DataOutput/DataInput.
(Adrien Grand, Zhang Chao, Uwe Schindler)
- GITHUB#12997 Avoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false.
(Zhang Chao, Adrien Grand)
- GITHUB#12989: Split taxonomy facet arrays across reusable chunks of elements to reduce allocations.
(Michael Froh, Stefan Vodita)
- GITHUB#13033: PointRangeQuery now exits earlier on segments whose values
don't intersect with the query range. When a PointRangeQuery is a required
clause of a boolean query, this helps save work on other required clauses of
the same boolean query.
(Adrien Grand)
- GITHUB#13026: ReqOptSumScorer will now propagate minimum competitive scores
to the optional clause if the required clause doesn't score. In practice,
this will help boolean queries that consist of a mix OF FILTER clauses and
SHOULD clauses.
(Adrien Grand)
- GITHUB#13052: Avoid set.removeAll(list) O(n^2) performance trap in the UpgradeIndexMergePolicy
(Dmitry Cherniachenko)
- GITHUB#13036 Optimize counts on two clause term disjunctions.
(Adrien Grand, Johannes Fredén)
- GITHUB#12962: Speedup concurrent multi-segment HNWS graph search
(Mayya Sharipova, Tom Veasey)
- GITHUB#13090: Prevent humongous allocations in ScalarQuantizer when building quantiles.
(Ben Trent)
- Bug Fixes (7)
- GITHUB#12866: Prevent extra similarity computation for single-level HNSW graphs.
(Kaival Parikh)
- GITHUB#12558: Ensure #finish is called on all drill-sideways FacetsCollectors even when no hits are scored.
(Greg Miller)
- GITHUB#12920: Address bug in TestDrillSideways#testCollectionTerminated that could occasionally cause the test to
fail with certain random seeds.
(Greg Miller)
- GITHUB#12885: Fixed the bug that JapaneseReadingFormFilter cannot convert some hiragana to romaji.
(Takuma Kuramitsu)
- GITHUB#12287: Fix a bug in ShapeTestUtil.
(Heemin Kim)
- GITHUB#13031: ScorerSupplier created by QueryProfilerWeight will propagate topLevelScoringClause to the sub ScorerSupplier.
(Shintaro Murakami)
- GITHUB#13059: Fixed missing IndicNormalization and DecimalDigit filters in TeluguAnalyzer normalization
(Dmitry Cherniachenko)
- Build (1)
- GITHUB#12931, GITHUB#12936, GITHUB#12937: Improve source file validation to detect incorrect
UTF-8 sequences and forbid U+200B; enable errorprone DisableUnicodeInCode check.
(Robert Muir, Uwe Schindler)
- Other (5)
- GITHUB#11023: Removing some dead code in CheckIndex.
(Jakub Slowinski)
- GITHUB#11023: Removing @lucene.experimental tags in testXXX methods in CheckIndex.
(Jakub Slowinski)
- GITHUB#12934: Cleaning up old references to Lucene/Solr.
(Jakub Slowinski)
- GITHUB#12967, GITHUB#13038, GITHUB#13040, GITHUB#13042, GITHUB#13047, GITHUB#13048, GITHUB#13049, GITHUB#13050, GITHUB#13051, GITHUB#13039:
Code cleanups and optimizations.
(Dmitry Cherniachenko)
- GITHUB#13053: Minor AnyQueryNode code cleanup
(Dmitry Cherniachenko)
- Bug Fixes (2)
- GITHUB#13027: Fix NPE when sampling for quantization in Lucene99HnswScalarQuantizedVectorsFormat
(Ben Trent)
- GITHUB#13014: Rollback the tmp storage of BytesRefHash to -1 after sort
(Guo Feng)
- Bug Fixes (2)
- GITHUB#12898: JVM SIGSEGV crash when compiling computeCommonPrefixLengthAndBuildHistogram
(Chris Hegarty)
- GITHUB#12900: Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped
(Guo Feng, Mike McCandless)
- API Changes (13)
- GITHUB#12578: Deprecate IndexSearcher#getExecutor in favour of executing concurrent tasks using
the TaskExecutor that the searcher holds, retrieved via IndexSearcher#getTaskExecutor
(Luca Cavanna)
- GITHUB#12556: StoredFieldVisitor has a new expert method StoredFieldVisitor#binaryField(FieldInfo, DataInput, int)
that allows implementors to read binary values directly from the DataInput without having to allocate a byte[].
The default implementation allocates an ew byte array and call StoredFieldVisitor#binaryField(FieldInfo, byte[]).
(Ignacio Vera)
- GITHUB#12592: Add RandomAccessInput#length method to the RandomAccessInput interface. In addition deprecate
ByteBuffersDataInput#size in favour of this new method.
(Ignacio Vera)
- GITHUB#12718: Make IndexSearcher#getSlices final as it is not expected to be overridden
(Luca Cavanna)
- GITHUB#12427: Automata#makeStringUnion #makeBinaryStringUnion now accept Iterable<BytesRef> instead of
Collection<BytesRef>. They also now explicitly throw IllegalArgumentException if input data is not properly sorted
instead of relying on assert.
(Shubham Chaudhary)
- GITHUB#12180: Add TaxonomyReader#getBulkOrdinals method to more efficiently retrieve facet ordinals for multiple
FacetLabel at once.
(Egor Potemkin)
- GITHUB#12816: Add HumanReadableQuery which takes a description parameter for debugging purposes.
(Jakub Slowinski)
- GITHUB#12646, GITHUB#12690: Move FST#addNode to FSTCompiler to avoid a circular dependency
between FST and FSTCompiler
(Anh Dung Bui)
- GITHUB#12709: Consolidate FSTStore and BytesStore in FST. Created FSTReader which contains the common methods
of the two
(Anh Dung Bui)
- GITHUB#12735: Remove FSTCompiler#getTermCount() and FSTCompiler.UnCompiledNode#inputCount
(Anh Dung Bui)
- GITHUB-12695: Remove public constructor of FSTCompiler. Please use FSTCompiler.Builder
instead.
(Juan M. Caicedo)
- GITHUB#12799: Make TaskExecutor constructor public and use TaskExecutor for concurrent
HNSW graph build.
(Shubham Chaudhary)
- GITHUB#12758, GITHUB#12803: Remove FST constructor with DataInput for metadata. Please
use the constructor with FSTMetadata instead.
(Anh Dung Bui)
- New Features (5)
- GITHUB#12548: Added similarityToQueryVector API to compute vector similarity scores
with DoubleValuesSource.
(Shubham Chaudhary)
- GITHUB#12685: Lucene now records if documents have been indexed as blocks in SegmentInfo. This is recorded on a per
segment basis and maintained across merges. The property is exposed via LeafReaderMetadata.
(Simon Willnauer)
- GITHUB#12582: Add int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy
storage for the vectors, requiring about 75% memory for fast HNSW search.
(Ben Trent)
- GITHUB#12660: HNSW graph now can be merged with multiple thread. Configurable in Lucene99HnswVectorsFormat.
(Patrick Zhai)
- GITHUB#12729: Add new Lucene99FlatVectorsFormat for writing vectors in a flat format and refactor
Lucene99HnswVectorsFormat to reuse the flat format. Added new Lucene99HnswQuantizedVectorsFormat that uses
quantized vectors for its flat storage.
(Ben Trent)
- Improvements (16)
- GITHUB#12523: TaskExecutor waits for all tasks to complete before returning when Exceptions
are thrown during concurrent operations
(Michael Peterson)
- GITHUB#12574: Make TaskExecutor public so that it can be retrieved from the searcher and used
outside of the o.a.l.search package
(Luca Cavanna)
- GITHUB#12603: Simplify the TaskExecutor API by exposing a single invokeAll method that takes a
collection of callables, executes them and returns their results
(Luca Cavanna)
- GITHUB#12606: Create a TaskExecutor when an executor is not provided to the IndexSearcher, in
order to simplify consumer's code
(Luca Cavanna)
- GITHUB#12676: Improve logging of vector support if vector module was enabled but Java version
is too old. It also logs partial vectorization support if old CPU or disabled AVX.
(Uwe Schindler, Robert Muir)
- GITHUB#12677: Better detect vector module in non-default setups (e.g., custom module layers).
(Uwe Schindler)
- GITHUB#12634, GITHUB#12632, GITHUB#12680, GITHUB#12681, GITHUB#12731, GITHUB#12737: Speed up
Panama vector support and test improvements.
(Uwe Schindler, Robert Muir)
- GITHUB#12586: Remove over-counting of deleted terms.
(Guo Feng)
- GITHUB#12705, GITHUB#12705, GITHUB#12785: Improve handling of NullPointerException and
IllegalStateException in MMapDirectory's IndexInputs. Also makes sure to close master
MemorySegmentIndexInput while not throwing IllegalStateException (retry in spin loop).
Also improve TestMmapDirectory.testAceWithThreads to run faster and use less resources.
(Uwe Schindler, Maurizio Cimadamore, Michael Sokolov)
- GITHUB#12689: TaskExecutor to cancel all tasks on exception to avoid needless computation.
(Luca Cavanna)
- GITHUB#12754: Refactor lookup of Hotspot VM options and do not initialize constants with NULL
if SecurityManager prevents access.
(Uwe Schindler)
- GITHUB#12801: Remove possible contention on a ReentrantReadWriteLock in
Monitor which could result in searches waiting for commits.
(Davis Cook)
- GITHUB#11277, LUCENE-10241: Upgrade to OpenNLP to 1.9.4.
(Jeff Zemerick)
- GITHUB#12542: FSTCompiler can now approximately limit how much RAM it uses to share
suffixes during FST construction using the suffixRAMLimitMB method. Larger values
result in a more minimal FST (more common suffixes are shard). Pass
Double.POSITIVE_INFINITY to use as much RAM as is needed to create a purely
minimal FST. Inspired by this Rust FST implemention:
https://blog.burntsushi.net/transducers
(Mike McCandless)
- GITHUB#12738: NodeHash now stores the FST nodes data instead of just node addresses
(Anh Dung Bui)
- GITHUB#12847: Test2BFST now reports the time it took to build the FST and the real FST size
(Anh Dung Bui)
- Optimizations (26)
- GITHUB#12183: Make TermStates#build concurrent.
(Shubham Chaudhary)
- GITHUB#12573: Use radix sort to speed up the sorting of deleted terms.
(Guo Feng)
- GITHUB#12382: Faster top-level conjunctions on term queries when sorting by
descending score.
(Adrien Grand)
- GITHUB#12591: Use stable radix sort to speed up the sorting of update terms.
(Guo Feng)
- GITHUB#12587: Use radix sort to speed up the sorting of terms in TermInSetQuery.
(Guo Feng)
- GITHUB#12604: Estimate the block size of FST BytesStore in BlockTreeTermsWriter
to reduce GC load during indexing.
(Guo Feng)
- GITHUB#12623: Use a MergeSorter taking advantage of extra storage for StableMSBRadixSorter.
(Guo Feng)
- GITHUB#12631: Write MSB VLong for better outputs sharing in block tree index, decreasing ~14% size
of .tip file.
(Guo Feng)
- GITHUB#12668: ImpactsEnums now decode frequencies lazily like PostingsEnums.
(Adrien Grand)
- GITHUB#12651: Use 2d array for OnHeapHnswGraph representation.
(Patrick Zhai)
- GITHUB#12653: Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip.
(Shubham Chaudhary)
- GITHUB#12589: Disjunctions now sometimes run as conjunctions when the minimum
competitive score requires multiple clauses to match.
(Adrien Grand)
- GITHUB#12710: Use Arrays#mismatch for Outputs#common operations.
(Guo Feng)
- GITHUB#12712: Speed up sorting postings file with an offline radix sorter in BPIndexReader.
(Guo Feng)
- GITHUB#12702: Disable suffix sharing for block tree index, making writing the terms dictionary index faster
and less RAM hungry, while making the index a bit (~1.X% for the terms index file on wikipedia).
(Guo Feng, Mike McCandless)
- GITHUB#12726: Return the same input vector if its a unit vector in VectorUtil#l2normalize.
(Shubham Chaudhary)
- GITHUB#12719: Top-level conjunctions that are not sorted by score now have a
specialized bulk scorer.
(Adrien Grand)
- GITHUB#12696: Change Postings back to using FOR in Lucene99PostingsFormat. Freqs, positions and offset keep using PFOR.
(Jakub Slowinski)
- GITHUB#1052: Faster merging of terms enums.
(Adrien Grand)
- GITHUB#11903: Faster sort on high-cardinality string fields.
(Adrien Grand)
- GITHUB#12381: Skip docs with DocValues in NumericLeafComparator.
(Lu Xugang, Adrien Grand)
- GITHUB#12784: Cache buckets to speed up BytesRefHash#sort.
(Guo Feng)
- GITHUB#12806: Utilize exact kNN search when gathering k >= numVectors in a segment
(Ben Trent)
- GITHUB#12782: Use group-varint encoding for the tail of postings.
(Adrien Grand, Zhang Chao)
- GITHUB#12748: Specialize arc store for continuous label in FST.
(Guo Feng, Chao Zhang)
- GITHUB#12825, GITHUB#12834: Hunspell: improved dictionary loading performance, allowed in-memory entry sorting.
(Peter Gromov)
- Changes in runtime behavior (3)
- GITHUB#12569: Prevent concurrent tasks from parallelizing execution further which could cause deadlock
(Luca Cavanna)
- GITHUB#12765: Disable vectorization on VMs that are not Hotspot-based.
(Uwe Schindler, Robert Muir)
- GITHUB#12552: Make FSTPostingsFormat load FSTs off-heap.
(Tony X)
- Bug Fixes (11)
- GITHUB#12654: TestIndexWriterOnVMError.testUnknownError times out (fixes potential IndexWriter
deadlock with tragic exceptions).
(Benjamin Trent, Dawid Weiss, Simon Willnauer)
- GITHUB#12614: Make LRUQueryCache respect Accountable queries on eviction and consistency check
(Grigoriy Troitskiy)
- GITHUB#11556: HTMLStripCharFilter fails on '>' or '<' characters in attribute values.
(Elliot Lin)
- GITHUB#12698: Fix IndexOutOfBoundsException when saving FSTStore-backed FST with different DataOutput for metadata
(Anh Dung Bui)
- GITHUB#12642: Ensure #finish only gets called once on the base collector during drill-sideways
(Greg Miller)
- GITHUB#12682: Scorer should sum up scores into a double.
(Shubham Chaudhary)
- GITHUB#12727: Ensure negative scores are not returned by vector similarity functions
(Ben Trent)
- GITHUB#12736: Fix NullPointerException when Monitor.getQuery cannot find the requested queryId
(Davis Cook)
- GITHUB#12770: Stop exploring HNSW graph if scores are not getting better.
(Ben Trent)
- GITHUB#12640: Ensure #finish is called on all drill-sideways collectors even if one throws a
CollectionTerminatedException
(Greg Miller)
- GITHUB#12626: Fix segmentInfos replace to set userData
(Shibi Balamurugan, Uwe Schindler, Marcus Eagan, Michael Froh)
- Build (5)
- GITHUB#12752: tests.multiplier could be omitted in test failure reproduce lines (esp. in
nightly mode).
(Dawid Weiss)
- GITHUB#12742: JavaCompile tasks may be in up-to-date state when modular dependencies have changed
leading to odd runtime errors
(Chris Hostetter, Dawid Weiss)
- GITHUB#12612: Upgrade forbiddenapis to version 3.6 and ASM for APIJAR extraction to 9.6.
(Uwe Schindler)
- GITHUB#12655: Upgrade to Gradle 8.4
(Kevin Risden)
- GITHUB#12845: Only enable support for tests.profile if jdk.jfr module is available
in Gradle runtime.
(Uwe Schindler)
- Other (5)
- GITHUB#12817: Add demo for faceting with StringValueFacetCounts over KeywordField and SortedDocValuesField.
(Stefan Vodita)
- GITHUB#12657: Internal refactor of HNSW graph merging
(Ben Trent).
- GITHUB#12625: Refactor ByteBlockPool so it is just a "shift/mask big array".
(Ignacio Vera)
- GITHUB#6675: Various improvements related to ByteBlockPool. Slice functionality on top of ByteBlockPool moved to its
own class, ByteSlicePool. ByteBlockPool's array of buffers is made private. There are new exceptions for buffer index
overflows and slices that are too large. Some bits of code are simplified. Documentation is updated and expanded.
(Stefan Vodita)
- GITHUB#12762: Refactor BKD HeapPointWriter to hide the internal data structure.
(Ignacio Vera)
- API Changes (3)
- GITHUB#12554: Allow FilteredDocIdSetIterator.match(doc) to throw IOException
(Gokul Manoj)
- GITHUB#11248: IntBlockPool's SliceReader, SliceWriter, and all int slice functionality are moved out to MemoryIndex.
(Stefan Vodita)
- GITHUB#12436: Move max vector dims limit to Codec
(Mayya Sharipova)
- New Features (6)
- GITHUB#12380: Introduced LeafCollector#finish, a hook that runs after
collection has finished running on a leaf.
(Adrien Grand)
- LUCENE-8183, GITHUB#9231: Added the abbility to get noSubMatches and noOverlappingMatches in
HyphenationCompoundWordFilter
(Martin Demberger, original from Rupert Westenthaler)
- GITHUB#12434: Add `KnnCollector` to `LeafReader` and `KnnVectorReader` so that custom collection of vector
search results can be provided. The first custom collector provides `ToParentBlockJoin[Float|Byte]KnnVectorQuery`
joining child vector documents with their parent documents.
(Ben Trent)
- GITHUB#12479: Add new Maximum Inner Product vector similarity function for non-normalized dot-product
vector search.
(Jack Mazanec, Ben Trent)
- GITHUB#12525: `WordDelimiterGraphFilterFactory` now supports the `ignoreKeywords` flag
(Thomas De Craemer)
- GITHUB#12489: Add support for recursive graph bisection, also called
bipartite graph partitioning, and often abbreviated BP, an algorithm for
reordering doc IDs that results in more compact postings and faster queries,
especially conjunctions.
(Adrien Grand)
- Improvements (5)
- GITHUB#12374: Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search
(Sorabh Hamirwasia)
- GITHUB#12499: Simplify task executor for concurrent operations by offloading concurrent operations to the
provided executor unconditionally.
(Luca Cavanna)
- GITHUB#12464: Hunspell: allow customizing the hash table load factor
(Peter Gromov)
- GITHUB#12468: Hunspell: check for aff file wellformedness more strictly
(Peter Gromov)
- GITHUB#12491: Hunspell: speed up the dictionary enumeration on suggestion
(Peter Gromov)
- Optimizations (13)
- GITHUB#12377: Avoid redundant loop for compute min value in DirectMonotonicWriter.
(Zhang Chao)
- GITHUB#12361: Faster top-level disjunctions sorted by descending score.
(Adrien Grand)
- GITHUB#12444: Faster top-level disjunctions sorted by descending score in
case of many terms or queries that expose suboptimal score upper bounds.
(Adrien Grand)
- GITHUB#12383: Assign a dummy simScorer in TermsWeight if score is not needed.
(Sagar Upadhyaya)
- GITHUB#12372: Reduce allocation during HNSW construction
(Jonathan Ellis)
- GITHUB#12385: Restore parallel knn query rewrite across segments rather than slices
(Luca Cavanna)
- GITHUB#12381: Speed up NumericDocValuesWriter with index sorting.
(Zhang Chao)
- GITHUB#12453: Faster bulk numeric reads from BufferedIndexInput
(Armin Braun)
- GITHUB#12415: Optimized counts on disjunctive queries.
(Adrien Grand)
- GITHUB#12518: Use panama vector API to speed up l2norm calculations
(Ben Trent)
- GITHUB#12480: Lazy computation of similarity score during initializeFromGraph
(Jack Wang)
- GITHUB#12490: Faster computation of top-k hits on boolean queries.
(Adrien Grand)
- GITHUB#12560: ExpressionValueSource defers #advanceExact on dependencies until their values are needed, avoiding
unnecessary advancing on values that are never evaluated (e.g., because of ternary expressions).
(Greg Miller)
- Changes in runtime behavior (3)
- GITHUB#12516: Unwrap and throw execution exceptions cause when performing concurrent search
(Luca Cavanna)
- GITHUB#12498: Offload concurrent search execution to the executor that's optionally provided to the IndexSearcher.
Tasks are no longer executed on the caller thread when rejected or if the executor queue goes above a predefined
threshold. Adaptive behaviour as well as a saturation policy can be incorporated in the provided executor instead.
(Luca Cavanna)
- GITHUB#12515: Offload sequential search execution to the executor that's optionally provided to the IndexSearcher
(Luca Cavanna)
- Bug Fixes (10)
- GITHUB#9660: Throw an ArithmeticException when the offset overflows in a ByteBlockPool.
(Stefan Vodita)
- GITHUB#11537: Fix stack overflow in RegExp for long strings by reducing recursion.
(Jakub Slowinski)
- GITHUB#12388: JoinUtil queries were ignoring boosts.
(Alan Woodward)
- GITHUB#12413: Fix HNSW graph search bug that potentially leaked unapproved docs
(Ben Trent).
- GITHUB#12423: Respect timeouts in ExitableDirectoryReader when searching with byte[] vectors
(Ben Trent).
- GITHUB#12451: Change TestStringsToAutomaton validation to avoid automaton conversion bug discovered in GH#12458
(Greg Miller).
- GITHUB#12472: UTF32ToUTF8 would sometimes accept extra invalid UTF-8 binary sequences. This should not have any
impact on the user, unless you explicitly invoke the convert function of UTF32ToUTF8, and in the extremely rare
scenario of searching a non-UTF-8 inverted field with Unicode search terms
(Tang Donghai).
- LUCENE-12521: Sort After returning in-correct result when missing values are competitive.
(Chaitanya Gohel)
- GITHUB#12555: Fix bug in TermsEnum#seekCeil on doc values terms enums
that causes IndexOutOfBoundsException.
(Egor Potemkin)
- GITHUB#12571: Fix HNSW graph read bug when built with excessive connections.
(Ben Trent).
- Other (4)
- GITHUB#12404: Remove usage and add some legacy java.util classes to forbiddenapis (Stack, Hashtable, Vector).
(Uwe Schindler)
- GITHUB#12410: Refactor vectorization support (split provider from implementation classes).
(Uwe Schindler, Chris Hegarty)
- GITHUB#12428: Replace consecutive close() calls and close() calls with null checks with IOUtils.close().
(Shubham Chaudhary)
- GITHUB#12512: Remove unused variable in BKDWriter.
(Zhang Chao)
- API Changes (4)
- GITHUB#11840, GITHUB#12304: Query rewrite now takes an IndexSearcher instead of
IndexReader to enable concurrent rewriting. Please note: This is implemented in
a backwards compatible way. A query overriding any of both rewrite methods is
supported. To implement this backwards layer in Lucene 9.x the
RuntimePermission "accessDeclaredMembers" is needed in applications using
SecurityManager.
(Patrick Zhai, Ben Trent, Uwe Schindler)
- GITHUB#12321: DaciukMihovAutomatonBuilder has been marked deprecated in preparation of reducing its visibility in
a future release.
(Greg Miller)
- GITHUB#12268: Add BitSet.clear() without parameters for clearing the entire set
(Jonathan Ellis)
- GITHUB#12346: add new IndexWriter#updateDocuments(Query, Iterable<Document>) API
to update documents atomically, with respect to refresh and commit using a query.
(Patrick Zhai)
- New Features (4)
- GITHUB#12257: Create OnHeapHnswGraphSearcher to let OnHeapHnswGraph to be searched in a thread-safety manner.
(Patrick Zhai)
- GITHUB#12302, GITHUB#12311, GITHUB#12363: Add vectorized implementations of VectorUtil.dotProduct(),
squareDistance(), cosine() with Java 20 or 21 jdk.incubator.vector APIs. Applications started
with command line parameter "java --add-modules jdk.incubator.vector" on exactly Java 20 or 21
will automatically use the new vectorized implementations if running on a supported platform
(x86 AVX2 or later, ARM NEON). This is an opt-in feature and requires explicit Java
command line flag! When enabled, Lucene logs a notice using java.util.logging. Please test
thoroughly and report bugs/slowness to Lucene's mailing list.
(Chris Hegarty, Robert Muir, Uwe Schindler)
- GITHUB#12294: Add support for Java 21 foreign memory API. If Java 19 up to 21 is used,
MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) and indexes
closed while queries are running can no longer crash the JVM. To disable this feature,
pass the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
(Uwe Schindler)
- GITHUB#12252 Add function queries for computing similarity scores between knn vectors.
(Elia Porciani, Alessandro Benedetti)
- Improvements (7)
- GITHUB#12245: Add support for Score Mode to `ToParentBlockJoinQuery` explain.
(Marcus Eagan via Mikhail Khludnev)
- GITHUB#12305: Minor cleanup and improvements to DaciukMihovAutomatonBuilder.
(Greg Miller)
- GITHUB#12325: Parallelize AbstractKnnVectorQuery rewrite across slices rather than segments.
(Luca Cavanna)
- GITHUB#12333: NumericLeafComparator#competitiveIterator makes better use of a "search after" value when paginating.
(Chaitanya Gohel)
- GITHUB#12290: Make memory fence in ByteBufferGuard explicit using `VarHandle.fullFence()`
- GITHUB#12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit.
(Greg Miller)
- GITHUB#12281: Require indexed KNN float vectors and query vectors to be finite.
(Jonathan Ellis, Uwe Schindler)
- Optimizations (9)
- GITHUB#12324: Speed up sparse block advanceExact with tiny step in IndexedDISI.
(Guo Feng)
- GITHUB#12270 Don't generate stacktrace in CollectionTerminatedException.
(Armin Braun)
- GITHUB#12160: Concurrent rewrite for AbstractKnnVectorQuery.
(Kaival Parikh)
- GITHUB#12286 Toposort use iterator to avoid stackoverflow.
(Tang Donghai)
- GITHUB#12235: Optimize HNSW diversity calculation.
(Patrick Zhai)
- GITHUB#12328: Optimize ConjunctionDISI.createConjunction
(Armin Braun)
- GITHUB#12357: Better paging when doing backwards random reads. This speeds up
queries relying on terms in NIOFSDirectory and SimpleFSDirectory.
(Alan Woodward)
- GITHUB#12339: Optimize part of duplicate calculation numDeletesToMerge in merge phase
(fudongying)
- GITHUB#12334: Honor after value for skipping documents even if queue is not full for PagingFieldCollector
(Chaitanya Gohel)
- Bug Fixes (4)
- GITHUB#12291: Skip blank lines from stopwords list.
(Jerry Chin)
- GITHUB#11350: Handle possible differences in FieldInfo when merging indices created with Lucene 8.x
(Tomás Fernández Löbbe)
- GITHUB#12352: [Tessellator] Improve the checks that validate the diagonal between two polygon nodes so
the resulting polygons are valid counter clockwise polygons.
(Ignacio Vera)
- LUCENE-10181: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth.
(Chris Fournier)
- Other (1)
- (No changes)
- API Changes (4)
- GITHUB#12116: Introduce IndexableField#storedValue() to expose the value that
should be stored to IndexingChain without needing to guess the field's type.
(Adrien Grand, Robert Muir)
- GITHUB#12129: Move DocValuesTermsQuery from sandbox to SortedDocValuesField#newSlowSetQuery
and SortedSetDocValuesField#newSlowSetQuery.
(Robert Muir)
- GITHUB#12173: TermInSetQuery#getTermData has been deprecated. This exposes internal implementation details that we
may want to change in the future, and users shouldn't rely on the encoding directly.
(Greg Miller)
- GITHUB#11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount.
(Greg Miller)
- New Features (3)
- GITHUB#12054: Introduce a new KeywordField for simple and efficient
filtering, sorting and faceting.
(Adrien Grand)
- GITHUB#12188: Add support for Java 20 foreign memory API. If exactly Java 19
or 20 is used, MMapDirectory will mmap Lucene indexes in chunks of 16 GiB
(instead of 1 GiB) and indexes closed while queries are running can no longer
crash the JVM. To disable this feature, pass the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
(Uwe Schindler)
- GITHUB#12169: Introduce a new token filter to expand synonyms based on Word2Vec DL4j models.
(Daniele Antuzi, Ilaria Petreti, Alessandro Benedetti)
- Improvements (5)
- GITHUB#12055: MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE rewrite method introduced and used as the new default
for multi-term queries with a FILTER rewrite (PrefixQuery, WildcardQuery, TermRangeQuery). This introduces better
skipping support for common use-cases.
(Adrien Grand, Greg Miller)
- GITHUB#12156: TermInSetQuery now extends MultiTermQuery instead of providing its own custom implementation (which
was essentially a clone of MultiTermQuery#CONSTANT_SCORE_REWRITE). It uses the new CONSTANT_SCORE_BLENDED_REWRITE
by default, but can be overridden through the constructor.
(Greg Miller)
- GITHUB#12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod.
(Greg Miller)
- GITHUB#12166: Remove the now unused class pointInPolygon.
(Marcus Eagan via Christine Poerschke and Nick Knize)
- GITHUB#12126: Refactor part of IndexFileDeleter and ReplicaFileDeleter into a public common utility class
FileDeleter.
(Patrick Zhai)
- Optimizations (9)
- GITHUB#11900: BloomFilteringPostingsFormat now uses multiple hash functions
in order to achieve the same false positive probability with less memory.
(Jean-François Boeuf)
- GITHUB#12118: Optimize FeatureQuery to TermQuery & weight when scoring is not required.
(Ben Trent, Robert Muir)
- GITHUB#12128, GITHUB#12133: Speed up docvalues set query by making use of sortedness.
(Robert Muir, Uwe Schindler)
- GITHUB#12050: Reuse HNSW graph for intialization during merge
(Jack Mazanec)
- GITHUB#12155: Speed up DocValuesRewriteMethod by making use of sortedness.
(Greg Miller)
- GITHUB#12139: Faster indexing of string fields.
(Adrien Grand)
- GITHUB#12179: Better PostingsEnum reuse in MultiTermQueryConstantScoreBlendedWrapper.
(Greg Miller)
- GITHUB#12198, GITHUB#12199: Reduced contention when indexing with many threads.
(Adrien Grand)
- GITHUB#12241: Add ordering of files in compound files.
(Christoph Büscher)
- Bug Fixes (8)
- GITHUB#12158: KeywordField#newSetQuery should clone input BytesRef[] to avoid modifying provided array.
(Greg Miller)
- GITHUB#12196: Fix MultiFieldQueryParser to handle both query boost and phrase slop at the same time.
(Jasir KT)
- GITHUB#12202: Fix MultiFieldQueryParser to apply boosts to regexp, wildcard, prefix, range, fuzzy queries.
(Jasir KT)
- GITHUB#12178: Add explanations for TermAutomatonQuery
(Marcus Eagan via Patrick Zhai, Mike McCandless, Robert Muir, Mikhail Khludnev)
- GITHUB#12214: Fix ordered intervals query to avoid skipping some of the results over interleaved terms.
(Hongyu Yan)
- GITHUB#12212: Bug fix for a DrillSideways issue where matching hits could occasionally be missed.
(Frederic Thevenet)
- GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end
(Peter Gromov)
- GITHUB#12260: Fix SynonymQuery equals implementation to take the targeted field name into account
(Luca Cavanna)
- Build (3)
- GITHUB#12131: Generate gradle.properties from gradlew, if absent
(Colvin Cowie, Uwe Schindler)
- GITHUB#12188: Building the lucene-core MR-JAR file is now possible without installing
additionally required Java versions (Java 19, Java 20,...). For compilation, a special
JAR file with Panama-foreign API signatures of each supported Java version was added to
source tree. Those can be regenerated an demand with "gradlew :lucene:core:regenerate".
(Uwe Schindler)
- GITHUB#12215: Upgrade forbiddenapis to version 3.5. This tones down some verbose warnings
printed while checking Java 19 and Java 20 sourcesets for the MR-JAR.
(Uwe Schindler)
- Documentation (1)
- GITHUB#10633: Update javadocs in TestBackwardsCompatibility to use gradle and not ant.
(Usman Shaikh)
- Other (2)
- GITHUB#11868: Add a FilterIndexInput and FilterIndexOutput class to more easily and safely create delegate
IndexInput and IndexOutput classes
(Marc D'Mello)
- GITHUB#12239: Hunspell: reduced suggestion set dependency on the hash table order
(Peter Gromov)
- API Changes (20)
- GITHUB#12093: Deprecate support for UTF8TaxonomyWriterCache and changed default to LruTaxonomyWriterCache.
Please use LruTaxonomyWriterCache instead.
(Vigya Sharma)
- GITHUB#11998: Add new stored fields and termvectors interfaces: IndexReader.storedFields()
and IndexReader.termVectors(). Deprecate IndexReader.document() and IndexReader.getTermVector().
The new APIs do not rely upon ThreadLocal storage for each index segment, which can greatly
reduce RAM requirements when there are many threads and/or segments.
(Adrien Grand, Robert Muir)
- GITHUB#11742: MatchingFacetSetsCounts#getTopChildren now properly returns "top" children instead
of all children.
(Greg Miller)
- GITHUB#11772: Removed native subproject and WindowsDirectory implementation from lucene.misc. Recommendation:
use MMapDirectory implementation on Windows.
(Robert Muir, Uwe Schindler, Dawid Weiss)
- GITHUB#11804: FacetsCollector#collect is no longer final, allowing extension.
(Greg Miller)
- GITHUB#11761: TieredMergePolicy now allowed a maximum allowable deletes percentage of down to 5%, and the default
maximum allowable deletes percentage is changed from 33% to 20%.
(Marc D'Mello)
- GITHUB#11822: Configure replicator PrimaryNode replia shutdown timeout.
(Steven Schlansker)
- GITHUB#11930: Added IOContext#LOAD for files that are a small fraction of the
total index size and heavily accessed with a random access pattern. Some
Directory implementations may choose to load files that use this IOContext in
memory to provide stronger guarantees on query latency.
(Adrien Grand, Uwe Schindler)
- GITHUB#11941: QueryBuilder#add and #newSynonymQuery methods now take a `field` parameter,
to avoid possible exceptions when building queries from an empty term list. The helper
TermAndBoost class now holds a BytesRef rather than a Term.
(Alan Woodward)
- GITHUB#11961: VectorValues#EMPTY was removed as this instance was not
necessary and also illegal as it reported a number of dimensions equal to
zero.
(Adrien Grand)
- GITHUB#11962: VectorValues#cost() now delegates to VectorValues#size().
(Adrien Grand)
- GITHUB#11984: Improved TimeLimitBulkScorer to check the timeout at exponantial rate.
(Costin Leau)
- GITHUB#12004: Add new KnnByteVectorQuery for querying vector fields that are encoded as BYTE. Removes the ability to
use KnnVectorQuery against fields encoded as BYTE
(Ben Trent)
- GITHUB#11997: Introduce IntField, LongField, FloatField and DoubleField.
These new fields index both 1D points and sorted numeric doc values and
provide best performance for filtering and sorting.
(Francisco Fernández Castaño, Adrien Grand)
- GITHUB#12066: Retire/deprecate instance method MMapDirectory#setUseUnmap().
Like the new setting for MemorySegments, this feature is enabled by default and
can only be disabled globally by passing the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableUnmapHack=false"
(Uwe Schindler)
- GITHUB#12038: Deprecate non-NRT replication support.
Please migrate to org.apache.lucene.replicator.nrt instead.
(Robert Muir)
- GITHUB#12087: Move DocValuesNumbersQuery from sandbox to NumericDocValuesField#newSlowSetQuery
and SortedNumericDocValuesField#newSlowSetQuery. IntField, LongField, FloatField, and DoubleField
implement newSetQuery with best-practice use of IndexOrDocValuesQuery.
(Robert Muir)
- GITHUB#12064: Create new KnnByteVectorField, ByteVectorValues and KnnVectorsReader#getByteVectorValues(String)
that are specialized for byte-sized vectors, and clarify the public API by making a clear distinction
between classes that produce and read float vectors and those that produce and read byte vectors.
(Ben Trent)
- GITHUB#12101: Remove VectorValues#binaryValue(). Vectors should only be
accessed through their high-level representation, via
VectorValues#vectorValue().
(Adrien Grand)
- GITHUB#12105: Deprecate KnnVectorField in favour of KnnFloatVectorField,
KnnVectoryQuery in favour of KnnFloatVectorQuery, and LeafReader#getVectorValues
in favour of LeafReader#getFloatVectorValues.
(Luca Cavanna)
- New Features (7)
- GITHUB#11795: Add ByteWritesTrackingDirectoryWrapper to expose metrics for bytes merged, flushed, and overall
write amplification factor.
(Marc D'Mello)
- GITHUB#11929: MMapDirectory gives more granular control on which files to
preload.
(Adrien Grand, Uwe Schindler)
- GITHUB#11999: MemoryIndex now supports stored fields.
(Alan Woodward)
- GITHUB#11997: Add IntField, LongField, FloatField and DoubleField: easy to
use numeric fields that perform well both for filtering and sorting.
(Francisco Fernández Castaño)
- GITHUB#12033: Support for Java 19 foreign memory support is now enabled by default,
no need to pass "--enable-preview" on the command line. If exactly Java 19 is used,
MMapDirectory will mmap Lucene indexes in chunks of 16 GiB (instead of 1 GiB) and
indexes closed while queries are running can no longer crash the JVM.
To disable this feature, pass the following sysprop on Java command line:
"-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false"
(Uwe Schindler)
- GITHUB#11869: RangeOnRangeFacetCounts added, supporting numeric range "relationship" faceting over docvalue-stored
ranges.
(Marc D'Mello)
- LUCENE-10626 Hunspell: add tools to aid dictionary editing:
analysis introspection, stem expansion and stem/flag suggestion
(Peter Gromov)
- Improvements (9)
- GITHUB#11785: Improve Tessellator performance by delaying calls to the method
#isIntersectingPolygon
(Ignacio Vera)
- GITHUB#687: speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocIdSetIterator
construction using bkd binary search.
(Jianping Weng)
- GITHUB#11985: ExitableTerms to override Terms#getMin and Terms#getMax in order to avoid
iterating through the terms when the wrapped implementation caches such values.
(Luca Cavanna)
- GITHUB#11860: Improve storage efficiency of connections in the HNSW graph that Lucene uses for
vector search.
(Ben Trent)
- GITHUB#12008: Clean up LongRange#verifyAndEncode logic to remove unnecessary NaN checks.
(Greg Miller)
- GITHUB#12003: Minor cleanup/improvements to IndexSortSortedNumericDocValuesRangeQuery.
(Greg Miller)
- GITHUB#12016: Upgrade lucene/expressions to use antlr 4.11.1
(Andriy Redko)
- GITHUB#12034: Remove null check in IndexReaderContext#leaves() usages
(Erik Pellizzon)
- GITHUB#12070: Compound file creation is no longer subject to merge throttling.
(Adrien Grand)
- Bug Fixes (15)
- GITHUB#11726: Indexing term vectors on large documents could fail due to
trying to apply a dictionary whose size is greater than the maximum supported
window size for LZ4.
(Adrien Grand)
- GITHUB#11768: Taxonomy and SSDV faceting now correctly breaks ties by preferring smaller ordinal
values.
(Greg Miller)
- GITHUB#11907: Fix latent casting bugs in BKDWriter.
(Ben Trent)
- GITHUB#11954: Remove QueryTimeout#isTimeoutEnabled method and move check to caller.
(Shubham Chaudhary)
- GITHUB#11950: Fix NPE in BinaryRangeFieldRangeQuery variants when the queried field doesn't exist
in a segment or is of the wrong type.
(Greg Miller)
- GITHUB#11990: PassageSelector now has a larger minimum size for its priority queue,
so that subsequent passage merges don't mean that we return too few passages in
total.
(Alan Woodward, Dawid Weiss)
- GITHUB#11986: Fix algorithm that chooses the bridge between a polygon and a hole when there is
common vertex.
(Ignacio Vera)
- GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries.
(Craig Taverner)
- GITHUB#12058: Fix detection of Hotspot in TestRamUsageEstimator so it works with OpenJ9.
(Uwe Schindler)
- GITHUB#12046: Out of boundary in CombinedFieldQuery#addTerm.
(Lu Xugang)
- GITHUB#12072: Fix exponential runtime for nested BooleanQuery#rewrite when a
BooleanClause is non-scoring.
(Ben Trent)
- GITHUB#11807: Don't rewrite queries in unified highlighter.
(Alan Woodward)
- GITHUB#12088: WeightedSpanTermExtractor should not throw UnsupportedOperationException
when it encounters a FieldExistsQuery.
(Alan Woodward)
- GITHUB#12084: Same bound with fallbackQuery.
(Lu Xugang)
- GITHUB#12077: WordBreakSpellChecker now correctly respects maxEvaluations
(hossman)
- Optimizations (18)
- GITHUB#11738: Optimize MultiTermQueryConstantScoreWrapper when a term is present that matches all
docs in a segment.
(Greg Miller)
- GITHUB#11735: KeywordRepeatFilter + OpenNLPLemmatizer always drops last token of a stream.
(Luke Kot-Zaniewski)
- GITHUB#11771: KeywordRepeatFilter + OpenNLPLemmatizer sometimes arbitrarily exits token stream.
(Luke Kot-Zaniewski)
- GITHUB#11803: DrillSidewaysScorer has improved to leverage "advance" instead of "next" where
possible, and splits out first and second phase checks to delay match confirmation.
(Greg Miller)
- GITHUB#11828: Tweak TermInSetQuery "dense" optimization to only require all terms present in a
given field to match a term (rather than all docs in a segment). This is consistent with
MultiTermQueryConstantScoreWrapper.
(Greg Miller)
- GITHUB#11876: Use ByteArrayComparator to speed up PointInSetQuery in single dimension case.
(Guo Feng)
- GITHUB#11880: Use ByteArrayComparator to speed up BinaryRangeFieldRangeQuery, RangeFieldQuery
LatLonPointDistanceFeatureQuery and CheckIndex.
(Guo Feng)
- GITHUB#11881: Further optimize drill-sideways scoring by specializing the single dimension case
and borrowing some concepts from "min should match" scoring.
(Greg Miller)
- GITHUB#11884: Simplify the logic of matchAll() in IndexSortSortedNumericDocValuesRangeQuery.
(Lu Xugang)
- GITHUB#11895: count() in BooleanQuery could be early quit.
(Lu Xugang)
- GITHUB#11972: `IndexSortSortedNumericDocValuesRangeQuery` can now also
optimize query execution with points for descending sorts.
(Adrien Grand)
- GITHUB#12006: Do ints compare instead of ArrayUtil#compareUnsigned4 in LatlonPointQueries.
(Guo Feng)
- GITHUB#12011: Minor speedup to flushing long postings lists when an index
sort is configured.
(Adrien Grand)
- GITHUB#12017: Aggressive count in BooleanWeight.
(Lu Xugang)
- GITHUB#12079: Faster merging of 1D points.
(Adrien Grand)
- GITHUB#12081: Small merging speedup on sorted indexes.
(Adrien Grand)
- GITHUB#12078: Enhance XXXField#newRangeQuery.
(Lu Xugang)
- GITHUB#11857, GITHUB#11859, GITHUB#11893, GITHUB#11909: Hunspell: improved suggestion performance
(Peter Gromov)
- Other (9)
- GITHUB#11856: Fix nanos to millis conversion for tests
(Marios Trivyzas)
- LUCENE-10423: Remove usages of System.currentTimeMillis() from tests.
(Marios Trivyzas)
- GITHUB#11811: Upgrade google java format to 1.15.0
(Dawid Weiss)
- GITHUB#11834: Upgrade forbiddenapis to version 3.4.
(Uwe Schindler)
- LUCENE-10635: Ensure test coverage for WANDScorer by using a test query.
(Zach Chen, Adrien Grand)
- GITHUB#11752: Added interface to relate a LatLonShape with another shape represented as Component2D.
(Navneet Verma)
- GITHUB#11983: Make constructors for OffsetFromPositions and OffsetsFromMatchIterator
public.
(Alan Woodward)
- LUCENE-10546: Update Faceting user guide.
(Egor Potemkin)
- GITHUB#12099: Introduce support in KnnVectorQuery for getters.
(Alessandro Benedetti)
- Build (1)
- GITHUB#11886: Upgrade to gradle 7.5.1
(Dawid Weiss)
- Bug Fixes (2)
- GITHUB#11905: Fix integer overflow when seeking the vector index for connections in a single segment.
This addresses a bug that was introduced in 9.2.0 where having many vectors is not handled well
in the vector connections reader.
- GITHUB#11939: Fix incorrect cost calculation in DocIdSetBuilder after upgradeToBitSet when doc list is growing.
This addresses a bug where the cost of TermRangeQuery/TermInSetQuery and some other queries will be highly underestimated.
- Improvements (2)
- GITHUB#11912, GITHUB#11918: Port generic exception handling from MemorySegmentIndexInput
to ByteBufferIndexInput. This also adds the invalid position while seeking or reading
to the exception message. Allows better debugging and analysis of bugs like GITHUB#11905.
(Uwe Schindler, Robert Muir)
- GITHUB#11916: improve checkindex to be more thorough for vectors.
(Ben Trent)
- Bug Fixes (1)
- GITHUB#11858: Fix kNN vectors format validation on large segments. This
addresses a regression in 9.4.0 where validation could fail, preventing
further writes or searches on the index.
(Julie Tibshirani)
- API Changes (1)
- LUCENE-10577: Add VectorEncoding to enable byte-encoded HNSW vectors
(Michael Sokolov, Julie Tibshirani)
- New Features (4)
- LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape.
(Nick Knize)
- LUCENE-10629: Support match set filtering with a query in MatchingFacetSetCounts.
(Stefan Vodita, Shai Erera)
- LUCENE-10633: SortField#setOptimizeSortWithIndexedData and
SortField#getOptimizeSortWithIndexedData were introduced to provide
an option to disable sort optimization for various sort fields.
(Mayya Sharipova)
- GITHUB#912: Support for Java 19 foreign memory support was added. Applications started
with command line parameter "java --enable-preview" will automatically use the new
foreign memory API of Java 19 to access indexes on disk with MMapDirectory. This is
an opt-in feature and requires explicit Java command line flag! When enabled, Lucene logs
a notice using java.util.logging. Please test thoroughly and report bugs/slowness to Lucene's
mailing list. When the new API is used, MMapDirectory will mmap Lucene indexes in chunks of
16 GiB (instead of 1 GiB) and indexes closed while queries are running can no longer crash
the JVM.
(Uwe Schindler)
- Improvements (4)
- LUCENE-10592: Build HNSW Graph on indexing.
(Mayya Sharipova, Adrien Grand, Julie Tibshirani)
- LUCENE-10207: TermInSetQuery can now provide a ScoreSupplier with cost estimation, making it
usable in IndexOrDocValuesQuery.
(Greg Miller)
- LUCENE-10216: Use MergePolicy to define and MergeScheduler to trigger the reader merges
required by addIndexes(CodecReader[]) API.
(Vigya Sharma, Michael McCandless)
- GITHUB#11715: Add Integer awareness to RamUsageEstimator.sizeOf
(Mike Drob)
- Optimizations (5)
- LUCENE-10661: Reduce memory copy in BytesStore.
(luyuncheng)
- GITHUB#1020: Support #scoreSupplier and small optimizations to DocValuesRewriteMethod.
(Greg Miller)
- LUCENE-10633: Added support for dynamic pruning to queries sorted by a string
field that is indexed with terms and SORTED or SORTED_SET doc values.
(Adrien Grand)
- LUCENE-10627: Using ByteBuffersDataInput reduce memory copy on compressing data.
(luyuncheng)
- GITHUB#1062: Optimize TermInSetQuery when a term is present that matches all docs in a segment.
(Greg Miller)
- Bug Fixes (7)
- LUCENE-10663: Fix KnnVectorQuery explain with multiple segments.
(Shiming Li)
- LUCENE-10673: Improve check of equality for latitudes for spatial3d GeoBoundingBox
(ignacio Vera)
- LUCENE-10678: Fix potential overflow when building a BKD tree with more than 4 billion points. The overflow
occurs when computing the partition point.
(Ignacio Vera)
- LUCENE-10644: Facets#getAllChildren testing should ignore child order.
(Yuting Gan)
- LUCENE-10665, GITHUB#11701: Fix classloading deadlock in analysis factories / AnalysisSPILoader
initialization.
(Uwe Schindler)
- LUCENE-10674: Ensure BitSetConjDISI returns NO_MORE_DOCS when sub-iterator exhausts.
(Jack Mazanec)
- GITHUB#11794: Guard FieldExistsQuery against null pointers
(Luca Cavanna)
- Build (2)
- GITHUB#11720: Upgrade randomizedtesting to 2.8.1 (potential fix for odd wall clock - related
timeout failures).
(Dawid Weiss)
- LUCENE-10669: The build should be more helpful when generated resources are touched
(Dawid Weiss)
- Other (1)
- LUCENE-10559: Add Prefilter Option to KnnGraphTester
(Kaival Parikh)
- API Changes (2)
- LUCENE-10603: SortedSetDocValues#NO_MORE_ORDS marked @deprecated in favor of iterating with
SortedSetDocValues#docValueCount().
(Greg Miller)
- GITHUB#978: Deprecate (remove in Lucene 10) obsolete constants in oal.util.Constants; remove
code which is no longer executed after Java 9.
(Uwe Schindler)
- New Features (4)
- LUCENE-10550: Add getAllChildren functionality to facets
(Yuting Gan)
- LUCENE-10274: Added facetsets module for high dimensional (hyper-rectangle) faceting
- (Shai Erera, Marc D'Mello, Greg Miller)
- LUCENE-10151 Enable timeout support in IndexSearcher.
(Deepika Sharma)
- Improvements (5)
- LUCENE-10078: Merge on full flush is now enabled by default with a timeout of
500ms.
(Adrien Grand)
- LUCENE-10585: Facet module code cleanup (copy/paste scrubbing, simplification and some very minor
optimization tweaks).
(Greg Miller)
- LUCENE-10603: Update SortedSetDocValues iteration to use SortedSetDocValues#docValueCount().
(Greg Miller, Stefan Vodita)
- LUCENE-10619: Optimize the writeBytes in TermsHashPerField.
(Tang Donghai)
- GITHUB#983: AbstractSortedSetDocValueFacetCounts internal code cleanup/refactoring.
(Greg Miller)
- Optimizations (11)
- LUCENE-8519: MultiDocValues.getNormValues should not call getMergedFieldInfos
(Rushabh Shah)
- GITHUB#961: BooleanQuery can return quick counts for simple boolean queries.
(Adrien Grand)
- LUCENE-10618: Implement BooleanQuery rewrite rules based for minimumShouldMatch.
(Fang Hou)
- LUCENE-10480: Implement Block-Max-Maxscore scorer for 2 clauses disjunction.
(Zach Chen, Adrien Grand)
- LUCENE-10606: For KnnVectorQuery, optimize case where filter is backed by BitSetIterator
(Kaival Parikh)
- LUCENE-10593: Vector similarity function and NeighborQueue reverse removal.
(Alessandro Benedetti)
- GITHUB#984: Use primitive type data structures in FloatTaxonomyFacets and IntTaxonomyFacets
#getAllChildren() internal implementation to avoid some garbage creation.
(Greg Miller)
- GITHUB#1010: Specialize ordinal encoding for common case in SortedSetDocValues.
(Greg Miller)
- LUCENE-10657: CopyBytes now saves one memory copy on ByteBuffersDataOutput.
(luyuncheng)
- GITHUB#1007: Optimize IntersectVisitor#visit implementations for certain bulk-add cases.
(Greg Miller)
- LUCENE-10653: BlockMaxMaxscoreScorer uses heapify instead of individual adds.
(Greg Miller)
- Changes in runtime behavior (1)
- GITHUB#978: IndexWriter diagnostics written to index only contain java's runtime version
and vendor.
(Uwe Schindler)
- Bug Fixes (13)
- LUCENE-10574: Prevent pathological O(N^2) merging.
(Adrien Grand)
- LUCENE-10584: Properly support #getSpecificValue for hierarchical dims in SSDV faceting.
(Greg Miller)
- LUCENE-10582: Fix merging of overridden CollectionStatistics in CombinedFieldQuery
(Yannick Welsch)
- LUCENE-10563: Fix failure to tessellate complex polygon
(Craig Taverner)
- LUCENE-10605: Fix error in 32bit jvm object alignment gap calculation
(Sun Wuqiang)
- GITHUB#956: Make sure KnnVectorQuery applies search boost.
(Julie Tibshirani)
- LUCENE-10598: SortedSetDocValues#docValueCount() should be always greater than zero.
(Lu Xugang)
- LUCENE-10600: SortedSetDocValues#docValueCount should be an int, not long
(Lu Xugang)
- LUCENE-10611: Fix failure when KnnVectorQuery has very selective filter
(Kaival Parikh)
- LUCENE-10607: Fix potential integer overflow in maxArcs computions
(Tang Donghai)
- GITHUB#986: Fix FieldExistsQuery rewrite when all docs have vectors.
(Julie Tibshirani)
- LUCENE-10623: Error implementation of docValueCount for SortingSortedSetDocValues
(Lu Xugang)
- GITHUB#1028: Fix error in TieredMergePolicy
(Lin Jian)
- Other (4)
- GITHUB#991: Update randomizedtesting to 2.8.0, hppc to 0.9.1, morfologik to 2.1.9.
(Dawid Weiss)
- LUCENE-10370: pass proper classpath/module arguments for forking jvms from within tests.
(Dawid Weiss)
- LUCENE-10604: Improve ability to test and debug triangulation algorithm in Tessellator.
(Craig Taverner)
- GITHUB#922: Remove unused and confusing FacetField indexing options
(Gautam Worah)
- Build (1)
- GITHUB#976: Exclude Lucene's own JAR files from classpath entries in Eclipse config.
(Uwe Schindler)
- API Changes (3)
- LUCENE-10325: Facets API extended to support getTopFacets.
(Yuting Gan)
- LUCENE-10482: Allow users to create their own DirectoryTaxonomyReaders with empty taxoArrays instead of letting the
taxoEpoch decide. Add a test case that demonstrates the inconsistencies caused when you reuse taxoArrays on older
checkpoints.
(Gautam Worah)
- LUCENE-10558: Add new constructors to Kuromoji and Nori dictionary classes to support classpath /
module system usage. It is now possible to use JDK's Class/ClassLoader/Module#getResource(...) apis
and pass their returned URL to dictionary constructors to load resources from Classpath or Module
resources.
(Uwe Schindler, Tomoko Uchida, Mike Sokolov)
- New Features (6)
- LUCENE-10312: Add PersianStemmer based on the Arabic stemmer.
(Ramin Alirezaee)
- LUCENE-10539: Return a stream of completions from FSTCompletion.
(Dawid Weiss)
- LUCENE-10385: Implement Weight#count on IndexSortSortedNumericDocValuesRangeQuery
to speed up computing the number of hits when possible.
(Lu Xugang, Luca Cavanna, Adrien Grand)
- LUCENE-10422: Monitor Improvements: `Monitor` can use a custom `Directory`
implementation. `Monitor` can be created with a readonly `QueryIndex` in order to
have readonly `Monitor` instances.
(Niko Usai)
- LUCENE-10456: Implement rewrite and Weight#count for MultiRangeQuery
by merging overlapping ranges .
(Jianping Weng)
- LUCENE-10444: Support alternate aggregation functions in association facets.
(Greg Miller)
- Improvements (6)
- LUCENE-10229: return -1 for unknown offsets in ExtendedIntervalsSource. Modify highlighting to
work properly with or without offsets.
(Dawid Weiss)
- LUCENE-10494: Implement method to bulk add all collection elements to a PriorityQueue.
(Bauyrzhan Sakhariyev)
- LUCENE-10484: Add support for concurrent random sampling by calling
RandomSamplingFacetsCollector#createManager.
(Luca Cavanna)
- LUCENE-10467: Throws IllegalArgumentException for Facets#getAllDims and Facets#getTopChildren
if topN <= 0.
(Yuting Gan)
- LUCENE-9848: Correctly sort HNSW graph neighbors when applying diversity criterion
(Mayya
Sharipova, Michael Sokolov)
- LUCENE-10527: Use 2*maxConn for the last layer in HNSW
(Mayya Sharipova)
- Optimizations (16)
- LUCENE-10555: avoid NumericLeafComparator#iteratorCost repeated initialization
when NumericLeafComparator#setScorer is called.
(Jianping Weng)
- LUCENE-10452: Hunspell: call checkCanceled less frequently to reduce the overhead
(Peter Gromov)
- LUCENE-10451: Hunspell: don't perform potentially expensive spellchecking after timeout
(Peter Gromov)
- LUCENE-10418: More `Query#rewrite` optimizations for the non-scoring case.
(Adrien Grand)
- LUCENE-10436: Deprecate DocValuesFieldExistsQuery, NormsFieldExistsQuery and KnnVectorFieldExistsQuery
with FieldExistsQuery.
(Zach Chen, Michael McCandless, Adrien Grand)
- LUCENE-10481: FacetsCollector will not request scores if it does not use them.
(Mike Drob)
- LUCENE-10503: Potential speedup for pure disjunctions whose clauses produce
scores that are very close to each other.
(Adrien Grand)
- LUCENE-10315: Use SIMD instructions to decode BKD doc IDs.
(Guo Feng, Adrien Grand, Ignacio Vera)
- LUCENE-8836: Speed up calls to TermsEnum#lookupOrd on doc values terms enums
and sequences of increasing ords.
(Bruno Roustant, Adrien Grand)
- LUCENE-10536: Doc values terms dictionaries now use the first (uncompressed)
term of each block as a dictionary when compressing suffixes of the other 63
terms of the block.
(Adrien Grand)
- LUCENE-10411: Add nearest neighbors vectors support to ExitableDirectoryReader.
(Zach Chen, Adrien Grand, Julie Tibshirani, Tomoko Uchida)
- LUCENE-10542: FieldSource exists implementations can avoid value retrieval
(Kevin Risden)
- LUCENE-10534: MinFloatFunction / MaxFloatFunction exists check can be slow
(Kevin Risden)
- LUCENE-10496: Queries sorted by field now better handle the degenerate case
when the search order and the index order are in opposite directions.
(Jianping Weng)
- LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle
ordToDoc in HNSW vectors
(Lu Xugang)
- LUCENE-10488: Facets#getTopDims optimized for taxonomy faceting and
ConcurrentSortedSetDocValuesFacetCounts.
(Yuting Gan)
- Bug Fixes (13)
- LUCENE-10477: Highlighter: WeightedSpanTermExtractor.extractWeightedSpanTerms to Query#rewrite
multiple times if necessary.
(Christine Poerschke, Adrien Grand)
- LUCENE-10491: A correctness bug in the way scores are provided within TaxonomyFacetSumValueSource
was fixed.
(Michael McCandless, Greg Miller)
- LUCENE-10466: Ensure IndexSortSortedNumericDocValuesRangeQuery handles sort field
types besides LONG
(Andriy Redko)
- LUCENE-10292: Suggest: Fix AnalyzingInfixSuggester / BlendedInfixSuggester to correctly return
existing lookup() results during concurrent build(). Fix other FST based suggesters so that
getCount() returned results consistent with lookup() during concurrent build().
(hossman)
- LUCENE-10508: Fixes some edge cases where GeoArea were built in a way that vertical planes
could not evaluate their sign, either because the planes where the same or the center between those
planes was lying in one of the planes.
(Ignacio Vera)
- LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets.
(Yuting Gan)
- LUCENE-10533: SpellChecker.formGrams is missing bounds check
(Kevin Risden)
- LUCENE-10529: Properly handle when TestTaxonomyFacetAssociations test case randomly indexes
no documents instead of throwing an NPE.
(Greg Miller)
- LUCENE-10470: Check if polygon has been successfully tessellated before we fail (we are failing some valid
tessellations) and allow filtering edges that fold on top of the previous one.
(Ignacio Vera)
- LUCENE-10530: Avoid floating point precision test case bug in TestTaxonomyFacetAssociations.
(Greg Miller)
- LUCENE-10552: KnnVectorQuery has incorrect equals/ hashCode.
(Lu Xugang)
- LUCENE-10558: Restore behaviour of deprecated Kuromoji and Nori dictionary constructors for
custom dictionary support. Please also use new URL-based constructors for classpath/module
system ressources.
(Uwe Schindler, Tomoko Uchida, Mike Sokolov)
- LUCENE-10564: Make sure SparseFixedBitSet#or updates ramBytesUsed.
(Julie Tibshirani)
- Build (3)
- GITHUB#768: Upgrade forbiddenapis to version 3.3.
(Uwe Schindler)
- GITHUB#890: Detect CI builds on Github or Jenkins and enable errorprone.
(Uwe Schindler, Dawid Weiss)
- LUCENE-10532: Remove LuceneTestCase.Slow annotation. All tests can be fast.
(Robert Muir)
- Other (4)
- LUCENE-10526: Test-framework: Add FilterFileSystemProvider.wrapPath(Path) method for mock filesystems
to override if they need to extend the Path implementation.
(Gautam Worah, Robert Muir)
- LUCENE-10525: Test-framework: Add detection of illegal windows filenames to WindowsFS.
(Gautam Worah)
- LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens to 255.
(Robert Muir, Uwe Schindler, Tomoko Uchida, Dawid Weiss)
- GITHUB#854: Allow to link to GitHub pull request from CHANGES.
(Tomoko Uchida, Jan Høydahl)
- API Changes (16)
- LUCENE-10244: MultiCollector::getCollectors is now public, allowing users to access the wrapped
collectors.
(Andriy Redko)
- LUCENE-10197: UnifiedHighlighter now has a Builder to construct it. The UH's setters are now
deprecated.
(Animesh Pandey, David Smiley)
- LUCENE-10301: the test framework is now a module. All the classes have been moved from
org.apache.lucene.* to org.apache.lucene.tests.* to avoid package name conflicts with the
core module.
(Dawid Weiss)
- LUCENE-10183: KnnVectorsWriter#writeField to take KnnVectorsReader instead of VectorValues.
(Zach Chen, Michael Sokolov, Julie Tibshirani, Adrien Grand)
- LUCENE-10335: Deprecate helper methods for resource loading in IOUtils and StopwordAnalyzerBase
that are not compatible with module system (Class#getResourceAsStream() and Class#getResource()
are caller sensitive in Java 11). Instead add utility method IOUtils#requireResourceNonNull(T)
to test existence of resource based on null return value.
(Uwe Schindler, Dawid Weiss)
- LUCENE-10349: WordListLoader methods now return unmodifiable CharArraySets.
(Uwe Schindler)
- LUCENE-10377: SortField.getComparator() has changed signature. The second parameter is now
a boolean indicating whether or not skipping should be enabled on the comparator.
(Alan Woodward)
- LUCENE-10381: Require users to provide FacetsConfig for SSDV faceting.
(Greg Miller)
- LUCENE-10368: IntTaxonomyFacets has been deprecated and is no longer a supported extension point
for user-created faceting implementations.
(Greg Miller)
- LUCENE-10400: Add constructors that take external resource Paths to dictionary classes in Kuromoji and Nori:
ConnectionCosts, TokenInfoDictionary, and UnknownDictionary. Old constructors that take resource scheme and
resource path in those classes are deprecated; These are replaced with the new constructors and planned to be
removed in a future release.
(Tomoko Uchida, Uwe Schindler, Mike Sokolov)
- LUCENE-10050: Deprecate DrillSideways#search(Query, Collector) in favor of
DrillSideways#search(Query, CollectorManager). This reflects the change (LUCENE-10002) being made in
IndexSearcher#search that trends towards using CollectorManagers over Collectors.
(Gautam Worah)
- LUCENE-10420: Move functional interfaces in IOUtils to top-level interfaces.
(David Smiley, Uwe Schindler, Dawid Weiss, Tomoko Uchida)
- LUCENE-10398: Add static method for getting Terms from LeafReader.
(Spike Liu)
- LUCENE-10440: TaxonomyFacets and FloatTaxonomyFacets have been deprecated and are no longer
supported extension points for user-created faceting implementations.
(Greg Miller)
- LUCENE-10431: MultiTermQuery.setRewriteMethod() has been deprecated, and constructor
parameters for the various implementations added.
(Alan Woodward)
- LUCENE-10171: OpenNLPOpsFactory.getLemmatizerDictionary(String, ResourceLoader) now returns a
DictionaryLemmatizer object instead of a raw String serialization of the dictionary.
(Spyros Kapnissis via Michael Gibney, Alessandro Benedetti)
- New Features (19)
- LUCENE-10255: Lucene JARs are now proper modules, with module descriptors and dependency information.
(Chris Hegarty, Uwe Schindler, Tomoko Uchida, Dawid Weiss)
- LUCENE-10342: Lucene Core now depends on java.logging (JUL) module and reports
if MMapDirectory cannot unmap mapped ByteBuffers or RamUsageEstimator's object size
calculations may be off. This was added especially for users running Lucene with the
Java Module System where some optional features are not available by default or supported.
For all apps using Lucene it is strongly recommended, to explicitely require non-standard
JDK modules: jdk.unsupported (unmapping) and jdk.management (OOP size for RAM usage calculatons).
It is also recommended to install JUL logging adapters to feed the log events into your app's
logging system.
(Uwe Schindler, Dawid Weiss, Tomoko Uchida, Robert Muir)
- LUCENE-10330: Make MMapDirectory tests fail by default, if unmapping does not work.
(Uwe Schindler, Dawid Weiss)
- LUCENE-10223: Add interval function support to StandardQueryParser. Add min-should-match operator
support to StandardQueryParser. Update and clean up package documentation in flexible query parser
module.
(Dawid Weiss, Alan Woodward)
- LUCENE-10220: Add an utility method to get IntervalSource from analyzed text (or token stream).
(Uwe Schindler, Dawid Weiss, Alan Woodward)
- LUCENE-10085: Added Weight#count on DocValuesFieldExistsQuery to speed up the query if terms or
points are indexed.
(Quentin Pradet, Adrien Grand)
- LUCENE-10263: Added Weight#count to NormsFieldExistsQuery to speed up the query if all
documents have the field..
(Alan Woodward)
- LUCENE-10248: Add SpanishPluralStemFilter, for precise stemming of Spanish plurals.
For more information, see https://s.apache.org/spanishplural
(Xavier Sanchez Loro)
- LUCENE-10243: StandardTokenizer, UAX29URLEmailTokenizer, and HTMLStripCharFilter have
been upgraded to Unicode 12.1
(Robert Muir)
- LUCENE-10335: Add ModuleResourceLoader as complement to ClasspathResourceLoader.
(Uwe Schindler)
- LUCENE-10245: MultiDoubleValues(Source) and MultiLongValues(Source) were added as multi-valued
versions of DoubleValues(Source) and LongValues(Source) to the facets module. LongValueFacetCounts,
LongRangeFacetCounts and DoubleRangeFacetCounts were augmented to support these new multi-valued
abstractions. DoubleRange and LongRange also support creating queries from these multi-valued
sources.
(Greg Miller)
- LUCENE-10250: Add support for arbitrary length hierarchical SSDV facets.
(Marc D'mello)
- LUCENE-10395: Add support for TotalHitCountCollectorManager, a collector manager
based on TotalHitCountCollector that allows users to parallelize counting the
number of hits.
(Luca Cavanna, Adrien Grand)
- LUCENE-10403: Add ArrayUtil#grow(T[]).
(Greg Miller)
- LUCENE-10414: Add fn:fuzzyTerm interval function to flexible query parser
(Dawid Weiss,
Alan Woodward)
- LUCENE-10378: Implement Weight#count for PointRangeQuery to provide a faster way to calculate
the number of matching range docs when each doc has at-most one point and the points are 1-dimensional.
(Gautam Worah, Ignacio Vera, Adrien Grand)
- LUCENE-10415: FunctionScoreQuery and IndexOrDocValuesQuery delegate Weight#count.
(Ignacio Vera)
- LUCENE-10382: Add support for filtering in KnnVectorQuery. This allows for finding the
nearest k documents that also match a query.
(Julie Tibshirani, Joel Bernstein)
- LUCENE-10237: Add MergeOnFlushMergePolicy to sandbox.
(Michael Froh, Anand Kotriwal)
- Improvements (9)
- LUCENE-10313: use java util logging in Luke. Add dynamic log filtering. Drop
the persistent log previously written to ~/.luke.d/luke.log. Configure Java's default
logging handlers to persist Luke logs according to your needs.
(Tomoko Uchida, Dawid Weiss)
- LUCENE-10238: Upgrade icu4j dependency to 70.1.
(Dawid Weiss)
- LUCENE-9820: Extract BKD tree interface and move intersecting logic to the
PointValues abstract class.
(Ignacio Vera, Adrien Grand)
- LUCENE-10262: Lift up restrictions for navigating PointValues#PointTree
added in LUCENE-9820
(Ignacio Vera)
- LUCENE-9538: Detect polygon self-intersections in the Tessellator.
(Ignacio Vera)
- LUCENE-10275: Speed up MultiRangeQuery by using an interval tree.
(Ignacio Vera)
- LUCENE-10229: Unify behaviour of match offsets for interval queries on fields
with or without offsets enabled.
(Patrick Zhai)
- LUCENE-10054 Make HnswGraph hierarchical
(Mayya Sharipova, Julie Tibshirani, Mike Sokolov,
Adrien Grand)
- LUCENE-10371: Make IndexRearranger able to arrange segment in a determined order.
(Patrick Zhai)
- Optimizations (20)
- LUCENE-10329: Use computed block mask for DirectMonotonicReader#get.
(Guo Feng)
- LUCENE-10280: Optimize BKD leaves' doc IDs codec when they are continuous.
(Guo Feng)
- LUCENE-10233: Store BKD leaves' doc IDs as bitset in some cases (typically for low cardinality fields
or sorted indices) to speed up addAll.
(Guo Feng, Adrien Grand)
- LUCENE-10225: Improve IntroSelector with 3-ways partitioning.
(Bruno Roustant, Adrien Grand)
- LUCENE-10321: Tweak MultiRangeQuery interval tree creation to skip "pulling up" mins.
(Greg Miller)
- LUCENE-10252: ValueSource.asDoubleValues and asLongValues should not compute the score unless
asked to -- typically never. This fixes a performance regression since 7.3 LUCENE-8099 when some
older boosting queries were replaced with this.
(David Smiley)
- LUCENE-10346: Optimize facet counting for single-valued TaxonomyFacetCounts.
(Guo Feng)
- LUCENE-10356: Further optimize facet counting for single-valued TaxonomyFacetCounts.
(Greg Miller)
- LUCENE-10379: Count directly into the dense values array in FastTaxonomyFacetCounts#countAll.
(Guo Feng, Greg Miller)
- LUCENE-10375: Speed up HNSW vectors merge by first writing combined vector
data to a file.
(Julie Tibshirani, Adrien Grand)
- LUCENE-10388: Remove MultiLevelSkipListReader#SkipBuffer to make JVM less confused.
(Guo Feng)
- LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of
matching clauses is a constant.
(LuYunCheng via Adrien Grand)
- LUCENE-10412: More `Query#rewrite` optimizations for MatchNoDocsQuery.
(Adrien Grand)
- LUCENE-10408 Better encoding of doc Ids in vectors.
(Mayya Sharipova, Julie Tibshirani, Adrien Grand)
- LUCENE-10424, LUCENE-10439: Optimize the "everything matches" case for count query in PointRangeQuery.
(Ignacio Vera, Lu Xugang)
- LUCENE-10084, LUCENE-10435: Rewrite DocValuesFieldExistsQuery to MatchAllDocsQuery whenever
terms or points have a docCount that is equal to maxDoc.
(Vigya Sharma, Lu Xugang)
- LUCENE-10442: When indexQuery or/and dvQuery be a MatchAllDocsQuery
then IndexOrDocValuesQuery should be rewrite to MatchAllDocsQuery.
(Lu Xugang)
- LUCENE-10450: IndexSortSortedNumericDocValuesRangeQuery could be rewrite to MatchAllDocsQuery.
(Lu Xugang)
- LUCENE-10453: Indexing and search speedup with KNN vectors when using
euclidean distance.
(Adrien Grand)
- LUCENE-10455: IndexSortSortedNumericDocValuesRangeQuery now implements the scorerSupplier API.
(Lu Xugang)
- Changes in runtime behavior (2)
- LUCENE-10291: Lucene now only writes files for terms and postings if at least
one field is indexed with postings.
(Yannick Welsch)
- LUCENE-10311: FixedBitSet#approximateCardinality now trades accuracy for
speed instead of delegating to FixedBitSet#cardinality.
(Robert Muir, Adrien Grand)
- Bug Fixes (16)
- LUCENE-10316: fix TestLRUQueryCache.testCachingAccountableQuery failure.
(Patrick Zhai)
- LUCENE-10279: Fix equals in MultiRangeQuery.
(Ignacio Vera)
- LUCENE-10349: Fix all analyzers to behave according to their documentation:
getDefaultStopSet() methods now return unmodifiable CharArraySets.
(Uwe Schindler)
- LUCENE-10352: Add missing service provider entries: KoreanNumberFilterFactory,
DaitchMokotoffSoundexFilterFactory
(Uwe Schindler, Robert Muir)
- LUCENE-10352: Fixed ctor argument checks: JapaneseKatakanaStemFilter,
DoubleMetaphoneFilter
(Uwe Schindler, Robert Muir)
- LUCENE-10236: Stop duplicating norms when scoring in CombinedFieldQuery.
(Zach Chen, Jim Ferenczi, Julie Tibshirani)
- LUCENE-10353: Add random null injection to TestRandomChains.
(Robert Muir,
Uwe Schindler)
- LUCENE-10377: CheckIndex could incorrectly throw an error when checking index sorts
defined on older indexes.
(Alan Woodward)
- LUCENE-9952: Address inaccurate dim counts for SSDV faceting in cases where a dim is configured
as multi-valued.
(Greg Miller)
- LUCENE-10401: Fix lookups on empty doc-value terms dictionaries to no longer
throw an ArrayIndexOutOfBoundsException.
(Adrien Grand)
- LUCENE-10402: Prefix intervals should declare their automaton as binary, otherwise prefixes
containing multibyte characters will not correctly match.
(Alan Woodward)
- LUCENE-10407: Containing intervals could sometimes yield incorrect matches when wrapped
in a disjunction.
(Alan Woodward, Dawid Weiss)
- LUCENE-10405: When using the MemoryIndex, binary and Sorted doc values are stored
as BytesRef instead of BytesRefHash so they don't have a limit on size.
(Ignacio Vera)
- LUCENE-10428: Queries with a misbehaving score function may no longer cause
infinite loops in their parent BooleanQuery.
(Ankit Jain, Daniel Doubrovkine, Adrien Grand)
- LUCENE-10431: MultiTermQuery no longer includes its rewrite method in its hashcode
calculation, as this could cause problems with wrapper queries like BooleanQuery which
expect their child queries hashcodes to be stable.
(Alan Woodward)
- LUCENE-10469: Fix ScoreMode propagation by ConstantScoreQuery.
(Adrien Grand)
- Other (7)
- LUCENE-10273: Deprecate SpanishMinimalStemFilter in favor of SpanishPluralStemFilter.
(Robert Muir)
- LUCENE-10284: Upgrade morfologik-stemming to 2.1.8.
(Dawid Weiss)
- LUCENE-10310: TestXYDocValuesQueries#doRandomDistanceTest does not produce random circles with radius
with '0' value any longer.
- LUCENE-10352: Removed duplicate instances of StringMockResourceLoader and migrated class to
test-framework.
(Uwe Schindler, Robert Muir)
- LUCENE-10352: Convert TestAllAnalyzersHaveFactories and TestRandomChains to a global integration test
and discover classes to check from module system. The test now checks all analyzer modules,
so it may discover new bugs outside of analysis:common module.
(Uwe Schindler, Robert Muir)
- LUCENE-10413: Make Ukrainian default stop words list available as a public getter.
(Alan Woodward)
- LUCENE-10437: Polygon tessellator throws a more informative error message when the provided polygon
does not contain enough no-collinear points.
(Ignacio Vera)
- New Features (8)
- LUCENE-9322, LUCENE-9855: Vector-valued fields, Lucene90 Codec
(Mike Sokolov, Julie Tibshirani, Tomoko Uchida)
- LUCENE-9004, LUCENE-10040: Approximate nearest vector search via NSW graphs
(Mike Sokolov, Tomoko Uchida et al.)
- LUCENE-9659: SpanPayloadCheckQuery now supports inequalities.
(Kevin Watters, Gus Heck)
- LUCENE-9589: Swedish Minimal Stemmer
(janhoy)
- LUCENE-9313: Add SerbianAnalyzer based on the snowball stemmer.
(Dragan Ivanovic)
- LUCENE-10095: Add NepaliAnalyzer based on the snowball stemmer.
(Robert Muir)
- LUCENE-10096: Add TamilAnalyzer based on the snowball stemmer.
(Robert Muir)
- LUCENE-10102: Add JapaneseCompletionFilter for Input Method-aware auto-completion
(Tomoko Uchida, Robert Muir, Jun Ohtani)
- System Requirements (1)
- LUCENE-8738: Move to Java 11 as minimum Java version.
(Adrien Grand, Uwe Schindler)
- API Changes (44)
- LUCENE-8638: Remove many deprecated methods and classes including FST.lookupByOutput(),
LegacyBM25Similarity and Jaspell suggester.
- LUCENE-8982: Separate out native code to another module to allow cpp
build with gradle. This also changes the name of the native "posix-support"
library to LuceneNativeIO.
(Zachary Chen, Dawid Weiss)
- LUCENE-9562: All binary analysis packages (and corresponding
Maven artifacts) with names containing '-analyzers-' have been renamed
to '-analysis-'.
(Dawid Weiss)
- LUCENE-8474: RAMDirectory and associated deprecated classes have been
removed.
(Dawid Weiss)
- LUCENE-3041: The deprecated Weight#extractTerms() method has been
removed
(Alan Woodward, Simon Willnauer, David Smiley, Luca Cavanna)
- LUCENE-8805: StoredFieldVisitor#stringField now takes a String rather than a
byte[] that stores the UTF-8 bytes of the stored string.
(Namgyu Kim via Adrien Grand)
- LUCENE-8811: BooleanQuery#setMaxClauseCount() and #getMaxClauseCount() have
moved to IndexSearcher. The checks are now implemented using a QueryVisitor
and apply to all queries, rather than only booleans.
(Atri Sharma, Adrien
Grand, Alan Woodward)
- LUCENE-8909: The deprecated IndexWriter#getFieldNames() method has been removed.
(Adrien Grand, Munendra S N)
- LUCENE-8948: Change "name" argument in ICU factories to "form". Here, "form" is
named after "Unicode Normalization Form".
(Tomoko Uchida)
- LUCENE-8933: Validate JapaneseTokenizer user dictionary entry.
(Tomoko Uchida)
- LUCENE-8905: Better defence against malformed arguments in TopDocsCollector
(Atri Sharma)
- LUCENE-9089: FST Builder renamed FSTCompiler with fluent-style Builder.
(Bruno Roustant)
- LUCENE-9212: Deprecated Intervals.multiterm() methods that take a bare Automaton
have been removed
(Alan Woodward)
- LUCENE-9264: SimpleFSDirectory has been removed in favor of NIOFSDirectory.
(Yannick Welsch)
- LUCENE-9281: Use java.util.ServiceLoader to load codec components and analysis
factories to be compatible with Java Module System. This allows to load factories
without META-INF/service from a Java module exposing the factory in the module
descriptor. This breaks backwards compatibility as custom analysis factories
must now also implement the default constructor (see MIGRATE.md).
(Uwe Schindler, Dawid Weiss)
- LUCENE-9307: BufferedIndexInput#setBufferSize has been removed.
(Adrien Grand)
- LUCENE-9340: SimpleBindings#add(SortField) has been removed.
(Alan Woodward)
- LUCENE-9462: Fields without positions should still return MatchIterator.
(Alan Woodward, Dawid Weiss)
- LUCENE-9516: Removed the ability to replace the IndexingChain / DocConsumer
in Lucenes IndexWriter. The interface is not sufficient to efficiently
replace the functionality with reasonable efforts.
(Simon Willnauer)
- LUCENE-9317 LUCENE-9318 LUCENE-9319 LUCENE-9558 LUCENE-9600 : Clean up package name conflicts
between modules. See MIGRATE.md for details.
(David Ryan, Tomoko Uchida, Uwe Schindler, Dawid Weiss)
- LUCENE-9646: Set BM25Similarity discountOverlaps via the constructor
(Patrick Marty via Bruno Roustant)
- LUCENE-9480: Make DataInput's skipBytes(long) abstract as the implementation was not performant.
IndexInput's api is unaffected: skipBytes() is implemented via seek().
(Greg Miller)
- LUCENE-9796: SortedDocValues no longer extends BinaryDocValues, as binaryValue() was not performant.
See MIGRATE.md for details.
(Robert Muir)
- LUCENE-9853: JapaneseAnalyzer should use CJKWidthCharFilter for full-width and half-width character normalization.
(Tomoko Uchida)
- LUCENE-9387: Removed CodecReader#ramBytesUsed.
(Adrien Grand)
- LUCENE-9334: Require consistency between data-structures on a per-field basis.
A field across all documents within an index must be indexed with the same index
options and data-structures. As a consequence of this, doc values updates are
only applicable for fields that are indexed with doc values only.
(Mayya Sharipova,
Adrien Grand, Simon Willnauer)
- LUCENE-9047: Directory API is now little endian.
(Ignacio Vera, Adrien Grand)
- LUCENE-9948: No longer require the user to specify whether-or-not a field is multi-valued in
LongValueFacetCounts (detect automatically based on what is indexed).
(Greg Miller)
- LUCENE-9843: Remove compression option on default codec's docvalues.
(Jack Conradson)
- LUCENE-9204: SpanQuery and its subclasses have been moved from core/ into the
queries/ module.
(Alan Woodward)
- LUCENE-9454: Analyzer no longer has a mutable version field.
(Alan Woodward)
- LUCENE-9956: Expose the getBaseQuery, getDrillDownQueries APIs from DrillDownQuery
(Gautam Worah)
- LUCENE-8143: SpanBoostQuery has been removed.
(Alan Woodward)
- LUCENE-9998: Remove unused parameter fis in StoredFieldsWriter.finish() and TermVectorsWriter.finish(),
including those subclasses.
(kkewwei)
- LUCENE-7020: TieredMergePolicy#setMaxMergeAtOnceExplicit has been removed.
TieredMergePolicy no longer sets a limit on the maximum number of segments
that can be merged at once via a forced merge.
(Adrien Grand, Shawn Heisey)
- LUCENE-10027: Directory reader open API from indexCommit and leafSorter has been modified
to add an extra parameter - minSupportedMajorVersion.
(Mayya Sharipova)
- LUCENE-9620: Added a (sometimes) faster implementation for IndexSearcher#count that relies on the new Weight#count API.
The Weight#count API represents a cleaner way for Query classes to optimize their counting method.
(Gautam Worah, Adrien Grand)
- LUCENE-10089: Add a method to SortField that allows to enable or disable numeric sort
optimization to use the points index to skip over non-competitive documents,
which is enabled by default from 9.0
(Mayya Sharipova, Adrien Grand)
- LUCENE-10115: Add an extension point, BaseQueryParser#getFuzzyDistance, to allow custom
query parsers to determine the similarity distance for fuzzy queries.
(Chris Hegarty)
- LUCENE-10132: Support addition of diagnostics by custom merge policies
(Chris Hegarty)
- LUCENE-9325: Sort is now final, and the `setSort()` method has been removed
(Alan Woodward)
- LUCENE-9431: The UnifiedHighlighter's WEIGHT_MATCHES flag is now set by default, provided its
requirements are met. It can be disabled via over-riding getFlags
(Animesh Pandey, David Smiley)
- LUCENE-10158: Add a new interface Unwrappable to the utils package to allow code to
unwrap wrappers/delegators that are added by Lucene's testing framework. This will allow
testing new MMapDirectory implementation based on JDK Project Panama.
(Uwe Schindler)
- LUCENE-10260: LucenePackage class has been removed. The implementation string can be
retrieved from Version.getPackageImplementationVersion().
(Uwe Schindler, Dawid Weiss)
- Improvements (48)
- LUCENE-10234: Added Automatic-Module-Name to all JARs. This is the first step to enable full Java
module system (JMS) support in later Lucene versions. At the moment, the automatic names should
not be considered stable.
(Dawid Weiss, Uwe Schindler)
- LUCENE-10182: TestRamUsageEstimator used RamUsageTester.sizeOf throughout, making some of the
tests trivial. Now, it compares results from RamUsageEstimator with those from RamUsageTester.
To prevent this error in the future, RamUsageTester.sizeOf was renamed to ramUsed.
(Uwe Schindler, Dawid Weiss, Stefan Vodita)
- LUCENE-10129: RamUsageEstimator overloads the shallowSizeOf method for primitive arrays
to avoid falling back on shallowSizeOf(Object), which could lead to performance traps.
(Robert Muir, Uwe Schindler, Stefan Vodita)
- LUCENE-10139: ExternalRefSorter returns a covariant with a subtype of BytesRefIterator
that is Closeable.
(Dawid Weiss).
- LUCENE-10135: Correct passage selector behavior for long matching snippets
(Dawid Weiss).
- LUCENE-9960: Avoid unnecessary top element replacement for equal elements in PriorityQueue.
(Dawid Weiss)
- LUCENE-9633: Improve match highlighter behavior for degenerate intervals (on non-existing positions).
(Dawid Weiss)
- LUCENE-9618: Do not call IntervalIterator.nextInterval after NO_MORE_DOCS is returned.
(Patrick Zhai)
- LUCENE-9576: Improve ConcurrentMergeScheduler settings by default, assuming modern I/O.
Previously Lucene was too conservative, jumping through hoops to detect if disks were SSD-backed.
In many common modern cases (VMs, RAID arrays, containers, encrypted mounts, non-Linux OS),
the pessimistic heuristics were wrong, resulting in slower indexing performance. Heuristics were
also complex and would trigger JDK issues even on unrelated mount points. Merge scheduler defaults
are now modernized and the heuristics removed. Users with spinning disks that want to maximize I/O
performance should tweak ConcurrentMergeScheduler.
(Robert Muir)
- LUCENE-9463: Query match region retrieval component, passage scoring and formatting
for building custom highlighters.
(Alan Woodward, Dawid Weiss)
- LUCENE-9370: RegExp query is no longer lenient about inappropriate backslashes and
follows the Java Pattern policy for rejecting illegal syntax.
(Mark Harwood)
- LUCENE-9336: RegExp query now supports \w \W \d \D \s \S expressions.
This is a break with previous behaviour where these were (mis)interpreted
as literally the characters w W d etc.
(Mark Harwood)
- LUCENE-8757: When provided with an ExecutorService to run queries across
multiple threads, IndexSearcher now groups small segments together, up to
250k docs per slice.
(Atri Sharma via Adrien Grand)
- LUCENE-8857: Introduce Custom Tiebreakers in TopDocs.merge for tie breaking on
docs on equal scores. Also, remove the ability of TopDocs.merge to set shard
indices
(Atri Sharma, Adrien Grand, Simon Willnauer)
- LUCENE-8958: Shared count early termination for relevance sorted indices
(Atri Sharma)
- LUCENE-8937: Avoid aggressive stemming on numbers in the FrenchMinimalStemmer.
(Adrien Gallou via Tomoko Uchida)
- LUCENE-8596: Kuromoji user dictionary now accepts entries containing hash mark (#) that were
previously treated as beginning a line-ending comment
(Satoshi Kato and Masaru Hasegawa via
Michael Sokolov)
- LUCENE-9109: Use StackWalker to implement TestSecurityManager's detection
of JVM exit
(Uwe Schindler)
- LUCENE-9110: Refactor stack analysis in tests to use generalized LuceneTestCase
methods that use StackWalker
(Uwe Schindler)
- LUCENE-9206: IndexMergeTool gets additional options to control the merging.
This tool no longer forceMerge(1)s to a single segment by default. If you
rely upon this behavior, pass -max-segments 1 instead.
(Robert Muir)
- LUCENE-9220: Upgrade snowball to 2.0. New snowball stemmers: Hindi, Indonesian,
Nepali, Serbian, and Tamil. New stoplist: Indonesian. Adds gradle 'snowball'
task to regenerate and ease future upgrades.
(Robert Muir, Dawid Weiss)
- LUCENE-9354: Improvements to snowball french stopwords list, so that it is less
aggressive.
(Philippe Ouellet)
- LUCENE-9114: Improve ValueSourceScorer's Default Cost Implementation
(Atri Sharma, David Smiley)
- LUCENE-9074: Introduce Slice Executor For Dynamic Runtime Execution Of Slices
(Atri Sharma)
- LUCENE-9280: Add an ability for field comparators to skip non-competitive documents.
Creating a TopFieldCollector with totalHitsThreshold less than Integer.MAX_VALUE
instructs Lucene to skip non-competitive documents whenever possible. For numeric
sort fields the skipping functionality works when the same field is indexed both
with doc values and points. In this case, there is an assumption that the same data is
stored in these points and doc values
(Mayya Sharipova, Jim Ferenczi, Adrien Grand)
- LUCENE-9449: Enhance DocComparator to provide an iterator over competitive
documents when searching with "after". This iterator can quickly position
on the desired "after" document skipping all documents and segments before
"after". Also redesign numeric comparators to provide skipping functionality
by default.
(Mayya Sharipova, Jim Ferenczi)
- LUCENE-9527: Upgrade javacc to 7.0.4, regenerate query parsers.
(Dawid Weiss)
- LUCENE-9531: Consolidated CharStream and FastCharStream classes: these have been moved
from each query parser package to org.apache.lucene.queryparser.charstream
(Dawid Weiss).
- LUCENE-9450: Use BinaryDocValues for the taxonomy index instead of StoredFields.
Add backwards compatibility tests for the taxonomy index.
(Gautam Worah, Michael McCandless)
- LUCENE-9605: Update snowball to d8cf01ddf37a, adds Yiddish stemmer.
(Robert Muir)
- LUCENE-8982: Make NativeUnixDirectory pure java with FileChannel direct IO flag,
and rename to DirectIODirectory (Zach Chen, Uwe Schindler, Mike McCandless, Dawid Weiss).
- LUCENE-9674: Implement faster advance on VectorValues using binary search.
(Anand Kotriwal, Mike Sokolov)
- LUCENE-9794: Speed up implementations of DataInput.skipBytes().
(Greg Miller)
- LUCENE-9898: Removes no longer used scorePayload method from BM25Similarity
(Pieter van Boxtel)
- LUCENE-9850: Switch to PFOR encoding for doc IDs (instead of FOR).
(Greg Miller)
- LUCENE-9929: Add NorwegianNormalizationFilter, which does the same as ScandinavianNormalizationFilter except
it does not fold oo->ø and ao->å.
(janhoy, Robert Muir, Adrien Grand)
- LUCENE-9535: Improve DocumentsWriterPerThreadPool to prefer larger instances.
(Adrien Grand)
- LUCENE-10000: MultiCollectorManager now has parity with MultiCollector with respect to how it
handles CollectionTerminationException and setMinCompetitiveScore calls.
(Greg Miller)
- LUCENE-10019: Align file starts in CFS files to have proper alignment (8 bytes)
(Uwe Schinder)
- LUCENE-9662: Make CheckIndex concurrent by parallelizing index check across segments.
(Zach Chen, Mike McCandless, Dawid Weiss, Robert Muir)
- LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple
facet ordinals at once. This API is 2-4% faster than iteratively calling getPath.
The getPath API now throws an IAE instead of returning null if the ordinal is out of bounds.
(Gautam Worah, Mike McCandless)
- LUCENE-10113: Use VarHandles to access int/long/short primitive types in byte arrays.
This improves readability and performance of encoding/decoding of primitives to index
file format in input/output classes like DataInput / DataOutput and codecs.
(Uwe Schindler, Robert Muir)
- LUCENE-10112: Improve LZ4 Compression performance with direct primitive read/writes.
(Tim Brooks, Uwe Schindler, Robert Muir, Adrien Grand)
- LUCENE-10125: Optimize primitive writes in OutputStreamIndexOutput.
(Uwe Schindler, Robert Muir, Adrien Grand)
- LUCENE-10143: Delegate primitive writes in RateLimitedIndexOutput.
(Uwe Schindler, Robert Muir, Adrien Grand)
- LUCENE-10145, LUCENE-10153: Faster flushes and merges of points by leveraging
VarHandles.
(Adrien Grand)
- LUCENE-10201: Spatial-Extras: Upgrading Spatial4j to 0.8 improving a varitety of minor things.
See release notes. https://github.com/locationtech/spatial4j/releases/tag/spatial4j-0.8
(David Smiley)
- LUCENE-10062: Switch taxonomy faceting to use numeric doc values for storing ordinals instead of binary doc values
with its own custom encoding.
(Greg Miller)
- Bug fixes (15)
- LUCENE-9686: Fix read past EOF handling in DirectIODirectory.
(Zach Chen,
Julie Tibshirani)
- LUCENE-8663: NRTCachingDirectory.slowFileExists may open a file while
it's inaccessible.
(Dawid Weiss)
- LUCENE-9117: RamUsageEstimator hangs with AOT compilation. Removed any attempt to
estimate Long.valueOf cache size.
(Cleber Muramoto, Dawid Weiss)
- LUCENE-9290: Don't assume that different XYPoint have different hash code
(Ignacio Vera via Mike Drob)
- LUCENE-9372: Fix paths for cygwin/msys before gradle wrapper jar lookup.
(Peter Barna)
- LUCENE-9365: FuzzyQuery was missing matches when prefix length was equal to the term length
(Mark Harwood, Mike Drob)
- LUCENE-9580: Fix bug in the polygon tessellator when introducing collinear edges during polygon
splitting.
(Ignacio Vera)
- LUCENE-9930: The Ukrainian analyzer was reloading its dictionary for every new
TokenStreamComponents, which could lead to memory leaks.
(Alan Woodward)
- LUCENE-9940: The order of disjuncts in DisjunctionMaxQuery does not matter
for equality checks
(Alan Woodward)
- LUCENE-9971: Requesting facet counts for unseen dimensions in SortedSetDocValueFacetCounts and
ConcurrentSortedSetDocValueFacetCounts now returns null / -1 instead of throwing
IllegalArgumentException as per Javadoc spec in Facets.
(Alexander Lukyanchikov)
- LUCENE-9823: Prevent unsafe rewrites for SynonymQuery and CombinedFieldQuery. Before, rewriting
could slightly change the scoring when weights were specified.
(Naoto Minami via Julie Tibshirani)
- LUCENE-10047: Fix a value de-duping bug in LongValueFacetCounts and RangeFacetCounts
(Greg Miller)
- LUCENE-10101, LUCENE-9281: Use getField() instead of getDeclaredField() to
minimize security impact by analysis SPI discovery.
(Uwe Schindler)
- LUCENE-10114: Remove unused byte order mark in Lucene90PostingsWriter. This
was initially introduced by accident in Lucene 8.4.
(Uwe Schindler)
- LUCENE-10140: Fix cases where minimizing interval iterators could return
incorrect matches
(Nikolay Khitrin, Alan Woodward)
- Changes in Backwards Compatibility Policy (3)
- LUCENE-9904: regenerated UAX29URLEmailTokenizer and the corresponding analyzer with up-to-date top
level domains. This may change the token sequence compared to previous Lucene versions.
(Dawid Weiss)
- LUCENE-9669: DirectoryReader#open now accepts an argument to open indices created with versions
older than N-1. Lucene now can open indices created with a major version of N-2 in read-only mode.
Opening an index created with a major version of N-2 with an IndexWriter is not supported.
Further does lucene only support file-format compatibilty which enables reading of old indices while
semantic changes like analysis or certain encoding on top of the file format are only supported on
a best effort basis.
(Simon Willnauer)
- LUCENE-10232: Fix MultiRangeQuery to confirm all dimensions for a given range match.
(Greg Miller)
- Build (6)
- LUCENE-9077 LUCENE-9433: Support Gradle build, remove Ant support from trunk
(Dawid Weiss, Erick Erickson, Uwe Schindler et.al.)
- LUCENE-8768: Fix Javadocs build in Java 11.
(Namgyu Kim)
- LUCENE-9544: add regenerate gradle script for nori dictionary
(Namgyu Kim)
- LUCENE-10195: Add gradle cache option and make some tasks cacheable.
(Jerome Prinet, Dawid Weiss)
- LUCENE-10198: LUCENE-10198: Allow external JAVA_OPTS in gradlew scripts; use sane defaults
(balmukund.mandal@intel.com, Dawid Weiss)
- LUCENE-10163: Move LICENSE and NOTICE files to top level to satisfy src artifact requirements
(janhoy)
- Other (20)
- LUCENE-10122: Use NumericDocValues to store taxonomy parent array
(Patrick Zhai)
- LUCENE-10136: allow 'var' declarations in source code
(Dawid Weiss)
- LUCENE-9570, LUCENE-9564: Apply google java format and enforce it on source Java files.
Review diffs and correct automatic formatting oddities.
(Erick Erickson,
Bruno Roustant, Dawid Weiss)
- LUCENE-9631: Properly override slice() on subclasses of OffsetRange.
(Dawid Weiss)
- LUCENE-9391: Upgrade HPPC to 0.8.2.
(Patrick Zhai)
- LUCENE-10021: Upgrade HPPC to 0.9.0. Replace usage of ...ScatterMap to ...HashMap.
(Patrick Zhai)
- LUCENE-9092: upgrade randomizedtesting to 2.7.5
(Dawid Weiss)
- LUCENE-8656: Deprecations in FuzzyQuery and get compiler warnings out of
queryparser code
(Alan Woodward, Erick Erickson)
- LUCENE-9344: Convert .txt files to properly formatted .md files.
(Tomoko Uchida, Uwe Schindler)
- LUCENE-9267: Update MatchingQueries documentation to correct
time unit.
(Pierre-Luc Perron via Mike Drob)
- LUCENE-9411: Fail compilation on warnings, 9x gradle-only (Erick Erickson, Dawid Weiss)
Deserves mention here as well as Lucene CHANGES.txt since it affects both.
- LUCENE-9215: Replace checkJavaDocs.py with doclet
(Robert Muir, Dawid Weiss, Uwe Schindler)
- LUCENE-9497: Integrate Error Prone, a static analysis tool during compilation
(Dawid Weiss, Varun Thacker)
- LUCENE-9627: Remove unused Lucene50FieldInfosFormat codec and small refactor some codecs
to separate reading header/footer from reading content of the file.
(Ignacio Vera)
- LUCENE-9773: Upgrade icu to 68.2
(Robert Muir)
- LUCENE-9822: Add assertion to PFOR exception encoding, documenting the BLOCK_SIZE assumption.
(Greg Miller)
- LUCENE-9883: Turn on ecj missingEnumCaseDespiteDefault setting.
(Zach Chen)
- LUCENE-9705: Make new versions of all index formats for the Lucene90 codec and move
the existing ones to the backwards codecs.
(Julie Tibshirani, Ignacio Vera)
- LUCENE-9907: Remove dependency on PackedInts#getReader() from the current codecs and move the
method to backwards codec.
(Ignacio Vera)
- LUCENE-10024: Catch NoSuchFileException when opening index directory with Luke.
(Michael Wechner, Tomoko Uchida)
- Bug Fixes (7)
- LUCENE-9580: Fix bug in the polygon tessellator when introducing collinear edges during polygon
splitting.
(Ignacio Vera)
- LUCENE-10470: Check if polygon has been successfully tessellated before we fail (we are failing some valid
tessellations) and allow filtering edges that fold on top of the previous one.
(Ignacio Vera)
- LUCENE-10563: Fix failure to tessellate complex polygon
(Craig Taverner)
- LUCENE-10678: Fix potential overflow when building a BKD tree with more than 4 billion points. The overflow
occurs when computing the partition point.
(Ignacio Vera)
- GITHUB#11986: Fix algorithm that chooses the bridge between a polygon and a hole when there is
common vertex.
(Ignacio Vera)
- GITHUB#12020: Fixes bug whereby very flat polygons can incorrectly contain intersecting geometries.
(Craig Taverner)
- GITHUB#12352: [Tessellator] Improve the checks that validate the diagonal between two polygon nodes so
the resulting polygons are valid counter clockwise polygons.
(Ignacio Vera)
- Optimizations (1)
- GITHUB#12604: Estimate the block size of FST BytesStore in BlockTreeTermsWriter
to reduce GC load during indexing.
(Guo Feng)
- Bug Fixes (2)
- LUCENE-10564: Make sure SparseFixedBitSet#or updates ramBytesUsed.
(Julie Tibshirani)
- LUCENE-10477: Highlighter: WeightedSpanTermExtractor.extractWeightedSpanTerms to Query#rewrite
multiple times if necessary.
(Christine Poerschke, Adrien Grand)
- Optimizations (1)
- LUCENE-10481: FacetsCollector will not request scores if it does not use them.
(Mike Drob)
- API Changes (1)
- (No changes)
- New Features (1)
- (No changes)
- Improvements (2)
- LUCENE-9662: Make CheckIndex concurrent by parallelizing index check across segments.
(Zach Chen, Mike McCandless, Dawid Weiss, Robert Muir)
- LUCENE-10103: Make QueryCache respect Accountable queries.
(Patrick Zhai)
- Optimizations (2)
- LUCENE-9673: Substantially improve RAM efficiency of how MemoryIndex stores
postings in memory, and reduced a bit of RAM overhead in
IndexWriter's internal postings book-keeping
(mashudong)
- LUCENE-10196: Improve IntroSorter with 3-ways partitioning.
(Bruno Roustant)
- Bug Fixes (6)
- LUCENE-10111: Missing calculating the bytes used of DocsWithFieldSet in NormValuesWriter.
(Lu Xugang)
- LUCENE-10116: Missing calculating the bytes used of DocsWithFieldSet and currentValues in SortedSetDocValuesWriter.
(Lu Xugang)
- LUCENE-10070 Skip deleted docs when accumulating facet counts for all docs.
(Ankur Goel, Greg Miller)
- LUCENE-10134: ConcurrentSortedSetDocValuesFacetCounts shouldn't share liveDocs Bits across threads.
(Ankur Goel)
- LUCENE-10154: NumericLeafComparator to define getPointValues.
(Mayya Sharipova, Adrien Grand)
- LUCENE-10208: Ensure that the minimum competitive score does not decrease in concurrent search.
(Jim Ferenczi, Adrien Grand)
- Build (1)
- LUCENE-10104, SOLR-15631: Upgrade forbiddenapis to version 3.2.
(Uwe Schindler)
- Other (1)
- LUCENE-10098: Add docs/links to GermanAnalyzer describing how to decompound nouns.
(Robert Muir)
- Bug Fixes (3)
- LUCENE-10110: MultiCollector now handles single leaf collector that wants to skip low-scoring hits
but the combined score mode doesn't allow it.
(Jim Ferenczi)
- LUCENE-10119: Sort optimization with search_after can wrongly skip documents
whose values are equal to the last value of the previous page
(Nhat Nguyen)
- LUCENE-10126: Sort optimization with a chunked bulk scorer
can wrongly skip documents
(Nhat Nguyen, Mayya Sharipova)
- API Changes (5)
- LUCENE-9962: DrillSideways allows sub-classes to provide "drill down" FacetsCollectors. They
may provide a null collector if they choose to bypass "drill down" facet collection.
(Greg Miller)
- LUCENE-9902: Change the getValue method from IntTaxonomyFacets to be protected instead of private.
Users can now access the count of an ordinal directly without constructing an extra FacetLabel.
Also use variable length arguments for the getOrdinal call in TaxonomyReader.
(Gautam Worah)
- LUCENE-10036: Replaced the ScoreCachingWrappingScorer ctor with a static factory method that
ensures unnecessary wrapping doesn't occur.
(Greg Miller)
- LUCENE-10027: Add a new Directory reader open API from indexCommit and
a custom comparator for sorting leaf readers.
(Mayya Sharipova)
- LUCENE-7020: TieredMergePolicy#setMaxMergeAtOnceExplicit is deprecated
and the number of segments that get merged via explicit merges is unlimited
by default.
(Adrien Grand, Shawn Heisey)
- New Features (2)
- LUCENE-10083: Analyzer and stemmer for Telugu language
(Vinod Singh)
- LUCENE-10035: The SimpleText codec now writes skip lists.
(wuda via Adrien Grand)
- Improvements (12)
- LUCENE-9944: Allow DrillSideways users to provide their own CollectorManager without also requiring
them to provide an ExecutorService.
(Greg Miller)
- LUCENE-9946: Support for multi-value fields in LongRangeFacetCounts and
DoubleRangeFacetCounts.
(Greg Miller)
- LUCENE-9965: Added QueryProfilerIndexSearcher and ProfilerCollector to support debugging
query execution strategy and timing.
(Jack Conradson, Julie Tibshirani)
- LUCENE-9981: Operations.getCommonSuffix/Prefix(Automaton) is now much more
efficient, from a worst case exponential down to quadratic cost in the
number of states + transitions in the Automaton. These methods no longer
use the costly determinize method, removing the risk of
TooComplexToDeterminizeException
(Robert Muir, Mike McCandless)
- LUCENE-9981: Operations.determinize now throws TooComplexToDeterminizeException
based on too much "effort" spent determinizing rather than a precise state
count on the resulting returned automaton, to better handle adversarial
cases like det(rev(regexp("(.*a){2000}"))) that spend lots of effort but
result in smallish eventual returned automata.
(Robert Muir, Mike McCandless)
- LUCENE-9983: Stop sorting determinize powersets unnecessarily.
(Patrick Zhai)
- LUCENE-9177: ICUNormalizer2CharFilter no longer requires normalization-inert
characters as boundaries for incremental processing, vastly improving worst-case
performance.
(Michael Gibney)
- LUCENE-10030: Lazily evaluate score in DrillSidewaysScorer.doQueryFirstScoring
(Grigoriy Troitskiy)
- LUCENE-9945: Extend DrillSideways to support exposing FacetCollectors directly.
(Greg Miller, Sejal Pawar)
- LUCENE-10043: Decrease default for LRUQueryCache's skipCacheFactor to 10.
This prevents caching a query clause when it is much more expensive than
running the top-level query.
(Julie Tibshirani)
- LUCENE-5309: Optimize facet counting for single-valued SSDV / StringValueFacetCounts.
(Greg Miller)
- LUCENE-9917: The BEST_SPEED compression mode now trades more compression ratio
in exchange of faster reads.
(Adrien Grand)
- Optimizations (4)
- LUCENE-9996: Improved memory efficiency of IndexWriter's RAM buffer, in
particular in the case of many fields and many indexing threads.
(Adrien Grand)
- LUCENE-10022: Rewrite empty DisjunctionMaxQuery to MatchNoDocsQuery.
(David Harsha via Julie Tibshirani)
- LUCENE-10031: Slightly faster segment merging for sorted indices.
(Adrien Grand)
- LUCENE-10014: Lucene90DocValuesFormat was using too many bits per
value when compressing via gcd, unnecessarily wasting index storage.
(weizijun)
- Bug Fixes (12)
- LUCENE-9988: Fix DrillSideways correctness bug introduced in LUCENE-9944
(Greg Miller)
- LUCENE-9964: Duplicate long values in a document field should only be counted once when using SortedNumericDocValuesFields
(Gautam Worah)
- LUCENE-9999: CombinedFieldQuery can fail with an exception when document
is missing some fields.
(Jim Ferenczi, Julie Tibshirani)
- LUCENE-10020: DocComparator should not skip docs with the same docID on
multiple sorts with search after
(Mayya Sharipova, Julie Tibshirani)
- LUCENE-10026: Fix CombinedFieldQuery equals and hashCode, which ensures
query rewrites don't drop CombinedFieldQuery clauses.
(Julie Tibshirani)
- LUCENE-10039: Correct CombinedFieldQuery scoring when there is a single
field.
(Julie Tibshirani)
- LUCENE-10046: Counting bug fixed in StringValueFacetCounts.
(Greg Miller)
- LUCENE-9963: FlattenGraphFilter is now more robust when handling
incoming holes in the input token graph
(Geoff Lawson)
- LUCENE-10008: Respect ignoreCase in CommonGramsFilterFactory
(Vigya Sharma)
- LUCENE-10060: Ensure DrillSidewaysQuery instances never get cached.
(Greg Miller, Zachary Chen)
- LUCENE-10081: KoreanTokenizer should check the max backtrace gap on whitespaces.
(Jim Ferenczi)
- LUCENE-10106: Sort optimization can wrongly skip the first document of
each segment
(Nhat Nguyen)
- Other (1)
- (No changes)
- API Changes (1)
- LUCENE-9680: IndexWriter#getFieldNames() method added to get fields present in index.
This method was removed in LUCENE-8909.
(Oren Ovadia)
- New Features (8)
- LUCENE-9507: Custom order for leaves in IndexReader and IndexWriter
(Mayya Sharipova, Mike McCandless, Jim Ferenczi)
- LUCENE-9575: PatternTypingFilter has been added to allow setting a type attribute on tokens based on
a configured set of regular expressions
(Gus Heck).
- LUCENE-9572: TypeAsSynonymFilter has been enhanced support ignoring some types, and to allow
the generated synonyms to copy some or all flags from the original token
(Gus Heck).
- LUCENE-9574 A token filter to drop tokens that match all specified flags.
(Gus Heck, Uwe Schindler)
- LUCENE-9537: Added smoothingScore method and default implementation to
Scorable abstract class. The smoothing score allows scorers to calculate a
score for a document where the search term or subquery is not present. The
smoothing score acts like an idf so that documents that do not have terms or
subqueries that are more frequent in the index are not penalized as much as
documents that do not have less frequent terms or subqueries and prevents
scores which are the product or terms or subqueries from going to zero. Added
the implementation of the Indri AND and the IndriDirichletSimilarity from the
academic Indri search engine: http://www.lemurproject.org/indri.php.
(Cameron VandenBerg)
- LUCENE-9694: New tool for creating a deterministic index to enable benchmarking changes
on a consistent multi-segment index even when they require re-indexing.
(Patrick Zhai)
- LUCENE-9385: Add FacetsConfig option to control which drill-down
terms are indexed for a FacetLabel
(Zachary Chen)
- LUCENE-9950: New facet counting implementation for general string doc value fields
(SortedSetDocValues / SortedDocValues) not created through FacetsConfig
(Greg Miller)
- Improvements (5)
- LUCENE-9725: BM25FQuery was extended to handle similarities beyond BM25Similarity. It
was renamed to CombinedFieldQuery to reflect its more general scope.
(Julie Tibshirani)
- LUCENE-9663: Adding compression to terms dict from SortedSet/Sorted DocValues.
(Jaison Bi via Bruno Roustant)
- LUCENE-9687: Hunspell support improvements: add API for spell-checking and suggestions, support compound words,
fix various behavior differences between Java and C++ implementations, improve performance
(Peter Gromov, Dawid Weiss)
- LUCENE-9877: Reduce index size by increasing allowable exceptions in PForUtil from 3 to 7.
(Greg Miller)
- LUCENE-9935: Enable bulk merge for stored fields with index sort.
(Robert Muir, Adrien Grand, Nhat Nguyen)
- Optimizations (2)
- LUCENE-9932: Performance improvement for BKD index building
(neoremind)
- LUCENE-9827: Speed up merging of stored fields and term vectors for smaller segments.
(Daniel Mitterdorfer, Dimitrios Liapis, Adrien Grand, Robert Muir)
- Bug Fixes (6)
- LUCENE-9791: BytesRefHash.equals/find is now thread safe, fixing a
Luwak/Monitor bug causing registered queries to sometimes fail to
match.
(Paweł Bugalski)
- LUCENE-9887: Fixed parameter use in RadixSelector.
(liupanfeng via Adrien Grand)
- LUCENE-9958: Fixed performance regression for boolean queries that configure a
minimum number of matching clauses.
(Adrien Grand, Matt Weber)
- LUCENE-9953: LongValueFacetCounts should count each document at most once when determining
the total count for a dimension. Prior to this fix, multi-value docs could contribute a > 1
count to the dimension count.
(Greg Miller)
- LUCENE-9967: Do not throw NullPointerException while trying to handle another exception in
ReplicaNode.start
(Steven Schlansker)
- LUCENE-9991: Fix edge case failure in TestStringValueFacetCounts
(Greg Miller)
- Other (4)
- LUCENE-9836: Removed the pure Maven build. It is no longer possible to build
artifacts using Maven (this feature was no longer working correctly). Due to
migration to Gradle for Lucene/Solr 9.0, the maintenance of the Maven build
was no longer reasonable. POM files are generated for deployment to Maven
Central only. Please use "ant generate-maven-artifacts" to produce and deploy
artifacts to any repository.
(Uwe Schindler, Dawid Weiss)
- LUCENE-9836: Migrate Maven tasks to use "maven-resolver-ant-tasks"
instead of the no longer maintained "maven-ant-tasks".
(Uwe Schindler)
- LUCENE-9985: Upgrade jetty to 9.4.41
(janhoy)
- LUCENE-9976: Fix WANDScorer assertion error.
(Zach Chen, Adrien Grand, Dawid Weiss)
- Bug Fixes (3)
- LUCENE-9870: Fix Circle2D intersectsLine t-value (distance) range clamp
(Jørgen Nystad)
- LUCENE-9744: NPE on a degenerate query in MinimumShouldMatchIntervalsSource
$MinimumMatchesIterator.getSubMatches().
(Alan Woodward)
- LUCENE-9762: DoubleValuesSource.fromQuery (also used by FunctionScoreQuery.boostByQuery) could
throw an exception when the query implements TwoPhaseIterator and when the score is requested
repeatedly.
(David Smiley, hossman)
- New Features (5)
- LUCENE-9552: New LatLonPoint query that accepts an array of LatLonGeometries.
(Ignacio Vera)
- LUCENE-9641: LatLonPoint query support for spatial relationships.
(Ignacio Vera)
- LUCENE-9553: New XYPoint query that accepts an array of XYGeometries.
(Ignacio Vera)
- LUCENE-9378: Doc values now allow configuring how to trade compression for
retrieval speed.
(Adrien Grand)
- LUCENE-9413: Add CJKWidthCharFilter and its factory
(Tomoko Uchida)
- Improvements (3)
- LUCENE-9455: ExitableTermsEnum should sample timeout and interruption
check before calling next().
(Zach Chen via Bruno Roustant)
- LUCENE-9023: GlobalOrdinalsWithScore should not compute occurrences when the
provided min is 1.
(Jim Ferenczi)
- LUCENE-9675: Binary doc values fields now expose their configured compression mode
in the attributes of the field info.
(Jim Ferenczi)
- Optimizations (4)
- LUCENE-9536: Reduced memory usage for OrdinalMap when a segment has all
values.
(Julie Tibshirani via Adrien Grand)
- LUCENE-9021: QueryParser: re-use the LookaheadSuccess exception.
(Przemek Bruski via Mikhail Khludnev)
- LUCENE-9636: Faster decoding of postings for some numbers of bits per value.
(Guo Feng via Adrien Grand)
- LUCENE-9346: WANDScorer now supports queries that have a
`minimumNumberShouldMatch` configured.
(Xi Zachary Chen via Adrien Grand)
- Bug Fixes (8)
- LUCENE-9508: DocumentsWriter was only stalling threads for 1 second allowing
documents to be indexed even the DocumentsWriter wasn't able to keep up flushing.
Unless IW can't make progress due to an ill behaving DWPT this issue was barely
noticeable.
(Simon Willnauer)
- LUCENE-9581: Japanese tokenizer should discard the compound token instead of disabling the decomposition
of long tokens when discardCompoundToken is activated.
(Jim Ferenczi)
- LUCENE-9595: Make Component2D#withinPoint implementations consistent with ShapeQuery logic.
(Ignacio Vera)
- LUCENE-9606: Wrap boolean queries generated by shape fields with a Constant score query.
(Ignacio Vera)
- LUCENE-9635: BM25FQuery - Mask encoded norm long value in array lookup.
(Yilun Cui)
- LUCENE-9617: Fix per-field memory leak in IndexWriter.deleteAll(). Reset next available internal
field number to 0 on FieldInfos.clear(), to avoid wasting FieldInfo references.
(Michael Froh)
- LUCENE-9642: When encoding triangles in ShapeField, make sure generated triangles are CCW by rotating
triangle points before checking triangle orientation.
(Ignacio Vera)
- LUCENE-9661: Fix deadlock in TermsEnum.EMPTY that occurs when trying to initialize TermsEnum and BaseTermsEnum
at the same time
(Namgyu Kim)
- Other (2)
- SOLR-14995: Update Jetty to 9.4.34
(Mike Drob)
- LUCENE-9637: Removes some unused code and replaces the Point implementation on ShapeField/ShapeQuery
random tests.
(Ignacio Vera)
- API Changes (2)
- LUCENE-9437: Lucene's facet module's DocValuesOrdinalsReader.decode method
is now public, making it easier for applications to decode facet
ordinals into their corresponding labels
(Ankur Goel)
- LUCENE-9515: IndexingChain now accepts individual primitives rather than a
DocumentsWriterPerThread instance in order to create a new DocConsumer.
(Simon Willnauer)
- New Features (4)
- LUCENE-9386: RegExpQuery added case insensitive matching option.
(Mark Harwood)
- LUCENE-8962: Add IndexWriter merge-on-refresh feature to selectively merge
small segments on getReader, subject to a configurable timeout, to improve
search performance by reducing the number of small segments for searching.
(Simon Willnauer)
- LUCENE-9484: Allow sorting an index after it was created. With SortingCodecReader, existing
unsorted segments can be wrapped and merged into a fresh index using IndexWriter#addIndices
API.
(Simon Willnauer, Adrien Grand)
- LUCENE-9444: Add utility class to retrieve facet labels from the
taxonomy index for a facet field so such fields do not also have to
be redundantly stored
(Ankur Goel)
- Improvements (10)
- LUCENE-8574: Add a new ExpressionValueSource which will enforce only one value per name
per hit in dependencies, ExpressionFunctionValues will no longer
recompute already computed values
(Patrick Zhai)
- LUCENE-9416: Fix CheckIndex to print an invalid non-zero norm as
unsigned long when detecting corruption.
- LUCENE-9440: FieldInfo#checkConsistency called twice from Lucene50(60)FieldInfosFormat#read;
Removed the (redundant?) assert and do these checks for real.
(Yauheni Putsykovich)
- LUCENE-9446: In BooleanQuery rewrite, always remove MatchAllDocsQuery filter clauses
when possible.
(Julie Tibshirani)
- LUCENE-9501: Improve coverage for Asserting* test classes: make sure to handle singleton doc
values, and sometimes exercise Weight#scorer instead of Weight#bulkScorer for top-level
queries.
(Julie Tibshirani)
- LUCENE-9511: Include StoredFieldsWriter in DWPT accounting to ensure that it's
heap consumption is taken into account when IndexWriter stalls or should flush
DWPTs.
(Simon Willnauer)
- LUCENE-9514: Include TermVectorsWriter in DWPT accounting to ensure that it's
heap consumption is taken into account when IndexWriter stalls or should flush
DWPTs.
(Simon Willnauer)
- LUCENE-9523: In query shapes over shape fields, skip points while traversing the
BKD tree when the relationship with the document is already known.
(Ignacio Vera)
- LUCENE-9539: Use more compact datastructures to represent sorted doc-values in memory when
sorting a segment before flush and in SortingCodecReader.
(Simon Willnauer)
- LUCENE-9458: WordDelimiterGraphFilter should order tokens at the same position by endOffset to
emit longer tokens first. The same graph is produced.
(David Smiley)
- Optimizations (4)
- LUCENE-9395: ConstantValuesSource now shares a single DoubleValues
instance across all segments
(Tony Xu)
- LUCENE-9447, LUCENE-9486: Stored fields now get higer compression ratios on
highly compressible data.
(Adrien Grand)
- LUCENE-9373: FunctionMatchQuery now accepts a "matchCost" optimization hint.
(Maxim Glazkov, David Smiley)
- LUCENE-9510: Indexing with an index sort is now faster by not compressing
temporary representations of the data.
(Adrien Grand)
- Bug Fixes (6)
- LUCENE-9427: Fix a regression where the unified highlighter didn't produce
highlights on fuzzy queries that correspond to exact matches.
(Julie Tibshirani)
- LUCENE-9467: Fix NRTCachingDirectory to use Directory#fileLength to check if a file
already exists instead of opening an IndexInput on the file which might throw a AccessDeniedException
in some Directory implementations.
(Simon Willnauer)
- LUCENE-9501: Fix a bug in IndexSortSortedNumericDocValuesRangeQuery where it could violate the
DocIdSetIterator contract.
(Julie Tibshirani)
- LUCENE-9401: Include field in ComplexPhraseQuery's toString()
(Thomas Hecker via Munendra S N)
- LUCENE-9578: Fix TermRangeQuery when there is no upper bound and the lower
bound is the empty string excluded. This would previously match no strings at
all while it should match all non-empty strings.
(Christoph Buescher via Adrien Grand)
- LUCENE-9524: Fix NPE in SpanWeight#explain when no scoring is required and
SpanWeight has null Similarity.SimScorer.
(Zach Chen)
- Documentation (1)
- LUCENE-9424: Add a performance warning to AttributeSource.captureState javadocs
(Patrick Zhai)
- Changes in Runtime Behavior (1)
- LUCENE-9539: SortingCodecReader now doesn't cache doc values fields anymore. Previously, SortingCodecReader
used to cache all doc values fields after they were loaded into memory. This reader should only be used
to sort segments after the fact using IndexWriter#addIndices.
(Simon Willnauer)
- Other (3)
- LUCENE-9292: Refactor BKD point configuration into its own class.
(Ignacio Vera)
- LUCENE-9470: Make TestXYMultiPolygonShapeQueries more resilient for CONTAINS queries.
(Ignacio Vera)
- LUCENE-9512: Move LockFactory stress test to be a unit/integration
test.
(Uwe Schindler, Dawid Weiss, Robert Muir)
- Build (1)
- Upgrade forbiddenapis to version 3.1.
(Uwe Schindler)
- Bug Fixes (1)
- LUCENE-9478: Prevent DWPTDeleteQueue from referencing itself and leaking memory. The queue
passed an implicit this reference to the next queue instance on flush which leaked about 500byte
of memory on each full flush, commit or getReader call.
(Simon Willnauer)
- Bug Fixes (1)
- LUCENE-9443: The UnifiedHighlighter was closing the underlying reader when there were multiple term-vector fields.
This was a regression in 8.6.0.
(David Smiley, Chris Beer)
- API Changes (9)
- LUCENE-9265: SimpleFSDirectory is deprecated in favor of NIOFSDirectory.
(Yannick Welsch)
- LUCENE-9304: Removed ability to set DocumentsWriterPerThreadPool on IndexWriterConfig.
The DocumentsWriterPerThreadPool is a packaged protected final class which made it impossible
to customize.
(Simon Willnauer)
- LUCENE-9339: MergeScheduler#merge doesn't accept a parameter if a new merge was found anymore.
(Simon Willnauer)
- LUCENE-9330: SortFields are now responsible for writing themselves into index headers if they
are used as index sorts.
(Alan Woodward, Uwe Schindler, Adrien Grand)
- LUCENE-9340: Deprecate SimpleBindings#add(SortField).
(Alan Woodward)
- LUCENE-9345: MergeScheduler is now decoupled from IndexWriter. Instead it accepts a MergeSource
interface that offers the basic methods to acquire pending merges, run the merge and do accounting
around it.
(Simon Willnauer)
- LUCENE-9349: QueryVisitor.consumeTermsMatching() now takes a
Supplier<ByteRunAutomaton> to enable queries that build large automata to
provide them lazily. TermsInSetQuery switches to using this method
to report matching terms.
(Alan Woodward)
- LUCENE-9366: DocValues.emptySortedNumeric() not longer takes a maxDoc parameter
(Alan Woodward)
- LUCENE-7822: CodecUtil#checkFooter(IndexInput, Throwable) now throws a
CorruptIndexException if checksums mismatch or if checksums can't be verified.
(Martin Amirault, Adrien Grand)
- New Features (2)
- LUCENE-7889: Grouping by range based on values from DoubleValuesSource and LongValuesSource
(Alan Woodward)
- LUCENE-8962: Add IndexWriter merge-on-commit feature to selectively merge small segments on commit,
subject to a configurable timeout, to improve search performance by reducing the number of small
segments for searching
(Michael Froh, Mike Sokolov, Mike Mccandless, Simon Willnauer)
- Improvements (13)
- LUCENE-9276: Use same code-path for updateDocuments and updateDocument in IndexWriter and
DocumentsWriter.
(Simon Willnauer)
- LUCENE-9279: Update dictionary version for Ukrainian analyzer to 4.9.1
(Andriy Rysin via Dawid Weiss)
- LUCENE-8050: PerFieldDocValuesFormat should not get the DocValuesFormat on a field that has no doc values.
(David Smiley, Juan Rodriguez)
- LUCENE-9304: Removed ThreadState abstraction from DocumentsWriter which allows pooling of DWPT directly and
improves the approachability of the IndexWriter code.
(Simon Willnauer)
- LUCENE-9324: Add an ID to SegmentCommitInfo in order to compare commits for equality and make
snapshots incremental on generational files.
(Simon Willnauer, Mike Mccandless, Adrien Grand)
- LUCENE-9342: TotalHits' relation will be EQUAL_TO when the number of hits is lower than TopDocsColector's numHits
(Tomás Fernández Löbbe)
- LUCENE-9353: Metadata of the terms dictionary moved to its own file, with the
`.tmd` extension. This allows checksums of metadata to be verified when
opening indices and helps save seeks when opening an index.
(Adrien Grand)
- LUCENE-9359: SegmentInfos#readCommit now always returns a
CorruptIndexException if the content of the file is invalid.
(Adrien Grand)
- LUCENE-9393: Make FunctionScoreQuery use ScoreMode.COMPLETE for creating the inner query weight when
ScoreMode.TOP_DOCS is requested.
(Tomás Fernández Löbbe)
- LUCENE-9392: Make FacetsConfig.DELIM_CHAR publicly accessible
(Ankur Goel)
- LUCENE-9397: UniformSplit supports encodable fields metadata.
(Bruno Roustant)
- LUCENE-9396: Improved truncation detection for points.
(Adrien Grand, Robert Muir)
- LUCENE-9402: Let MultiCollector handle minCompetitiveScore
(Tomás Fernández Löbbe, Adrien Grand)
- Optimizations (8)
- LUCENE-9254: UniformSplit keeps FST off-heap.
(Bruno Roustant)
- LUCENE-8103: DoubleValuesSource and QueryValueSource now use a TwoPhaseIterator if one is provided by the Query.
(Michele Palmia, David Smiley)
- LUCENE-9287: UsageTrackingQueryCachingPolicy no longer caches DocValuesFieldExistsQuery.
(Ignacio Vera)
- LUCENE-9286: FST.Arc.BitTable reads directly FST bytes. Arc is lightweight again and FSTEnum traversal faster.
(Bruno Roustant)
- LUCENE-7788: fail precommit on unparameterised log messages and examine for wasted work/objects
(Erick Erickson)
- LUCENE-9273: Speed up geometry queries by specialising Component2D spatial operations. Instead of using a generic
relate method for all relations, we use specialize methods for each one. In addition, the type of triangle is
computed at deserialization time, therefore we can be more selective when decoding points of a triangle.
(Ignacio Vera)
- LUCENE-9087: Build always trees with full leaves and lower the default value for maxPointsPerLeafNode to 512.
(Ignacio Vera)
- LUCENE-9148: Points now write their index in a separate file.
(Adrien Grand)
- Bug Fixes (14)
- LUCENE-9259: Fix wrong NGramFilterFactory argument name for preserveOriginal option
(Paul Pazderski)
- LUCENE-8849: DocValuesRewriteMethod.visit wasn't visiting its embedded query
(Michele Palmia, David Smiley)
- LUCENE-9258: DocTermsIndexDocValues assumed it was operating on a SortedDocValues (single valued) field when
it could be multi-valued used with a SortedSetSelector
(Michele Palmia)
- LUCENE-9164: Ensure IW processes all internal events before it closes itself on a rollback.
(Simon Willnauer, Nhat Nguyen, Dawid Weiss, Mike Mccandless)
- LUCENE-8908: Return default value from objectVal when doc doesn't match the query in QueryValueSource
(Bill Bell, hossman, Munendra S N, Michele Palmia)
- LUCENE-9133: Fix for potential NPE in TermFilteredPresearcher for empty fields
(Marvin Justice via Mike Drob)
- LUCENE-9309: Wait for #addIndexes merges when aborting merges.
(Simon Willnauer)
- LUCENE-9337: Ensure CMS updates it's thread accounting datastructures consistently.
CMS today releases it's lock after finishing a merge before it re-acquires it to update
the thread accounting datastructures. This causes threading issues where concurrently
finishing threads fail to pick up pending merges causing potential thread starvation on
forceMerge calls.
(Simon Willnauer)
- LUCENE-9314: Single-document monitor runs were using the less efficient MultiDocumentBatch
implementation.
(Pierre-Luc Perron, Alan Woodward)
- LUCENE-9362: Fix equality check in ExpressionValueSource#rewrite. This fixes rewriting of inner value sources.
(Dmitry Emets)
- LUCENE-9405: IndexWriter incorrectly calls closeMergeReaders twice when the merged segment is 100% deleted.
(Michael Froh, Simon Willnauer, Mike Mccandless, Mike Sokolov)
- LUCENE-9400: Tessellator might build illegal polygons when several holes share the shame vertex.
(Ignacio Vera)
- LUCENE-9417: Tessellator might build illegal polygons when several holes share are connected to the same
vertex.
(Ignacio Vera)
- LUCENE-9418: Fix ordered intervals over interleaved terms
(Alan Woodward)
- Other (12)
- LUCENE-9257: Always keep FST off-heap. FSTLoadMode, Reader attributes and openedFromWriter removed.
(Bruno Roustant)
- LUCENE-9272: Checksums of the terms index are now verified when
LeafReader#checkIntegrity is called rather than when opening the index.
(Adrien Grand)
- LUCENE-9270: Update Javadoc about normalizeEntry in the Kuromoji DictionaryBuilder.
(Namgyu Kim)
- LUCENE-9275: Make TestLatLonMultiPolygonShapeQueries more resilient for CONTAINS queries.
(Ignacio Vera)
- LUCENE-9244: Adjust TestLucene60PointsFormat#testEstimatePointCount2Dims so it does not fail when a point
is shared by multiple leaves.
(Ignacio Vera)
- LUCENE-9271: ByteBufferIndexInput was refactored to work on top of the
ByteBuffer API.
(Adrien Grand)
- LUCENE-9191: Make LineFileDocs's random seeking more efficient, making tests using LineFileDocs faster
(Robert Muir,
Mike McCandless)
- LUCENE-9338: Refactors SimpleBindings to improve type safety and cycle detection
(Alan Woodward,
Adrien Grand)
- LUCENE-9358: Change the way the multi-dimensional BKD tree builder generates the intermediate tree representation to be
equal to the one dimensional case to avoid unnecessary tree and leaves rotation.
(Ignacio Vera)
- LUCENE-9288: poll_mirrors.py release script can handle HTTPS mirrors.
(Ignacio Vera)
- LUCENE-9232: Fix or suppress 13 resource leak precommit warnings in lucene/replicator
(Andras Salamon via Erick Erickson)
- LUCENE-9398: Always keep BKD index off-heap. BKD reader does not implement Accountable any more.
(Ignacio Vera)
- Build (4)
- Upgrade forbiddenapis to version 3.0.1.
(Uwe Schindler)
- LUCENE-9376: Fix or suppress 20 resource leak precommit warnings in lucene/search
(Andras Salamon via Erick Erickson)
- LUCENE-9380: Fix auxiliary class warnings in Lucene
(Erick Erickson)
- LUCENE-9389: Enhance gradle logging calls validation: eliminate getMessage()
(Andras Salamon via Erick Erickson)
- Optimizations (1)
- LUCENE-9350: Partial reversion of LUCENE-9068; holding levenshtein automata on FuzzyQuery can end
up blowing up query caches which use query objects as cache keys, so building the automata is
now delayed to search time again.
(Alan Woodward, Mike Drob)
- Bug Fixes (1)
- LUCENE-9300: Fix corruption of the new gen field infos when doc values updates are applied on a segment created
externally and added to the index with IndexWriter#addIndexes(Directory).
(Jim Ferenczi, Adrien Grand)
- API Changes (9)
- LUCENE-9093: Not an API change but a change in behavior of the UnifiedHighlighter's LengthGoalBreakIterator that will
yield Passages sized a little different due to the fact that the sizing pivot is now the center of the first match and
not its left edge.
- LUCENE-9116: PostingsWriterBase and PostingsReaderBase no longer support
setting a field's metadata via a `long[]`.
(Adrien Grand)
- LUCENE-9116: The FSTOrd postings format has been removed.
(Adrien Grand)
- LUCENE-8369: Remove obsolete spatial module.
(Nick Knize, David Smiley)
- LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes to core.
(Nick Knize)
- LUCENE-9218: XY geometries API works in float space.
(Ignacio Vera)
- LUCENE-9212: Intervals.multiterm() takes CompiledAutomaton rather than plain Automaton
(Alan Woodward)
- LUCENE-9150: Restore support for dynamic PlanetModel in spatial3d.
(Nick Knize)
- LUCENE-9171: QueryBuilder.newTermQuery() and .newSynonymQuery() now take boost parameters.
(Alessandro Benedetti, Alan Woodward)
- New Features (3)
- LUCENE-8903: Add LatLonShape and XYShape point query.
(Ignacio Vera)
- LUCENE-8707: Add LatLonShape and XYShape distance query.
(Ignacio Vera)
- LUCENE-9238: New XYPointField field and Queries for indexing, searching and sorting
cartesian points.
(Ignacio Vera)
- Improvements (12)
- LUCENE-9149: Increase data dimension limit in BKD.
(Nick Knize)
- LUCENE-9102: Add maxQueryLength option to DirectSpellchecker.
(Andy Webb via Bruno Roustant)
- LUCENE-9091: UnifiedHighlighter HTML escaping should only escape essentials
(Nándor Mátravölgyi)
- LUCENE-9105: UniformSplit postings format detects corrupted index and better handles IO exceptions.
(Bruno Roustant)
- LUCENE-9106: UniformSplit postings format allows extension of block/line serializers.
(Bruno Roustant)
- LUCENE-9093: UnifiedHighlighter's LengthGoalBreakIterator has a new fragmentAlignment option to better center the
first match in the passage. Also the sizing point now pivots at the center of the first match term and not its left
edge. This yields Passages that won't be identical to the previous behavior.
(Nándor Mátravölgyi, David Smiley)
- LUCENE-9153: Allow WhitespaceAnalyzer to set a maxTokenLength other than the default of 255
(Alan Woodward)
- LUCENE-9152: Improve line intersections with polygons when they are touching from the outside.
(Ignacio Vera)
- LUCENE-9123: Add new JapaneseTokenizer constructors with discardCompoundToken option that controls whether
the tokenizer emits original (compound) tokens when the mode is not NORMAL.
(Kazuaki Hiraga via Tomoko Uchida)
- LUCENE-9253: KoreanTokenizer now supports custom dictionaries(system, unknown).
(Namgyu Kim)
- LUCENE-9171: QueryBuilder can now use BoostAttributes on input token streams to selectively
boost particular terms or synonyms in parsed queries.
(Alessandro Benedetti, Alan Woodward)
- LUCENE-9298: Improve RAM accounting in BufferedUpdates when deleted doc IDs and terms are cleared.
(Yu Binglei, Simon Willnauer)
- Optimizations (10)
- LUCENE-9211: Add compression for Binary doc value fields.
(Mark Harwood)
- LUCENE-4702: Better compression of terms dictionaries.
(Adrien Grand)
- LUCENE-9228: Sort dvUpdates in the term order before applying if they all update a
single field to the same value. This optimization can reduce the flush time by around
20% for the docValues update user cases.
(Nhat Nguyen, Adrien Grand, Simon Willnauer)
- LUCENE-9245: Reduce AutomatonTermsEnum memory usage.
(Bruno Roustant, Robert Muir)
- LUCENE-9237: Faster UniformSplit intersect TermsEnum.
(Bruno Roustant)
- LUCENE-9260: LeafReader#checkIntegrity verifies checksums of CFS files.
(Adrien Grand)
- LUCENE-9068: FuzzyQuery builds its Automaton up-front
(Alan Woodward, Mike Drob)
- LUCENE-9113: Faster merging of SORTED/SORTED_SET doc values.
(Adrien Grand)
- LUCENE-9125: Optimize Automaton.step() with binary search and introduce Automaton.next().
(Bruno Roustant)
- LUCENE-9147: The index of stored fields and term vectors in now off-heap.
(Adrien Grand)
- Bug Fixes (11)
- LUCENE-9084: Fix potential deadlock due to circular synchronization in AnalyzingInfixSuggester
(Paul Ward)
- LUCENE-9115: NRTCachingDirectory no longer caches files of unknown size.
(Adrien Grand)
- LUCENE-9144: Fix error message on OneDimensionBKDWriter when too many points are added to the writer.
(Ignacio Vera)
- LUCENE-9135: Make UniformSplit FieldMetadata counters long.
(Bruno Roustant)
- LUCENE-9200: Fix TieredMergePolicy to use double (not float) math to make its merging decisions, fixing
a corner-case bug uncovered by fun randomized tests
(Robert Muir, Mike McCandless)
- LUCENE-9099: Unordered and Ordered interval queries now correctly handle
repeated subterms - ordered intervals could supply an 'extra' minimized
interval, resulting in odd matches when combined with eg CONTAINS queries;
and unordered intervals would match duplicate subterms on the same position,
so an query for UNORDERED(foo, foo) would match a document containing 'foo'
only once.
(Alan Woodward)
- LUCENE-9250: Add support for Circle2d#intersectsLine around the dateline.
(Ignacio Vera)
- LUCENE-9243: Add fudge factor when creating a bounding box of a XYCircle.
(Ignacio Vera)
- LUCENE-9239: Circle2D#WithinTriangle detects properly if a triangle is Within distance.
(Ignacio Vera)
- LUCENE-9251: Fix bug in the polygon tessellator where edges with different value on #isEdgeFromPolygon
were bot filtered out properly.
(Ignacio Vera)
- LUCENE-9263: Fix wrong transformation of distance in meters to radians in Geo3DPoint.
(Ignacio Vera)
- Other (6)
- LUCENE-9109: Backport some changes from master (except StackWalker) to improve
TestSecurityManager
(Uwe Schindler)
- LUCENE-9110: Backport refactored stack analysis in tests to use generalized
LuceneTestCase methods
(Uwe Schindler)
- LUCENE-9141: Simplify LatLonShapeXQuery API by adding a new abstract class called LatLonGeometry. Queries are
executed with input objects that extend such interface.
(Ignacio Vera)
- LUCENE-9194: Simplify XYShapeXQuery API by adding a new abstract class called XYGeometry. Queries are
executed with input objects that extend such interface.
(Ignacio Vera)
- LUCENE-9096: Simplification of CompressingTermVectorsWriter#flushOffsets.
(kkewwei via Adrien Grand)
- LUCENE-9225: Rectangle extends LatLonGeometry so it can be used in a geometry collection.
(Ignacio Vera)
- API Changes (1)
- LUCENE-9029: Deprecate SloppyMath toRadians/toDegrees in favor of Java Math.
(Jack Conradson via Adrien Grand)
- New Features (1)
- LUCENE-8620: Add CONTAINS support for LatLonShape and XYShape.
(Ignacio Vera)
- Improvements (7)
- LUCENE-9002: Skip costly caching clause in LRUQueryCache if it makes the query
many times slower.
(Guoqiang Jiang)
- LUCENE-9006: WordDelimiterGraphFilter's catenateAll token is now ordered before any token parts, like WDF did.
(David Smiley)
- LUCENE-9028: introducing Intervals.multiterm()
(Mikhail Khludnev)
- LUCENE-9018: ConcatenateGraphFilter now has a configurable separator.
(Stanislav Mikulchik, David Smiley)
- LUCENE-9036: ExitableDirectoryReader may interupt scaning over DocValues
(Mikhail Khludnev)
- LUCENE-9062: QueryVisitor now has a consumeTermsMatching() method, allowing queries
that match a class of terms to pass a ByteRunAutomaton matching those that class
back to the visitor.
(Alan Woodward, David Smiley)
- LUCENE-9073: IntervalQuery to respond field on toString() and explain()
(Mikhail Khludnev)
- Optimizations (9)
- LUCENE-8928: When building a kd-tree for dimensions n > 2, compute exact bounds for an inner node every N splits
to improve the quality of the tree. N is defined by SPLITS_BEFORE_EXACT_BOUNDS which is set to 4.
(Ignacio Vera, Adrien Grand)
- BaseDirectoryReader no longer sums up the `LeafReader#numDocs` of its leaves
eagerly. This especially helps when creating views of readers that hide
documents, since computing the number of live documents is an expensive
operation.
(Adrien Grand)
- LUCENE-8992: TopFieldCollector and TopScoreDocCollector can now share minimum scores across leaves
concurrently.
(Adrien Grand, Atri Sharma, Jim Ferenczi)
- LUCENE-8932: BKDReader's index is now stored off-heap when the IndexInput is
an instance of ByteBufferIndexInput.
(Jack Conradson via Adrien Grand)
- LUCENE-9024: IntroSelector now falls back to the median of medians algorithm
instead of sorting when the maximum recursion level is exceeded, providing
better worst-case runtime.
(Paul Sanwald via Adrien Grand)
- LUCENE-8920: The denser arcs of FST now index labels with a bitset in order
to provide near constant time access.
(Bruno Roustant, Mike Sokolov via Adrien Grand)
- LUCENE-9027: Use SIMD instructions to decode postings.
(Adrien Grand)
- LUCENE-9049: Remove FST cached root arcs now redundant with labels indexed by bitset.
This frees some on-heap FST space.
(Jack Conradson via Bruno Roustant)
- LUCENE-9045: Do not use TreeMap/TreeSet in BlockTree and PerFieldPostingsFormat.
(Bruno Roustant)
- Bug Fixes (7)
- LUCENE-9001: Fix race condition in SetOnce.
(Przemko Robakowski)
- LUCENE-9030: Fix WordnetSynonymParser behaviour so it behaves similar to
SolrSynonymParser.
(Christoph Buescher via Alan Woodward)
- LUCENE-9054: Fix reproduceJenkinsFailures.py to not overwrite junit XML files when retrying
(hossman)
- LUCENE-9031: UnsupportedOperationException on MatchesIterator.getQuery()
(Alan Woodward, Mikhail Khludnev)
- LUCENE-8996: maxScore was sometimes missing from distributed grouped responses.
(Julien Massenet, Diego Ceccarelli, Munendra S N, Christine Poerschke)
- LUCENE-9055: Fix the detection of lines crossing triangles through edge points.
(Ignacio Vera)
- LUCENE-9103: Disjunctions can miss some hits in some rare conditions.
(Adrien Grand)
- Other (6)
- LUCENE-8979: Code Cleanup: Use entryset for map iteration wherever possible. - Part 2
(Koen De Groote)
- LUCENE-8994: Code Cleanup - Pass values to list constructor instead of empty constructor followed by addAll().
(Koen De Groote)
- LUCENE-8746: Refactor EdgeTree - Introduce a Component tree that represents the tree of components (e.g polygons).
Edge tree is now just a tree of edges.
(Ignacio Vera)
- LUCENE-9046: Fix wrong example in Javadoc of TermInSetQuery
(Namgyu Kim)
- LUCENE-8983: Add sandbox PhraseWildcardQuery to control multi-terms expansions in a phrase.
(Bruno Roustant)
- LUCENE-9067: Polygon2D#contains() is now thread safe.
(Ignacio Vera)
- Build (2)
- Upgrade forbiddenapis to version 2.7; upgrade Groovy to 2.4.17.
(Uwe Schindler)
- LUCENE-9041: Upgrade ecj to 3.19.0 to fix sporadic precommit javadoc issues
(Kevin Risden)
- Bug Fixes (1)
- LUCENE-9050: MultiTermIntervalsSource.visit() was not calling back to its
visitor.
(Alan Woodward)
- API Changes (5)
- LUCENE-8909: IndexWriter#getFieldNames() method is used to get fields present in index. After LUCENE-8316, this
method is no longer required. Hence, deprecate IndexWriter#getFieldNames() method.
(Adrien Grand, Munendra S N)
- LUCENE-8755: SpatialPrefixTreeFactory now consumes the "version" parsed with Lucene's Version class. The quad
and packed quad prefix trees are sensitive to this. It's recommended to pass the version like you
should do likewise for analysis components for tokenized text, or else changes to the encoding in future versions
may be incompatible with older indexes.
(Chongchen Chen, David Smiley)
- LUCENE-8956: QueryRescorer now only sorts the first topN hits instead of all
initial hits.
(Paul Sanwald via Adrien Grand)
- LUCENE-8921: IndexSearcher.termStatistics() no longer takes a TermStates; it takes the docFreq and totalTermFreq.
And don't call if docFreq <= 0. The previous implementation survives as deprecated and final. It's removed in 9.0.
(Bruno Roustant, David Smiley, Alan Woodward)
- LUCENE-8990: PointValues#estimateDocCount(visitor) estimates the number of documents that would be matched by
the given IntersectVisitor. THe method is used to compute the cost() of ScorerSuppliers instead of
PointValues#estimatePointCount(visitor).
(Ignacio Vera, Adrien Grand)
- New Features (6)
- LUCENE-8936: Add SpanishMinimalStemFilter
(vinod kumar via Tomoko Uchida)
- LUCENE-8764 LUCENE-8945: Add "export all terms and doc freqs" feature to Luke with delimiters.
(Leonardo Menezes, Amish Shah via Tomoko Uchida)
- LUCENE-8747: Composite Matches from multiple subqueries now allow access to
their submatches, and a new NamedMatches API allows marking of subqueries
and a simple way to find which subqueries have matched on a given document
(Alan Woodward, Jim Ferenczi)
- LUCENE-8769: Introduce Range Query For Multiple Connected Ranges
(Atri Sharma)
- LUCENE-8960: Introduce LatLonDocValuesPointInPolygonQuery for LatLonDocValuesField
(Ignacio Vera)
- LUCENE-8753: New UniformSplitPostingsFormat (name "UniformSplit") primarily benefiting in simplicity and
extensibility. New STUniformSplitPostingsFormat (name "SharedTermsUniformSplit") that shares a single internal
term dictionary across fields.
(Bruno Roustant, Juan Rodriguez, David Smiley)
- Improvements (15)
- LUCENE-8874: Show SPI names instead of class names in Luke Analysis tab.
(Tomoko Uchida)
- LUCENE-8894: Add APIs to find SPI names for Tokenizer/CharFilter/TokenFilter factory classes.
(Tomoko Uchida)
- LUCENE-8914: move the logic for discarding inner modes in FloatPointNearestNeighbor to the IntersectVisitor
so we take advantage of the change introduced in LUCENE-7862.
(Ignacio Vera)
- LUCENE-8955: move the logic for discarding inner modes in LatLonPoint NearestNeighbor to the IntersectVisitor
so we take advantage of the change introduced in LUCENE-7862.
(Ignacio Vera)
- LUCENE-8918: PhraseQuery throws exceptions at construction time if it is passed
null arguments.
(Alan Woodward)
- LUCENE-8916: GraphTokenStreamFiniteStrings preserves all Token attributes
through its finite strings TokenStreams
(Alan Woodward)
- LUCENE-8906: Expose Lucene50PostingsFormat.IntBlockTermState as public so that other postings formats can re-use it.
(Bruno Roustant)
- LUCENE-8942: Remove redundant parameters and improve visibility strictness in
LRUQueryCache
(Atri Sharma)
- SOLR-13663: Introduce <SpanPositionRange> into XML Query Parser
(Alessandro Benedetti via Mikhail Khludnev)
- LUCENE-8952: Use a sort key instead of true distance in NearestNeighbor
(Julie Tibshirani).
- LUCENE-8620: Tessellator labels the edges of the generated triangles whether they belong to
the original polygon. This information is added to the triangle encoding.
(Ignacio Vera)
- LUCENE-8964: Fix geojson shape parsing on string arrays in properties
(Alexander Reelsen)
- LUCENE-8976: Use exact distance between point and bounding rectangle in FloatPointNearestNeighbor.
(Ignacio Vera)
- LUCENE-8966: The Korean analyzer now splits tokens on boundaries between digits and alphabetic characters.
(Jim Ferenczi)
- LUCENE-8984: MoreLikeThis MLT is biased for uncommon fields
(Andy Hind via Anshum Gupta)
- Optimizations (8)
- LUCENE-8922: DisjunctionMaxQuery more efficiently leverages impacts to skip
non-competitive hits.
(Adrien Grand)
- LUCENE-8935: BooleanQuery with no scoring clause can now early terminate the query when
the total hits is not requested.
(Jim Ferenczi)
- LUCENE-8941: Matches on wildcard queries will defer building their full
disjunction until a MatchesIterator is pulled
(Alan Woodward)
- LUCENE-8755: spatial-extras quad and packed quad prefix trees now index points faster.
(Chongchen Chen, David Smiley)
- LUCENE-8860: add additional leaf node level optimizations in LatLonShapeBoundingBoxQuery.
(Igor Motov via Ignacio Vera)
- LUCENE-8968: Improve performance of WITHIN and DISJOINT queries for Shape queries by
doing just one pass whenever possible.
(Ignacio Vera)
- LUCENE-8939: Introduce shared count based early termination across multiple slices
(Atri Sharma)
- LUCENE-8980: Blocktree's seekExact now short-circuits false if the term isn't in the min-max range of the segment.
Large perf gain for ID/time like data when populated sequentially.
(Guoqiang Jiang)
- Bug Fixes (2)
- LUCENE-8755: spatial-extras quad and packed quad prefix trees could throw a
NullPointerException for certain cell edge coordinates
(Chongchen Chen, David Smiley)
- LUCENE-9005: BooleanQuery.visit() would pull subVisitors from its parent visitor, rather
than from a visitor for its own specific query. This could cause problems when BQ was
nested under another BQ. Instead, we now pull a MUST subvisitor, pass it to any MUST
subclauses, and then pull SHOULD, MUST_NOT and FILTER visitors from it rather than from
the parent.
(Alan Woodward)
- Other (7)
- LUCENE-8778 LUCENE-8911 LUCENE-8957: Define analyzer SPI names as static final fields and document the names in Javadocs.
(Tomoko Uchida, Uwe Schindler)
- LUCENE-8758: QuadPrefixTree: removed levelS and levelN fields which weren't used.
(Amish Shah)
- LUCENE-8975: Code Cleanup: Use entryset for map iteration wherever possible.
(Koen De Groote)
- LUCENE-8993, LUCENE-8807: Changed all repository and download references in build files
to HTTPS.
(Uwe Schindler)
- LUCENE-8998: Fix OverviewImplTest.testIsOptimized reproducible failure.
(Tomoko Uchida)
- LUCENE-8999: LuceneTestCase.expectThrows now propogates assert/assumption failures up to the test
w/o wrapping in a new assertion failure unless the caller has explicitly expected them
(hossman)
- LUCENE-8062: GlobalOrdinalsWithScoreQuery is no longer eligible for query caching.
(Jim Ferenczi)
- API Changes (3)
- LUCENE-8865: IndexSearcher now uses Executor instead of ExecutorSerivce.
This change is fully backwards compatible since ExecutorService directly
implements Executor.
(Simon Willnauer)
- LUCENE-8856: Intervals queries have moved from the sandbox to the queries
module.
(Alan Woodward)
- LUCENE-8893: Intervals.wildcard() and Intervals.prefix() methods now take
BytesRef rather than String.
(Alan Woodward)
- New Features (10)
- LUCENE-8632: New XYShape Field and Queries for indexing and searching general cartesian
geometries.
(Nick Knize)
- LUCENE-8891: Snowball stemmer/analyzer for the Estonian language.
(Gert Morten Paimla via Tomoko Uchida)
- LUCENE-8815: Provide a DoubleValues implementation for retrieving the value of features without
requiring a separate numeric field. Note that as feature values are stored with only 8 bits of
mantissa the values returned may have a delta from the original values indexed.
(Colin Goodheart-Smithe via Adrien Grand)
- LUCENE-8803: Provide a FeatureSortfield to allow sorting search hits by descending value of a
feature. This is exposed via the factory method FeatureField#newFeatureSort.
(Colin Goodheart-Smithe via Adrien Grand)
- LUCENE-8784: The KoreanTokenizer now preserves punctuations if discardPunctuation is set
to false (defaults to true).
(Namgyu Kim via Jim Ferenczi)
- LUCENE-8812: Add new KoreanNumberFilter that can change Hangul character to number
and process decimal point. It is similar to the JapaneseNumberFilter.
(Namgyu Kim)
- LUCENE-8362: Add doc-value support to range fields.
(Atri Sharma via Adrien Grand)
- LUCENE-8766: Add monitor subproject (previously Luwak monitoring library). This
allows a stream of documents to be matched against a set of registered queries
in an efficient manner, for use as a monitoring or classification tool.
(Alan Woodward)
- LUCENE-7714: Add a numeric range query in sandbox that takes advantage of index sorting.
(Julie Tibshirani via Jim Ferenczi)
- LUCENE-8859: The completion suggester's postings format now have an option to
load its internal FST off-heap.
(Jim Ferenczi)
- Bug Fixes (9)
- LUCENE-8831: Fixed LatLonShapeBoundingBoxQuery .hashCode methods.
(Ignacio Vera)
- LUCENE-8775: Improve tessellator to handle better cases where a hole share a vertex
with the polygon.
(Ignacio Vera)
- LUCENE-8785: Ensure new threadstates are locked before retrieving the number of active threadstates.
This causes assertion errors and potentially broken field attributes in the IndexWriter when
IndexWriter#deleteAll is called while actively indexing.
(Simon Willnauer)
- LUCENE-8804: Forbid calls to putAttribute on frozen FieldType instances.
(Vamshi Vijay Nakkirtha via Adrien Grand)
- LUCENE-8828: Removes the buggy 'disallow overlaps' boolean from Intervals.unordered(),
and replaces it with a new Intervals.unorderedNoOverlaps() method
(Alan Woodward)
- LUCENE-8843: Don't ignore exceptions that are thrown when trying to open a
file in IOUtils#fsync.
(Jason Tedor via Adrien Grand)
- LUCENE-8835: FileSwitchDirectory now respects the file extension when listing directory
contents to ensure we don't expose pending deletes if both directory point to the same
underlying filesystem directory.
(Simon Willnauer)
- LUCENE-8853: FileSwitchDirectory now applies best effort to place tmp files in the same
directory as the target files.
(Simon Willnauer)
- LUCENE-8892: Add missing closing parentheses in MultiBoolFunction's description()
(Florian Diebold, Munendra S N)
- Improvements (8)
- LUCENE-7840: Non-scoring BooleanQuery now removes SHOULD clauses before building the scorer supplier
as opposed to eliminating them during scoring construction.
(Atri Sharma via Jim Ferenczi)
- LUCENE-8770: BlockMaxConjunctionScorer now leverages two-phase iterators in order to avoid
executing the second phase when scorers don't intersect.
(Adrien Grand, Jim Ferenczi)
- LUCENE-8818: Fix smokeTestRelease.py encoding bug
(janhoy)
- LUCENE-8845: Allow Intervals.prefix() and Intervals.wildcard() to specify
their maximum allowed expansions
(Alan Woodward)
- LUCENE-8875: Introduce a Collector optimized for use cases when large
number of hits are requested
(Atri Sharma)
- LUCENE-8848 LUCENE-7757 LUCENE-8492: The UnifiedHighlighter now detects that parts of the query are not understood by
it, and thus it should not make optimizations that result in no highlights or slow highlighting. This generally works
best for WEIGHT_MATCHES mode. Consequently queries produced by ComplexPhraseQueryParser and the surround QueryParser
will now highlight correctly.
(David Smiley)
- LUCENE-8793: Luke enhanced UI for CustomAnalyzer: show detailed analysis steps.
(Jun Ohtani via Tomoko Uchida)
- LUCENE-8855: Add Accountable to some Query implementations
(ab, Adrien Grand)
- Optimizations (8)
- LUCENE-8796: Use exponential search instead of binary search in
IntArrayDocIdSet#advance method
(Luca Cavanna via Adrien Grand)
- LUCENE-8865: Use incoming thread for execution if IndexSearcher has an executor.
Now caller threads execute at least one search on an index even if there is
an executor provided to minimize thread context switching.
(Simon Willnauer)
- LUCENE-8868: New storing strategy for BKD tree leaves with low cardinality.
It stores the distinct values once with the cardinality value reducing the
storage cost.
(Ignacio Vera)
- LUCENE-8885: Optimise BKD reader by exploiting cardinality information stored
on leaves.
(Ignacio Vera)
- LUCENE-8896: Override default implementation of IntersectVisitor#visit(DocIDSetBuilder, byte[])
for several queries.
(Ignacio Vera)
- LUCENE-8901: Load frequencies lazily only when needed in BlockDocsEnum and
BlockImpactsEverythingEnum
(Mayya Sharipova).
- LUCENE-8888: Optimize distribution of points with data dimensions in
BKD tree leaves.
(Ignacio Vera)
- LUCENE-8311: Phrase queries now leverage impacts.
(Adrien Grand)
- Test Framework (1)
- LUCENE-8825: CheckHits now display the shard index in case of mismatch
between top hits.
(Atri Sharma via Adrien Grand)
- Other (6)
- LUCENE-8847: Code Cleanup: Remove StringBuilder.append with concatenated
strings.
(Koen De Groote via Uwe Schindler)
- LUCENE-8861: Script to find open Github PRs that needs attention
(janhoy)
- LUCENE-8852: ReleaseWizard tool for release managers
(janhoy)
- LUCENE-8838: Remove support for Steiner points on Tessellator.
(Ignacio Vera)
- LUCENE-8879: Improve BKDRadixSelector tests.
(Ignacio Vera)
- LUCENE-8886: Fix TestMutablePointsReaderUtils tests.
(Ignacio Vera)
- Improvements (1)
- LUCENE-8781: FST lookup performance has been improved in many cases by
encoding Arcs using full-sized arrays with gaps. The new encoding is
enabled for postings in the default codec and for suggesters.
(Mike Sokolov)
- API Changes (2)
- LUCENE-3041: A query introspection API has been added. Queries should
implement a visit() method, taking a QueryVisitor, and either pass the
visitor down to any child queries, or call a visitX() or consumeX() method
on it. All locations in the code that called Weight.extractTerms()
have been changed to use this API, and the extractTerms() method has
been deprecated.
(Alan Woodward, Simon Willnauer, David Smiley, Luca
Cavanna)
- LUCENE-8735: Directory.getPendingDeletions is now abstract to ensure
subclasses override it. FilterDirectory now delegates the call, ensuring
correct default behaviour for subclasses.
(Henning Andersen)
- New Features (1)
- LUCENE-2562: The well-known graphical user interface for inspecting Lucene
indexes "Luke" was added as a Lucene module. It can be started from the
binary distribution by calling the shell scripts in the module folder
or from the source checkout by using `ant -f lucene/luke/build.xml run`.
Luke provides a Swing-based user interface and can be used to open
Lucene or Solr (or Elasticsearch) indexes, inspect documents, check index
commits and segments, or test (custom) analyzers. It also has maintenance
functions to check index structures and force merge indexes for archival.
Luke was originally developed by Andrzej Bialecki, later maintained by
Dmitry Kan and finally rewritten by Tomoko Uchida to use the ASF licensing
compatible Swing framework (as shipped with JDKs).
(Tomoko Uchida, Uwe Schindler)
- Bug fixes (10)
- LUCENE-8736: LatLonShapePolygonQuery returns incorrect WITHIN results
with shared boundaries. Point in Polygon now correctly includes boundary
points. Box and Polygon relations with triangles have also been improved to
correctly include boundary points.
(Nick Knize)
- LUCENE-8712: Polygon2D does not detect crossings through segment edges.
(Ignacio Vera)
- LUCENE-8720: NameIntCacheLRU (in the facets module) had an int
overflow bug that disabled cleaning of the cache
(Russell A Brown)
- LUCENE-8726: ValueSource.asDoubleValuesSource() could leak a reference to
IndexSearcher
(Alan Woodward, Yury Pakhomov)
- LUCENE-8719: FixedShingleFilter can miss shingles at the end of a token stream if
there are multiple paths with different lengths.
(Alan Woodward)
- LUCENE-8688: TieredMergePolicy#findForcedMerges now tries to create the
cheapest merges that allow the index to go down to `maxSegmentCount` segments
or less.
(Armin Braun via Adrien Grand)
- LUCENE-8477: Interval disjunctions could miss valid hits if some of the
clauses of the disjunction are minimized away. We now rewrite intervals
if a source contains a disjunction and the internal gaps matter for
matching. This behaviour can be disabled if users are more interested
in speed rather than accuracy of matching.
(Alan Woodward, Jim Ferenczi)
- LUCENE-8741: ValueSource.fromDoubleValuesSource() was casting to
Scorer instead of Scorable, leading to ClassCastExceptions
(Markus Jelsma,
Alan Woodward)
- LUCENE-8754: Fix ConcurrentModificationException in SegmentInfo if
attributes are accessed in MergePolicy while the merge is running
(Simon Willnauer)
- LUCENE-8765: Fixed validation of the number of added points in KD trees.
(Zhao Yang via Adrien Grand)
- Improvements (13)
- LUCENE-8673: Use radix partitioning when merging dimensional points instead
of sorting all dimensions before hand.
(Ignacio Vera, Adrien Grand)
- LUCENE-8687: Optimise radix partitioning for points on heap.
(Ignacio Vera)
- LUCENE-8699: Change HeapPointWriter to use a single byte array instead to a list
of byte arrays. In addition a new interface PointValue is added to abstract out
the different formats between offline and on-heap writers.
(Ignacio Vera)
- LUCENE-8703: Build point writers in the BKD tree only when they are needed.
(Ignacio Vera)
- LUCENE-8652: SynonymQuery can now deboost the document frequency of each term when
blending the score of the synonym.
(Jim Ferenczi)
- LUCENE-8631: The Korean's user dictionary now picks the longest-matching word and discards
the other matches.
(Yeongsu Kim via Jim Ferenczi)
- LUCENE-8732: ConstantScoreQuery can now early terminate the query if the minimum score is
greater than the constant score and total hits are not requested.
(Jim Ferenczi)
- LUCENE-8750: Implements setMissingValue() on sort fields produced from
DoubleValuesSource and LongValuesSource
(Mike Sokolov via Alan Woodward)
- LUCENE-8701: ToParentBlockJoinQuery now creates a child scorer that disallows skipping over
non-competitive documents if the score of a parent depends on the score of multiple
children (avg, max, min). Additionally the score mode `none` that assigns a constant score to
each parent can early terminate top scores's collection.
(Jim Ferenczi)
- LUCENE-8751: Weight#matches now use the ScorerSupplier to build scorers with a lead cost of 1
(single document).
(Jim Ferenczi)
- LUCENE-8752: Japanese new era name '令和' (Reiwa) is added to the dictionary used in
JapaneseTokenizer so that the analyzer handles the era name correctly.
Reiwa is set to replace the Heisei Era on May 1, 2019.
(Tomoko Uchida)
- LUCENE-8671: Introduced reader attributes allows a per IndexReader configuration
of codec internals. This enables a per reader configuration if FSTs are on- or off-heap on a
per field basis
(Simon Willnauer)
- LUCENE-8787: spatial-extras DateRangePrefixTree used to only parse ISO-8601 timestamps with 0 or 3
digits of milliseconds precision but now parses other lengths (although > 3 not used).
(Thomas Lemmé via David Smiley)
- Changes in Runtime Behavior (4)
- LUCENE-8671: Load FST off-heap also for ID-like fields if reader is not opened
from an IndexWriter.
(Simon Willnauer)
- LUCENE-8730: WordDelimiterGraphFilter always emits its original token first. This
brings its behaviour into line with the deprecated WordDelimiterFilter, so that
the only difference in output between the two is in the position length
attribute.
(Alan Woodward, Jim Ferenczi)
- LUCENE-7386: Disjunctions nested in disjunctions are now flattened. This might
trigger changes in the produced scores due to changes to the order in which
scores of sub clauses are summed up.
(Adrien Grand)
- LUCENE-8756: MoreLikeThisQuery now respects custom term frequencies
(TermFrequencyAttribute) at search time
(Olli Kuonanoja)
- Other (5)
- LUCENE-8680: Refactor EdgeTree#relateTriangle method.
(Ignacio Vera)
- LUCENE-8685: Refactor LatLonShape tests.
(Ignacio Vera)
- LUCENE-8713: Add Line2D tests.
(Ignacio Vera)
- LUCENE-8729: Workaround: Disable accessibility doclints (Java 13+),
so compilation with recent JDK succeeds.
(Uwe Schindler)
- LUCENE-8725: Make TermsQuery.SeekingTermSetTermsEnum a top level class and public
(noble)
- API Changes (31)
- LUCENE-8662: TermsEnum.seekExact(BytesRef) to abstract and delegate seekExact(BytesRef)
in FilterLeafReader.FilterTermsEnum.
(Jeffery Yuan via Tomás Fernández Löbbe, Simon Willnauer)
- LUCENE-8469: Deprecated StringHelper.compare has been removed.
(Dawid Weiss)
- LUCENE-8039: Introduce a "delta distance" method set to GeoDistance. This
allows distance calculations, especially for paths, to take into account an
"excursion" to include the specified point.
- LUCENE-8007: Index statistics Terms.getSumDocFreq(), Terms.getDocCount() are
now required to be stored by codecs. Additionally, TermsEnum.totalTermFreq()
and Terms.getSumTotalTermFreq() are now required: if frequencies are not
stored they are equal to TermsEnum.docFreq() and Terms.getSumDocFreq(),
respectively, because all freq() values equal 1.
(Adrien Grand, Robert Muir)
- LUCENE-8038: Deprecated PayloadScoreQuery constructors have been removed
(Alan
Woodward)
- LUCENE-8014: Similarity.computeSlopFactor() and
Similarity.computePayloadFactor() have been removed
(Alan Woodward)
- LUCENE-7996: Queries are now required to produce positive scores.
(Adrien Grand)
- LUCENE-8099: CustomScoreQuery, BoostedQuery and BoostingQuery have been
removed
(Alan Woodward)
- LUCENE-8012: Explanation now takes Number rather than float
(Alan Woodward,
Robert Muir)
- LUCENE-8116: SimScorer now only takes a frequency and a norm as per-document
scoring factors.
(Adrien Grand)
- LUCENE-8113: TermContext has been renamed to TermStates, and can now be
constructed lazily if term statistics are not required
(Alan Woodward)
- LUCENE-8242: Deprecated method IndexSearcher#createNormalizedWeight() has
been removed
(Alan Woodward)
- LUCENE-8267: Memory codecs removed from the codebase (MemoryPostings,
MemoryDocValues).
(Dawid Weiss)
- LUCENE-8144: Moved QueryCachingPolicy.ALWAYS_CACHE to the test framework.
(Nhat Nguyen via Adrien Grand)
- LUCENE-8356: StandardFilter and StandardFilterFactory have been removed
(Alan Woodward)
- LUCENE-8373: StandardAnalyzer.ENGLISH_STOP_WORD_SET has been removed
(Alan Woodward)
- LUCENE-8388: Unused PostingsEnum#attributes() method has been removed
(Alan Woodward)
- LUCENE-8405: TopDocs.maxScore is removed. IndexSearcher and TopFieldCollector
no longer have an option to compute the maximum score when sorting by field.
(Adrien Grand)
- LUCENE-8411: TopFieldCollector no longer takes a fillFields option, it now
always fills fields.
(Adrien Grand)
- LUCENE-8412: TopFieldCollector no longer takes a trackDocScores option. Scores
need to be set on top hits via TopFieldCollector#populateScores instead.
(Adrien Grand)
- LUCENE-6228: A new Scorable abstract class has been added, containing only those
methods from Scorer that should be called from Collectors. LeafCollector.setScorer()
now takes a Scorable rather than a Scorer.
(Alan Woodward, Adrien Grand)
- LUCENE-8475: Deprecated constants have been removed from RamUsageEstimator.
(Dimitrios Athanasiou)
- LUCENE-8483: Scorers may no longer take null as a Weight
(Alan Woodward)
- LUCENE-8352: TokenStreamComponents is now final, and can take a Consumer<Reader>
in its constructor
(Mark Harwood, Alan Woodward, Adrien Grand)
- LUCENE-8498: LowerCaseTokenizer has been removed, and CharTokenizer no longer
takes a normalizer function.
(Alan Woodward)
- LUCENE-7875: Moved MultiFields static methods out of the class. getLiveDocs is now
in MultiBits which is now public. getMergedFieldInfos and getIndexedFields are now in
FieldInfos. getTerms is now in MultiTerms. getTermPositionsEnum and getTermDocsEnum
were collapsed and renamed to just getTermPostingsEnum and moved to MultiTerms.
(David Smiley)
- LUCENE-8513: MultiFields.getFields is now removed. Please avoid this class,
and Fields in general, when possible.
(David Smiley)
- LUCENE-8497: MultiTermAwareComponent has been removed, and in its place
TokenFilterFactory and CharFilterFactory now expose type-safe normalize()
methods. This decouples normalization from tokenization entirely.
(Mayya Sharipova, Alan Woodward)
- LUCENE-8597: IntervalIterator now exposes a gaps() method that reports the
number of gaps between its component sub-intervals. This can be used in a
new filter available via Intervals.maxgaps().
(Alan Woodward)
- LUCENE-8609: Remove IndexWriter#numDocs() and IndexWriter#maxDoc() in favor
of IndexWriter#getDocStats().
(Simon Willnauer)
- LUCENE-8292: Make TermsEnum fully abstract.
(Simon Willnauer)
- Changes in Runtime Behavior (15)
- LUCENE-8333: Switch MoreLikeThis.setMaxDocFreqPct to use maxDoc instead of
numDocs.
(Robert Muir, Dawid Weiss).
- LUCENE-7837: Indices that were created before the previous major version
will now fail to open even if they have been merged with the previous major
version.
(Adrien Grand)
- LUCENE-8020: Similarities are no longer passed terms that don't exist by
queries such as SpanOrQuery, so scoring formulas no longer require
divide-by-zero hacks. IndexSearcher.termStatistics/collectionStatistics return null
instead of returning bogus values for a non-existent term or field.
(Robert Muir)
- LUCENE-7996: FunctionQuery and FunctionScoreQuery now return a score of 0
when the function produces a negative value.
(Adrien Grand)
- LUCENE-8116: Similarities now score fields that omit norms as if the norm was
1. This might change score values on fields that omit norms.
(Adrien Grand)
- LUCENE-8134: Index options are no longer automatically downgraded.
(Adrien Grand)
- LUCENE-8031: Length normalization correctly reflects omission of term frequencies.
(Robert Muir, Adrien Grand)
- LUCENE-7444: StandardAnalyzer no longer defaults to removing English stopwords
(Alan Woodward)
- LUCENE-8060: IndexSearcher's search and searchAfter methods now only compute
total hit counts accurately up to 1,000 in order to enable top-hits
optimizations such as block-max WAND (LUCENE-8135).
(Adrien Grand)
- LUCENE-8505: IndexWriter#addIndices will now fail if the target index is sorted but
the candidate is not.
(Jim Ferenczi)
- LUCENE-8535: Highlighter and FVH doesn't support ToParent and ToChildBlockJoinQuery out of the
box anymore. In order to highlight on Block-Join Queries a custom WeightedSpanTermExtractor / FieldQuery
should be used.
(Simon Willnauer, Jim Ferenczi, Julie Tibshirani)
- LUCENE-8563: BM25 scores don't include the (k1+1) factor in their numerator
anymore. This doesn't affect ordering as this is a constant factor which is
the same for every document.
(Luca Cavanna via Adrien Grand)
- LUCENE-8509: WordDelimiterGraphFilter will no longer set the offsets of internal
tokens by default, preventing a number of bugs when the filter is chained with
tokenfilters that change the length of their tokens
(Alan Woodward)
- LUCENE-8633: IntervalQuery scores do not use term weighting any more, the score
is instead calculated as a function of the sloppy frequency of the matching
intervals.
(Alan Woodward, Jim Ferenczi)
- LUCENE-8635: FSTs can now remain off-heap, accessed via
IndexInput, and the default codec's term dictionary
(BlockTreeTermsReader) will now leave the FST for the terms index
off-heap for non-primary-key fields using MMapDirectory, reducing
heap usage for such fields.
(Ankit Jain)
- New Features (12)
- LUCENE-8340: LongPoint#newDistanceFeatureQuery may be used to boost scores based on
how close a value of a long field is from an configurable origin. This is
typically useful to boost by recency.
(Adrien Grand)
- LUCENE-8482: LatLonPoint#newDistanceFeatureQuery may be used to boost scores
based on the haversine distance of a LatLonPoint field to a provided point. This is
typically useful to boost by distance.
(Ignacio Vera)
- LUCENE-8216: Added a new BM25FQuery in sandbox to blend statistics across several fields
using the BM25F formula.
(Adrien Grand, Jim Ferenczi)
- LUCENE-8564: GraphTokenFilter is an abstract class useful for token filters that need
to read-ahead in the token stream and take into account graph structures. This
also changes FixedShingleFilter to extend GraphTokenFilter
(Alan Woodward)
- LUCENE-8612: Intervals.extend() treats an interval as if it covered a wider
span than it actually does, allowing users to force minimum gaps between
intervals in a phrase.
(Alan Woodward)
- LUCENE-8629: New interval functions: Intervals.before(), Intervals.after(),
Intervals.within() and Intervals.overlapping().
(Alan Woodward)
- LUCENE-8622: Adds a minimum-should-match interval function that produces intervals
spanning a subset of a set of sources.
(Alan Woodward)
- LUCENE-8645: Intervals.fixField() allows you to report intervals from one field
as if they came from another.
(Alan Woodward)
- LUCENE-8646: New interval functions: Intervals.prefix() and Intervals.wildcard()
(Alan Woodward)
- LUCENE-8655: Add a getter in FunctionScoreQuery class in order to access to the
underlying DoubleValuesSource.
(Gérald Quaire via Alan Woodward)
- LUCENE-8697: GraphTokenStreamFiniteStrings correctly handles side paths
containing gaps
(Alan Woodward)
- LUCENE-8702: Simplify intervals returned from vararg Intervals factory methods
(Alan Woodward)
- Improvements (6)
- LUCENE-7997: Add BaseSimilarityTestCase to sanity check similarities.
SimilarityBase switches to 64-bit doubles internally to help avoid common numeric issues.
Add missing range checks for similarity parameters.
Improve BM25 and ClassicSimilarity's explanations.
(Robert Muir)
- LUCENE-8011: Improved similarity explanations.
(Mayya Sharipova via Adrien Grand)
- LUCENE-4198: Codecs now have the ability to index score impacts.
(Adrien Grand)
- LUCENE-8135: Boolean queries now implement the block-max WAND algorithm in
order to speed up selection of top scored documents.
(Adrien Grand)
- LUCENE-8279: CheckIndex now cross-checks terms with norms.
(Adrien Grand)
- LUCENE-8660: TopDocsCollectors now return an accurate count (instead of a lower bound)
if the total hit count is equal to the provided threshold.
(Adrien Grand, Jim Ferenczi)
- Optimizations (12)
- LUCENE-8040: Optimize IndexSearcher.collectionStatistics, avoiding MultiFields/MultiTerms
(David Smiley, Robert Muir)
- LUCENE-4100: Disjunctions now support faster collection of top hits when the
total hit count is not required.
(Stefan Pohl, Adrien Grand, Robert Muir)
- LUCENE-7993: Phrase queries are now faster if total hit counts are not
required.
(Adrien Grand)
- LUCENE-8109: Boolean queries propagate information about the minimum
competitive score in order to make collection faster if there are disjunctions
or phrase queries as sub queries, which know how to leverage this information
to run faster.
(Adrien Grand)
- LUCENE-8439: Disjunction max queries can skip blocks to select the top documents
if the total hit count is not required.
(Jim Ferenczi, Adrien Grand)
- LUCENE-8204: Boolean queries with a mix of required and optional clauses are
now faster if the total hit count is not required.
(Jim Ferenczi, Adrien Grand)
- LUCENE-8448: Boolean queries now propagates the mininum score to their sub-scorers.
(Jim Ferenczi, Adrien Grand)
- LUCENE-8511: MultiFields.getIndexedFields is now optimized; does not call getMergedFieldInfos
(David Smiley)
- LUCENE-8507: TopFieldCollector can now update the minimum competitive score if the primary sort
is by relevancy and the total hit count is not required.
(Jim Ferenczi)
- LUCENE-8464: ConstantScoreScorer now implements setMinCompetitveScore in order
to early terminate the iterator if the minimum score is greater than the constant
score.
(Christophe Bismuth via Jim Ferenczi)
- LUCENE-8607: MatchAllDocsQuery can shortcut when total hit count is not
required
(Alan Woodward, Adrien Grand)
- LUCENE-8585: Index-time jump-tables for DocValues, for O(1) advance when retrieving doc values.
(Toke Eskildsen, Adrien Grand)
- Bug fixes (6)
- LUCENE-8726: ValueSource.asDoubleValuesSource() could leak a reference to
IndexSearcher
(Alan Woodward, Yury Pakhomov)
- LUCENE-8735: FilterDirectory.getPendingDeletions now forwards to the delegate
even the method is not abstract in the super class. This prevents issues
where our best effort in carrying on generations in the IndexWriter since pending
deletions are swallowed by the FilterDirectory.
(Henning Andersen, Simon Willnauer)
- LUCENE-8688: TieredMergePolicy#findForcedMerges now tries to create the
cheapest merges that allow the index to go down to `maxSegmentCount` segments
or less.
(Armin Braun via Adrien Grand)
- LUCENE-8785: Ensure new threadstates are locked before retrieving the number of active threadstates.
This causes assertion errors and potentially broken field attributes in the IndexWriter when
IndexWriter#deleteAll is called while actively indexing.
(Simon Willnauer)
- LUCENE-8720: NameIntCacheLRU (in the facets module) had an int
overflow bug that disabled cleaning of the cache
(Russell A Brown)
- LUCENE-8809: Refresh and rollback concurrently can leave segment states unclosed
(Nhat Nguyen)
- Changes in Runtime Behavior (1)
- LUCENE-8527: StandardTokenizer and UAX29URLEmailTokenizer now support Unicode 9.0,
and provide Unicode UTS#51 v11.0 Emoji tokenization with the "<EMOJI>" token type.
- Build (2)
- LUCENE-8611: Update randomizedtesting to 2.7.2, JUnit to 4.12, add hamcrest-core
dependency.
(Dawid Weiss)
- LUCENE-8537: ant test command fails under lucene/tools
(Peter Somogyi)
- Bug fixes (9)
- LUCENE-8669: Fix LatLonShape WITHIN queries that fail with Multiple search Polygons
that share the dateline.
(Nick Knize)
- LUCENE-8603: Fix the inversion of right ids for additional nouns in the Korean user dictionary.
(Yoo Jeongin via Jim Ferenczi)
- LUCENE-8624: int overflow in ByteBuffersDataOutput.size().
(Mulugeta Mammo,
Dawid Weiss)
- LUCENE-8625: int overflow in ByteBuffersDataInput.sliceBufferList.
(Mulugeta Mammo,
Dawid Weiss)
- LUCENE-8639: Newly created threadstates while flushing / refreshing can cause duplicated
sequence IDs on IndexWriter.
(Simon Willnauer)
- LUCENE-8649: LatLonShape's within and disjoint queries can return false positives with
indexed multi-shapes.
(Ignacio Vera)
- LUCENE-8654: Polygon2D#relateTriangle returns the wrong answer if polygon is inside
the triangle.
(Ignacio Vera)
- LUCENE-8650: ConcatenatingTokenStream did not correctly clear its state in reset(), and
was not propagating final position increments from its child streams correctly.
(Dan Meehl, Alan Woodward)
- LUCENE-8676: The Korean tokenizer does not update the last position if the backtrace is caused
by a big buffer (1024 chars).
(Jim Ferenczi)
- New Features (3)
- LUCENE-8026: ExitableDirectoryReader may now time out queries that run on
points such as range queries or geo queries.
(Christophe Bismuth via Adrien Grand)
- LUCENE-8508: IndexWriter can now set the created version via
IndexWriterConfig#setIndexCreatedVersionMajor. This is an expert feature.
(Adrien Grand)
- LUCENE-8601: Attributes set in the IndexableFieldType for each field during indexing will
now be recorded into the corresponding FieldInfo's attributes, accessible at search
time
(Murali Krishna P)
- Improvements (8)
- LUCENE-8463: TopFieldCollector can now early-terminates queries when sorting by SortField.DOC.
(Christophe Bismuth via Jim Ferenczi)
- LUCENE-8562: Speed up merging segments of points with data dimensions by only sorting on the indexed
dimensions.
(Ignacio Vera)
- LUCENE-8529: TopSuggestDocsCollector will now use the completion key to tiebreak completion
suggestion with identical scores.
(Jim Ferenczi)
- LUCENE-8575: SegmentInfos#toString now includes attributes and diagnostics.
(Namgyu Kim via Adrien Grand)
- LUCENE-8548: The KoreanTokenizer no longer splits unknown words on combining diacritics and
detects script boundaries more accurately with Character#UnicodeScript#of.
(Christophe Bismuth, Jim Ferenczi)
- LUCENE-8581: Change LatLonShape encoding to use 4 bytes Per Dimension.
(Ignacio Vera, Nick Knize, Adrien Grand)
- LUCENE-8527: Upgrade JFlex dependency to 1.7.0; in StandardTokenizer and UAX29URLEmailTokenizer,
increase supported Unicode version from 6.3 to 9.0, and support Unicode UTS#51 v11.0 Emoji tokenization.
- LUCENE-8640: Date Range format validation
(Lucky Sharma, David Smiley via Mikhail Khludnev)
- Optimizations (6)
- LUCENE-8552: FieldInfos.getMergedFieldInfos no longer does any merging if there is <= 1 segment.
(Christophe Bismuth via David Smiley)
- LUCENE-8590: BufferedUpdates now uses an optimized storage for buffering docvalues updates that
can safe up to 80% of the heap used compared to the previous implementation and uses non-object
based datastructures.
(Simon Willnauer, Mike McCandless, Shai Erera, Adrien Grand)
- LUCENE-8598: Moved to the default accepted overhead ratio for packet ints in DocValuesFieldUpdats
yields an up-to 4x performance improvement when applying doc values updates.
(Simon Willnauer, Adrien Grand)
- LUCENE-8599: Use sparse bitset to store docs in SingleValueDocValuesFieldUpdates.
(Simon Willnauer, Adrien Grand)
- LUCENE-8600: Doc-value updates get applied faster by sorting with quicksort,
rather than an in-place mergesort, which needs to perform fewer swaps.
(Adrien Grand)
- LUCENE-8623: Decrease I/O pressure when merging high dimensional points.
(Ignacio Vera)
- Test Framework (1)
- LUCENE-8604: TestRuleLimitSysouts now has an optional "hard limit" of bytes that can be written
to stderr and stdout (anything beyond the hard limit is ignored). The default hard limit is 2 GB of
logs per test class.
(Dawid Weiss)
- Other (3)
- LUCENE-8573: BKDWriter now uses FutureArrays#mismatch to compute shared prefixes.
(Christoph Büscher via Adrien Grand)
- LUCENE-8605: Separate bounding box spatial logic from query logic on LatLonShapeBoundingBoxQuery.
(Ignacio Vera)
- LUCENE-8609: Deprecated IndexWriter#numDocs() and IndexWriter#maxDoc() in favor of IndexWriter#getDocStats()
that allows to get consistent numDocs and maxDoc stats that are not subject to concurrent changes.
(Simon Willnauer, Nhat Nguyen)
- Build (2)
- LUCENE-8504: Upgrade forbiddenapis to version 2.6.
(Uwe Schindler)
- LUCENE-8493: Stop publishing insecure .sha1 files with releases
(janhoy)
- Bug fixes (13)
- LUCENE-8479: QueryBuilder#analyzeGraphPhrase now throws TooManyClause exception
if the number of expanded path reaches the BooleanQuery#maxClause limit.
(Jim Ferenczi)
- LUCENE-8522: throw InvalidShapeException when constructing a polygon and
all points are coplanar.
(Ignacio Vera)
- LUCENE-8531: QueryBuilder#analyzeGraphPhrase now creates one phrase query per finite strings
in the graph if the slop is greater than 0. Span queries cannot be used in this case because
they don't handle slop the same way than phrase queries.
(Steve Rowe, Uwe Schindler, Jim Ferenczi)
- LUCENE-8524: Add the Hangul Letter Araea (interpunct) as a separator in Nori's tokenizer.
This change also removes empty terms and trim surface form in Nori's Korean dictionary.
(Trey Jones, Jim Ferenczi)
- LUCENE-8550: Fix filtering of coplanar points when creating linked list on
polygon tesselator.
(Ignacio Vera)
- LUCENE-8549: Polygon tessellator throws an error if some parts of the shape
could not be processed.
(Ignacio Vera)
- LUCENE-8540: Better handling of min/max values for Geo3d encoding.
(Ignacio Vera)
- LUCENE-8534: Fix incorrect computation for triangles intersecting polygon edges in
shape tessellation.
(Ignacio Vera)
- LUCENE-8559: Fix bug where polygon edges were skipped when checking for intersections.
(Ignacio Vera)
- LUCENE-8556: Use latitude and longitude instead of encoding values to check if triangle is ear
when using morton optimisation.
(Ignacio Vera)
- LUCENE-8586: Intervals.or() could get stuck in an infinite loop on certain indexes
(Alan Woodward)
- LUCENE-8595: Fix interleaved DV update and reset. Interleaved update and reset value
to the same doc in the same updates package looses an update if the reset comes before
the update as well as loosing the reset if the update comes frist.
(Simon Willnauer, Adrien Grand)
- LUCENE-8592: Fix index sorting corruption due to numeric overflow. The merge of sorted segments
can produce an invalid sort if the sort field is an Integer/Long that uses reverse order and contains
values equal to Integer/Long#MIN_VALUE. These values are always sorted first during a merge
(instead of last because of the reverse order) due to this bug. Indices affected by the bug can be
detected by running the CheckIndex command on a distribution that contains the fix (7.6+).
(Jim Ferenczi, Adrien Grand, Mike McCandless, Simon Willnauer)
- New Features (5)
- LUCENE-8496: Selective indexing - modify BKDReader/BKDWriter to allow users
to select a fewer number of dimensions to be used for creating the index than
the total number of dimensions used for field encoding. i.e., dimensions 0 to N
may be used to determine how to split the inner nodes, and dimensions N+1 to D
are ignored and stored as data dimensions at the leaves.
(Nick Knize)
- LUCENE-8538: Add a Simple WKT Shape Parser for creating Lucene Geometries (Polygon, Line,
Rectangle) from WKT format.
(Nick Knize)
- LUCENE-8462: Adds an Arabic snowball stemmer based on
https://github.com/snowballstem/snowball/blob/master/algorithms/arabic.sbl
(Ryadh Dahimene via Jim Ferenczi)
- LUCENE-8554: Add new LatLonShapeLineQuery that queries indexed LatLonShape fields
by arbitrary lines.
(Nick Knize)
- LUCENE-8555: Add dateline crossing support to LatLonShapeBoundingBoxQuery.
(Nick Knize)
- Improvements (3)
- LUCENE-8521: Change LatLonShape encoding to 7 dimensions instead of 6; where the
first 4 are index dimensions defining the bounding box of the Triangle and the
remaining 3 data dimensions define the vertices of the triangle.
(Nick Knize)
- LUCENE-8557: LeafReader.getFieldInfos is now documented and tested that it ought to return
the same cached instance. MemoryIndex's impl now pre-creates the FieldInfos instead of
re-calculating a new instance each time.
(Tim Underwood, David Smiley)
- LUCENE-8558: Replace O(N) lookup with O(1) lookup in PerFieldMergeState#FilterFieldInfos.
(Kranthi via Simon Willnauer)
- Other (2)
- LUCENE-8523: Correct typo in JapaneseNumberFilterFactory javadocs
(Ankush Jhalani
via Alan Woodward)
- LUCENE-8533: Fix Javadocs of DataInput#readVInt(): Negative numbers are
supported, but should be avoided.
(Vladimir Dolzhenko via Uwe Schindler)
- Bug Fixes (1)
- LUCENE-8454: Fix incorrect vertex indexing and other computation errors in
shape tessellation that would sometimes cause an infinite loop.
(Nick Knize)
- API Changes (18)
- LUCENE-8467: RAMDirectory, RAMFile, RAMInputStream, RAMOutputStream are deprecated
(Dawid Weiss)
- LUCENE-8356: StandardFilter is deprecated
(Alan Woodward)
- LUCENE-8373: ENGLISH_STOP_WORD_SET on StandardAnalyzer is deprecated. Instead
use EnglishAnalyzer.ENGLISH_STOP_WORD_SET. The default constructor for
StopAnalyzer is also deprecated, and a stop word set should be explicitly
passed to the constructor.
(Alan Woodward)
- LUCENE-8378: Add DocIdSetIterator.range static method to return an iterator
matching a range of docids
(Mike McCandless)
- LUCENE-8379: Add experimental TermQuery.getTermStates method
(Mike McCandless)
- LUCENE-8407: Add experimental SpanTermQuery.getTermStates method
(David Smiley)
- LUCENE-8390: MatchesIteratorSupplier replaced by IOSupplier
(Alan Woodward,
David Smiley)
- LUCENE-8397: Add DirectoryTaxonomyWriter.getCache
(Mike McCandless)
- LUCENE-8387: Add experimental IndexSearcher.getSlices API to see which slices
IndexSearcher is searching concurrently when it's created with an ExecutorService
(Mike McCandless)
- LUCENE-8263: TieredMergePolicy's reclaimDeletesWeight has been replaced with a
new deletesPctAllowed setting to control how aggressively deletes should be
reclaimed.
(Erick Erickson, Adrien Grand)
- LUCENE-7314: Graduate LatLonPoint and query classes to core
(Nick Knize)
- LUCENE-8428: The way that oal.util.PriorityQueue creates sentinel objects has
been changed from a protected method to a java.util.function.Supplier as a
constructor argument.
(Adrien Grand)
- LUCENE-8437: CheckIndex.Status.cantOpenSegments and missingSegmentVersion
have been removed as they were not computed correctly.
(Adrien Grand)
- LUCENE-8286: The UnifiedHighlighter has a new HighlightFlag.WEIGHT_MATCHES flag that
will tell this highlighter to use the new MatchesIterator API as the underlying
approach to navigate matching hits for a query. This mode will highlight more
accurately than any other highlighter, and can mark up phrases as one span instead of
word-by-word. The UH's public internal APIs changed a bit in the process.
(David Smiley)
- LUCENE-8471: IndexWriter.getFlushingBytes() returns how many bytes are currently
being flushed to disk.
(Alan Woodward)
- LUCENE-8422: Static helper functions for Matches and MatchesIterator implementations
have been moved from Matches to MatchesUtils
(Alan Woodward)
- LUCENE-8343: Suggesters now require Long (versus long, previously) from weight() method
while indexing, and provide double (versus long, previously) scores at lookup time
(Alessandro Benedetti)
- LUCENE-8459: SearcherTaxonomyManager now has a constructor taking already opened
IndexReaders, allowing the caller to pass a FilterDirectoryReader, for example.
(Mike McCandless)
- Bug Fixes (13)
- LUCENE-8445: Tighten condition when two planes are identical to prevent constructing
bogus tiles when building GeoPolygons.
(Ignacio Vera)
- LUCENE-8444: Prevent building functionally identical plane bounds when constructing
DualCrossingEdgeIterator .
(Ignacio Vera)
- LUCENE-8380: UTF8TaxonomyWriterCache inconsistency.
(Ruslan Torobaev, Dawid Weiss)
- LUCENE-8164: IndexWriter silently accepts broken payload. This has been fixed
via LUCENE-8165 since we are now checking for offset+length going out of bounds.
(Robert Muir, Nhat Nyugen, Simon Willnauer)
- LUCENE-8370: Reproducing
TestLucene{54,70}DocValuesFormat.testSortedSetVariableLengthBigVsStoredFields()
failures
(Erick Erickson)
- LUCENE-8376, LUCENE-8371: ConditionalTokenFilter.end() would not propagate correctly
if the last token in the stream was subsequently dropped; FixedShingleFilter did
not set position increment in end()
(Alan Woodward)
- LUCENE-8395: WordDelimiterGraphFilter would incorrectly insert a hole into a
TokenStream if a token consisting entirely of delimiter characters was
encountered, but preserve_original was set.
(Alan Woodward)
- LUCENE-8398: TieredMergePolicy.getMaxMergedSegmentMB has rounding error
(Erick Erickson)
- LUCENE-8429: DaciukMihovAutomatonBuilder is no longer prone to stack
overflows by enforcing a maximum term length.
(Adrien Grand)
- LUCENE-8441: IndexWriter now checks doc value type for index sort fields
and fails the document if they are not compatible.
(Jim Ferenczi, Mike McCandless)
- LUCENE-8458: Adjust initialization condition of PendingSoftDeletes and ensures
it is initialized before accepting deletes
(Simon Willnauer, Nhat Nguyen)
- LUCENE-8466: IndexWriter.deleteDocs(Query... query) incorrectly applies deletes on flush
if the index is sorted.
(Adrien Grand, Jim Ferenczi, Vish Ramachandran)
- LUCENE-8502: Allow access to delegate in FilterCodecReader. FilterCodecReader didn't
allow access to it's delegate like other filter readers. This adds a new #getDelegate method
to access the wrapped reader.
(Simon Willnauer)
- Changes in Runtime Behavior (3)
- LUCENE-7976: TieredMergePolicy now respects maxSegmentSizeMB by default when executing
findForcedMerges and findForcedDeletesMerges
(Erick Erickson)
- LUCENE-8263: TieredMergePolicy now reclaims deleted documents more
aggressively by default ensuring that no more than ~1/3 of the index size is
used by deleted documents.
(Adrien Grand)
- LUCENE-8503: Call #getDelegate instead of direct member access during unwrap.
Filter*Reader instances access the member or the delegate directly instead of
calling getDelegate(). In order to track access of the delegate these methods
should call #getDelegate()
(Simon Willnauer)
- Improvements (14)
- LUCENE-8468: A ByteBuffer based Directory implementation.
(Dawid Weiss)
- LUCENE-8447: Add DISJOINT and WITHIN support to LatLonShape queries.
(Nick Knize)
- LUCENE-8440: Add support for indexing and searching Line and Point shapes using LatLonShape encoding
(Nick Knize)
- LUCENE-8435: Add new LatLonShapePolygonQuery for querying indexed LatLonShape fields by arbitrary polygons
(Nick Knize)
- LUCENE-8367: Make per-dimension drill down optional for each facet dimension
(Mike McCandless)
- LUCENE-8396: Add Points Based Shape Indexing and Search that decomposes shapes
into a triangular mesh and indexes individual triangles as a 6 dimension point
(Nick Knize)
- LUCENE-8345, GitHub PR #392: Remove instantiation of redundant wrapper classes for primitives;
add wrapper class constructors to forbiddenapis.
(Michael Braun via Uwe Schindler)
- LUCENE-8415: Clean up Directory contracts and JavaDoc comments.
(Dawid Weiss)
- LUCENE-8414: Make segmentInfos private in IndexWriter
(Simon Willnauer, Nhat Nguyen)
- LUCENE-8446: The UnifiedHighlighter's DefaultPassageFormatter now treats overlapping matches in
the passage as merged (as if one larger match).
(David Smiley)
- LUCENE-8460: Better argument validation in StoredField.
(Namgyu Kim)
- LUCENE-8432: TopFieldComparator stops comparing documents if the index is
sorted, even if hits still need to be visited to compute the hit count.
(Nikolay Khitrin)
- LUCENE-8422: IntervalQuery now returns useful Matches
(Alan Woodward)
- LUCENE-7862: Store the real bounds of the leaf cells in the BKD index when the
number of dimensions is bigger than 1. It improves performance when there is
correlation between the dimensions, for example ranges.
(Ignacio Vera, Adrien Grand)
- Build (1)
- LUCENE-5143: Stop publishing KEYS file with each version, use topmost lucene/KEYS file only.
The buildAndPushRelease.py script validates that RM's PGP key is in the KEYS file.
Remove unused 'copy-to-stage' and '-dist-keys' targets from ant build.
(janhoy)
- Other (9)
- LUCENE-8485: Update randomizedtesting to version 2.6.4.
(Dawid Weiss)
- LUCENE-8366: Upgrade to ICU 62.1. Emoji handling now uses Unicode 11's
Extended_Pictographic property.
(Robert Muir)
- LUCENE-8408: original Highlighter: Remove obsolete static AttributeFactory instance
in TokenStreamFromTermVector.
(Michael Braun, David Smiley)
- LUCENE-8420: Upgrade OpenNLP to 1.9.0 so OpenNLP tool can read the new model format which 1.8.x
cannot read. 1.9.0 can read the old format.
(Koji Sekiguchi)
- LUCENE-8453: Add documentation to analysis factories of Korean (Nori) analyzer
module.
(Tomoko Uchida via Uwe Schindler)
- LUCENE-8455: Upgrade ECJ compiler to 4.6.1 in lucene/common-build.xml
(Erick Erickson)
- LUCENE-8456: Upgrade Apache Commons Compress to v1.18
(Steve Rowe)
- LUCENE-765: Improved org.apache.lucene.index javadocs.
(Mike Sokolov)
- LUCENE-8476: Remove redundant nullity check and switch to optimized List.sort in the
Korean's user dictionary.
(Namgyu Kim)
- Bug Fixes (4)
- LUCENE-8365: Fix ArrayIndexOutOfBoundsException in UnifiedHighlighter. This fixes
a "off by one" error in the UnifiedHighlighter's code that is only triggered when
two nested SpanNearQueries contain the same term.
(Marc-Andre Morissette via Simon Willnauer)
- LUCENE-8381: Fix IndexWriter incorrectly interprets hard-deletes as soft-deletes
while wrapping reader for merges.
(Simon Willnauer, Nhat Nguyen)
- LUCENE-8384: Fix missing advance docValues generation while handling docValues
update in PendingSoftDeletes.
(Simon Willnauer, Nhat Nguyen)
- LUCENE-8472: Always rewrite the soft-deletes merge retention query.
(Adrien Grand, Nhat Nguyen)
- Upgrading (1)
- LUCENE-8344: If you are using the AnalyzingSuggester or FuzzySuggester subclass, and if you
explicitly use the preservePositionIncrements=false setting (not the default), then you ought
to rebuild your suggester index. If you don't, queries or indexed data with trailing position
gaps (e.g. stop words) may not work correctly.
(David Smiley, Jim Ferenczi)
- API Changes (3)
- LUCENE-8242: IndexSearcher.createNormalizedWeight() has been deprecated.
Instead use IndexSearcher.createWeight(), rewriting the query first.
(Alan Woodward)
- LUCENE-8248: MergePolicyWrapper is renamed to FilterMergePolicy and now
also overrides getMaxCFSSegmentSizeMB
(Mike Sokolov via Mike McCandless)
- LUCENE-8303: LiveDocsFormat is now only responsible for (de)serialization of
live docs.
(Adrien Grand)
- Changes in Runtime Behavior (2)
- LUCENE-8309: Live docs are no longer backed by a FixedBitSet.
(Adrien Grand)
- LUCENE-8330: Detach IndexWriter from MergePolicy. MergePolicy now instead of
requiring IndexWriter as a hard dependency expects a MergeContext which
IndexWriter implements.
(Simon Willnauer, Robert Muir, Dawid Weiss, Mike McCandless)
- New Features (19)
- LUCENE-8200: Allow doc-values to be updated atomically together
with a document. Doc-Values updates now can be used as a soft-delete
mechanism to all keeping several version of a document or already
deleted documents around for later reuse. See "IW.softUpdateDocument(...)"
for reference.
(Simon Willnauer)
- LUCENE-8197: A new FeatureField makes it easy and efficient to integrate
static relevance signals into the final score.
(Adrien Grand, Robert Muir)
- LUCENE-8202: Add a FixedShingleFilter
(Alan Woodward, Adrien Grand, Jim
Ferenczi)
- LUCENE-8125: ICUTokenizer support for emoji/emoji sequence tokens.
(Robert Muir)
- LUCENE-8196, LUCENE-8300: A new IntervalQuery in the sandbox allows efficient proximity
searches based on minimum-interval semantics.
(Alan Woodward, Adrien Grand,
Jim Ferenczi, Simon Willnauer, Matt Weber)
- LUCENE-8233: Add support for soft deletes to IndexWriter delete accounting.
Soft deletes are accounted for inside the index writer and therefor also
by merge policies. A SoftDeletesRetentionMergePolicy is added that allows
to selectively carry over soft_deleted document across merges for retention
policies
(Simon Willnauer, Mike McCandless, Robert Muir)
- LUCENE-8237: Add a SoftDeletesDirectoryReaderWrapper that allows to respect
soft deletes if the reader is opened form a directory.
(Simon Willnauer,
Mike McCandless, Uwe Schindler, Adrien Grand)
- LUCENE-8229, LUCENE-8270: Add a method Weight.matches(LeafReaderContext, doc)
that returns an iterator over matching positions for a given query and document.
This allows exact hit extraction and will enable implementation of accurate
highlighters.
(Alan Woodward, Adrien Grand, David Smiley)
- LUCENE-8249: Implement Matches API for phrase queries
(Alan Woodward, Adrien
Grand)
- LUCENE-8246: Allow to customize the number of deletes a merge claims. This
helps merge policies in the soft-delete case to correctly implement retention
policies without triggering uncessary merges.
(Simon Willnauer, Mike McCandless)
- LUCENE-8231: A new analysis module (nori) similar to Kuromoji
but to handle Korean using mecab-ko-dic and morphological analysis.
(Robert Muir, Jim Ferenczi)
- LUCENE-8265: WordDelimter/GraphFilter now have an option to skip tokens
marked with KeywordAttribute
(Mike Sokolov via Mike McCandless)
- LUCENE-8297: Add IW#tryUpdateDocValues(Reader, int, Fields...) IndexWriter can
update doc values for a specific term but this might affect all documents
containing the term. With tryUpdateDocValues users can update doc-values
fields for individual documents. This allows for instance to soft-delete
individual documents.
(Simon Willnauer)
- LUCENE-8298: Allow DocValues updates to reset a value. Passing a DV field with a null
value to IW#updateDocValues or IW#tryUpdateDocValues will now remove the value from the
provided document. This allows to undelete a soft-deleted document unless it's been claimed
by a merge.
(Simon Willnauer)
- LUCENE-8273: ConditionalTokenFilter allows analysis chains to skip particular token
filters based on the attributes of the current token. This generalises the keyword
token logic currently used for stemmers and WDF. It is integrated into
CustomAnalyzer by using the `when` and `whenTerm` builder methods, and a new
ProtectedTermFilter is added as an example.
(Alan Woodward, Robert Muir,
David Smiley, Steve Rowe, Mike Sokolov)
- LUCENE-8310: Ensure IndexFileDeleter accounts for pending deletes. Today we fail
creating the IndexWriter when the directory has a pending delete. Yet, this
is mainly done to prevent writing still existing files more than once.
IndexFileDeleter already accounts for that for existing files which we can
now use to also take pending deletes into account which ensures that all file
generations per segment always go forward.
(Simon Willnauer)
- LUCENE-7960: Add preserveOriginal option to the NGram and EdgeNGram filters.
(Ingomar Wesp, Shawn Heisey via Robert Muir)
- LUCENE-8335: Enforce soft-deletes field up-front. Soft deletes field must be marked
as such once it's introduced and can't be changed after the fact.
(Nhat Nguyen via Simon Willnauer)
- LUCENE-8332: New ConcatenateGraphFilter for concatenating all tokens into one (or more
in the event of a graph input). This is useful for fast analyzed exact-match lookup,
suggesters, and as a component of a named entity recognition system. This was excised
out of CompletionTokenStream in the NRT doc suggester.
(David Smiley, Jim Ferenczi)
- Bug Fixes (19)
- LUCENE-8221: MoreLikeThis.setMaxDocFreqPct can easily int-overflow on larger
indexes.
- LUCENE-8266: Detect bogus tiles when creating a standard polygon and
throw a TileException.
(Ignacio Vera)
- LUCENE-8234: Fixed bug in how spatial relationship is computed for
GeoStandardCircle when it covers the whole world.
(Ignacio Vera)
- LUCENE-8236: Filter duplicated points when creating GeoPath shapes to
avoid creation of bogus planes.
(Ignacio Vera)
- LUCENE-8243: IndexWriter.addIndexes(Directory[]) did not properly preserve
index file names for updated doc values fields
(Simon Willnauer,
Michael McCandless, Nhat Nguyen)
- LUCENE-8275: Push up #checkPendingDeletes to Directory to ensure IW fails if
the directory has pending deletes files even if the directory is filtered or
a FileSwitchDirectory
(Simon Willnauer, Robert Muir)
- LUCENE-8244: Do not leak open file descriptors in SearcherTaxonomyManager's
refresh on exception
(Mike McCandless)
- LUCENE-8305: ComplexPhraseQuery.rewrite now handles an embedded MultiTermQuery
that rewrites to a MatchNoDocsQuery instead of throwing an exception.
(Bjarke Mortensen, Andy Tran via David Smiley)
- LUCENE-8287: Ensure that empty regex completion queries always return no results.
(Julie Tibshirani via Jim Ferenczi)
- LUCENE-8317: Prevent concurrent deletes from being applied during full flush.
Future deletes could potentially be exposed to flushes/commits/refreshes if the
amount of RAM used by deletes is greater than half of the IW RAM buffer.
(Simon Willnauer)
- LUCENE-8320: Fix WindowsFS to correctly account for rename and hardlinks.
(Simon Willnauer, Nhat Nguyen)
- LUCENE-8328: Ensure ReadersAndUpdates consistently executes under lock.
(Nhat Nguyen via Simon Willnauer)
- LUCENE-8325: Fixed the smartcn tokenizer to not split UTF-16 surrogate pairs.
(chengpohi via Jim Ferenczi)
- LUCENE-8186: LowerCaseTokenizerFactory now lowercases text in multi-term
queries.
(Tim Allison via Adrien Grand)
- LUCENE-8278: Some end-of-input no-scheme domain-only URL tokens are typed as
<ALPHANUM> rather than <URL>.
(Junte Zhang, Steve Rowe)
- LUCENE-8355: Prevent IW from opening an already dropped segment while DV updates
are written.
(Nhat Nguyen via Simon Willnauer)
- LUCENE-8344: TokenStreamToAutomaton (used by some suggesters) was not ignoring a trailing
position increment when the preservePositionIncrement setting is false.
(David Smiley, Jim Ferenczi)
- LUCENE-8357: FunctionScoreQuery.boostByQuery() and boostByValue() were
producing truncated Explanations
(Markus Jelsma, Alan Woodward)
- LUCENE-8360: NGramTokenFilter and EdgeNGramTokenFilter did not correctly
set position increments in end()
(Alan Woodward)
- Other (9)
- LUCENE-8301: Update randomizedtesting to 2.6.0.
(Dawid Weiss)
- LUCENE-8299: Geo3D wrapper uses new polygon method factory that gives better
support for polygons with many points (>100).
(Ignacio vera)
- LUCENE-8261: InterpolatedProperties.interpolate and recursive property
references.
(Steve Rowe, Dawid Weiss)
- LUCENE-8228: removed obsolete IndexDeletionPolicy clone() requirements from
the javadoc.
(Dawid Weiss)
- LUCENE-8219: Use a realistic estimate of the number of nodes and links in
LevensteinAutomaton.java, to save reallocation of arrays.
(Christian Ziech)
- LUCENE-8214: Improve selection of testPoint for GeoComplexPolygon.
(Ignacio Vera)
- SOLR-10912: Add automatic patch validation.
(Mano Kovacs, Steve Rowe)
- LUCENE-8122, LUCENE-8175: Upgrade analysis/icu to ICU 61.1.
(Robert Muir, Adrien Grand, Uwe Schindler)
- LUCENE-8291: Remove QueryTemplateManager utility class from XML queryparser.
This class is just a general XML transforming tool (using property files and
XSLT) and has nothing to do with query parsing. It can easily be implemented
using more sophisticated libraries or using XSL transformers from the JDK.
This change also removes the Lucene demo webapp to prevent XSS issues in
untested/unmaintained code.
(Uwe Schindler)
- Build (2)
- LUCENE-7935: Publish .sha512 hash files with the release artifacts and stop
publishing .md5 hashes since the algorithm is broken
(janhoy)
- LUCENE-8230: Upgrade forbiddenapis to version 2.5.
(Uwe Schindler)
- Documentation (1)
- LUCENE-8238: Improve WordDelimiterFilter and WordDelimiterGraphFilter javadocs
(Mike Sokolov via Mike McCandless)
- Bug fixes (1)
- LUCENE-8254: LRUQueryCache could cause IndexReader to hang on close, when
shared with another reader with no CacheHelper
(Alan Woodward, Simon Willnauer,
Adrien Grand)
- API Changes (4)
- LUCENE-8051: LevensteinDistance renamed to LevenshteinDistance.
(Pulak Ghosh via Adrien Grand)
- LUCENE-8099: Deprecate CustomScoreQuery, BoostedQuery and BoostingQuery.
Users should instead use FunctionScoreQuery, possibly combined with
a lucene expression
(Alan Woodward)
- LUCENE-8104: Remove facets module compile-time dependency on queries
(Alan Woodward)
- LUCENE-8145: UnifiedHighlighter now uses a unitary OffsetsEnum rather
than a list of enums
(Alan Woodward, David Smiley, Jim Ferenczi, Timothy
Rodriguez)
- New Features (2)
- LUCENE-2899: Add new module analysis/opennlp, with analysis components
to perform tokenization, part-of-speech tagging, lemmatization and phrase
chunking by invoking the corresponding OpenNLP tools. Named entity
recognition is also provided as a Solr update request processor.
(Lance Norskog, Grant Ingersoll, Joern Kottmann, Em, Kai Gülzau,
Rene Nederhand, Robert Muir, Steven Bower, Steve Rowe)
- LUCENE-8126: Add new spatial prefix tree (SPT) based on google S2 geometry.
It can only be used currently with Geo3D spatial context and it provides
improvements on indexing time for non-points shapes and on query performance.
(Ignacio Vera, David Smiley).
- Improvements (11)
- LUCENE-8081: Allow IndexWriter to opt out of flushing on indexing threads
Index/Update Threads try to help out flushing pending document buffers to
disk. This change adds an expert setting to opt ouf of this behavior unless
flusing is falling behind.
(Simon Willnauer)
- LUCENE-8086: spatial-extras Geo3dFactory: Use GeoExactCircle with
configurable precision for non-spherical planet models.
(Ignacio Vera via David Smiley)
- LUCENE-8093: TrimFilterFactory implements MultiTermAwareComponent
(Alan Woodward)
- LUCENE-8094: TermInSetQuery.toString now returns "field:(A B C)"
(Mike McCandless)
- LUCENE-8121: UnifiedHighlighter passage relevancy is improved for terms that are
position sensitive (e.g. part of a phrase) by having an accurate freq.
(David Smiley)
- LUCENE-8129: A Unicode set filter can now be specified when using ICUFoldingFilter.
(Ere Maijala)
- LUCENE-7966: Build Multi-Release JARs to enable usage of optimized intrinsic methods
from Java 9 for index bounds checking and array comparison/mismatch. This change
introduces Java 8 replacements for those Java 9 methods and patches the compiled
classes to use the optimized variants through the MR-JAR mechanism.
(Uwe Schindler, Robert Muir, Adrien Grand, Mike McCandless)
- LUCENE-8127: Speed up rewriteNoScoring when there are no MUST clauses.
(Michael Braun via Adrien Grand)
- LUCENE-8152: Improve consumption of doc-value iterators.
(Horatiu Lazu via
Adrien Grand)
- LUCENE-8033: FieldInfos now always use a dense encoding.
(Mayya Sharipova
via Adrien Grand)
- LUCENE-8190: Specialized cell interface to allow any spatial prefix tree to
benefit from the setting setPruneLeafyBranches on RecursivePrefixTreeStrategy.
(Ignacio Vera)
- Bug Fixes (10)
- LUCENE-8077: Fixed bug in how CheckIndex verifies doc-value iterators.
(Xiaoshan Sun via Adrien Grand)
- SOLR-11758: Fixed FloatDocValues.boolVal to correctly return true for all values != 0.0F
(Munendra S N via hossman)
- LUCENE-8121: The UnifiedHighlighter would highlight some terms within some nested
SpanNearQueries at positions where it should not have. It's fixed in the UH by
switching to the SpanCollector API. The original Highlighter still has this
problem (LUCENE-2287, LUCENE-5455, LUCENE-6796). Some public but internal parts of
the UH were refactored.
(David Smiley, Steve Davids)
- LUCENE-8120: Fix LatLonBoundingBox's toString() method
(Martijn van Groningen, Adrien Grand)
- LUCENE-8130: Fix NullPointerException from TermStates.toString()
(Mike McCandless)
- LUCENE-8124: Fixed HyphenationCompoundWordTokenFilter to handle correctly
hyphenation patterns with indicator >= 7.
(Holger Bruch via Adrien Grand)
- LUCENE-8163: BaseDirectoryTestCase could produce random filenames that fail
on Windows
(Alan Woodward)
- LUCENE-8174: Fixed {Float,Double,Int,Long}Range.toString().
(Oliver Kaleske
via Adrien Grand)
- LUCENE-8182: Fixed BoostingQuery to apply the context boost instead of the parent query
boost
(Jim Ferenczi)
- LUCENE-8188: Fixed bugs in OpenNLPOpsFactory that were causing InputStreams fetched from the
ResourceLoader to be leaked
(hossman)
- Other (8)
- LUCENE-8111: IndexOrDocValuesQuery Javadoc references outdated method name.
(Kai Chan via Adrien Grand)
- LUCENE-8106: Add script (reproduceJenkinsFailures.py) to attempt to reproduce
failing tests from a Jenkins log.
(Steve Rowe)
- LUCENE-8075: Removed unnecessary null check in IntersectTermsEnum.
(Pulak Ghosh via Adrien Grand)
- LUCENE-8156: Require users to not have ASM on the Ant classpath during build.
This is required by LUCENE-7966.
(Adrien Grand, Uwe Schindler)
- LUCENE-8161: spatial-extras: the Spatial4j dependency has been updated from 0.6 to 0.7,
which is drop-in compatible (Lucene doesn't expressly use any of the few API differences).
Spatial4j 0.7 is compatible with JTS 1.15.0 and not any prior version. JTS 1.15.0 is
dual-licensed to include BSD; prior versions were LGPL.
(David Smiley)
- LUCENE-8155: Add back support in smoke tester to run against later Java versions.
(Uwe Schindler)
- LUCENE-8169: Migrated build to use OpenClover 4.2.1 for checking code coverage.
(Uwe Schindler)
- LUCENE-8170: Improve OpenClover reports (separate test from production code);
enable coverage reports inside test-frameworks.
(Uwe Schindler)
- Build (2)
- LUCENE-8168: Moved Groovy scripts in build files to separate files.
Update Groovy to 2.4.13.
(Uwe Schindler)
- LUCENE-8176: HttpReplicatorTest awaits more than a minute for stopping Jetty threads
(Mikhail Khludnev)
- Bug Fixes (1)
- LUCENE-8117: Fix advanceExact on SortedNumericDocValues produced by Lucene54DocValues.
(Jim Ferenczi).
- API Changes (8)
- LUCENE-8017, LUCENE-8042: Weight, DoubleValuesSource and related objects
now implement a SegmentCacheable interface, with a single method
isCacheable(LeafReaderContext) determining whether or not the object may
be cached against a LeafReader.
(Alan Woodward, Robert Muir)
- LUCENE-8038: Payload factors for scoring in PayloadScoreQuery are now
calculated by a PayloadDecoder, instead of delegating to the Similarity.
(Alan Woodward)
- LUCENE-8014: Similarity.computeSlopFactor() and
Similarity.computePayloadFactor() have been deprecated.
(Alan Woodward)
- LUCENE-6278: Scorer.freq() has been removed
(Alan Woodward)
- LUCENE-7736: DoubleValuesSource and LongValuesSource now expose a
rewrite(IndexSearcher) function.
(Alan Woodward)
- LUCENE-7998: DoubleValuesSource.fromQuery() allows you to use the scores
from a Query as a DoubleValuesSource.
(Alan Woodward)
- LUCENE-8049: IndexWriter.getMergingSegments()'s return type was changed from
Collection to Set to more accurately reflect it's nature.
(David Smiley)
- LUCENE-8059: TopFieldDocCollector can now early terminate collection when
the sort order is compatible with the index order. As a consequence,
EarlyTerminatingSortingCollector is now deprecated.
(Adrien Grand)
- New Features (3)
- LUCENE-8061: Add convenience factory methods to create BBoxes and XYZSolids
directly from bounds objects.
- LUCENE-7736: IndexReaderFunctions expose various IndexReader statistics as
DoubleValuesSources.
(Alan Woodward)
- LUCENE-8068: Allow IndexWriter to write a single DWPT to disk Adds a
flushNextBuffer method to IndexWriter that allows the caller to
synchronously move the next pending or the biggest non-pending index buffer to
disk. This enables flushing selected buffer to disk without highjacking an
indexing thread. This is for instance useful if more than one IW (shards) must
be maintained in a single JVM / system.
(Simon Willnauer)
- Bug Fixes (11)
- LUCENE-8076: Normalize Vincenti distance calculation for planet models that aren't normalized.
(Ignacio Vera)
- LUCENE-8057: Exact circle bounds computation was incorrect.
(Ignacio Vera)
- LUCENE-8056: Exact circle segment bounding suffered from precision errors.
(Karl Wright)
- LUCENE-8054: Fix the exact circle case where relationships fail when the
planet model has c <= ab, because the planes are constructed incorrectly.
(Ignacio Vera)
- LUCENE-7991: KNearestNeighborDocumentClassifier.knnSearch no longer applies
a previous boosted field's factor to subsequent unboosted fields.
(Christine Poerschke)
- LUCENE-7999: Switch from int to long to track the name for the next
segment to write, so that very long lived indices with very frequent
refreshes or commits, and high indexing thread counts, do not
overflow an int
(Mykhailo Demianenko via Mike McCandless)
- LUCENE-8025: Use sumTotalTermFreq=sumDocFreq when scoring DOCS_ONLY fields
that omit term frequency information, as it is equivalent in that case.
Previously bogus numbers were used, and many similarities would
completely degrade.
(Robert Muir, Adrien Grand)
- LUCENE-8045: ParallelLeafReader did not correctly report FieldInfo.dvGen
(Alan Woodward)
- LUCENE-8034: Use subtraction instead of addition to sidestep int
overflow in SpanNotQuery.
(Hari Menon via Mike McCandless)
- LUCENE-8078: The query cache should not cache instances of
MatchNoDocsQuery.
(Jon Harper via Adrien Grand)
- LUCENE-8048: Filesystems do not guarantee order of directories updates
(Nikolay Martynov, Simon Willnauer, Erick Erickson)
- Optimizations (6)
- LUCENE-8018: Smaller FieldInfos memory footprint by not retaining unnecessary
references to TreeMap entries.
(Julian Vassev via Adrien Grand)
- LUCENE-7994: Use int/int scatter map to gather facet counts when the
number of hits is small relative to the number of unique facet labels
(Dawid Weiss, Robert Muir, Mike McCandless)
- LUCENE-8062: GlobalOrdinalsQuery is no longer eligible for caching.
(Jim Ferenczi)
- LUCENE-8058: Large instances of TermInSetQuery are no longer eligible for
caching as they could break memory accounting of the query cache.
(Adrien Grand)
- LUCENE-8055: MemoryIndex.MemoryDocValuesIterator returns 2 documents
instead of 1.
(Simon Willnauer)
- LUCENE-8043: Fix document accounting in IndexWriter to prevent writing too many
documents. Once this happens, Lucene refuses to open the index and throws a
CorruptIndexException.
(Simon Willnauer, Yonik Seeley, Mike McCandless)
- Tests (1)
- LUCENE-8035: Run tests with JDK-specific options: --illegal-access=deny
on Java 9+.
(Uwe Schindler)
- Build (1)
- LUCENE-6144: Upgrade Ivy to 2.4.0; 'ant ivy-bootstrap' now removes old Ivy
jars in ~/.ant/lib/.
(Shawn Heisey, Steve Rowe)
- Changes in Runtime Behavior (1)
- Resolving of external entities in queryparser/xml/CoreParser is disallowed
by default. See SOLR-11477 for details.
- New Features (18)
- LUCENE-7970: Add a shape to Geo3D that consists of multiple planes that
approximate a true circle, rather than an ellipse, for non-spherical planet models.
(Karl Wright, Ignacio Vera)
- LUCENE-7955: Add support for the concept of "nearest distance" to Geo3D's
GeoPath abstraction, which is the distance along the path to the point that is
closest to the provided point.
(Karl Wright)
- LUCENE-7906: Add spatial relationships between all currently-defined Geo shapes.
(Ignacio Vera)
- LUCENE-7955: Add support for zero-width paths.
(Karl Wright)
- LUCENE-7936: Add serialization and deserialization support to Geo3D.
(Karl Wright,
Ignacio Vera)
- LUCENE-7942: Distance computations now have the ability to accurately aggregate
distances, rather than just doing sums.
(Karl Wright)
- LUCENE-7934: Add a planet model interface.
(Karl Wright)
- LUCENE-7918: Revamp the API for composites so that it's generic and can be used
for many kinds of shapes.
(Ignacio Vera)
- LUCENE-7621: Add CoveringQuery, a query whose required number of matching
clauses can be defined per document.
(Adrien Grand)
- LUCENE-7927: Add LongValueFacetCounts, to compute facet counts for individual
numeric values
(Mike McCandless)
- LUCENE-7940: Add BengaliAnalyzer.
(Md. Abdulla-Al-Sun via Robert Muir)
- LUCENE-7392: Add point based LatLonBoundingBox as new RangeField Type.
(Nick Knize)
- LUCENE-7951: Spatial-extras has much better Geo3d support by implementing Spatial4j
abstractions: SpatialContextFactory, ShapeFactory, BinaryCodec, DistanceCalculator.
(Ignacio Vera, David Smiley)
- LUCENE-7973: Update dictionary version for Ukrainian analyzer to 3.9.0
(Andriy
Rysin via Dawid Weiss)
- LUCENE-7974: Add FloatPointNearestNeighbor, an N-dimensional FloatPoint
K-nearest-neighbor search implementation.
(Steve Rowe)
- LUCENE-7975: Change the default taxonomy facets cache to a faster
byte[] (UTF-8) based cache.
(Mike McCandless)
- LUCENE-7972: DirectoryTaxonomyReader, in Lucene's facet module, now
implements Accountable, so you can more easily track how much heap
it's using.
(Mike McCandless)
- LUCENE-7982: A new NormsFieldExistsQuery matches documents that have
norms in a specified field
(Colin Goodheart-Smithe via Mike McCandless)
- Optimizations (6)
- LUCENE-7905: Optimize how OrdinalMap (used by
SortedSetDocValuesFacetCounts and others) builds its map
(Robert
Muir, Adrien Grand, Mike McCandless)
- LUCENE-7655: Speed up geo-distance queries in case of dense single-valued
fields when most documents match.
(Maciej Zasada via Adrien Grand)
- LUCENE-7897: IndexOrDocValuesQuery now requires the range cost to be more
than 8x greater than the cost of the lead iterator in order to use doc values.
(Murali Krishna P via Adrien Grand)
- LUCENE-7925: Collapse duplicate SHOULD or MUST clauses by summing up their
boosts.
(Adrien Grand)
- LUCENE-7939: MinShouldMatchSumScorer now leverages two-phase iteration in
order to be faster when used in conjunctions.
(Adrien Grand)
- LUCENE-7827: AnalyzingInfixSuggester doesn't create "textgrams"
when minPrefixChar=0
(Mikhail Khludnev)
- Bug Fixes (9)
- LUCENE-8066: It was still possible to construct a concave GeoExactCircle, so use
a sector approach to prevent that.
(Ignacio Vera)
- LUCENE-7967: The GeoDegeneratePoint isWithin() method needed allowance for
numerical precision.
(Karl Wright)
- LUCENE-7965: GeoBBoxFactory was constructing the wrong shape at the poles
if the longitude span was greater than 180 degrees.
(Karl Wright)
- LUCENE-7916: Prevent ArrayIndexOutOfBoundsException if ICUTokenizer is used
with a different ICU JAR version than it is compiled against. Note, this is
not recommended, lucene-analyzers-icu contains binary data structures
specific to ICU/Unicode versions it is built against.
(Chris Koenig, Robert Muir)
- LUCENE-7891: Lucene's taxonomy facets now uses a non-buggy LRU cache
by default.
(Jan-Willem van den Broek via Mike McCandless)
- LUCENE-7959: Improve NativeFSLockFactory's exception message if it cannot create
write.lock for an empty index due to bad permissions/read-only filesystem/etc.
(Erick Erickson, Shawn Heisey, Robert Muir)
- LUCENE-7968: AnalyzingSuggester would sometimes order suggestions incorrectly,
it did not properly break ties on the surface forms when both the weights and
the analyzed forms were equal.
(Robert Muir)
- LUCENE-7957: ConjunctionScorer.getChildren was failing to return all
child scorers
(Adrien Grand, Mike McCandless)
- SOLR-11477: Disallow resolving of external entities in queryparser/xml/CoreParser
by default.
(Michael Stepankin, Olga Barinova, Uwe Schindler, Christine Poerschke)
- Build (3)
- SOLR-11181: Switch order of maven artifact publishing procedure: deploy first
instead of locally installing first, to workaround a double repository push of
*-sources.jar and *-javadoc.jar files.
(Lynn Monson via Steve Rowe)
- LUCENE-6673: Maven build fails for target javadoc:jar.
(Ramkumar Aiyengar, Daniel Collins via Steve Rowe)
- LUCENE-7985: Upgrade forbiddenapis to 2.4.1.
(Uwe Schindler)
- Other (5)
- LUCENE-7948, LUCENE-7937: Upgrade randomizedtesting to 2.5.3 (minor fixes
in test filtering for IDEs).
(Mike Sokolov, Dawid Weiss)
- LUCENE-7933: LongBitSet now validates the numBits parameter
(Won
Jonghoon, Mike McCandless)
- LUCENE-7978: Add some more documentation about setting up build
environment.
(Anton R. Yuste via Uwe Schindler)
- LUCENE-7983: IndexWriter.IndexReaderWarmer is now a functional interface
instead of an abstract class with a single method
(Dawid Weiss)
- LUCENE-5753: Update TLDs recognized by UAX29URLEmailTokenizer.
(Steve Rowe)
- Bug Fixes (1)
- LUCENE-7957: ConjunctionScorer.getChildren was failing to return all
child scorers
(Adrien Grand, Mike McCandless)
- New Features (8)
- LUCENE-7703: SegmentInfos now record the major Lucene version at index
creation time.
(Adrien Grand)
- LUCENE-7756: LeafReader.getMetaData now exposes the index created version as
well as the oldest Lucene version that contributed to the segment.
(Adrien Grand)
- LUCENE-7854: The new TermFrequencyAttribute used during analysis
with a custom token stream allows indexing custom term frequencies
(Mike McCandless)
- LUCENE-7866: Add a new DelimitedTermFrequencyTokenFilter that allows to
mark tokens with a custom term frequency (LUCENE-7854). It parses a numeric
value after a separator char ('|') at the end of each token and changes
the term frequency to this value.
(Uwe Schindler, Robert Muir, Mike
McCandless)
- LUCENE-7868: Multiple threads can now resolve deletes and doc values
updates concurrently, giving sizable speedups in update-heavy
indexing use cases
(Simon Willnauer, Mike McCandless)
- LUCENE-7823: Pure query based naive bayes classifier using BM25 scores
(Tommaso Teofili)
- LUCENE-7838: Knn classifier based on fuzzified term queries
(Tommaso Teofili)
- LUCENE-7855: Added advanced options of the Wikipedia tokenizer to its factory.
(Juan Pedro via Adrien Grand)
- API Changes (23)
- LUCENE-2605: Classic QueryParser no longer splits on whitespace by default.
Use setSplitOnWhitespace(true) to get the old behavior.
(Steve Rowe)
- LUCENE-7369: Similarity.coord and BooleanQuery.disableCoord are removed.
(Adrien Grand)
- LUCENE-7368: Removed query normalization.
(Adrien Grand)
- LUCENE-7355: AnalyzingQueryParser has been removed as its functionality has
been folded into the classic QueryParser.
(Adrien Grand)
- LUCENE-7407: Doc values APIs have been switched from random access
to iterators, enabling future codec compression improvements.
(Mike
McCandless)
- LUCENE-7475: Norms now support sparsity, allowing to pay for what is
actually used.
(Adrien Grand)
- LUCENE-7494: Points now have a per-field API, like doc values.
(Adrien Grand)
- LUCENE-7410: Cache keys and close listeners have been refactored in order
to be less trappy. See IndexReader.getReaderCacheHelper and
LeafReader.getCoreCacheHelper.
(Adrien Grand)
- LUCENE-6819: Index-time boosts are not supported anymore. As a replacement,
index-time scoring factors should be indexed into a doc value field and
combined at query time using eg. FunctionScoreQuery.
(Adrien Grand)
- LUCENE-7734: FieldType's copy constructor was widened to accept any IndexableFieldType.
(David Smiley)
- LUCENE-7701: Grouping collectors have been refactored, such that groups are
now defined by a GroupSelector implementation.
(Alan Woodward)
- LUCENE-7741: DoubleValuesSource now has an explain() method
(Alan Woodward,
Adrien Grand)
- LUCENE-7815: Removed the PostingsHighlighter; you should use the UnifiedHighlighter
instead, which derived from the UH. WholeBreakIterator and
CustomSeparatorBreakIterator were moved to UH's package.
(David Smiley)
- LUCENE-7850: Removed support for legacy numerics.
(Adrien Grand)
- LUCENE-7500: Removed abstract LeafReader.fields(); instead terms(fieldName)
has been made abstract, fomerly was final. Also, MultiFields.getTerms
was optimized to work directly instead of being implemented on getFields.
(David Smiley)
- LUCENE-7872: TopDocs.totalHits is now a long.
(Adrien Grand, hossman)
- LUCENE-7868: IndexWriterConfig.setMaxBufferedDeleteTerms is
removed.
(Simon Willnauer, Mike McCandless)
- LUCENE-7877: PrefixAwareTokenStream is replaced with ConcatenatingTokenStream
(Alan Woodward, Uwe Schindler, Adrien Grand)
- LUCENE-7867: The deprecated Token class is now only available in the test
framework
(Alan Woodward, Adrien Grand)
- LUCENE-7723: DoubleValuesSource enforces implementation of equals() and
hashCode()
(Alan Woodward)
- LUCENE-7737: The spatial-extras module no longer has a dependency on the
queries module. All uses of ValueSource are either replaced with core
DoubleValuesSource extensions, or with the new ShapeValuesSource and
ShapeValuesPredicate classes
(Alan Woodward, David Smiley)
- LUCENE-7892: Doc-values query factory methods have been renamed so that their
name contains "slow" in order to cleary indicate that they would usually be a
bad choice.
(Adrien Grand)
- LUCENE-7899: FieldValueQuery is renamed to DocValuesFieldExistsQuery
(Adrien Grand, Mike McCandless)
- Bug Fixes (7)
- LUCENE-7626: IndexWriter will no longer accept broken token offsets
(Mike McCandless)
- LUCENE-7859: Spatial-extras PackedQuadPrefixTree bug that only revealed itself
with the new pointsOnly optimizations in LUCENE-7845.
(David Smiley)
- LUCENE-7871: fix false positive match in BlockJoinSelector when children have no value, introducing
wrap methods accepting children as DISI. Extracting ToParentDocValues
(Mikhail Khludnev)
- LUCENE-7914: Add a maximum recursion level in automaton recursive
functions (Operations.isFinite and Operations.topsortState) to prevent
large automaton to overflow the stack
(Robert Muir, Adrien Grand, Jim Ferenczi)
- LUCENE-7864: IndexMergeTool is not using intermediate hard links (even
if possible).
(Dawid Weiss)
- LUCENE-7956: Fixed potential stack overflow error in ICUNormalizer2CharFilter.
(Adrien Grand)
- LUCENE-7963: Remove useless getAttribute() in DefaultIndexingChain that
causes performance drop, introduced by LUCENE-7626.
(Daniel Mitterdorfer
via Uwe Schindler)
- Improvements (4)
- LUCENE-7489: Better storage of sparse doc-values fields with the default
codec.
(Adrien Grand)
- LUCENE-7730: More accurate encoding of the length normalization factor
thanks to the removal of index-time boosts.
(Adrien Grand)
- LUCENE-7901: Original Highlighter now eagerly throws an exception if you
provide components that are null.
(Jason Gerlowski, David Smiley)
- LUCENE-7841: Normalize ґ to г in Ukrainian analyzer.
(Andriy Rysin via Dawid Weiss)
- Optimizations (7)
- LUCENE-7416: BooleanQuery optimizes queries that have queries that occur both
in the sets of SHOULD and FILTER clauses, or both in MUST/FILTER and MUST_NOT
clauses.
(Spyros Kapnissis via Adrien Grand, Uwe Schindler)
- LUCENE-7506: FastTaxonomyFacetCounts should use CPU in proportion to
the size of the intersected set of hits from the query and documents
that have a facet value, so sparse faceting works as expected
(Adrien Grand via Mike McCandless)
- LUCENE-7519: Add optimized APIs to compute browse-only top level
facets
(Mike McCandless)
- LUCENE-7589: Numeric doc values now have the ability to encode blocks of
values using different numbers of bits per value if this proves to save
storage.
(Adrien Grand)
- LUCENE-7845: Enhance spatial-extras RecursivePrefixTreeStrategy queries when the
query is a point (for 2D) or a is a simple date interval (e.g. 1 month). When
the strategy is marked as pointsOnly, the results is a TermQuery.
(David Smiley)
- LUCENE-7874: DisjunctionMaxQuery rewrites to a BooleanQuery when tiebreaker is set to 1.
(Jim Ferenczi)
- LUCENE-7828: Speed up range queries on range fields by improving how we
compute the relation between the query and inner nodes of the BKD tree.
(Adrien Grand)
- Other (14)
- LUCENE-7923: Removed FST.Arc.node field (unused).
(Dawid Weiss)
- LUCENE-7328: Remove LegacyNumericEncoding from GeoPointField.
(Nick Knize)
- LUCENE-7360: Remove Explanation.toHtml()
(Alan Woodward)
- LUCENE-7681: MemoryIndex uses new DocValues API
(Alan Woodward)
- LUCENE-7753: Make fields static when possible.
(Daniel Jelinski via Adrien Grand)
- LUCENE-7540: Upgrade ICU to 59.1
(Mike McCandless, Jim Ferenczi)
- LUCENE-7852: Correct copyright year(s) in lucene/LICENSE.txt file.
(Christine Poerschke, Steve Rowe)
- LUCENE-7719: Generalized the UnifiedHighlighter's support for AutomatonQuery
for character & binary automata. Added AutomatonQuery.isBinary.
(David Smiley)
- LUCENE-7873: Due to serious problems with context class loaders in several
frameworks (OSGI, Java 9 Jigsaw), the lookup of Codecs, PostingsFormats,
DocValuesFormats and all analysis factories was changed to only inspect the
current classloader that defined the interface class (lucene-core.jar).
See MIGRATE.txt for more information!
(Uwe Schindler, Dawid Weiss)
- LUCENE-7883: Lucene no longer uses the context class loader when resolving
resources in CustomAnalyzer or ClassPathResourceLoader. Resources are only
resolved against Lucene's class loader by default. Please use another builder
method to change to a custom classloader.
(Uwe Schindler)
- LUCENE-5822: Convert README to Markdown
(Jason Gerlowski via Mike Drob)
- LUCENE-7773: Remove unused/deprecated token types from StandardTokenizer.
(Ahmet Arslan via Steve Rowe)
- LUCENE-7800: Remove code that potentially rethrows checked exceptions
from methods that don't declare them ("sneaky throw" hack).
(Robert Muir,
Uwe Schindler, Dawid Weiss)
- LUCENE-7876: Avoid calls to LeafReader.fields() and MultiFields.getFields()
that are trivially replaced by LeafReader.terms() and MultiFields.getTerms()
(David Smiley)
- Build (1)
- LUCENE-6144: Upgrade Ivy to 2.4.0; 'ant ivy-bootstrap' now removes old Ivy
jars in ~/.ant/lib/.
(Shawn Heisey, Steve Rowe)
- Changes in Runtime Behavior (1)
- Resolving of external entities in queryparser/xml/CoreParser is disallowed
by default. See SOLR-11477 for details.
- Bug Fixes (1)
- SOLR-11477: Disallow resolving of external entities in queryparser/xml/CoreParser
by default.
(Michael Stepankin, Olga Barinova, Uwe Schindler, Christine Poerschke)
- Bug Fixes (2)
- LUCENE-7869: Changed MemoryIndex to sort 1d points. In case of 1d points, the PointInSetQuery.MergePointVisitor expects
that these points are visited in ascending order. The memory index doesn't do this and this can result in document
with multiple points that should match to not match.
(Martijn van Groningen)
- LUCENE-7878: Fix query builder to keep the SHOULD clause that wraps multi-word synonyms.
(Jim Ferenczi)
- New Features (1)
- LUCENE-7811: Add a concurrent SortedSet facets implementation.
(Mike McCandless)
- Bug Fixes (14)
- LUCENE-7777: ByteBlockPool.readBytes sometimes throws
ArrayIndexOutOfBoundsException when byte blocks larger than 32 KB
were added
(Mike McCandless)
- LUCENE-7797: The static FSDirectory.listAll(Path) method was always
returning an empty array.
(Atkins Chang via Mike McCandless)
- LUCENE-7481: Fixed missing rewrite methods for SpanPayloadCheckQuery
and PayloadScoreQuery.
(Erik Hatcher)
- LUCENE-7808: Fixed PayloadScoreQuery and SpanPayloadCheckQuery
.equals and .hashCode methods.
(Erik Hatcher)
- LUCENE-7798: Add .equals and .hashCode to ToParentBlockJoinSortField
(Mikhail Khludnev)
- LUCENE-7814: DateRangePrefixTree (in spatial-extras) had edge-case bugs for
years >= 292,000,000.
(David Smiley)
- LUCENE-5365, LUCENE-7818: Fix incorrect condition in queryparser's
QueryNodeOperation#logicalAnd().
(Olivier Binda, Amrit Sarkar,
AppChecker via Uwe Schindler)
- LUCENE-7821: The classic and flexible query parsers, as well as Solr's
"lucene"/standard query parser, should require " TO " in range queries,
and accept "TO" as endpoints in range queries.
(hossman, Steve Rowe)
- LUCENE-7824: Fix graph query analysis for multi-word synonym rules with common terms (eg. new york, new york city).
(Jim Ferenczi)
- LUCENE-7817: Pass cached query to onQueryCache instead of null.
(Christoph Kaser via Adrien Grand)
- LUCENE-7831: CodecUtil should not seek to negative offsets.
(Adrien Grand)
- LUCENE-7833: ToParentBlockJoinQuery computed the min score instead of the max
score with ScoreMode.MAX.
(Adrien Grand)
- LUCENE-7847: Fixed all-docs-match optimization of range queries on range
fields.
(Adrien Grand)
- LUCENE-7810: Fix equals() and hashCode() methods of several join queries.
(Hossman, Adrien Grand, Martijn van Groningen)
- Improvements (5)
- LUCENE-7782: OfflineSorter now passes the total number of items it
will write to getWriter
(Mike McCandless)
- LUCENE-7785: Move dictionary for Ukrainian analyzer to external dependency.
(Andriy Rysin via Steve Rowe, Dawid Weiss)
- LUCENE-7801: SortedSetDocValuesReaderState now implements
Accountable so you can see how much RAM it's using
(Robert Muir,
Mike McCandless)
- LUCENE-7792: OfflineSorter can now run concurrently if you pass it
an optional ExecutorService
(Dawid Weiss, Mike McCandless)
- LUCENE-7811: Sorted set facets now use sparse storage when
collecting hits, when appropriate.
(Mike McCandless)
- Optimizations (1)
- LUCENE-7787: spatial-extras HeatmapFacetCounter will now short-circuit it's
work when Bits.MatchNoBits is passed.
(David Smiley)
- Other (5)
- LUCENE-7796: Make IOUtils.reThrow idiom declare Error return type so
callers may use it in a way that compiler knows subsequent code is
unreachable. reThrow is now deprecated in favor of IOUtils.rethrowAlways
with a slightly different semantics (see javadoc).
(Hossman, Robert Muir,
Dawid Weiss)
- LUCENE-7754: Inner classes should be static whenever possible.
(Daniel Jelinski via Adrien Grand)
- LUCENE-7751: Avoid boxing primitives only to call compareTo.
(Daniel Jelinski via Adrien Grand)
- LUCENE-7743: Never call new String(String).
(Daniel Jelinski via Adrien Grand)
- LUCENE-7761: Fixed comment in ReqExclScorer.
(Pablo Pita Leira via Adrien Grand)
- Bug Fixes (3)
- LUCENE-7755: Fixed join queries to not reference IndexReaders, as it could
cause leaks if they are cached.
(Adrien Grand)
- LUCENE-7749: Made LRUQueryCache delegate the scoreSupplier method.
(Martin Amirault via Adrien Grand)
- LUCENE-7769: The UnifiedHighligter wasn't highlighting portions of the query
wrapped in BoostQuery or SpanBoostQuery.
(David Smiley, Dmitry Malinin)
- Other (1)
- LUCENE-7763: Remove outdated comment in IndexWriterConfig.setIndexSort javadocs.
(马可阳 via Christine Poerschke)
- API Changes (12)
- LUCENE-7740: Refactor Range Fields to remove Field suffix (e.g., DoubleRange),
move InetAddressRange and InetAddressPoint from sandbox to misc module, and
refactor all other range fields from sandbox to core.
(Nick Knize)
- LUCENE-7624: TermsQuery has been renamed as TermInSetQuery and moved to core.
(Alan Woodward)
- LUCENE-7637: TermInSetQuery requires that all terms come from the same field.
(Adrien Grand)
- LUCENE-7644: FieldComparatorSource.newComparator() and
SortField.getComparator() no longer throw IOException
(Alan Woodward)
- LUCENE-7643: Replaced doc-values queries in lucene/sandbox with factory
methods on the *DocValuesField classes.
(Adrien Grand)
- LUCENE-7659: Added a IndexWriter#getFieldNames() method (experimental) to return
all field names as visible from the IndexWriter. This would be useful for
IndexWriter#updateDocValues() calls, to prevent calling with non-existent
docValues fields
(Ishan Chattopadhyaya, Adrien Grand, Mike McCandless)
- LUCENE-6959: Removed ToParentBlockJoinCollector in favour of
ParentChildrenBlockJoinQuery, that can return the matching children documents per
parent document. This query should be executed for each matching parent document
after the main query has been executed.
(Adrien Grand, Martijn van Groningen,
Mike McCandless)
- LUCENE-7628: Scorer.getChildren() now only returns Scorers that are
positioned on the current document, and can throw an IOException.
AssertingScorer checks that getChildren() is not called on an unpositioned
Scorer.
(Alan Woodward, Adrien Grand)
- LUCENE-7702: Removed GraphQuery in favour of simple boolean query.
(Matt Webber via Jim Ferenczi)
- LUCENE-7707: TopDocs.merge now takes a boolean option telling it
when to use the incoming shard index versus when to assign the shard
index itself, allowing users to merge shard responses incrementally
instead of once all shard responses are present.
(Simon Willnauer,
Mike McCandless)
- LUCENE-7700: A cleanup of merge throughput control logic. Refactored all the
code previously scattered throughout the IndexWriter and
ConcurrentMergeScheduler into a more accessible set of public methods (see
MergePolicy.OneMergeProgress, MergeScheduler.wrapForMerge and
OneMerge.mergeInit).
(Dawid Weiss, Mike McCandless).
- LUCENE-7734: FieldType's copy constructor was widened to accept any IndexableFieldType.
(David Smiley)
- New Features (10)
- LUCENE-7738: Add new InetAddressRange for indexing and querying InetAddress
ranges.
(Nick Knize)
- LUCENE-7449: Add CROSSES relation support to RangeFieldQuery.
(Nick Knize)
- LUCENE-7623: Add FunctionScoreQuery and FunctionMatchQuery
(Alan Woodward,
Adrien Grand, David Smiley)
- LUCENE-7619: Add WordDelimiterGraphFilter, just like
WordDelimiterFilter except it produces correct token graphs so that
proximity queries at search time will produce correct results
(Mike
McCandless)
- LUCENE-7656: Added the LatLonDocValuesField.new(Box/Distance)Query() factory
methods that are the equivalent of factory methods on LatLonPoint but operate
on doc values. These new methods should be wrapped in an IndexOrDocValuesQuery
for best performance.
(Adrien Grand)
- LUCENE-7673: Added MultiValued[Int/Long/Float/Double]FieldSource that given a
SortedNumericSelector.Type can give a ValueSource view of a
SortedNumericDocValues field.
(Tomás Fernández Löbbe)
- LUCENE-7465: Add SimplePatternTokenizer and
SimplePatternSplitTokenizer, using Lucene's regexp/automaton
implementation for analysis/tokenization
(Clinton Gormley, Mike
McCandless)
- LUCENE-7688: Add OneMergeWrappingMergePolicy class.
(Keith Laban, Christine Poerschke)
- LUCENE-7686: The near-real-time document suggester can now
efficiently filter out duplicate suggestions
(Uwe Schindler, Mike
McCandless)
- LUCENE-7712: SimpleQueryParser now supports default fuzziness
syntax, mapping foo~ to a FuzzyQuery with edit distance 2.
(Lee
Hinman, David Pilato via Mike McCandless)
- Bug Fixes (6)
- LUCENE-7630: Fix (Edge)NGramTokenFilter to no longer drop payloads
and preserve all attributes.
(Nathan Gass via Uwe Schindler)
- LUCENE-7679: MemoryIndex was ignoring omitNorms settings on passed-in
IndexableFields.
(Alan Woodward)
- LUCENE-7692: PatternReplaceCharFilterFactory now implements MultiTermAware.
(Adrien Grand)
- LUCENE-7685: ToParentBlockJoinQuery and ToChildBlockJoinQuery now use the
rewritten child query in their equals and hashCode implementations.
(Adrien Grand)
- LUCENE-7698: CommonGramsQueryFilter was producing a disconnected
token graph, messing up phrase queries when it was used during query
parsing
(Ere Maijala via Mike McCandless)
- LUCENE-7708: ShingleFilter without unigram was producing a disconnected
token graph, messing up queries when it was used during query
parsing
(Jim Ferenczi)
- Improvements (8)
- LUCENE-7055: Added Weight#scorerSupplier, which allows to estimate the cost
of a Scorer before actually building it, in order to optimize how the query
should be run, eg. using points or doc values depending on costs of other
parts of the query.
(Adrien Grand)
- LUCENE-7643: IndexOrDocValuesQuery allows to execute range queries using
either points or doc values depending on which one is more efficient.
(Adrien Grand)
- LUCENE-7662: If index files are missing, throw CorruptIndexException instead
of the less descriptive FileNotFound or NoSuchFileException
(Mike Drob via
Mike McCandless, Erick Erickson)
- LUCENE-7680: UsageTrackingQueryCachingPolicy never caches term filters anymore
since they are plenty fast. This also has the side-effect of leaving more
space in the history for costly filters.
(Adrien Grand)
- LUCENE-7677: UsageTrackingQueryCachingPolicy now caches compound queries a bit
earlier than regular queries in order to improve cache efficiency.
(Adrien Grand)
- LUCENE-7710: BlockPackedReader throws CorruptIndexException and includes
IndexInput description instead of plain IOException
(Mike Drob via
Mike McCandless)
- LUCENE-7695: ComplexPhraseQueryParser to support query time synonyms
(Markus Jelsma
via Mikhail Khludnev)
- LUCENE-7747: QueryBuilder now iterates lazily over the possible paths when building a graph query
(Jim Ferenczi)
- Optimizations (10)
- LUCENE-7641: Optimized point range queries to compute documents that do not
match the range on single-valued fields when more than half the documents in
the index would match.
(Adrien Grand)
- LUCENE-7656: Speed up for LatLonPointDistanceQuery by computing distances even
less often.
(Adrien Grand)
- LUCENE-7661: Speed up for LatLonPointInPolygonQuery by pre-computing the
relation of the polygon with a grid.
(Adrien Grand)
- LUCENE-7660: Speed up LatLonPointDistanceQuery by improving the detection of
whether BKD cells are entirely within the distance close to the dateline.
(Adrien Grand)
- LUCENE-7654: ToParentBlockJoinQuery now implements two-phase iteration and
computes scores lazily in order to be faster when used in conjunctions.
(Adrien Grand)
- LUCENE-7667: BKDReader now calls `IntersectVisitor.grow()` on larger
increments.
(Adrien Grand)
- LUCENE-7638: Query parsers now analyze the token graph for articulation
points (or cut vertices) in order to create more efficient queries for
multi-token synonyms.
(Jim Ferenczi)
- LUCENE-7699: Query parsers now use span queries to produce more efficient
phrase queries for multi-token synonyms.
(Matt Webber via Jim Ferenczi)
- LUCENE-7742: Fix places where we were unboxing and then re-boxing
according to FindBugs
(Daniel Jelinski via Mike McCandless)
- LUCENE-7739: Fix places where we unnecessarily boxed while parsing
a numeric value according to FindBugs
(Daniel Jelinski via Mike
McCandless)
- Build (7)
- LUCENE-7653: Update randomizedtesting to version 2.5.0.
(Dawid Weiss)
- LUCENE-7665: Remove grouping dependency from the join module.
(Martijn van Groningen)
- SOLR-10023: Add non-recursive 'test-nocompile' target: Only runs unit tests.
Jars are not downloaded; compilation is not updated; and Clover is not enabled.
(Steve Rowe)
- LUCENE-7694: Update forbiddenapis to version 2.3.
(Uwe Schindler)
- LUCENE-7693: Replace "org.apache." logic in GetMavenDependenciesTask.
(Daniel Collins, Christine Poerschke)
- LUCENE-7726: Fix HTML entity bugs in Javadocs to be able to build with
Java 9.
(Uwe Schindler, Hossman)
- LUCENE-7727: Replace end-of-life Markdown parser "Pegdown" by "Flexmark"
for compatibility with Java 9.
(Uwe Schindler)
- Other (3)
- LUCENE-7666: Fix typos in lucene-join package info javadoc.
(Tom Saleeba via Christine Poerschke)
- LUCENE-7658: queryparser/xml CoreParser now implements SpanQueryBuilder interface.
(Daniel Collins, Christine Poerschke)
- LUCENE-7715: NearSpansUnordered simplifications.
(Paul Elschot via Adrien Grand)
- Bug Fixes (2)
- LUCENE-7676: Fixed FilterCodecReader to override more super-class methods.
Also added TestFilterCodecReader class.
(Christine Poerschke)
- LUCENE-7717: The UnifiedHighlighter and PostingsHighlighter were not highlighting
prefix queries with multi-byte characters. TermRangeQuery is affected too.
(Dmitry Malinin, David Smiley)
- Build (1)
- LUCENE-7651: Fix Javadocs build for Java 8u121 by injecting "Google Code
Prettify" without adding Javascript to Javadocs's -bottom parameter.
Also update Prettify to latest version to fix Google Chrome issue.
(Uwe Schindler)
- Bug Fixes (3)
- LUCENE-7657: Fixed potential memory leak in the case that a (Span)TermQuery
with a TermContext is cached.
(Adrien Grand)
- LUCENE-7647: Made stored fields reclaim native memory more aggressively when
configured with BEST_COMPRESSION. This could otherwise result in out-of-memory
issues.
(Adrien Grand)
- LUCENE-7670: AnalyzingInfixSuggester should not immediately open an
IndexWriter over an already-built index.
(Steve Rowe)
- API Changes (6)
- LUCENE-7533: Classic query parser no longer allows autoGeneratePhraseQueries
to be set to true when splitOnWhitespace is false (and vice-versa).
- LUCENE-7607: LeafFieldComparator.setScorer and SimpleFieldComparator.setScorer
are declared as throwing IOException
(Alan Woodward)
- LUCENE-7617: Collector construction for two-pass grouping queries is
abstracted into a new Grouper class, which can be passed as a constructor
parameter to GroupingSearch. The abstract base classes for the different
grouping Collectors are renamed to remove the Abstract* prefix.
(Alan Woodward, Martijn van Groningen)
- LUCENE-7609: The expressions module now uses the DoubleValuesSource API, and
no longer depends on the queries module. Expression#getValueSource() is
replaced with Expression#getDoubleValuesSource().
(Alan Woodward, Adrien
Grand)
- LUCENE-7610: The facets module now uses the DoubleValuesSource API, and
methods that take ValueSource parameters are deprecated
(Alan Woodward)
- LUCENE-7611: DocumentValueSourceDictionary now takes a LongValuesSource
as a parameter, and the ValueSource equivalent is deprecated
(Alan Woodward)
- New features (9)
- LUCENE-5867: Added BooleanSimilarity.
(Robert Muir, Adrien Grand)
- LUCENE-7466: Added AxiomaticSimilarity.
(Peilin Yang via Tommaso Teofili)
- LUCENE-7590: Added DocValuesStatsCollector to compute statistics on DocValues
fields.
(Shai Erera)
- LUCENE-7587: The new FacetQuery and MultiFacetQuery helper classes
make it simpler to execute drill down when drill sideways counts are
not needed
(Emmanuel Keller via Mike McCandless)
- LUCENE-6664: A new SynonymGraphFilter outputs a correct graph
structure for multi-token synonyms, separating out a
FlattenGraphFilter that is hardwired into the current
SynonymFilter. This finally makes it possible to implement
correct multi-token synonyms at search time. See
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
for details.
(Mike McCandless)
- LUCENE-5325: Added LongValuesSource and DoubleValuesSource, intended as
type-safe replacements for ValueSource in the queries module. These
expose per-segment LongValues or DoubleValues iterators.
(Alan Woodward, Adrien Grand)
- LUCENE-7603: Graph token streams are now handled accurately by query
parsers, by enumerating all paths and creating the corresponding
query/ies as sub-clauses
(Matt Weber via Mike McCandless)
- LUCENE-7588: DrillSideways can now run queries concurrently, and
supports an IndexSearcher using an executor service to run each query
concurrently across all segments in the index
(Emmanuel Keller via
Mike McCandless)
- LUCENE-7627: Added .intersect methods to SortedDocValues and
SortedSetDocValues to allow filtering their TermsEnums with a
CompiledAutomaton
(Alan Woodward, Mike McCandless)
- Bug Fixes (11)
- LUCENE-7547: JapaneseTokenizerFactory was failing to close the
dictionary file it opened
(Markus via Mike McCandless)
- LUCENE-7562: CompletionFieldsConsumer sometimes throws
NullPointerException on ghost fields
(Oliver Eilhard via Mike McCandless)
- LUCENE-7533: Classic query parser: disallow autoGeneratePhraseQueries=true
when splitOnWhitespace=false (and vice-versa).
(Steve Rowe)
- LUCENE-7536: ASCIIFoldingFilterFactory used to return an illegal multi-term
component when preserveOriginal was set to true.
(Adrien Grand)
- LUCENE-7576: Fix Terms.intersect in the default codec to detect when
the incoming automaton is a special case and throw a clearer
exception than NullPointerException
(Tom Mortimer via Mike McCandless)
- LUCENE-6989: Fix Exception handling in MMapDirectory's unmap hack
support code to work with Java 9's new InaccessibleObjectException
that does not extend ReflectiveAccessException in Java 9.
(Uwe Schindler)
- LUCENE-7581: Lucene now prevents updating a doc values field that is used
in the index sort, since this would lead to corruption.
(Jim
Ferenczi via Mike McCandless)
- LUCENE-7570: IndexWriter may deadlock if a commit is running while
there are too many merges running and one of the merges hits a
tragic exception
(Joey Echeverria via Mike McCandless)
- LUCENE-7594: Fixed point range queries on floating-point types to recommend
using helpers for exclusive bounds that are consistent with Double.compare.
(Adrien Grand, Dawid Weiss)
- LUCENE-7606: Normalization with CustomAnalyzer would only apply the last
token filter.
(Adrien Grand)
- LUCENE-7612: Removed an unused dependency from the suggester to the misc
module.
(Alan Woodward)
- Improvements (16)
- LUCENE-7532: Add back lost codec file format documentation
(Shinichiro Abe via Mike McCandless)
- LUCENE-6824: TermAutomatonQuery now rewrites to TermQuery,
PhraseQuery or MultiPhraseQuery when the word automaton is simple
(Mike McCandless)
- LUCENE-7431: Allow a certain amount of overlap to be specified between the include
and exclude arguments of SpanNotQuery via negative pre and/or post arguments.
(Marc Morissette via David Smiley)
- LUCENE-7544: UnifiedHighlighter: add extension points for handling custom queries.
(Michael Braun, David Smiley)
- LUCENE-7538: Asking IndexWriter to store a too-massive text field
now throws IllegalArgumentException instead of a cryptic exception
that closes your IndexWriter
(Steve Chen via Mike McCandless)
- LUCENE-7524: Added more detailed explanation of how IDF is computed in
ClassicSimilarity and BM25Similarity.
(Adrien Grand)
- LUCENE-7564: AnalyzingInfixSuggester should close its IndexWriter by default
at the end of build().
(Steve Rowe)
- LUCENE-7526: Enhanced UnifiedHighlighter's passage relevancy for queries with
wildcards and sometimes just terms. Added shouldPreferPassageRelevancyOverSpeed()
which can be overridden to return false to eek out more speed in some cases.
(Timothy M. Rodriguez, David Smiley)
- LUCENE-7560: QueryBuilder.createFieldQuery is no longer final,
giving custom query parsers subclassing QueryBuilder more freedom to
control how text is analyzed and converted into a query
(Matt Weber
via Mike McCandless)
- LUCENE-7537: Index time sorting now supports multi-valued sorts
using selectors (MIN, MAX, etc.)
(Jim Ferenczi via Mike McCandless)
- LUCENE-7575: UnifiedHighlighter can now highlight fields with queries that don't
necessarily refer to that field (AKA requireFieldMatch==false). Disabled by default.
See UH get/setFieldMatcher.
(Jim Ferenczi via David Smiley)
- LUCENE-7592: If the segments file is truncated, we now throw
CorruptIndexException instead of the more confusing EOFException
(Mike Drob via Mike McCandless)
- LUCENE-6989: Make MMapDirectory's unmap hack work with Java 9 EA (b150+):
Unmapping uses new sun.misc.Unsafe#invokeCleaner(ByteBuffer).
Java 9 now needs same permissions like Java 8;
RuntimePermission("accessClassInPackage.jdk.internal.ref")
is no longer needed. Support for older Java 9 builds was removed.
(Uwe Schindler)
- LUCENE-7401: Changed the way BKD trees pick the split dimension in order to
ensure all dimensions are indexed.
(Adrien Grand)
- LUCENE-7614: Complex Phrase Query parser ignores double quotes around single token
prefix, wildcard, range queries
(Mikhail Khludnev)
- LUCENE-7620: Added LengthGoalBreakIterator, a wrapper around another B.I. to skip breaks
that would create Passages that are too short. Only for use with the UnifiedHighlighter
(and probably PostingsHighlighter).
(David Smiley)
- Optimizations (4)
- LUCENE-7568: Optimize merging when index sorting is used but the
index is already sorted
(Jim Ferenczi via Mike McCandless)
- LUCENE-7563: The BKD in-memory index for dimensional points now uses
a compressed format, using substantially less RAM in some cases
(Adrien Grand, Mike McCandless)
- LUCENE-7583: BKD writing now buffers each leaf block in heap before
writing to disk, giving a small speedup in points-heavy use cases.
(Mike McCandless)
- LUCENE-7572: Doc values queries now cache their hash code.
(Adrien Grand)
- Other (5)
- LUCENE-7546: Fixed references to benchmark wikipedia data and the Jenkins line-docs file
(David Smiley)
- LUCENE-7534: fix smokeTestRelease.py to run on Cygwin
(Mikhail Khludnev)
- LUCENE-7559: UnifiedHighlighter: Make Passage and OffsetsEnum more exposed to allow
passage creation to be customized.
(David Smiley)
- LUCENE-7599: Simplify TestRandomChains using Java's built-in Predicate and
Function interfaces.
(Ahmet Arslan via Adrien Grand)
- LUCENE-7595: Improve RAMUsageTester in test-framework to estimate memory usage of
runtime classes and work with Java 9 EA (b148+). Disable static field heap usage
checker in LuceneTestCase.
(Uwe Schindler, Dawid Weiss)
- Build (3)
- LUCENE-7387: fix defaultCodec in build.xml to account for the line ending
(hossman)
- LUCENE-7543: Make changes-to-html target an offline operation, by moving the
Lucene and Solr DOAP RDF files into the Git source repository under
dev-tools/doap/ and then pulling release dates from those files, rather than
from JIRA.
(Mano Kovacs, hossman, Steve Rowe)
- LUCENE-7596: Update Groovy to version 2.4.8 to allow building with Java 9
build 148+. Also update JGit version for working-copy checks.
(Uwe Schindler)
- API Changes (none)
- New Features (2)
- LUCENE-7438: New "UnifiedHighlighter" derivative of the PostingsHighlighter that
can consume offsets from postings, term vectors, or analysis. It can highlight phrases
as accurately as the standard Highlighter. Light term vectors can be used with offsets
in postings for fast wildcard (MultiTermQuery) highlighting.
(David Smiley, Timothy Rodriguez)
- LUCENE-7490: SimpleQueryParser now parses '*' to MatchAllDocsQuery
(Lee Hinman via Mike McCandless)
- Bug Fixes (13)
- LUCENE-7507: Upgrade morfologik-stemming to version 2.1.1 (fixes security
manager issue with Polish dictionary lookup).
(Dawid Weiss)
- LUCENE-7472: MultiFieldQueryParser.getFieldQuery() drops queries that are
neither BooleanQuery nor TermQuery.
(Steve Rowe)
- LUCENE-7456: PerFieldPostings/DocValues was failing to delegate the
merge method
(Julien MASSENET via Mike McCandless)
- LUCENE-7468: ASCIIFoldingFilter should not emit duplicated tokens when
preserve original is on.
(David Causse via Adrien Grand)
- LUCENE-7484: FastVectorHighlighter failed to highlight SynonymQuery
(Jim Ferenczi via Mike McCandless)
- LUCENE-7476: JapaneseNumberFilter should not invoke incrementToken
on its input after it's exhausted
(Andy Hind via Mike McCandless)
- LUCENE-7486: DisjunctionMaxQuery does not work correctly with queries that
return negative scores.
(Ivan Provalov, Uwe Schindler, Adrien Grand)
- LUCENE-7491: Suddenly turning on dimensional points for some fields
that already exist in an index but didn't previously index
dimensional points could cause unexpected merge exceptions
(Hans
Lund, Mike McCandless)
- LUCENE-6914: Fixed DecimalDigitFilter in case of supplementary code points.
(Hossman)
- LUCENE-7493: FacetCollector.search threw an unexpected exception if
you asked for zero hits but wanted facets
(Mahesh via Mike McCandless)
- LUCENE-7505: AnalyzingInfixSuggester returned invalid results when
allTermsRequired is false and context filters are specified
(Mike
McCandless)
- LUCENE-7429: AnalyzerWrapper can now modify the normalization chain too and
DelegatingAnalyzerWrapper does the right thing automatically.
(Adrien Grand)
- LUCENE-7135: Lucene's check for 32 or 64 bit JVM now works around security
manager blocking access to some properties
(Aaron Madlon-Kay via
Mike McCandless)
- Improvements (3)
- LUCENE-7439: FuzzyQuery now matches all terms within the specified
edit distance, even if they are short terms
(Mike McCandless)
- LUCENE-7496: Better toString for SweetSpotSimilarity
(janhoy)
- LUCENE-7520: Highlighter's WeightedSpanTermExtractor shouldn't attempt to expand a MultiTermQuery
when its field doesn't match the field the extraction is scoped to.
(Cao Manh Dat via David Smiley)
- Optimizations (1)
- LUCENE-7501: BKDReader should not store the split dimension explicitly in the
1D case.
(Adrien Grand)
- Other (3)
- LUCENE-7513: Upgrade randomizedtesting to 2.4.0.
(Dawid Weiss)
- LUCENE-7452: Block join query exception suggests how to find a doc, which
violates orthogonality requirement.
(Mikhail Khludnev)
- LUCENE-7438: Renovate the Benchmark module's support for benchmarking highlighting. All
highlighters are supported via SearchTravRetHighlight.
(David Smiley)
- Build (1)
- LUCENE-7292: Fix build to use "--release 8" instead of "-release 8" on
Java 9 (this changed with recent EA build b135).
(Uwe Schindler)
- API Changes (1)
- LUCENE-7436: MinHashFilter's constructor, and some of its default
settings, should be public.
(Doug Turnbull via Mike McCandless)
- Bug Fixes (4)
- LUCENE-7417: The standard Highlighter could throw an IllegalArgumentException when
trying to highlight a query containing a degenerate case of a MultiPhraseQuery with one
term.
(Thomas Kappler via David Smiley)
- LUCENE-7440: Document id skipping (PostingsEnum.advance) could throw an
ArrayIndexOutOfBoundsException exception on large index segments (>1.8B docs)
with large skips.
(yonik)
- LUCENE-7442: MinHashFilter's ctor should validate its args.
(Cao Manh Dat via Steve Rowe)
- LUCENE-7318: Fix backwards compatibility issues around StandardAnalyzer
and its components, introduced with Lucene 6.2.0. The moved classes
were restored in their original packages: LowercaseFilter and StopFilter,
as well as several utility classes.
(Uwe Schindler, Mike McCandless)
- API Changes (1)
- ScoringWrapperSpans was removed since it had no purpose or effect as of Lucene 5.5.
- New Features (11)
- LUCENE-7388: Add point based IntRangeField, FloatRangeField, LongRangeField along with
supporting queries and tests
(Nick Knize)
- LUCENE-7381: Add point based DoubleRangeField and RangeFieldQuery for
indexing and querying on Ranges up to 4 dimensions
(Nick Knize)
- LUCENE-6968: LSH Filter
(Tommaso Teofili, Andy Hind, Cao Manh Dat)
- LUCENE-7302: IndexWriter methods that change the index now return a
long "sequence number" indicating the effective equivalent
single-threaded execution order
(Mike McCandless)
- LUCENE-7335: IndexWriter's commit data is now late binding,
recording key/values from a provided iterable based on when the
commit actually takes place
(Mike McCandless)
- LUCENE-7287: UkrainianMorfologikAnalyzer is a new dictionary-based
analyzer for the Ukrainian language
(Andriy Rysin via Mike
McCandless)
- LUCENE-7373: Directory.renameFile, which did both renaming and fsync
of the directory metadata, has been deprecated; use the new separate
methods Directory.rename and Directory.syncMetaData instead
(Robert Muir,
Uwe Schindler, Mike McCandless)
- LUCENE-7355: Added Analyzer#normalize(), which only applies normalization to
an input string.
(Adrien Grand)
- LUCENE-7380: Add Polygon.fromGeoJSON for more easily creating
Polygon instances from a standard GeoJSON string
(Robert Muir, Mike
McCandless)
- LUCENE-7395: PerFieldSimilarityWrapper requires a default similarity
for calculating query norm and coordination factor in Lucene 6.x.
Lucene 7 will no longer have those factors.
(Uwe Schindler, Sascha Markus)
- SOLR-9279: Queries module: new ComparisonBoolFunction base class
(Doug Turnbull via David Smiley)
- Bug Fixes (8)
- LUCENE-6662: Fixed potential resource leaks.
(Rishabh Patel via Adrien Grand)
- LUCENE-7340: MemoryIndex.toString() could throw NPE; fixed. Renamed to toStringDebug().
(Daniel Collins, David Smiley)
- LUCENE-7382: Fix bug introduced by LUCENE-7355 that used the
wrong default AttributeFactory for new Tokenizers.
(Terry Smith, Uwe Schindler)
- LUCENE-7389: Fix FieldType.setDimensions(...) validation for the dimensionNumBytes
parameter.
(Martijn van Groningen)
- LUCENE-7391: Fix performance regression in MemoryIndex's fields() introduced
in Lucene 6.
(Steve Mason via David Smiley)
- LUCENE-7395, SOLR-9315: Fix PerFieldSimilarityWrapper to also delegate query
norm and coordination factor using a default similarity added as ctor param.
(Uwe Schindler, Sascha Markus)
- SOLR-9413: Fix analysis/kuromoji's CSVUtil.quoteEscape logic, add TestCSVUtil test.
(AppChecker, Christine Poerschke)
- LUCENE-7419: Fix performance bug with TokenStream.end(), where it would lookup
PositionIncrementAttribute every time.
(Mike McCandless, Robert Muir)
- Improvements (16)
- LUCENE-7323: Compound file writing now verifies the incoming
sub-files' checkums and segment IDs, to catch hardware issues or
filesytem bugs earlier
(Robert Muir, Mike McCandless)
- LUCENE-6766: Index time sorting has graduated from the misc module
to core, is much simpler to use, via
IndexWriter.setIndexSort, and now works with dimensional points.
(Adrien Grand, Mike McCandless)
- LUCENE-5931: Detect when an application tries to reopen an
IndexReader after (illegally) removing the old index and
reindexing
(Vitaly Funstein, Robert Muir, Mike McCandless)
- LUCENE-6171: Lucene now passes the StandardOpenOption.CREATE_NEW
option when writing new files so the filesystem enforces our
write-once architecture, possibly catching externally caused
issues sooner
(Robert Muir, Mike McCandless)
- LUCENE-7318: StandardAnalyzer has been moved from the analysis
module into core and is now the default analyzer in
IndexWriterConfig
(Robert Muir, Mike McCandless)
- LUCENE-7345: RAMDirectory now enforces write-once files as well
(Robert Muir, Mike McCandless)
- LUCENE-7337: MatchNoDocsQuery now scores with 0 normalization factor
and empty boolean queries now rewrite to MatchNoDocsQuery instead of
vice/versa
(Jim Ferenczi via Mike McCandless)
- LUCENE-7359: Add equals() and hashCode() to Explanation
(Alan Woodward)
- LUCENE-7353: ScandinavianFoldingFilterFactory and
ScandinavianNormalizationFilterFactory now implement MultiTermAwareComponent.
(Adrien Grand)
- LUCENE-2605: Add classic QueryParser option setSplitOnWhitespace() to
control whether to split on whitespace prior to text analysis. Default
behavior remains unchanged: split-on-whitespace=true.
(Steve Rowe)
- LUCENE-7276: MatchNoDocsQuery now includes an optional reason for
why it was used
(Jim Ferenczi via Mike McCandless)
- LUCENE-7355: AnalyzingQueryParser now only applies the subset of the analysis
chain that is about normalization for range/fuzzy/wildcard queries.
(Adrien Grand)
- LUCENE-7376: Add support for ToParentBlockJoinQuery to fast vector highlighter's
FieldQuery.
(Martijn van Groningen)
- LUCENE-7385: Improve/fix assert messages in SpanScorer.
(David Smiley)
- LUCENE-7393: Add ICUTokenizer option to parse Myanmar text as syllables instead of words,
because the ICU word-breaking algorithm has some issues. This allows for the previous
tokenization used before Lucene 5.
(AM, Robert Muir)
- LUCENE-7409: Changed MMapDirectory's unmapping to work safer, but still with
no guarantees. This uses a store-store barrier and yields the current thread
before unmapping to allow in-flight requests to finish. The new code no longer
uses WeakIdentityMap as it delegates all ByteBuffer reads throgh a new
ByteBufferGuard wrapper that is shared between all ByteBufferIndexInput clones.
(Robert Muir, Uwe Schindler)
- Optimizations (7)
- LUCENE-7330, LUCENE-7339: Speed up conjunction queries.
(Adrien Grand)
- LUCENE-7356: SearchGroup tweaks.
(Christine Poerschke)
- LUCENE-7351: Doc id compression for points.
(Adrien Grand)
- LUCENE-7371: Point values are now better compressed using run-length
encoding.
(Adrien Grand)
- LUCENE-7311: Cached term queries do not seek the terms dictionary anymore.
(Adrien Grand)
- LUCENE-7396, LUCENE-7399: Faster flush of points.
(Adrien Grand, Mike McCandless)
- LUCENE-7406: Automaton and PrefixQuery tweaks (fewer object (re)allocations).
(Christine Poerschke)
- Other (6)
- LUCENE-4787: Fixed some highlighting javadocs.
(Michael Dodsworth via Adrien
Grand)
- LUCENE-7334: Update ASM dependency to 5.1.
(Uwe Schindler)
- LUCENE-7346: Update forbiddenapis to version 2.2.
(Uwe Schindler)
- LUCENE-7360: Explanation.toHtml() is deprecated.
(Alan Woodward)
- LUCENE-7372: Factor out an org.apache.lucene.search.FilterWeight class.
(Christine Poerschke, Adrien Grand, David Smiley)
- LUCENE-7384: Removed ScoringWrapperSpans. And tweaked SpanWeight.buildSimWeight() to
reuse the existing Similarity instead of creating a new one.
(David Smiley)
- New Features (5)
- LUCENE-7099: Add LatLonDocValuesField.newDistanceSort to the sandbox.
(Robert Muir)
- LUCENE-7140: Add PlanetModel.bisection to spatial3d
(Karl Wright via
Mike McCandless)
- LUCENE-7069: Add LatLonPoint.nearest, to find nearest N points to a
provided query point
(Mike McCandless)
- LUCENE-7234: Added InetAddressPoint.nextDown/nextUp to easily generate range
queries with excluded bounds.
(Adrien Grand)
- LUCENE-7300: The misc module now has a directory wrapper that uses hard-links if
applicable and supported when copying files from another FSDirectory in
Directory#copyFrom.
(Simon Willnauer)
- API Changes (6)
- LUCENE-7184: Refactor LatLonPoint encoding methods to new GeoEncodingUtils
helper class in core geo package. Also refactors LatLonPointTests to
TestGeoEncodingUtils
(Nick Knize)
- LUCENE-7163: refactor GeoRect, Polygon, and GeoUtils tests to geo
package in core
(Nick Knize)
- LUCENE-7152: Refactor GeoUtils from lucene-spatial package to
core
(Nick Knize)
- LUCENE-7141: Switch OfflineSorter's ByteSequencesReader to
BytesRefIterator
(Mike McCandless)
- LUCENE-7150: Spatial3d gets useful APIs to create common shape
queries, matching LatLonPoint.
(Karl Wright via Mike McCandless)
- LUCENE-7243: Removed the LeafReaderContext parameter from
QueryCachingPolicy#shouldCache.
(Adrien Grand)
- Optimizations (14)
- LUCENE-7071: Reduce bytes copying in OfflineSorter, giving ~10%
speedup on merging 2D LatLonPoint values
(Mike McCandless)
- LUCENE-7105, LUCENE-7215: Optimize LatLonPoint's newDistanceQuery.
(Robert Muir)
- LUCENE-7097: IntroSorter now recurses to 2 * log_2(count) quicksort
stack depth before switching to heapsort
(Adrien Grand, Mike McCandless)
- LUCENE-7115: Speed up FieldCache.CacheEntry toString by setting initial
StringBuilder capacity
(Gregory Chanan)
- LUCENE-7147: Improve disjoint check for geo distance query traversal
(Ryan Ernst, Robert Muir, Mike McCandless)
- LUCENE-7153: GeoPointField and LatLonPoint polygon queries now support
multiple polygons and holes, with memory usage independent of
polygon complexity.
(Karl Wright, Mike McCandless, Robert Muir)
- LUCENE-7159: Speed up LatLonPoint polygon performance.
(Robert Muir, Ryan Ernst)
- LUCENE-7211: Reduce memory & GC for spatial RPT Intersects when the number of
matching docs is small.
(Jeff Wartes, David Smiley)
- LUCENE-7235: LRUQueryCache should not take a lock for segments that it will
not cache on anyway.
(Adrien Grand)
- LUCENE-7238: Explicitly disable the query cache in MemoryIndex#createSearcher.
(Adrien Grand)
- LUCENE-7237: LRUQueryCache now prefers returning an uncached Scorer than
waiting on a lock.
(Adrien Grand)
- LUCENE-7261, LUCENE-7262, LUCENE-7264, LUCENE-7258: Speed up DocIdSetBuilder
(which is used by TermsQuery, multi-term queries and several point queries).
(Adrien Grand, Jeff Wartes, David Smiley)
- LUCENE-7299: Speed up BytesRefHash.sort() using radix sort.
(Adrien Grand)
- LUCENE-7306: Speed up points indexing and merging using radix sort.
(Adrien Grand)
- Bug Fixes (9)
- LUCENE-7127: Fix corner case bugs in GeoPointDistanceQuery.
(Robert Muir)
- LUCENE-7166: Fix corner case bugs in LatLonPoint/GeoPointField bounding box
queries.
(Robert Muir)
- LUCENE-7168: Switch to stable encode for geo3d, remove quantization
test leniency, remove dead code
(Mike McCandless)
- LUCENE-7301: Multiple doc values updates to the same document within
one update batch could be applied in the wrong order resulting in
the wrong updated value
(Ishan Chattopadhyaya, hossman, Mike McCandless)
- LUCENE-7312: Fix geo3d's x/y/z double to int encoding to ensure it always
rounds down
(Karl Wright, Mike McCandless)
- LUCENE-7132: BooleanQuery sometimes assigned too-low scores in cases
where ranges of documents had only a single clause matching while
other ranges had more than one clause matching
(Ahmet Arslan,
hossman, Mike McCandless)
- LUCENE-7286: Added support for highlighting SynonymQuery.
(Adrien Grand)
- LUCENE-7291: Spatial heatmap faceting could mis-count when the heatmap crosses the
dateline and indexed non-point shapes are much bigger than the heatmap region.
(David Smiley)
- LUCENE-7333: Fix test bug where randomSimpleString() generated a filename
that is a reserved device name on Windows.
(Uwe Schindler, Mike McCandless)
- Other (9)
- LUCENE-7295: TermAutomatonQuery.hashCode calculates Automaton.toDot().hash,
equivalence relationship replaced with object identity.
(Dawid Weiss)
- LUCENE-7277: Make Query.hashCode and Query.equals abstract.
(Paul Elschot,
Dawid Weiss)
- LUCENE-7174: Upgrade randomizedtesting to 2.3.4.
(Uwe Schindler, Dawid Weiss)
- LUCENE-7205: Remove repeated nl.getLength() calls in
(Boolean|DisjunctionMax|FuzzyLikeThis)QueryBuilder.
(Christine Poerschke)
- LUCENE-7210: Make TestCore*Parser's analyzer choice override-able
(Christine Poerschke, Daniel Collins)
- LUCENE-7263: Make queryparser/xml/CoreParser's SpanQueryBuilderFactory
accessible to deriving classes.
(Daniel Collins via Christine Poerschke)
- SOLR-9109/SOLR-9121: Allow specification of a custom Ivy settings file via system
property "ivysettings.xml".
(Misha Dmitriev, Christine Poerschke, Uwe Schindler, Steve Rowe)
- LUCENE-7206: Improve the ToParentBlockJoinQuery's explain by including the explain
of the best matching child doc.
(Ilya Kasnacheev, Jeff Evans via Martijn van Groningen)
- LUCENE-7307: Add getters to the PointInSetQuery and PointRangeQuery queries.
(Martijn van Groningen, Adrien Grand)
- Build (2)
- LUCENE-7292: Use '-release' instead of '-source/-target' during
compilation on Java 9+ to ensure real cross-compilation.
(Uwe Schindler)
- LUCENE-7296: Update forbiddenapis to version 2.1.
(Uwe Schindler)
- New Features (1)
- LUCENE-7278: Spatial-extras DateRangePrefixTree's Calendar is now configurable, to
e.g. clear the Gregorian Change Date. Also, toString(cal) is now identical to
DateTimeFormatter.ISO_INSTANT.
(David Smiley)
- Bug Fixes (10)
- LUCENE-7187: Block join queries' Weight#extractTerms(...) implementations
should delegate to the wrapped weight.
(Martijn van Groningen)
- LUCENE-7209: Fixed explanations of FunctionScoreQuery.
(Adrien Grand)
- LUCENE-7232: Fixed InetAddressPoint.newPrefixQuery, which was generating an
incorrect query when the prefix length was not a multiple of 8.
(Adrien Grand)
- LUCENE-7279: JapaneseTokenizer throws ArrayIndexOutOfBoundsException
on some valid inputs
(Mike McCandless)
- LUCENE-7188: remove incorrect sanity check in NRTCachingDirectory.listAll()
that led to IllegalStateException being thrown when nothing was wrong.
(David Smiley, yonik)
- LUCENE-7219: Make queryparser/xml (Point|LegacyNumeric)RangeQuery builders
match the underlying queries' (lower|upper)Term optionality logic.
(Kaneshanathan Srivisagan, Christine Poerschke)
- LUCENE-7257: Fixed PointValues#size(IndexReader, String), docCount,
minPackedValue and maxPackedValue to skip leaves that do not have points
rather than raising an IllegalStateException.
(Adrien Grand)
- LUCENE-7284: GapSpans needs to implement positionsCost().
(Daniel Bigham, Alan
Woodward)
- LUCENE-7231: WeightedSpanTermExtractor didn't deal correctly with single-term
phrase queries.
(Eva Popenda, Alan Woodward)
- LUCENE-7293: Don't try to highlight GeoPoint queries
(Britta Weber,
Nick Knize, Mike McCandless, Uwe Schindler)
- Documentation (1)
- LUCENE-7223: Improve XXXPoint javadocs to make it clear that you
should separately add StoredField if you want to retrieve these
field values at search time
(Greg Huber, Robert Muir, Mike McCandless)
- System Requirements (2)
- LUCENE-5950: Move to Java 8 as minimum Java version.
(Ryan Ernst, Uwe Schindler)
- LUCENE-6069: Lucene Core now gets compiled with Java 8 "compact1" profile,
all other modules with "compact2".
(Robert Muir, Uwe Schindler)
- New Features (17)
- LUCENE-6631: Lucene Document classification
(Tommaso Teofili, Alessandro Benedetti)
- LUCENE-6747: FingerprintFilter is a TokenFilter that outputs a single
token which is a concatenation of the sorted and de-duplicated set of
input tokens. Useful for normalizing short text in clustering/linking
tasks.
(Mark Harwood, Adrien Grand)
- LUCENE-5735: NumberRangePrefixTreeStrategy now includes interval/range faceting
for counting ranges that align with the underlying terms as defined by the
NumberRangePrefixTree (e.g. familiar date units like days).
(David Smiley)
- LUCENE-6711: Use CollectionStatistics.docCount() for IDF and average field
length computations, to avoid skew from documents that don't have the field.
(Ahmet Arslan via Robert Muir)
- LUCENE-6758: Use docCount+1 for DefaultSimilarity's IDF, so that queries
containing nonexistent fields won't screw up querynorm.
(Terry Smith, Robert Muir)
- SOLR-7876: The QueryTimeout interface now has a isTimeoutEnabled method
that can return false to exit from ExitableDirectoryReader wrapping at
the point fields() is called.
(yonik)
- LUCENE-6825: Add low-level support for block-KD trees
(Mike McCandless)
- LUCENE-6852, LUCENE-6975: Add support for points (dimensionally
indexed values) to index, document and codec APIs, including a
simple text implementation.
(Mike McCandless)
- LUCENE-6861: Create Lucene60Codec, supporting points.
(Mike McCandless)
- LUCENE-6879: Allow to define custom CharTokenizer instances without
subclassing using Java 8 lambdas or method references.
(Uwe Schindler)
- LUCENE-6881: Cutover all BKD implementations to points
(Mike McCandless)
- LUCENE-6837: Add N-best output support to JapaneseTokenizer.
(Hiroharu Konno via Christian Moen)
- LUCENE-6962: Add per-dimension min/max to points
(Mike McCandless)
- LUCENE-6975: Add ExactPointQuery, to match a single N-dimensional
point
(Robert Muir, Mike McCandless)
- LUCENE-6989: Add preliminary support for MMapDirectory unmapping in Java 9.
(Uwe Schindler, Chris Hegarty, Peter Levart)
- LUCENE-7040: Upgrade morfologik-stemming to version 2.1.0.
(Dawid Weiss)
- LUCENE-7048: Add XXXPoint.newSetQuery, to create a query that
efficiently matches all documents containing any of the specified
point values. This is the analog of TermsQuery, but for points
instead.
(Adrien Grand, Robert Muir, Mike McCandless)
- API Changes (17)
- LUCENE-7094: BBoxStrategy and PointVectorStrategy now support
PointValues (in addition to legacy numeric trie). Their APIs
were changed a little and also made more consistent. PointValues/Trie
is optional, DocValues is optional, stored value is optional.
(Nick Knize, David Smiley)
- LUCENE-6067: Accountable.getChildResources has a default
implementation returning the empty list.
(Robert Muir)
- LUCENE-6583: FilteredQuery has been removed. Instead, you can construct a
BooleanQuery with one MUST clause for the query, and one FILTER clause for
the filter.
(Adrien Grand)
- LUCENE-6651: AttributeImpl#reflectWith(AttributeReflector) was made
abstract and has no reflection-based default implementation anymore.
(Uwe Schindler)
- LUCENE-6706: PayloadTermQuery and PayloadNearQuery have been removed.
Instead, use PayloadScoreQuery to wrap any SpanQuery.
(Alan Woodward)
- LUCENE-6829: OfflineSorter, and the classes that use it (suggesters,
hunspell) now do all temporary file IO via Directory instead of
directly through java's temp dir. Directory.createTempOutput
creates a uniquely named IndexOutput, and the new
IndexOutput.getName returns its name
(Dawid Weiss, Robert Muir, Mike
McCandless)
- LUCENE-6917: Deprecate and rename NumericXXX classes to
LegacyNumericXXX in favor of points
(Mike McCandless)
- LUCENE-6947: SortField.missingValue is now protected. You can read its
value using the new SortField.getMissingValue getter.
(Adrien Grand)
- LUCENE-7028: Remove duplicate method in LegacyNumericUtils.
(Uwe Schindler)
- LUCENE-7052, LUCENE-7053: Remove custom comparators from BytesRef
class and solely use natural byte[] comparator throughout codebase.
This also simplifies API of BytesRefHash. It also replaces the natural
comparator in ArrayUtil by Java 8's Comparator#naturalOrder().
(Mike McCandless, Uwe Schindler, Robert Muir)
- LUCENE-7060: Update Spatial4j to 0.6. The package com.spatial4j.core
is now org.locationtech.spatial4j.
(David Smiley)
- LUCENE-7058: Add getters to various Query implementations
(Guillaume Smet via
Alan Woodward)
- LUCENE-7064: MultiPhraseQuery is now immutable and should be constructed
with MultiPhraseQuery.Builder.
(Luc Vanlerberghe via Adrien Grand)
- LUCENE-7072: Geo3DPoint always uses WGS84 planet model.
(Robert Muir, Mike McCandless)
- LUCENE-7056: Geo3D classes are in different packages now.
(David Smiley)
- LUCENE-6952: These classes are now abstract: FilterCodecReader, FilterLeafReader,
FilterCollector, FilterDirectory. And some Filter* classes in
lucene-test-framework too.
(David Smiley)
- SOLR-8867: FunctionValues.getRangeScorer now takes a LeafReaderContext instead
of an IndexReader, and avoids matching documents without a value in the field
for numeric fields.
(yonik)
- Optimizations (5)
- LUCENE-6891: Use prefix coding when writing points in
each leaf block in the default codec, to reduce the index
size
(Mike McCandless)
- LUCENE-6901: Optimize points indexing: use faster
IntroSorter instead of InPlaceMergeSorter, and specialize 1D
merging to merge sort the already sorted segments instead of
re-indexing
(Mike McCandless)
- LUCENE-6793: LegacyNumericRangeQuery.hashCode() is now less subject to hash
collisions.
(J.B. Langston via Adrien Grand)
- LUCENE-7050: TermsQuery is now cached more aggressively by the default
query caching policy.
(Adrien Grand)
- LUCENE-7066: PointRangeQuery got optimized for the case that all documents
have a value and all points from the segment match.
(Adrien Grand)
- Changes in Runtime Behavior (3)
- LUCENE-6789: IndexSearcher's default Similarity is changed to BM25Similarity.
Use ClassicSimilarity to get the old vector space DefaultSimilarity.
(Robert Muir)
- LUCENE-6886: Reserve the .tmp file name extension for temp files,
and codec components are no longer allowed to use this extension
(Robert Muir, Mike McCandless)
- LUCENE-6835: Directory.listAll now returns entries in sorted order,
to not leak platform-specific behavior, and "retrying file deletion"
is now the responsibility of Directory.deleteFile, not the caller.
(Robert Muir, Mike McCandless)
- Tests (1)
- LUCENE-7009: Add expectThrows utility to LuceneTestCase. This uses a lambda
expression to encapsulate a statement that is expected to throw an exception.
(Ryan Ernst)
- Bug Fixes (7)
- LUCENE-7065: Fix the explain for the global ordinals join query. Before the
explain would also indicate that non matching documents would match.
On top of that with score mode average, the explain would fail with a NPE.
(Martijn van Groningen)
- LUCENE-7101: OfflineSorter had O(N^2) merge cost, and used too many
temporary file descriptors, for large sorts
(Mike McCandless)
- LUCENE-7111: DocValuesRangeQuery.newLongRange behaves incorrectly for
Long.MAX_VALUE and Long.MIN_VALUE
(Ishan Chattopadhyaya via Steve Rowe)
- LUCENE-7139: Fix bugs in geo3d's Vincenty surface distance
implementation
(Karl Wright via Mike McCandless)
- LUCENE-7112: WeightedSpanTermExtractor.extractUnknownQuery is only called
on queries that could not be extracted.
(Adrien Grand)
- LUCENE-7126: Remove GeoPointDistanceRangeQuery. This query was implemented
with boolean NOT, and incorrect for multi-valued documents.
(Robert Muir)
- LUCENE-7158: Consistently use earth's WGS84 mean radius wherever our
geo search implementations approximate the earth as a sphere
(Karl
Wright via Mike McCandless)
- Other (5)
- LUCENE-7035: Upgrade icu4j to 56.1/unicode 8.
(Robert Muir)
- LUCENE-7087: Let MemoryIndex#fromDocument(...) accept 'Iterable<? extends IndexableField>'
as document instead of 'Document'.
(Martijn van Groningen)
- LUCENE-7091: Add doc values support to MemoryIndex
(Martijn van Groningen, David Smiley)
- LUCENE-7093: Add point values support to MemoryIndex
(Martijn van Groningen, Mike McCandless)
- LUCENE-7095: Add point values support to the numeric field query time join.
(Martijn van Groningen, Mike McCandless)
- Changes in Runtime Behavior (1)
- Resolving of external entities in queryparser/xml/CoreParser is disallowed
by default. See SOLR-11477 for details.
- Bug Fixes (2)
- LUCENE-7419: Fix performance bug with TokenStream.end(), where it would lookup
PositionIncrementAttribute every time.
(Mike McCandless, Robert Muir)
- SOLR-11477: Disallow resolving of external entities in queryparser/xml/CoreParser
by default.
(Michael Stepankin, Olga Barinova, Uwe Schindler, Christine Poerschke)
- Bug Fixes (8)
- LUCENE-7417: The standard Highlighter could throw an IllegalArgumentException when
trying to highlight a query containing a degenerate case of a MultiPhraseQuery with one
term.
(Thomas Kappler via David Smiley)
- LUCENE-7657: Fixed potential memory leak in the case that a (Span)TermQuery
with a TermContext is cached.
(Adrien Grand)
- LUCENE-7647: Made stored fields reclaim native memory more aggressively when
configured with BEST_COMPRESSION. This could otherwise result in out-of-memory
issues.
(Adrien Grand)
- LUCENE-7562: CompletionFieldsConsumer sometimes throws
NullPointerException on ghost fields
(Oliver Eilhard via Mike McCandless)
- LUCENE-7547: JapaneseTokenizerFactory was failing to close the
dictionary file it opened
(Markus via Mike McCandless)
- LUCENE-6914: Fixed DecimalDigitFilter in case of supplementary code points.
(Hossman)
- LUCENE-7440: Document id skipping (PostingsEnum.advance) could throw an
ArrayIndexOutOfBoundsException exception on large index segments (>1.8B docs)
with large skips.
(yonik)
- LUCENE-7570: IndexWriter may deadlock if a commit is running while
there are too many merges running and one of the merges hits a
tragic exception
(Joey Echeverria via Mike McCandless)
- Other (1)
- LUCENE-6989: Backport MMapDirectory's unmapping code from Lucene 6.4 to use
MethodHandles. This allows it to work with Java 9 (EA build 150 and later).
(Uwe Schindler)
- Build (3)
- LUCENE-7543: Make changes-to-html target an offline operation, by moving the
Lucene and Solr DOAP RDF files into the Git source repository under
dev-tools/doap/ and then pulling release dates from those files, rather than
from JIRA.
(Mano Kovacs, hossman, Steve Rowe)
- LUCENE-7596: Update Groovy to version 2.4.8 to allow building with Java 9
build 148+. Also update JGit version for working-copy checks. This does not
fix all issues with Java 9, but allows to build the distribution.
(Uwe Schindler)
- LUCENE-7651: Backport (Lucene 6.4.1) fix for Java 8u121 to allow documentation
build to inject "Google Code Prettify" without adding Javascript to Javadocs's
-
bottom parameter. Unfortunately, this fix disables Prettify if Javadocs are
built with Java 7, as there is no generic way in Java 7 to inject Javascript
without breaking Java 8 (and possible paid Java 7 security updates). This
fix also updates Prettify to latest version to work around a Google Chrome
issue.
(Uwe Schindler)
- Bug Fixes (11)
- LUCENE-7065: Fix the explain for the global ordinals join query. Before the
explain would also indicate that non matching documents would match.
On top of that with score mode average, the explain would fail with a NPE.
(Martijn van Groningen)
- LUCENE-7111: DocValuesRangeQuery.newLongRange behaves incorrectly for
Long.MAX_VALUE and Long.MIN_VALUE
(Ishan Chattopadhyaya via Steve Rowe)
- LUCENE-7139: Fix bugs in geo3d's Vincenty surface distance
implementation
(Karl Wright via Mike McCandless)
- LUCENE-7187: Block join queries' Weight#extractTerms(...) implementations
should delegate to the wrapped weight.
(Martijn van Groningen)
- LUCENE-7279: JapaneseTokenizer throws ArrayIndexOutOfBoundsException
on some valid inputs
(Mike McCandless)
- LUCENE-7219: Make queryparser/xml (Point|LegacyNumeric)RangeQuery builders
match the underlying queries' (lower|upper)Term optionality logic.
(Kaneshanathan Srivisagan, Christine Poerschke)
- LUCENE-7284: GapSpans needs to implement positionsCost().
(Daniel Bigham, Alan
Woodward)
- LUCENE-7231: WeightedSpanTermExtractor didn't deal correctly with single-term
phrase queries.
(Eva Popenda, Alan Woodward)
- LUCENE-7301: Multiple doc values updates to the same document within
one update batch could be applied in the wrong order resulting in
the wrong updated value
(Ishan Chattopadhyaya, hossman, Mike McCandless)
- LUCENE-7132: BooleanQuery sometimes assigned too-low scores in cases
where ranges of documents had only a single clause matching while
other ranges had more than one clause matching
(Ahmet Arslan,
hossman, Mike McCandless)
- LUCENE-7291: Spatial heatmap faceting could mis-count when the heatmap crosses the
dateline and indexed non-point shapes are much bigger than the heatmap region.
(David Smiley)
- Bug fixes (3)
- LUCENE-7112: WeightedSpanTermExtractor.extractUnknownQuery is only called
on queries that could not be extracted.
(Adrien Grand)
- LUCENE-7188: remove incorrect sanity check in NRTCachingDirectory.listAll()
that led to IllegalStateException being thrown when nothing was wrong.
(David Smiley, yonik)
- LUCENE-7209: Fixed explanations of FunctionScoreQuery.
(Adrien Grand)
- New Features (6)
- LUCENE-5868: JoinUtil.createJoinQuery(..,NumericType,..) query-time join
for LONG and INT fields with NUMERIC and SORTED_NUMERIC doc values.
(Alexey Zelin via Mikhail Khludnev)
- LUCENE-6939: Add exponential reciprocal scoring to
BlendedInfixSuggester, to even more strongly favor suggestions that
match closer to the beginning
(Arcadius Ahouansou via Mike McCandless)
- LUCENE-6958: Improved CustomAnalyzer to take class references to factories
as alternative to their SPI name. This enables compile-time safety when
defining analyzer's components.
(Uwe Schindler, Shai Erera)
- LUCENE-6818, LUCENE-6986: Add DFISimilarity implementing the divergence
from independence model.
(Ahmet Arslan via Robert Muir)
- SOLR-4619: Added removeAllAttributes() to AttributeSource, which removes
all previously added attributes.
- LUCENE-7010: Added MergePolicyWrapper to allow easy wrapping of other policies.
(Shai Erera)
- API Changes (10)
- LUCENE-6997: refactor sandboxed GeoPointField and query classes to lucene-spatial
module under new lucene.spatial.geopoint package
(Nick Knize)
- LUCENE-6908: GeoUtils static relational methods have been refactored to new
GeoRelationUtils and now correctly handle large irregular rectangles, and
pole crossing distance queries.
(Nick Knize)
- LUCENE-6900: Grouping sortWithinGroup variables used to allow null to mean
Sort.RELEVANCE. Null is no longer permitted.
(David Smiley)
- LUCENE-6919: The Scorer class has been refactored to expose an iterator
instead of extending DocIdSetIterator. asTwoPhaseIterator() has been renamed
to twoPhaseIterator() for consistency.
(Adrien Grand)
- LUCENE-6973: TeeSinkTokenFilter no longer accepts a SinkFilter (the latter
has been removed). If you wish to filter the sinks, you can wrap them with
any other TokenFilter (e.g. a FilteringTokenFilter). Also, you can no longer
add a SinkTokenStream to an existing TeeSinkTokenFilter. If you need to
share multiple streams with a single sink, chain them with multiple
TeeSinkTokenFilters.
DateRecognizerSinkFilter was renamed to DateRecognizerFilter and moved under
analysis/common. TokenTypeSinkFilter was removed (use TypeTokenFilter instead).
TokenRangeSinkFilter was removed.
(Shai Erera, Uwe Schindler)
- LUCENE-6980: Default applyAllDeletes to true when opening
near-real-time readers
(Mike McCandless)
- LUCENE-6981: SpanQuery.getTermContexts() helper methods are now public, and
SpanScorer has a public getSpans() method.
(Alan Woodward)
- LUCENE-6932: IndexInput.seek implementations now throw EOFException
if you seek beyond the end of the file
(Adrien Grand, Mike McCandless)
- LUCENE-6988: IndexableField.tokenStream() no longer throws IOException
(Alan Woodward)
- LUCENE-7028: Deprecate a duplicate method in NumericUtils.
(Uwe Schindler)
- Optimizations (9)
- LUCENE-6930: Decouple GeoPointField from NumericType by using a custom
and efficient GeoPointTokenStream and TermEnum designed for GeoPoint prefix
terms.
(Nick Knize)
- LUCENE-6951: Improve GeoPointInPolygonQuery using point orientation based
line crossing algorithm, and adding result for multi-value docs when least
1 point satisfies polygon criteria.
(Nick Knize)
- LUCENE-6889: BooleanQuery.rewrite now performs some query optimization, in
particular to rewrite queries that look like: "+*:* #filter" to a
"ConstantScore(filter)".
(Adrien Grand)
- LUCENE-6912: Grouping's Collectors now calculate a response to needsScores()
instead of always 'true'.
(David Smiley)
- LUCENE-6815: DisjunctionScorer now advances two-phased iterators lazily,
stopping to evaluate them as soon as a single one matches. The other iterators
will be confirmed lazily when computing score() or freq().
(Adrien Grand)
- LUCENE-6926: MUST_NOT clauses now use the match cost API to run the slow bits
last whenever possible.
(Adrien Grand)
- LUCENE-6944: BooleanWeight no longer creates sub-scorers if BS1 is not
applicable.
(Adrien Grand)
- LUCENE-6940: MUST_NOT clauses execute faster, especially when they are sparse.
(Adrien Grand)
- LUCENE-6470: Improve efficiency of TermsQuery constructors.
(Robert Muir)
- Bug Fixes (10)
- LUCENE-6976: BytesRefTermAttributeImpl.copyTo NPE'ed if BytesRef was null.
Added equals & hashCode, and a new test for these things.
(David Smiley)
- LUCENE-6932: RAMDirectory's IndexInput was failing to throw
EOFException in some cases
(Stéphane Campinas, Adrien Grand via Mike
McCandless)
- LUCENE-6896: Don't treat the smallest possible norm value as an infinitely
long document in SimilarityBase or BM25Similarity. Add more warnings to sims
that will not work well with extreme tf values.
(Ahmet Arslan, Robert Muir)
- LUCENE-6984: SpanMultiTermQueryWrapper no longer modifies its wrapped query.
(Alan Woodward, Adrien Grand)
- LUCENE-6998: Fix a couple places to better detect truncated index files
as corruption.
(Robert Muir, Mike McCandless)
- LUCENE-7002: Fixed MultiCollector to not throw a NPE if setScorer is called
after one of the sub collectors is done collecting.
(John Wang, Adrien Grand)
- LUCENE-7027: Fixed NumericTermAttribute to not throw IllegalArgumentException
after NumericTokenStream was exhausted.
(Uwe Schindler, Lee Hinman,
Mike McCandless)
- LUCENE-7018: Fix GeoPointTermQueryConstantScoreWrapper to add document on
first GeoPointField match.
(Nick Knize)
- LUCENE-7019: Add two-phase iteration to GeoPointTermQueryConstantScoreWrapper.
(Robert Muir via Nick Knize)
- LUCENE-6989: Improve MMapDirectory's unmapping checks to catch more non-working
cases. The unmap-hack does not yet work with recent Java 9. Official support
will come with Lucene 6.
(Uwe Schindler)
- Other (15)
- LUCENE-6924: Upgrade randomizedtesting to 2.3.2.
(Dawid Weiss)
- LUCENE-6920: Improve custom function checks in expressions module
to use MethodHandles and work without extra security privileges.
(Uwe Schindler, Robert Muir)
- LUCENE-6921: Fix SPIClassIterator#isParentClassLoader to don't
require extra permissions.
(Uwe Schindler)
- LUCENE-6923: Fix RamUsageEstimator to access private fields inside
AccessController block for computing size.
(Robert Muir)
- LUCENE-6907: make TestParser extendable, rename test/.../xml/
NumericRangeQueryQuery.xml to NumericRangeQuery.xml
(Christine Poerschke)
- LUCENE-6925: add ForceMergePolicy class in test-framework
(Christine Poerschke)
- LUCENE-6945: factor out TestCorePlus(Queries|Extensions)Parser from
TestParser, rename TestParser to TestCoreParser
(Christine Poerschke)
- LUCENE-6949: fix (potential) resource leak in SynonymFilterFactory
(https://scan.coverity.com/projects/5620 CID 120656)
(Christine Poerschke, Coverity Scan (via Rishabh Patel))
- LUCENE-6961: Improve Exception handling in AnalysisFactories /
AnalysisSPILoader: Don't wrap exceptions occuring in factory's
ctor inside InvocationTargetException.
(Uwe Schindler)
- LUCENE-6965: Expression's JavascriptCompiler now throw ParseException
with bad function names or bad arity instead of IllegalArgumentException.
(Tomás Fernández Löbbe, Uwe Schindler, Ryan Ernst)
- LUCENE-6964: String-based signatures in JavascriptCompiler replaced
with better compile-time-checked MethodType; generated class files
are no longer marked as synthetic.
(Uwe Schindler)
- LUCENE-6978: Refactor several code places that lookup locales
by string name to use BCP47 locale tag instead. LuceneTestCase
now also prints locales on failing tests this way.
Locale#forLanguageTag() and Locale#toString() were placed on list
of forbidden signatures.
(Uwe Schindler, Robert Muir)
- LUCENE-6988: You can now add IndexableFields directly to a MemoryIndex,
and create a MemoryIndex from a lucene Document.
(Alan Woodward)
- LUCENE-7005: TieredMergePolicy tweaks (>= vs. >, @see get vs. set)
(Christine Poerschke)
- LUCENE-7006: increase BaseMergePolicyTestCase use (TestNoMergePolicy and
TestSortingMergePolicy now extend it, TestUpgradeIndexMergePolicy added)
(Christine Poerschke)
- Bug Fixes (9)
- LUCENE-6910: fix 'if ... > Integer.MAX_VALUE' check in
(Binary|Numeric)DocValuesFieldUpdates.merge
(https://scan.coverity.com/projects/5620 CID 119973 and CID 120081)
(Christine Poerschke, Coverity Scan (via Rishabh Patel))
- LUCENE-6946: SortField.equals now takes the missingValue parameter into
account.
(Adrien Grand)
- LUCENE-6918: LRUQueryCache.onDocIdSetEviction is only called when at least
one DocIdSet is being evicted.
(Adrien Grand)
- LUCENE-6929: Fix SpanNotQuery rewriting to not drop the pre/post parameters.
(Tim Allison via Adrien Grand)
- LUCENE-6950: Fix FieldInfos handling of UninvertingReader, e.g. do not
hide the true docvalues update generation or other properties.
(Ishan Chattopadhyaya via Robert Muir)
- LUCENE-6948: Fix ArrayIndexOutOfBoundsException in PagedBytes$Reader.fill
by removing an unnecessary long-to-int cast.
(Michael Lawley via Christine Poerschke)
- SOLR-7865: BlendedInfixSuggester was returning too many results
(Arcadius Ahouansou via Mike McCandless)
- LUCENE-6970: Fixed off-by-one error in Lucene54DocValuesProducer that could
potentially corrupt doc values.
(Adrien Grand)
- LUCENE-2229: Fix Highlighter's SimpleSpanFragmenter when multiple adjacent
stop words following a span can unduly make the fragment way too long.
(Elmer Garduno, Lukhnos Liu via David Smiley)
- New Features (9)
- LUCENE-6875: New Serbian Filter.
(Nikola Smolenski via Robert Muir,
Dawid Weiss)
- LUCENE-6720: New FunctionRangeQuery wrapper around ValueSourceScorer
(returned from ValueSource/FunctionValues.getRangeScorer()).
(David Smiley)
- LUCENE-6724: Add utility APIs to GeoHashUtils to compute neighbor
geohash cells
(Nick Knize via Mike McCandless).
- LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin.
(Robert Muir)
- LUCENE-6699: Add integration of BKD tree and geo3d APIs to give
fast, very accurate query to find all indexed points within an
earth-surface shape
(Karl Wright, Mike McCandless)
- LUCENE-6838: Added IndexSearcher#getQueryCache and #getQueryCachingPolicy.
(Adrien Grand)
- LUCENE-6844: PayloadScoreQuery can include or exclude underlying span scores
from its score calculations
(Bill Bell, Alan Woodward)
- LUCENE-6778: Add GeoPointDistanceRangeQuery, to search for points
within a "ring" (beyond a minimum distance and below a maximum
distance)
(Nick Knize via Mike McCandless)
- LUCENE-6874: Add a new UnicodeWhitespaceTokenizer to analysis/common
that uses Unicode character properties extracted from ICU4J to tokenize
text on whitespace. This tokenizer will split on non-breaking
space (NBSP), too.
(David Smiley, Uwe Schindler, Steve Rowe)
- API Changes (12)
- LUCENE-6590: Query.setBoost(), Query.getBoost() and Query.clone() are gone.
In order to apply boosts, you now need to wrap queries in a BoostQuery.
(Adrien Grand)
- LUCENE-6716: SpanPayloadCheckQuery now takes a List<BytesRef> rather than
a Collection<byte[]>.
(Alan Woodward)
- LUCENE-6489: The various span payload queries have been moved to the queries
submodule, and PayloadSpanUtil is now in sandbox.
(Alan Woodward)
- LUCENE-6650: The spatial module no longer uses Filter in any way. All
spatial Filters are now subclass Query. The spatial heatmap/facet API
now accepts a Bits parameter to filter counts.
(David Smiley, Adrien Grand)
- LUCENE-6803: Deprecate sandbox Regexp Query.
(Uwe Schindler)
- LUCENE-6301: org.apache.lucene.search.Filter is now deprecated. You should use
Query objects instead of Filters, and the BooleanClause.Occur.FILTER clause in
order to let Lucene know that a Query should be used for filtering but not
scoring.
- LUCENE-6939: SpanOrQuery.addClause is now deprecated, clauses should all be
provided at construction time.
(Paul Elschot via Adrien Grand)
- LUCENE-6855: CachingWrapperQuery is deprecated and will be removed in 6.0.
(Adrien Grand)
- LUCENE-6870: DisjunctionMaxQuery#add is now deprecated, clauses should all be
provided at construction time.
(Adrien Grand)
- LUCENE-6884: Analyzer.tokenStream() and Tokenizer.setReader() are no longer
declared as throwing IOException.
(Alan Woodward)
- LUCENE-6849: Expose IndexWriter.flush() method, to move all
in-memory segments to disk without opening a near-real-time reader
nor calling fsync
(Robert Muir, Simon Willnauer, Mike McCandless)
- LUCENE-6911: Add correct StandardQueryParser.getMultiFields() method,
deprecate no-op StandardQueryParser.getMultiFields(CharSequence[]) method.
(Christine Poerschke, Mikhail Khludnev, Coverity Scan (via Rishabh Patel))
- Optimizations (18)
- LUCENE-6708: TopFieldCollector does not compute the score several times on the
same document anymore.
(Adrien Grand)
- LUCENE-6720: ValueSourceScorer, returned from
FunctionValues.getRangeScorer(), now uses TwoPhaseIterator.
(David Smiley)
- LUCENE-6756: MatchAllDocsQuery now has a dedicated BulkScorer for better
performance when used as a top-level query.
(Adrien Grand)
- LUCENE-6746: DisjunctionMaxQuery, BoostingQuery and BoostedQuery now create
sub weights through IndexSearcher so that they can be cached.
(Adrien Grand)
- LUCENE-6754: Optimized IndexSearcher.count for the cases when it can use
index statistics instead of collecting all matches.
(Adrien Grand)
- LUCENE-6773: Nested conjunctions now iterate over documents as if clauses
were all at the same level.
(Adrien Grand)
- LUCENE-6777: Reuse BytesRef when visiting term ranges in
GeoPointTermsEnum to reduce GC pressure
(Nick Knize via Mike
McCandless)
- LUCENE-6779: Reduce memory allocated by CompressingStoredFieldsWriter to write
strings larger than 64kb by an amount equal to string's utf8 size.
(Dawid Weiss, Robert Muir, shalin)
- LUCENE-6850: Optimize BooleanScorer for sparse clauses.
(Adrien Grand)
- LUCENE-6840: Ordinal indexes for SORTED_SET/SORTED_NUMERIC fields and
addresses for BINARY fields are now stored on disk instead of in memory.
(Adrien Grand)
- LUCENE-6878: Speed up TopDocs.merge.
(Daniel Jelinski via Adrien Grand)
- LUCENE-6885: StandardDirectoryReader (initialCapacity) tweaks
(Christine Poerschke)
- LUCENE-6863: Optimized storage requirements of doc values fields when less
than 1% of documents have a value.
(Adrien Grand)
- LUCENE-6892: various lucene.index initialCapacity tweaks
(Christine Poerschke)
- LUCENE-6276: Added TwoPhaseIterator.matchCost() which allows to confirm the
least costly TwoPhaseIterators first.
(Paul Elschot via Adrien Grand)
- LUCENE-6898: In the default codec, the last stored field value will not
be fully read from disk if the supplied StoredFieldVisitor doesn't want it.
So put your largest text field value last to benefit.
(David Smiley)
- LUCENE-6909: Remove unnecessary synchronized from
FacetsConfig.getDimConfig for better concurrency
(Sanne Grinovero
via Mike McCandless)
- SOLR-7730: Speed up SlowCompositeReaderWrapper.getSortedSetDocValues() by
avoiding merging FieldInfos just to check doc value type.
(Paul Vasilyev, Yuriy Pakhomov, Mikhail Khludnev, yonik)
- Bug Fixes (19)
- LUCENE-6905: Unwrap center longitude for dateline crossing
GeoPointDistanceQuery.
(Nick Knize)
- LUCENE-6817: ComplexPhraseQueryParser.ComplexPhraseQuery does not display
slop in toString().
(Ahmet Arslan via Dawid Weiss)
- LUCENE-6730: Hyper-parameter c is ignored in term frequency NormalizationH1.
(Ahmet Arslan via Robert Muir)
- LUCENE-6742: Lovins & Finnish implementation of SnowballFilter was
fixed to behave exactly as specified. A bug in the snowball compiler
caused differences in output of the filter in comparison to the original
test data. In addition, the performance of those filters was improved
significantly.
(Uwe Schindler, Robert Muir)
- LUCENE-6783: Removed side effects from FuzzyLikeThisQuery.rewrite.
(Adrien Grand)
- LUCENE-6776: Fix geo3d math to handle randomly squashed planet
models
(Karl Wright via Mike McCandless)
- LUCENE-6792: Fix TermsQuery.toString() to work with binary terms.
(Ruslan Muzhikov, Robert Muir)
- LUCENE-5503: When Highlighter's WeightedSpanTermExtractor converts a
PhraseQuery to an equivalent SpanQuery, it would sometimes use a slop that is
too low (no highlight) or determine inOrder wrong.
(Tim Allison via David Smiley)
- LUCENE-6790: Fix IndexWriter thread safety when one thread is
handling a tragic exception but another is still committing
(Mike
McCandless)
- LUCENE-6810: Upgrade to Spatial4j 0.5 -- fixes some edge-case bugs in the
spatial module. See https://github.com/locationtech/spatial4j/blob/master/CHANGES.md
(David Smiley)
- LUCENE-6813: OfflineSorter no longer removes its output Path up
front, and instead opens it for write with the
StandardCopyOption.REPLACE_EXISTING to overwrite any prior file, so
that callers can safely use Files.createTempFile for the output.
This change also fixes OfflineSorter's default temp directory when
running tests to use mock filesystems so e.g. we detect file handle
leaks
(Dawid Weiss, Robert Muir, Mike McCandless)
- LUCENE-6813: RangeTreeWriter was failing to close all file handles
it opened, leading to intermittent failures on Windows
(Dawid Weiss,
Robert Muir, Mike McCandless)
- LUCENE-6826: Fix ClassCastException when merging a field that has no
terms because they were filtered out by e.g. a FilterCodecReader
(Trejkaz via Mike McCandless)
- LUCENE-6823: LocalReplicator should use System.nanoTime as its clock
source for checking for expiration
(Ishan Chattopadhyaya via Mike
McCandless)
- LUCENE-6856: The Weight wrapper used by LRUQueryCache now delegates to the
original Weight's BulkScorer when applicable.
(Adrien Grand)
- LUCENE-6858: Fix ContextSuggestField to correctly wrap token stream
when using CompletionAnalyzer.
(Areek Zillur)
- LUCENE-6872: IndexWriter handles any VirtualMachineError, not just OOM,
as tragic.
(Robert Muir)
- LUCENE-6814: PatternTokenizer no longer hangs onto heap sized to the
maximum input string it's ever seen, which can be a large memory
"leak" if you tokenize large strings with many threads across many
indices
(Alex Chow via Mike McCandless)
- LUCENE-6888: Explain output of map() function now also prints default value
(janhoy)
- Other (26)
- LUCENE-6899: Upgrade randomizedtesting to 2.3.1.
(Dawid Weiss)
- LUCENE-6478: Test execution can hang with java.security.debug.
(Dawid Weiss)
- LUCENE-6862: Upgrade of RandomizedRunner to version 2.2.0.
(Dawid Weiss)
- LUCENE-6857: Validate StandardQueryParser with NOT operator
with-in parantheses.
(Jigar Shah via Dawid Weiss)
- LUCENE-6827: Use explicit capacity ArrayList instead of a LinkedList
in MultiFieldQueryNodeProcessor.
(Dawid Weiss).
- LUCENE-6812: Upgrade RandomizedTesting to 2.1.17.
(Dawid Weiss)
- LUCENE-6174: Improve "ant eclipse" to select right JRE for building.
(Uwe Schindler, Dawid Weiss)
- LUCENE-6417, LUCENE-6830: Upgrade ANTLR used in expressions module
to version 4.5.1-1.
(Jack Conradson, Uwe Schindler)
- LUCENE-6729: Upgrade ASM used in expressions module to version 5.0.4.
(Uwe Schindler)
- LUCENE-6738: remove IndexWriterConfig.[gs]etIndexingChain
(Christine Poerschke)
- LUCENE-6755: more tests of ToChildBlockJoinScorer.advance
(hossman)
- LUCENE-6571: fix some private access level javadoc errors and warnings
(Cao Manh Dat, Christine Poerschke)
- LUCENE-6768: AbstractFirstPassGroupingCollector.groupSort private member
is not needed.
(Christine Poerschke)
- LUCENE-6761: MatchAllDocsQuery's Scorers do not expose approximations
anymore.
(Adrien Grand)
- LUCENE-6775, LUCENE-6833: Improved MorfologikFilterFactory to allow
loading of custom dictionaries from ResourceLoader. Upgraded
Morfologik to version 2.0.1. The 'dictionary' attribute has been
reverted back and now points at the dictionary resource to be
loaded instead of the default Polish dictionary.
(Uwe Schindler, Dawid Weiss)
- LUCENE-6797: Make GeoCircle an interface and use a factory to create
it, to eventually handle degenerate cases
(Karl Wright via Mike
McCandless)
- LUCENE-6800: Use XYZSolidFactory to create XYZSolids
(Karl Wright
via Mike McCandless)
- LUCENE-6798: Geo3d now models degenerate (too tiny) circles as a
single point
(Karl Wright via Mike McCandless)
- LUCENE-6770: Add javadocs that FSDirectory canonicalizes the path.
(Uwe Schindler, Vladimir Kuzmin)
- LUCENE-6795: Fix various places where code used
AccessibleObject#setAccessible() without a privileged block. Code
without a hard requirement to do reflection were rewritten. This
makes Lucene and Solr ready for Java 9 Jigsaw's module system, where
reflection on Java's runtime classes is very restricted.
(Robert Muir, Uwe Schindler)
- LUCENE-6467: Simplify Query.equals.
(Paul Elschot via Adrien Grand)
- LUCENE-6845: SpanScorer is now merged into Spans
(Alan Woodward, David Smiley)
- LUCENE-6887: DefaultSimilarity is deprecated, use ClassicSimilarity for equivalent behavior,
or consider switching to BM25Similarity which will become the new default in Lucene 6.0
(hossman)
- LUCENE-6893: factor out CorePlusQueriesParser from CorePlusExtensionsParser
(Christine Poerschke)
- LUCENE-6902: Don't retry to fsync files / directories; fail
immediately.
(Daniel Mitterdorfer, Uwe Schindler)
- LUCENE-6801: Clarify JavaDocs of PhraseQuery that it in fact supports terms
at the same position (as does MultiPhraseQuery), treated like a conjunction.
Added test.
(David Smiley, Adrien Grand)
- Build (2)
- LUCENE-6732: Improve checker for invalid source patterns to also
detect javadoc-style license headers. Use Groovy to implement the
checks instead of plain Ant.
(Uwe Schindler)
- LUCENE-6594: Update forbiddenapis to 2.0.
(Uwe Schindler)
- Tests (1)
- LUCENE-6752: Add Math#random() to forbiddenapis.
(Uwe Schindler,
Mikhail Khludnev, Andrei Beliakov)
- Changes in Backwards Compatibility Policy (1)
- LUCENE-6742: The Lovins & Finnish implementation of SnowballFilter
were fixed to now behave exactly like the original Snowball stemmer.
If you have indexed text using those stemmers you may need to reindex.
(Uwe Schindler, Robert Muir)
- Changes in Runtime Behavior (3)
- LUCENE-6772: MultiCollector now catches CollectionTerminatedException and
removes the collector that threw this exception from the list of sub
collectors to collect.
(Adrien Grand)
- LUCENE-6784: IndexSearcher's query caching is enabled by default. Run
indexSearcher.setQueryCache(null) to disable.
(Adrien Grand)
- LUCENE-6305: BooleanQuery.equals and hashcode do not depend on the order of
clauses anymore.
(Adrien Grand)
- Bug Fixes (1)
- SOLR-7865: BlendedInfixSuggester was returning too many results
(Arcadius Ahouansou via Mike McCandless)
- Bug Fixes (3)
- LUCENE-6774: Remove classloader hack in MorfologikFilter.
(Robert Muir,
Uwe Schindler)
- LUCENE-6748: UsageTrackingQueryCachingPolicy no longer caches trivial queries
like MatchAllDocsQuery.
(Adrien Grand)
- LUCENE-6781: Fixed BoostingQuery to rewrite wrapped queries.
(Adrien Grand)
- Tests (1)
- LUCENE-6760, SOLR-7958: Move TestUtil#randomWhitespace to the only
Solr test that is using it. The method is not useful for Lucene tests
(and easily breaks, e.g., in Java 9 caused by Unicode version updates).
(Uwe Schindler)
- New Features (31)
- LUCENE-6485: Add CustomSeparatorBreakIterator to postings
highlighter which splits on any character. For example, it
can be used with getMultiValueSeparator render whole field
values.
(Luca Cavanna via Robert Muir)
- LUCENE-6459: Add common suggest API that mirrors Lucene's
Query/IndexSearcher APIs for Document based suggester.
Adds PrefixCompletionQuery, RegexCompletionQuery,
FuzzyCompletionQuery and ContextQuery.
(Areek Zillur via Mike McCandless)
- LUCENE-6487: Spatial Geo3D API now has a WGS84 ellipsoid world model option.
(Karl Wright via David Smiley)
- LUCENE-6477: Add experimental BKD geospatial tree doc values format
and queries, for fast "bbox/polygon contains lat/lon points"
(Mike
McCandless)
- LUCENE-6526: Asserting(Query|Weight|Scorer) now ensure scores are not computed
if they are not needed.
(Adrien Grand)
- LUCENE-6481: Add GeoPointField, GeoPointInBBoxQuery,
GeoPointInPolygonQuery for simple "indexed lat/lon point in
bbox/shape" searching.
(Nick Knize via Mike McCandless)
- LUCENE-5954: The segments_N commit point now stores the Lucene
version that wrote the commit as well as the lucene version that
wrote the oldest segment in the index, for faster checking of "too
old" indices
(Ryan Ernst, Robert Muir, Mike McCandless)
- LUCENE-6519: BKDPointInPolygonQuery is much faster by avoiding
the per-hit polygon check when a leaf cell is fully contained by the
polygon.
(Nick Knize, Mike McCandless)
- LUCENE-6549: Add preload option to MMapDirectory.
(Robert Muir)
- LUCENE-6504: Add Lucene53Codec, with norms implemented directly
via the Directory's RandomAccessInput api.
(Robert Muir)
- LUCENE-6539: Add new DocValuesNumbersQuery, to match any document
containing one of the specified long values. This change also
moves the existing DocValuesTermsQuery and DocValuesRangeQuery
to Lucene's sandbox module, since in general these queries are
quite slow and are only fast in specific cases.
(Adrien Grand,
Robert Muir, Mike McCandless)
- LUCENE-6577: Give earlier and better error message for invalid CRC.
(Robert Muir)
- LUCENE-6544: Geo3D: (1) Regularize path & polygon construction, (2) add
PlanetModel.surfaceDistance() (ellipsoidal calculation), (3) cache lat & lon
in GeoPoint, (4) add thread-safety where missing -- Geo3dShape.
(Karl Wright,
David Smiley)
- LUCENE-6606: SegmentInfo.toString now confesses how the documents
were sorted, when SortingMergePolicy was used
(Christine Poerschke
via Mike McCandless)
- LUCENE-6524: IndexWriter can now be initialized from an already open
near-real-time or non-NRT reader.
(Boaz Leskes, Robert Muir, Mike
McCandless)
- LUCENE-6578: Geo3D can now compute the distance from a point to a shape, both
inner distance and to an outside edge. Multiple distance algorithms are
available.
(Karl Wright, David Smiley)
- LUCENE-6632: Geo3D: Compute circle planes more accurately.
(Karl Wright via David Smiley)
- LUCENE-6653: Added general purpose BytesTermAttribute to basic token
attributes package that can be used for TokenStreams that solely produce
binary terms.
(Uwe Schindler)
- LUCENE-6365: Add Operations.topoSort, to run topological sort of the
states in an Automaton
(Markus Heiden via Mike McCandless)
- LUCENE-6365: Replace Operations.getFiniteStrings with a
more scalable iterator API (FiniteStringsIterator)
(Markus Heiden
via Mike McCandless)
- LUCENE-6589: Add a new org.apache.lucene.search.join.CheckJoinIndex class
that can be used to validate that an index has an appropriate structure to
run join queries.
(Adrien Grand)
- LUCENE-6659: Remove IndexWriter's unnecessary hard limit on max concurrency
(Robert Muir, Mike McCandless)
- LUCENE-6547: Add GeoPointDistanceQuery, matching all points within
the specified distance from the center point. Fix
GeoPointInBBoxQuery to handle dateline crossing.
- LUCENE-6694: Add LithuanianAnalyzer and LithuanianStemmer.
(Dainius Jocas via Robert Muir)
- LUCENE-6695: Added a new BlendedTermQuery to blend statistics across several
terms.
(Simon Willnauer, Adrien Grand)
- LUCENE-6706: Added a new PayloadScoreQuery that generalises the behaviour of
PayloadTermQuery and PayloadNearQuery to all Span queries.
(Alan Woodward)
- LUCENE-6697: Add experimental range tree doc values format and
queries, based on a 1D version of the spatial BKD tree, for a faster
and smaller alternative to postings-based numeric and binary term
filtering. Range trees can also handle values larger than 64 bits.
(Adrien Grand, Mike McCandless)
- LUCENE-6647: Add GeoHash string utility APIs
(Nick Knize via Mike
McCandless).
- LUCENE-6710: GeoPointField now uses full 64 bits (up from 62) to encode
lat/lon
(Nick Knize via Mike McCandless).
- LUCENE-6580: SpanNearQuery now allows defined-width gaps in its subqueries
(Alan Woodward, Adrien Grand).
- LUCENE-6712: Use doc values to post-filter GeoPointField hits that
fall in boundary cells, resulting in smaller index, faster searches
and less heap used for each query
(Nick Knize via Mike McCandless).
- API Changes (20)
- LUCENE-6508: Simplify Lock api, there is now just
Directory.obtainLock() which returns a Lock that can be
released (or fails with exception). Add lock verification
to IndexWriter. Improve exception messages when locking fails.
(Uwe Schindler, Mike McCandless, Robert Muir)
- LUCENE-6371, LUCENE-6490: Payload collection from Spans is moved to a more generic
SpanCollector framework. Spans no longer implements .hasPayload() and
.getPayload() methods, and instead exposes a collect() method that allows
the collection of arbitrary postings information. SpanPayloadCheckQuery and
SpanPayloadNearCheckQuery have moved from the .spans package to the .payloads
package.
(Alan Woodward, David Smiley, Paul Elschot, Robert Muir)
- LUCENE-6529: Removed an optimization in UninvertingReader that was causing
incorrect results for Numeric fields using precisionStep
(hossman, Robert Muir)
- LUCENE-6551: Add missing ConcurrentMergeScheduler.getAutoIOThrottle
getter
(Simon Willnauer, Mike McCandless)
- LUCENE-6552: Add MergePolicy.OneMerge.getMergeInfo and rename
setInfo to setMergeInfo
(Simon Willnauer, Mike McCandless)
- LUCENE-6525: Deprecate IndexWriterConfig's writeLockTimeout.
(Robert Muir)
- LUCENE-6583: FilteredQuery is deprecated and will be removed in 6.0. It should
be replaced with a BooleanQuery which handle the query as a MUST clause and
the filter as a FILTER clause.
(Adrien Grand)
- LUCENE-6553: The postings, spans and scorer APIs no longer take an acceptDocs
parameter. Live docs are now always checked on top of these APIs.
(Adrien Grand)
- LUCENE-6634: PKIndexSplitter now takes a Query instead of a Filter to decide
how to split an index.
(Adrien Grand)
- LUCENE-6643: GroupingSearch from lucene/grouping was changed to take a Query
object to define groups instead of a Filter.
(Adrien Grand)
- LUCENE-6554: ToParentBlockJoinFieldComparator was removed because of a bug
with missing values that could not be fixed. ToParentBlockJoinSortField now
works with string or numeric doc values selectors. Sorting on anything else
than a string or numeric field would require to implement a custom selector.
(Adrien Grand)
- LUCENE-6648: All lucene/facet APIs now take Query objects where they used to
take Filter objects.
(Adrien Grand)
- LUCENE-6640: Suggesters now take a BitsProducer object instead of a Filter
object to reduce the scope of doc IDs that may be returned, emphasizing the
fact that these objects need to support random-access.
(Adrien Grand)
- LUCENE-6646: Make EarlyTerminatingCollector take a Sort object directly
instead of a SortingMergePolicy.
(Christine Poerschke via Adrien Grand)
- LUCENE-6649: BitDocIdSetFilter and BitDocIdSetCachingWrapperFilter are now
deprecated in favour of BitSetProducer and QueryBitSetProducer, which do not
extend oal.search.Filter.
(Adrien Grand)
- LUCENE-6607: Factor out geo3d into its own spatial3d module.
(Karl
Wright, Nick Knize, David Smiley, Mike McCandless)
- LUCENE-6531: PhraseQuery is now immutable and can be built using the
PhraseQuery.Builder class.
(Adrien Grand)
- LUCENE-6570: BooleanQuery is now immutable and can be built using the
BooleanQuery.Builder class.
(Adrien Grand)
- LUCENE-6702: NRTSuggester: Add a method to inject context values at index time
in ContextSuggestField. Simplify ContextQuery logic for extracting contexts and
add dedicated method to consider all context values at query time.
(Areek Zillur, Mike McCandless)
- LUCENE-6719: NumericUtils getMinInt, getMaxInt, getMinLong, getMaxLong now
return null if there are no terms for the specified field, previously these
methods returned primitive values and raised an undocumented NullPointerException
if there were no terms for the field.
(hossman, Timothy Potter)
- Bug fixes (27)
- LUCENE-6500: ParallelCompositeReader did not always call
closed listeners. This was fixed by LUCENE-6501.
(Adrien Grand, Uwe Schindler)
- LUCENE-6520: Geo3D GeoPath.done() would throw an NPE if adjacent path
segments were co-linear.
(Karl Wright via David Smiley)
- LUCENE-5805: QueryNodeImpl.removeFromParent was doing nothing in a
costly manner
(Christoph Kaser, Cao Manh Dat via Mike McCAndless)
- LUCENE-6533: SlowCompositeReaderWrapper no longer caches its live docs
instance since this can prevent future improvements like a
disk-backed live docs
(Adrien Grand, Mike McCandless)
- LUCENE-6558: Highlighters now work with CustomScoreQuery
(Cao Manh
Dat via Mike McCandless)
- LUCENE-6560: BKDPointInBBoxQuery now handles "dateline crossing"
correctly
(Nick Knize, Mike McCandless)
- LUCENE-6564: Change PrintStreamInfoStream to use thread safe Java 8
ISO-8601 date formatting (in Lucene 5.x use Java 7 FileTime#toString
as workaround); fix output of tests to use same format.
(Uwe Schindler,
Ramkumar Aiyengar)
- LUCENE-6593: Fixed ToChildBlockJoinQuery's scorer to not refuse to advance
to a document that belongs to the parent space.
(Adrien Grand)
- LUCENE-6591: Never write a negative vLong
(Robert Muir, Ryan Ernst,
Adrien Grand, Mike McCandless)
- LUCENE-6588: Fix how ToChildBlockJoinQuery deals with acceptDocs.
(Christoph Kaser via Adrien Grand)
- LUCENE-6597: Geo3D's GeoCircle now supports a world-globe diameter.
(Karl Wright via David Smiley)
- LUCENE-6608: Fix potential resource leak in BigramDictionary.
(Rishabh Patel via Uwe Schindler)
- LUCENE-6614: Improve partition detection in IOUtils#spins() so it
works with NVMe drives.
(Uwe Schindler, Mike McCandless)
- LUCENE-6586: Fix typo in GermanStemmer, causing possible wrong value
for substCount.
(Christoph Kaser via Mike McCandless)
- LUCENE-6658: Fix IndexUpgrader to also upgrade indexes without any
segments.
(Trejkaz, Uwe Schindler)
- LUCENE-6677: QueryParserBase fails to enforce maxDeterminizedStates when
creating a WildcardQuery
(David Causse via Mike McCandless)
- LUCENE-6680: Preserve two suggestions that have same key and weight but
different payloads
(Arcadius Ahouansou via Mike McCandless)
- LUCENE-6681: SortingMergePolicy must override MergePolicy.size(...).
(Christine Poerschke via Adrien Grand)
- LUCENE-6682: StandardTokenizer performance bug: scanner buffer is
unnecessarily copied when maxTokenLength doesn't change. Also stop silently
maxing out buffer size (and effectively also max token length) at 1M chars,
but instead throw an exception from setMaxTokenLength() when the given
length is greater than 1M chars.
(Piotr Idzikowski, Steve Rowe)
- LUCENE-6696: Fix FilterDirectoryReader.close() to never close the
underlying reader several times.
(Adrien Grand)
- LUCENE-6334: FastVectorHighlighter failed to highlight phrases across
more than one value in a multi-valued field.
(Chris Earle, Nik Everett
via Mike McCandless)
- LUCENE-6704: GeoPointDistanceQuery was visiting too many term ranges,
consuming too much heap for a large radius
(Nick Knize via Mike McCandless)
- SOLR-5882: fix ScoreMode.Min at ToParentBlockJoinQuery
(Mikhail Khludnev)
- LUCENE-6718: JoinUtil.createJoinQuery failed to rewrite queries before
creating a Weight.
(Adrien Grand)
- LUCENE-6713: TooComplexToDeterminizeException claims to be serializable
but wasn't
(Simon Willnauer, Mike McCandless)
- LUCENE-6723: Fix date parsing problems in Java 9 with date formats using
English weekday/month names.
(Uwe Schindler)
- LUCENE-6618: Properly set MMapDirectory.UNMAP_SUPPORTED when it is now allowed
by security policy.
(Robert Muir)
- Changes in Runtime Behavior (12)
- LUCENE-6501: The subreader structure in ParallelCompositeReader
was flattened, because the current implementation had too many
hidden bugs regarding refounting and close listeners.
If you create a new ParallelCompositeReader, it will just take
all leaves of the passed readers and form a flat structure of
ParallelLeafReaders instead of trying to assemble the original
structure of composite and leaf readers.
(Adrien Grand,
Uwe Schindler)
- LUCENE-6537: NearSpansOrdered no longer tries to minimize its
Span matches. This means that the matching algorithm is entirely
lazy. All spans returned by the previous implementation are still
reported, but matching documents may now also return additional
spans that were previously discarded in preference to shorter
overlapping ones.
(Alan Woodward, Adrien Grand, Paul Elschot)
- LUCENE-6538: Also include java.vm.version and java.runtime.version
in per-segment diagnostics
(Robert Muir, Mike McCandless)
- LUCENE-6569: Optimize MultiFunction.anyExists and allExists to eliminate
excessive array creation in common 2 argument usage
(Jacob Graves, hossman)
- LUCENE-2880: Span queries now score more consistently with regular queries.
(Robert Muir, Adrien Grand)
- LUCENE-6601: FilteredQuery now always rewrites to a BooleanQuery which handles
the query as a MUST clause and the filter as a FILTER clause.
LEAP_FROG_QUERY_FIRST_STRATEGY and LEAP_FROG_FILTER_FIRST_STRATEGY do not
guarantee anymore which iterator will be advanced first, it will depend on the
respective costs of the iterators. QUERY_FIRST_FILTER_STRATEGY and
RANDOM_ACCESS_FILTER_STRATEGY still consume the filter using its random-access
API, however the returned bits may be called on different documents compared
to before.
(Adrien Grand)
- LUCENE-6542: FSDirectory's ctor now works with security policies or file systems
that restrict write access.
(Trejkaz, hossman, Uwe Schindler)
- LUCENE-6651: The default implementation of AttributeImpl#reflectWith(AttributeReflector)
now uses AccessControler#doPrivileged() to do the reflection. Please consider
implementing this method in all your custom attributes, because the method will be
made abstract in Lucene 6.
(Uwe Schindler)
- LUCENE-6639: LRUQueryCache and CachingWrapperQuery now consider a query as
"used" when the first Scorer is pulled instead of when a Scorer is pulled on
the first segment on an index.
(Terry Smith, Adrien Grand)
- LUCENE-6579: IndexWriter now sacrifices (closes) itself to protect the index
when an unexpected, tragic exception strikes while merging.
(Robert
Muir, Mike McCandless)
- LUCENE-6691: SortingMergePolicy.isSorted now considers FilterLeafReader instances.
EarlyTerminatingSortingCollector.terminatedEarly accessor added.
TestEarlyTerminatingSortingCollector.testTerminatedEarly test added.
(Christine Poerschke)
- LUCENE-6609: Add getSortField impls to many subclasses of FieldCacheSource which return
the most direct SortField implementation. In many trivial sort by ValueSource usages, this
will result in less RAM, and more precise sorting of extreme values due to no longer
converting to double.
(hossman)
- Optimizations (9)
- LUCENE-6548: Some optimizations for BlockTree's intersect with very
finite automata
(Mike McCandless)
- LUCENE-6585: Flatten conjunctions and conjunction approximations into
parent conjunctions. For example a sloppy phrase query of "foo bar"~5
with a filter of "baz" will internally leapfrog foo,bar,baz as one
conjunction.
(Ryan Ernst, Robert Muir, Adrien Grand)
- LUCENE-6325: Reduce RAM usage of FieldInfos, and speed up lookup by
number, by using an array instead of TreeMap except in very sparse
cases
(Robert Muir, Mike McCandless)
- LUCENE-6617: Reduce heap usage for small FSTs
(Mike McCandless)
- LUCENE-6616: IndexWriter now lists the files in the index directory
only once on init, and IndexFileDeleter no longer suppresses
FileNotFoundException and NoSuchFileException. This also improves
IndexFileDeleter to delete segments_N files last, so that in the
presence of a virus checker, the index is never left in a state
where an expired segments_N references non-existing files
(Robert
Muir, Mike McCandless)
- LUCENE-6645: Optimized the way we merge postings lists in multi-term queries
and TermsQuery. This should especially help when there are lots of small
postings lists.
(Adrien Grand, Mike McCandless)
- LUCENE-6668: Optimized storage for sorted set and sorted numeric doc values
in the case that there are few unique sets of values.
(Adrien Grand, Robert Muir)
- LUCENE-6690: Sped up MultiTermsEnum.next() on high-cardinality fields.
(Adrien Grand)
- LUCENE-6621: Removed two unused variables in analysis/stempel/src/java/org/
egothor/stemmer/Compile.java
(Rishabh Patel via Christine Poerschke)
- Build (6)
- LUCENE-6518: Don't report false thread leaks from IBM J9
ClassCache Reaper in test framework.
(Dawid Weiss)
- LUCENE-6567: Simplify payload checking in SpanPayloadCheckQuery
(Alan
Woodward)
- LUCENE-6568: Make rat invocation depend on ivy configuration being set up
(Ramkumar Aiyengar)
- LUCENE-6683: ivy-fail goal directs people to non-existent page
(Mike Drob via Steve Rowe)
- LUCENE-6693: Updated Groovy to 2.4.4, Pegdown to 1.5, Svnkit to 1.8.10.
Also fixed some PermGen errors while running full build caused by
these updates: Tasks are now installed from root's build.xml.
(Uwe Schindler)
- LUCENE-6741: Fix jflex files to regenerate the java files correctly.
(Uwe Schindler)
- Test Framework (4)
- LUCENE-6637: Fix FSTTester to not violate file permissions
on -Dtests.verbose=true.
(Mesbah M. Alam, Uwe Schindler)
- LUCENE-6542: LuceneTestCase now has runWithRestrictedPermissions() to run
an action with reduced permissions. This can be used to simulate special
environments (e.g., read-only dirs). If tests are running without a security
manager, an assume cancels test execution automatically.
(Uwe Schindler)
- LUCENE-6652: Removed lots of useless Byte(s)TermAttributes all over test
infrastructure.
(Uwe Schindler)
- LUCENE-6563: Improve MockFileSystemTestCase.testURI to check if a path
can be encoded according to local filesystem requirements. Otherwise
stop test execution.
(Christine Poerschke via Uwe Schindler)
- Changes in Backwards Compatibility Policy (4)
- LUCENE-6553: The iterator returned by the LeafReader.postings method now
always includes deleted docs, so you have to check for deleted documents on
top of the iterator.
(Adrien Grand)
- LUCENE-6633: DuplicateFilter has been deprecated and will be removed in 6.0.
DiversifiedTopDocsCollector can be used instead with a maximum number of hits
per key equal to 1.
(Adrien Grand)
- LUCENE-6653: The workflow for consuming the TermToBytesRefAttribute was changed:
getBytesRef() now does all work and is called on each token, fillBytesRef()
was removed. The implementation is free to reuse the internal BytesRef
or return a new one on each call.
(Uwe Schindler)
- LUCENE-6682: StandardTokenizer.setMaxTokenLength() now throws an exception if
a length greater than 1M chars is given. Previously the effective max token
length (the scanner's buffer) was capped at 1M chars, but getMaxTokenLength()
incorrectly returned the previously requested length, even when it exceeded 1M.
(Piotr Idzikowski, Steve Rowe)
- Bug Fixes (4)
- LUCENE-6482: Fix class loading deadlock relating to Codec initialization,
default codec and SPI discovery.
(Shikhar Bhushan, Uwe Schindler)
- LUCENE-6523: NRT readers now reflect a new commit even if there is
no change to the commit user data
(Mike McCandless)
- LUCENE-6527: Queries now get a dummy Similarity when scores are not needed
in order to not load unnecessary information like norms.
(Adrien Grand)
- LUCENE-6559: TimeLimitingCollector now also checks for timeout when a new
leaf reader is pulled ie. if we move from one segment to another even without
collecting a hit.
(Simon Willnauer)
- New Features (16)
- LUCENE-6308, LUCENE-6385, LUCENE-6391: Span queries now share
document conjunction/intersection
code with boolean queries, and use two-phased iterators for
faster intersection by avoiding loading positions in certain cases.
(Paul Elschot, Terry Smith, Robert Muir via Mike McCandless)
- LUCENE-6393: Add two-phase support to SpanPositionCheckQuery
and its subclasses: SpanPositionRangeQuery, SpanPayloadCheckQuery,
SpanNearPayloadCheckQuery, SpanFirstQuery.
(Paul Elschot, Robert Muir)
- LUCENE-6394: Add two-phase support to SpanNotQuery and refactor
FilterSpans to just have an accept(Spans candidate) method for
subclasses.
(Robert Muir)
- LUCENE-6373: SpanOrQuery shares disjunction logic with boolean
queries, and supports two-phased iterators to avoid loading
positions when possible.
(Paul Elschot via Robert Muir)
- LUCENE-6352, LUCENE-6472: Added a new query time join to the join module
that uses global ordinals, which is faster for subsequent joins between
reopens.
(Martijn van Groningen, Adrien Grand)
- LUCENE-5879: Added experimental auto-prefix terms to BlockTree terms
dictionary, exposed as AutoPrefixPostingsFormat
(Adrien Grand,
Uwe Schindler, Robert Muir, Mike McCandless)
- LUCENE-5579: New CompositeSpatialStrategy combines speed of RPT with
accuracy of SDV. Includes optimized Intersect predicate to avoid many
geometry checks. Uses TwoPhaseIterator.
(David Smiley)
- LUCENE-5989: Allow passing BytesRef to StringField to make it easier
to index arbitrary binary tokens, and change the experimental
StoredFieldVisitor.stringField API to take UTF-8 byte[] instead of
String
(Mike McCandless)
- LUCENE-6389: Added ScoreMode.Min that aggregates the lowest child score
to the parent hit.
(Martijn van Groningen, Adrien Grand)
- LUCENE-6423: New LimitTokenOffsetFilter that limits tokens to those before
a configured maximum start offset.
(David Smiley)
- LUCENE-6422: New spatial PackedQuadPrefixTree, a generally more efficient
choice than QuadPrefixTree, especially for high precision shapes.
When used, you should typically disable RPT's pruneLeafyBranches option.
(Nick Knize, David Smiley)
- LUCENE-6451: Expressions now support bindings keys that look like
zero arg functions
(Jack Conradson via Ryan Ernst)
- LUCENE-6083: Add SpanWithinQuery and SpanContainingQuery that return
spans inside of / containing another spans.
(Paul Elschot via Robert Muir)
- LUCENE-6454: Added distinction between member variable and method in
expression helper VariableContext
(Jack Conradson via Ryan Ernst)
- LUCENE-6196: New Spatial "Geo3d" API with partial Spatial4j integration.
It is a set of shapes implemented using 3D planar geometry for calculating
spatial relations on the surface of a sphere. Shapes include Point, BBox,
Circle, Path (buffered line string), and Polygon.
(Karl Wright via David Smiley)
- LUCENE-6464: Add a new expert lookup method to
AnalyzingInfixSuggester to accept an arbitrary BooleanQuery to
express how contexts should be filtered.
(Arcadius Ahouansou via
Mike McCandless)
- Optimizations (10)
- LUCENE-6379: IndexWriter.deleteDocuments(Query...) now detects if
one of the queries is MatchAllDocsQuery and just invokes the much
faster IndexWriter.deleteAll in that case
(Robert Muir, Adrien
Grand, Mike McCandless)
- LUCENE-6388: Optimize SpanNearQuery when payloads are not present.
(Robert Muir)
- LUCENE-6421: Defer reading of positions in MultiPhraseQuery until
they are needed.
(Robert Muir)
- LUCENE-6392: Highligher- reduce memory of tokens in
TokenStreamFromTermVector, and add maxStartOffset limit.
(David Smiley)
- LUCENE-6456: Queries that generate doc id sets that are too large for the
query cache are not cached instead of evicting everything.
(Adrien Grand)
- LUCENE-6455: Require a minimum index size to enable query caching in order
not to cache eg. on MemoryIndex.
(Adrien Grand)
- LUCENE-6330: BooleanScorer (used for top-level disjunctions) does not decode
norms when not necessary anymore.
(Adrien Grand)
- LUCENE-6350: TermsQuery is now compressed with PrefixCodedTerms.
(Robert Muir, Mike McCandless, Adrien Grand)
- LUCENE-6458: Multi-term queries matching few terms per segment now execute
like a disjunction.
(Adrien Grand)
- LUCENE-6360: TermsQuery rewrites to a disjunction when there are 16 matching
terms or less.
(Adrien Grand)
- Bug Fixes (16)
- LUCENE-329: Fix FuzzyQuery defaults to rank exact matches highest.
(Mark Harwood, Adrien Grand)
- LUCENE-6378: Fix all RuntimeExceptions to throw the underlying root cause.
(Varun Thacker, Adrien Grand, Mike McCandless)
- LUCENE-6415: TermsQuery.extractTerms is a no-op (used to throw an
UnsupportedOperationException).
(Adrien Grand)
- LUCENE-6416: BooleanQuery.extractTerms now only extracts terms from scoring
clauses.
(Adrien Grand)
- LUCENE-6409: Fixed integer overflow in LongBitSet.ensureCapacity.
(Luc Vanlerberghe via Adrien Grand)
- LUCENE-6424, LUCENE-6430: Fix many bugs with mockfs filesystems in the
test-framework: always consistently wrap Path, fix buggy behavior for
globs, implement equals/hashcode for filtered Paths, etc.
(Ryan Ernst, Simon Willnauer, Robert Muir)
- LUCENE-6426: Fix FieldType's copy constructor to also copy over the numeric
precision step.
(Adrien Grand)
- LUCENE-6345: Null check terms/fields in Lucene queries
(Lee
Hinman via Mike McCandless)
- LUCENE-6400: SolrSynonymParser should preserve original token instead
of replacing it with a synonym, when expand=true and there is no
explicit mapping
(Ian Ribas, Robert Muir, Mike McCandless)
- LUCENE-6449: Don't throw NullPointerException if some segments are
missing the field being highlighted, in PostingsHighlighter
(Roman
Khmelichek via Mike McCandless)
- LUCENE-6427: Added assertion about the presence of ghost bits in
(Fixed|Long)BitSet.
(Luc Vanlerberghe via Adrien Grand)
- LUCENE-6468: Fixed NPE with empty Kuromoji user dictionary.
(Jun Ohtani via Christian Moen)
- LUCENE-6483: Ensure core closed listeners are called on the same cache key as
the reader which has been used to register the listener.
(Adrien Grand)
- LUCENE-6486 DocumentDictionary iterator no longer skips
documents with no payloads and now returns an empty BytesRef instead
(Marius Grama via Michael McCandless)
- LUCENE-6505: NRT readers now reflect segments_N filename and commit
user data from previous commits
(Mike McCandless)
- LUCENE-6507: Don't let NativeFSLock.close() release other locks
(Simon Willnauer, Robert Muir, Uwe Schindler, Mike McCandless)
- API Changes (8)
- LUCENE-6377: SearcherFactory#newSearcher now accepts the previous reader
to simplify warming logic during opening new searchers.
(Simon Willnauer)
- LUCENE-6410: Removed unused "reuse" parameter to
Terms.iterator.
(Robert Muir, Mike McCandless)
- LUCENE-6425: Replaced Query.extractTerms with Weight.extractTerms.
(Adrien Grand)
- LUCENE-6446: Simplified Explanation API.
(Adrien Grand)
- LUCENE-6445: Two new methods in Highlighter's TokenSources; the existing
methods are now marked deprecated.
(David Smiley)
- LUCENE-6484: Removed EliasFanoDocIdSet, which was unused.
(Paul Elschot via Adrien Grand)
- LUCENE-6466: Moved SpanQuery.getSpans() and .extractTerms() to SpanWeight
(Alan Woodward, Robert Muir)
- LUCENE-6497: Allow subclasses of FieldType to check frozen state
(Ryan Ernst)
- Other (6)
- LUCENE-6413: Test runner should report the number of suites completed/
remaining.
(Dawid Weiss)
- LUCENE-5439: Add 'ant jacoco' build target.
(Robert Muir)
- LUCENE-6315: Simplify the private iterator Lucene uses internally
when resolving deleted terms to matched docids.
(Robert Muir, Adrien
Grand, Mike McCandless)
- LUCENE-6399: Benchmark module's QueryMaker.resetInputs should call setConfig
so queries can react to property changes in new rounds.
(David Smiley)
- LUCENE-6382: Lucene now enforces that positions never exceed the
maximum value IndexWriter.MAX_POSITION.
(Robert Muir, Mike McCandless)
- LUCENE-6372: Simplified and improved equals/hashcode of span queries.
(Paul Elschot via Adrien Grand)
- Build (1)
- LUCENE-6420: Update forbiddenapis to v1.8
(Uwe Schindler)
- Test Framework (2)
- LUCENE-6419: Added two-phase iteration assertions to AssertingQuery.
(Adrien Grand)
- LUCENE-6437: Randomly set CPU core count and spins, derived from
test's master seed, used by ConcurrentMergeScheduler to set dynamic
defaults, for better test randomization and to help tests reproduce
(Robert Muir, Mike McCandless)
- New Features (9)
- LUCENE-6066: Added DiversifiedTopDocsCollector to misc for collecting no more
than a given number of results under a choice of key. Introduces new remove
method to core's PriorityQueue.
(Mark Harwood)
- LUCENE-6191: New spatial 2D heatmap faceting for PrefixTreeStrategy.
(David Smiley)
- LUCENE-6227: Added BooleanClause.Occur.FILTER to filter documents without
participating in scoring (on the contrary to MUST).
(Adrien Grand)
- LUCENE-6294: Added oal.search.CollectorManager to allow for parallelization
of the document collection process on IndexSearcher.
(Adrien Grand)
- LUCENE-6303: Added filter caching baked into IndexSearcher, disabled by
default.
(Adrien Grand)
- LUCENE-6304: Added a new MatchNoDocsQuery that matches no documents.
(Lee Hinman via Adrien Grand)
- LUCENE-6341: Add a -fast option to CheckIndex.
(Robert Muir)
- LUCENE-6355: IndexWriter's infoStream now also logs time to write FieldInfos
during merge
(Lee Hinman via Mike McCandless)
- LUCENE-6339: Added Near-real time Document Suggester via custom postings format
(Areek Zillur, Mike McCandless, Simon Willnauer)
- Bug Fixes (11)
- LUCENE-6368: FST.save can truncate output (BufferedOutputStream may be closed
after the underlying stream).
(Ippei Matsushima via Dawid Weiss)
- LUCENE-6249: StandardQueryParser doesn't support pure negative clauses.
(Dawid Weiss)
- LUCENE-6190: Spatial pointsOnly flag on PrefixTreeStrategy shouldn't switch all predicates to
Intersects.
(David Smiley)
- LUCENE-6242: Ram usage estimation was incorrect for SparseFixedBitSet when
object alignment was different from 8.
(Uwe Schindler, Adrien Grand)
- LUCENE-6293: Fixed TimSorter bug.
(Adrien Grand)
- LUCENE-6001: DrillSideways hits NullPointerException for certain
BooleanQuery searches.
(Dragan Jotannovic, jane chang via Mike
McCandless)
- LUCENE-6311: Fix NIOFSDirectory and SimpleFSDirectory so that the
toString method of IndexInputs confess when they are from a compound
file.
(Robert Muir, Mike McCandless)
- LUCENE-6381: Add defensive wait time limit in
DocumentsWriterStallControl to prevent hangs during indexing if we
miss a .notify/All somewhere
(Mike McCandless)
- LUCENE-6386: Correct IndexWriter.forceMerge documentation to state
that up to 3X (X = current index size) spare disk space may be needed
to complete forceMerge(1).
(Robert Muir, Shai Erera, Mike McCandless)
- LUCENE-6395: Seeking by term ordinal was failing to set the term's
bytes in MemoryIndex
(Mike McCandless)
- LUCENE-6429: Removed the TermQuery(Term,int) constructor which could lead to
inconsistent term statistics.
(Adrien Grand, Robert Muir)
- Optimizations (16)
- LUCENE-6183, LUCENE-5647: Avoid recompressing stored fields
and term vectors when merging segments without deletions.
Lucene50Codec's BEST_COMPRESSION mode uses a higher deflate
level for more compact storage.
(Robert Muir)
- LUCENE-6184: Make BooleanScorer only score windows that contain
matches.
(Adrien Grand)
- LUCENE-6161: Speed up resolving of deleted terms to docIDs by doing
a combined merge sort between deleted terms and segment terms
instead of a separate merge sort for each segment. In delete-heavy
use cases this can be a sizable speedup.
(Mike McCandless)
- LUCENE-6201: BooleanScorer can now deal with values of minShouldMatch that
are greater than one and is used when queries produce dense result sets.
(Adrien Grand)
- LUCENE-6218: Don't decode frequencies or match all positions when scoring
is not needed.
(Robert Muir)
- LUCENE-6233 Speed up CheckIndex when the index has term vectors
(Robert Muir, Mike McCandless)
- LUCENE-6198: Added the TwoPhaseIterator API, exposed on scorers which
is for now only used on phrase queries and conjunctions in order to check
positions lazily if the phrase query is in a conjunction with other queries.
(Robert Muir, Adrien Grand, David Smiley)
- LUCENE-6244, LUCENE-6251: All boolean queries but those that have a
minShouldMatch > 1 now either propagate or take advantage of the two-phase
iteration capabilities added in LUCENE-6198.
(Adrien Grand, Robert Muir)
- LUCENE-6241: FSDirectory.listAll() doesnt filter out subdirectories anymore,
for faster performance. Subdirectories don't matter to Lucene. If you need to
filter out non-index files with some custom usage, you may want to look at
the IndexFileNames class.
(Robert Muir)
- LUCENE-6262: ConstantScoreQuery does not wrap the inner weight anymore when
scores are not required.
(Adrien Grand)
- LUCENE-6263: MultiCollector automatically caches scores when several
collectors need them.
(Adrien Grand)
- LUCENE-6275: SloppyPhraseScorer now uses the same logic as ConjunctionScorer
in order to advance doc IDs, which takes advantage of the cost() API.
(Adrien Grand)
- LUCENE-6290: QueryWrapperFilter propagates approximations and FilteredQuery
rewrites to a BooleanQuery when the filter is a QueryWrapperFilter in order
to leverage approximations.
(Adrien Grand)
- LUCENE-6318: Reduce RAM usage of FieldInfos when there are many fields.
(Mike McCandless, Robert Muir)
- LUCENE-6320: Speed up CheckIndex.
(Robert Muir)
- LUCENE-4942: Optimized the encoding of PrefixTreeStrategy indexes for
non-point data: 33% smaller index, 68% faster indexing, and 44% faster
searching. YMMV
(David Smiley)
- API Changes (21)
- LUCENE-6204, LUCENE-6208: Simplify CompoundFormat: remove files()
and remove files parameter to write().
(Robert Muir)
- LUCENE-6217: Add IndexWriter.isOpen and getTragicException.
(Simon
Willnauer, Mike McCandless)
- LUCENE-6218, LUCENE-6220: Add Collector.needsScores() and needsScores
parameter to Query.createWeight().
(Robert Muir, Adrien Grand)
- LUCENE-4524, LUCENE-6246, LUCENE-6256, LUCENE-6271: Merge DocsEnum and DocsAndPositionsEnum
into a single PostingsEnum iterator. TermsEnum.docs() and TermsEnum.docsAndPositions()
are replaced by TermsEnum.postings().
(Alan Woodward, Simon Willnauer, Robert Muir, Ryan Ernst)
- LUCENE-6222: Removed TermFilter, use a QueryWrapperFilter(TermQuery)
instead. This will be as efficient now that queries can opt out from
scoring.
(Adrien Grand)
- LUCENE-6269: Removed BooleanFilter, use a QueryWrapperFilter(BooleanQuery)
instead.
(Adrien Grand)
- LUCENE-6270: Replaced TermsFilter with TermsQuery, use a
QueryWrapperFilter(TermsQuery) instead.
(Adrien Grand)
- LUCENE-6223: Move BooleanQuery.BooleanWeight to BooleanWeight.
(Robert Muir)
- LUCENE-1518: Make Filter extend Query and return 0 as score.
(Uwe Schindler, Adrien Grand)
- LUCENE-6245: Force Filter subclasses to implement toString API from Query.
(Ryan Ernst)
- LUCENE-6268: Replace FieldValueFilter and DocValuesRangeFilter with equivalent
queries that support approximations.
(Adrien Grand)
- LUCENE-6289: Replace DocValuesRangeFilter with DocValuesRangeQuery which
supports approximations.
(Adrien Grand)
- LUCENE-6266: Remove unnecessary Directory params from SegmentInfo.toString,
SegmentInfos.files/toString, and SegmentCommitInfo.toString.
(Robert Muir)
- LUCENE-6272: Scorer extends DocSetIdIterator rather than DocsEnum
(Alan
Woodward)
- LUCENE-6281: Removed support for slow collations from lucene/sandbox. Better
performance would be achieved through CollationKeyAnalyzer or
ICUCollationKeyAnalyzer.
(Adrien Grand)
- LUCENE-6286: Removed IndexSearcher methods that take a Filter object.
A BooleanQuery with a filter clause must be used instead.
(Adrien Grand)
- LUCENE-6300: PrefixFilter, TermRangeFilter and NumericRangeFilter have been
removed. Use PrefixQuery, TermRangeQuery and NumericRangeQuery instead.
(Adrien Grand)
- LUCENE-6303: Replaced FilterCache with QueryCache and CachingWrapperFilter
with CachingWrapperQuery.
(Adrien Grand)
- LUCENE-6317: Deprecate DataOutput.writeStringSet and writeStringStringMap.
Use writeSetOfStrings/Maps instead.
(Mike McCandless, Robert Muir)
- LUCENE-6307: Rename SegmentInfo.getDocCount -> .maxDoc,
SegmentInfos.totalDocCount -> .totalMaxDoc, MergeInfo.totalDocCount
-
> .totalMaxDoc and MergePolicy.OneMerge.totalDocCount ->
.totalMaxDoc
(Adrien Grand, Robert Muir, Mike McCandless)
- LUCENE-6367: PrefixQuery now subclasses AutomatonQuery, removing the
specialized PrefixTermsEnum.
(Robert Muir, Mike McCandless)
- Other (6)
- LUCENE-6248: Remove unused odd constants from StandardSyntaxParser.jj
(Dawid Weiss)
- LUCENE-6193: Collapse identical catch branches in try-catch statements.
(shalin)
- LUCENE-6239: Removed RAMUsageEstimator's sun.misc.Unsafe calls.
(Robert Muir, Dawid Weiss, Uwe Schindler)
- LUCENE-6292: Seed StringHelper better.
(Robert Muir)
- LUCENE-6333: Refactored queries to delegate their equals and hashcode
impls to the super class.
(Lee Hinman via Adrien Grand)
- LUCENE-6343: DefaultSimilarity javadocs had the wrong float value to
demonstrate precision of encoded norms
(András Péteri via Mike McCandless)
- Changes in Runtime Behavior (2)
- LUCENE-6255: PhraseQuery now ignores leading holes and requires that
positions are positive and added in order.
(Adrien Grand)
- LUCENE-6298: SimpleQueryParser returns an empty query rather than
null, if e.g. the terms were all stopwords.
(Lee Hinman via Robert Muir)
- New Features (32)
- LUCENE-5945: All file handling converted to NIO.2 apis.
(Robert Muir)
- LUCENE-5946: SimpleFSDirectory now uses Files.newByteChannel, for
portability with custom FileSystemProviders. If you want the old
non-interruptible behavior of RandomAccessFile, use RAFDirectory
in the misc/ module.
(Uwe Schindler, Robert Muir)
- SOLR-3359: Added analyzer attribute/property to SynonymFilterFactory.
(Ryo Onodera via Koji Sekiguchi)
- LUCENE-5648: Index and search date ranges, particularly multi-valued ones. It's
implemented in the spatial module as DateRangePrefixTree used with
NumberRangePrefixTreeStrategy.
(David Smiley)
- LUCENE-5895: Lucene now stores a unique id per-segment and per-commit to aid
in accurate replication of index files
(Robert Muir, Mike McCandless)
- LUCENE-5889: Add commit method to AnalyzingInfixSuggester, and allow just using .add
to build up the suggester.
(Varun Thacker via Mike McCandless)
- LUCENE-5123: Add a "pull" option to the postings writing API, so
that a PostingsFormat now receives a Fields instance and it is
responsible for iterating through all fields, terms, documents and
positions.
(Robert Muir, Mike McCandless)
- LUCENE-5268: Full cutover of all postings formats to the "pull"
FieldsConsumer API, removing PushFieldsConsumer. Added new
PushPostingsWriterBase for single-pass push of docs/positions to the
postings format.
(Mike McCandless)
- LUCENE-5906: Use Files.delete everywhere instead of File.delete, so that
when things go wrong, you get a real exception message why.
(Uwe Schindler, Robert Muir)
- LUCENE-5933: Added FilterSpans for easier wrapping of Spans instance.
(Shai Erera)
- LUCENE-5925: Remove fallback logic from opening commits, instead use
Directory.renameFile so that in-progress commits are never visible.
(Robert Muir)
- LUCENE-5820: SuggestStopFilter should have a factory.
(Varun Thacker via Steve Rowe)
- LUCENE-5949: Add Accountable.getChildResources().
(Robert Muir)
- SOLR-5986: Added ExitableDirectoryReader that extends FilterDirectoryReader and enables
exiting requests that take too long to enumerate over terms.
(Anshum Gupta, Steve Rowe,
Robert Muir)
- LUCENE-5911: Add MemoryIndex.freeze() to allow thread-safe searching over a
MemoryIndex.
(Alan Woodward, David Smiley, Robert Muir)
- LUCENE-5969: Lucene 5.0 has a new index format with mismatched file detection,
improved exception handling, and indirect norms encoding for sparse fields.
(Mike McCandless, Ryan Ernst, Robert Muir)
- LUCENE-6053: Add Serbian analyzer.
(Nikola Smolenski via Robert Muir, Mike McCandless)
- LUCENE-4400: Add support for new NYSIIS Apache commons phonetic
codec
(Thomas Neidhart via Mike McCandless)
- LUCENE-6059: Add Daitch-Mokotoff Soundex phonetic Apache commons
phonetic codec, and upgrade to Apache commons codec 1.10.
(Thomas
Neidhart via Mike McCandless)
- LUCENE-6058: With the upgrade to Apache commons codec 1.10, the
experimental BeiderMorseFilter has changed its behavior, so any
index using it will need to be rebuilt.
(Thomas
Neidhart via Mike McCandless)
- LUCENE-6050: Accept MUST and MUST_NOT (in addition to SHOULD) for
each context passed to Analyzing/BlendedInfixSuggester
(Arcadius
Ahouansou, jane chang via Mike McCandless)
- LUCENE-5929: Also extract terms to highlight from block join
queries.
(Julie Tibshirani via Mike McCandless)
- LUCENE-6063: Allow overriding whether/how ConcurrentMergeScheduler
stalls incoming threads when merges are falling behind
(Mike
McCandless)
- LUCENE-5833: DocumentDictionary now enumerates each value separately
in a multi-valued field (not just the first value), so you can build
suggesters from multi-valued fields.
(Varun Thacker via Mike
McCandless)
- LUCENE-6077: Added a filter cache.
(Adrien Grand, Robert Muir)
- LUCENE-6088: TermsFilter implements Accountable.
(Adrien Grand)
- LUCENE-6034: The default highlighter when used with QueryScorer will highlight payload-sensitive
queries provided that term vectors with positions, offsets, and payloads are present. This is the
only highlighter that can highlight such queries accurately.
(David Smiley)
- LUCENE-5914: Add an option to Lucene50Codec to support either BEST_SPEED
or BEST_COMPRESSION for stored fields.
(Adrien Grand, Robert Muir)
- LUCENE-6119: Add auto-IO-throttling to ConcurrentMergeScheduler, to
rate limit IO writes for each merge depending on incoming merge
rate.
(Mike McCandless)
- LUCENE-6155: Add payload support to MemoryIndex. The default highlighter's
QueryScorer and WeighedSpanTermExtractor now have setUsePayloads(bool).
(David Smiley)
- LUCENE-6166: Deletions (alone) can now trigger new merges.
(Mike McCandless)
- LUCENE-6177: Add CustomAnalyzer that allows to configure analyzers
like you do in Solr's index schema. This class has a builder API to configure
Tokenizers, TokenFilters, and CharFilters based on their SPI names
and parameters as documented by the corresponding factories.
(Uwe Schindler)
- Optimizations (18)
- LUCENE-5960: Use a more efficient bitset, not a Set<Integer>, to
track visited states.
(Markus Heiden via Mike McCandless)
- LUCENE-5959: Don't allocate excess memory when building automaton in
finish.
(Markus Heiden via Mike McCandless)
- LUCENE-5963: Reduce memory allocations in
AnalyzingSuggester.
(Markus Heiden via Mike McCandless)
- LUCENE-5938: MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE is now faster on
queries that match few documents by using a sparse bit set implementation.
(Adrien Grand)
- LUCENE-5969: Refactor merging to be more efficient, checksum calculation is
per-segment/per-producer, and norms and doc values merging no longer cause
RAM spikes for latent fields.
(Mike McCandless, Robert Muir)
- LUCENE-5983: CachingWrapperFilter now uses a new DocIdSet implementation
called RoaringDocIdSet instead of WAH8DocIdSet.
(Adrien Grand)
- LUCENE-6022: DocValuesDocIdSet checks live docs before doc values.
(Adrien Grand)
- LUCENE-6030: Add norms patched compression for a small number of common values
(Ryan Ernst)
- LUCENE-6040: Speed up EliasFanoDocIdSet through broadword bit selection.
(Paul Elschot)
- LUCENE-6033: CachingTokenFilter now uses ArrayList not LinkedList, and has new
isCached() method.
(David Smiley)
- LUCENE-6031: TokenSources (in the default highlighter) converts term vectors into a
TokenStream much faster in linear time (not N*log(N) using less memory, and with reset()
implemented. Only one of offsets or positions are required of the term vector.
(David Smiley)
- LUCENE-6089, LUCENE-6090: Tune CompressionMode.HIGH_COMPRESSION for
better compression and less cpu usage.
(Adrien Grand, Robert Muir)
- LUCENE-6034: QueryScorer, used by the default highlighter, needn't re-index the provided
TokenStream with MemoryIndex when it comes from TokenSources (term vectors) with offsets and
positions.
(David Smiley)
- LUCENE-5951: ConcurrentMergeScheduler detects whether the index is on SSD or not
and does a better job defaulting its settings. This only works on Linux for now;
other OS's will continue to use the previous defaults (tuned for spinning disks).
(Robert Muir, Uwe Schindler, hossman, Mike McCandless)
- LUCENE-6131: Optimize SortingMergePolicy.
(Robert Muir)
- LUCENE-6133: Improve default StoredFieldsWriter.merge() to be more efficient.
(Robert Muir)
- LUCENE-6145: Make EarlyTerminatingSortingCollector able to early-terminate
when the sort order is a prefix of the index-time order.
(Adrien Grand)
- LUCENE-6178: Score boolean queries containing MUST_NOT clauses with BooleanScorer2,
to use skip list data and avoid unnecessary scoring.
(Adrien Grand, Robert Muir)
- API Changes (40)
- LUCENE-5900: Deprecated more constructors taking Version in *InfixSuggester and
ICUCollationKeyAnalyzer, and removed TEST_VERSION_CURRENT from the test framework.
(Ryan Ernst)
- LUCENE-4535: oal.util.FilterIterator is now an internal API.
(Adrien Grand)
- LUCENE-4924: DocIdSetIterator.docID() must now return -1 when the iterator is
not positioned. This change affects all classes that inherit from
DocIdSetIterator, including DocsEnum and DocsAndPositionsEnum.
(Adrien Grand)
- LUCENE-5127: Reduce RAM usage of FixedGapTermsIndex. Remove
IndexWriterConfig.setTermIndexInterval, IndexWriterConfig.setReaderTermsIndexDivisor,
and termsIndexDivisor from StandardDirectoryReader. These options have been no-ops
with the default codec since Lucene 4.0. If you want to configure the interval for
this term index, pass it directly in your codec, where it can also be configured
per-field.
(Robert Muir)
- LUCENE-5388: Remove Reader from Tokenizer's constructor and from
Analyzer's createComponents. TokenStreams now always get their input
via setReader.
(Benson Margulies via Robert Muir - pull request #16)
- LUCENE-5527: The Collector API has been refactored to use a dedicated Collector
per leaf.
(Shikhar Bhushan, Adrien Grand)
- LUCENE-5702: The FieldComparator API has been refactor to a per-leaf API, just
like Collectors.
(Adrien Grand)
- LUCENE-4246: IndexWriter.close now always closes, even if it throws
an exception. The new IndexWriterConfig.setCommitOnClose (default
true) determines whether close() should commit before closing.
- LUCENE-5608, LUCENE-5565: Refactor SpatialPrefixTree/Cell API. Doesn't use Strings
as tokens anymore, and now iterates cells on-demand during indexing instead of
building a collection. RPT now has more setters.
(David Smiley)
- LUCENE-5666: Change uninverted access (sorting, faceting, grouping, etc)
to use the DocValues API instead of FieldCache. For FieldCache functionality,
use UninvertingReader in lucene/misc (or implement your own FilterReader).
UninvertingReader is more efficient: supports multi-valued numeric fields,
detects when a multi-valued field is single-valued, reuses caches
of compatible types (e.g. SORTED also supports BINARY and SORTED_SET access
without insanity). "Insanity" is no longer possible unless you explicitly want it.
Rename FieldCache* and DocTermOrds* classes in the search package to DocValues*.
Move SortedSetSortField to core and add SortedSetFieldSource to queries/, which
takes the same selectors. Add helper methods to DocValues.java that are better
suited for search code (never return null, etc).
(Mike McCandless, Robert Muir)
- LUCENE-5871: Remove Version from IndexWriterConfig. Use
IndexWriterConfig.setCommitOnClose to change the behavior of IndexWriter.close().
The default has been changed to match that of 4.x.
(Ryan Ernst, Mike McCandless)
- LUCENE-5965: CorruptIndexException requires a String or DataInput resource.
(Robert Muir)
- LUCENE-5972: IndexFormatTooOldException and IndexFormatTooNewException now
extend from IOException.
(Ryan Ernst, Robert Muir)
- LUCENE-5569: *AtomicReader/AtomicReaderContext have been renamed to *LeafReader/LeafReaderContext.
(Ryan Ernst)
- LUCENE-5938: Removed MultiTermQuery.ConstantScoreAutoRewrite as
MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE is usually better.
(Adrien Grand)
- LUCENE-5924: Rename CheckIndex -fix option to -exorcise. This option does not
actually fix the index, it just drops data.
(Robert Muir)
- LUCENE-5969: Add Codec.compoundFormat, which handles the encoding of compound
files. Add getMergeInstance() to codec producer APIs, which can be overridden
to return an instance optimized for merging instead of searching. Add
Terms.getStats() which can return additional codec-specific statistics about a field.
Change instance method SegmentInfos.read() to two static methods: SegmentInfos.readCommit()
and SegmentInfos.readLatestCommit().
(Mike McCandless, Robert Muir)
- LUCENE-5992: Remove FieldInfos from SegmentInfosWriter.write API.
(Robert Muir, Mike McCandless)
- LUCENE-5998: Simplify Field/SegmentInfoFormat to read+write methods.
(Robert Muir)
- LUCENE-6000: Removed StandardTokenizerInterface. Tokenizers now use
their jflex impl directly.
(Ryan Ernst)
- LUCENE-6006: Removed FieldInfo.normType since it's redundant: it
will be DocValuesType.NUMERIC if the field indexed and does not omit
norms, else null.
(Robert Muir, Mike McCandless)
- LUCENE-6013: Removed indexed boolean from IndexableFieldType and
FieldInfo, since it's redundant with IndexOptions != null.
(Robert
Muir, Mike McCandless)
- LUCENE-6021: FixedBitSet.nextSetBit now returns DocIdSetIterator.NO_MORE_DOCS
instead of -1 when there are no more bits which are set.
(Adrien Grand)
- LUCENE-5953: Directory and LockFactory APIs were restructured: Locking is
now under the responsibility of the Directory implementation. LockFactory is
only used by subclasses of BaseDirectory to delegate locking to an impl
class. LockFactories are now singletons and are responsible to create a Lock
instance based on a Directory implementation passed to the factory method.
See MIGRATE.txt for more details.
(Uwe Schindler, Robert Muir)
- LUCENE-6062: Throw exception instead of silently doing nothing if you try to
sort/group/etc on a misconfigured field (e.g. no docvalues, no UninvertingReader, etc).
(Robert Muir)
- LUCENE-6068: LeafReader.fields() never returns null.
(Robert Muir)
- LUCENE-6082: Remove abort() from codec apis.
(Robert Muir)
- LUCENE-6084: IndexOutput's constructor now requires a String
resourceDescription so its toString is sane
(Robert Muir, Mike
McCandless)
- LUCENE-6087: Allow passing custom DirectoryReader to SearcherManager
(Mike McCandless)
- LUCENE-6085: Undeprecate SegmentInfo attributes, but add safety so they
won't be trappy if codec tries to use them during docvalues updates.
(Robert Muir)
- LUCENE-6097: Remove dangerous / overly expert
IndexWriter.abortMerges and waitForMerges methods.
(Robert Muir,
Mike McCandless)
- LUCENE-6099: Add FilterDirectory.unwrap and
FilterDirectoryReader.unwrap
(Simon Willnauer, Mike McCandless)
- LUCENE-6121: CachingTokenFilter.reset() now propagates to its input if called before
incrementToken(). You must call reset() now on this filter instead of doing it a-priori on the
input(), which previously didn't work.
(David Smiley, Robert Muir)
- LUCENE-6147: Make the core Accountables.namedAccountable function public
(Ryan Ernst)
- LUCENE-6150: Remove staleFiles set and onIndexOutputClosed() from FSDirectory.
(Uwe Schindler, Robert Muir, Mike McCandless)
- LUCENE-6146: Replaced Directory.copy() with Directory.copyFrom().
(Robert Muir)
- LUCENE-6149: Infix suggesters' highlighting and allTermsRequired can
be set at the constructor for non-contextual lookup.
(Boon Low, Tomás Fernández Löbbe)
- LUCENE-6158, LUCENE-6165: IndexWriter.addIndexes(IndexReader...) changed to
addIndexes(CodecReader...)
(Robert Muir)
- LUCENE-6179: Out-of-order scoring is not allowed anymore, so
Weight.scoresDocsOutOfOrder and LeafCollector.acceptsDocsOutOfOrder have been
removed and boolean queries now always score in order.
- LUCENE-6212: IndexWriter no longer accepts per-document Analyzer to
add/updateDocument. These methods were trappy as they made it
easy to accidentally index tokens that were not easily
searchable.
(Mike McCandless)
- Bug Fixes (28)
- LUCENE-5650: Enforce read-only access to any path outside the temporary
folder via security manager, and make test temp dirs absolute.
(Ryan Ernst, Dawid Weiss)
- LUCENE-5948: RateLimiter now fully inits itself on init.
(Varun
Thacker via Mike McCandless)
- LUCENE-5981: CheckIndex obtains write.lock, since with some parameters it
may modify the index, and to prevent false corruption reports, as it does
not have the regular "spinlock" of DirectoryReader.open. It now implements
Closeable and you must close it to release the lock.
(Mike McCandless, Robert Muir)
- LUCENE-6004: Don't highlight the LookupResult.key returned from
AnalyzingInfixSuggester
(Christian Reuschling, jane chang via Mike McCandless)
- LUCENE-5980: Don't let document length overflow.
(Robert Muir)
- LUCENE-5961: Fix the exists() method for FunctionValues returned by many ValueSources to
behave properly when wrapping other ValueSources which do not exist for the specified document
(hossman)
- LUCENE-6039: Add IndexOptions.NONE and DocValuesType.NONE instead of
using null to mean not index and no doc values, renamed
IndexOptions.DOCS_ONLY to DOCS, and pulled IndexOptions and
DocValues out of FieldInfo into their own classes in
org.apache.lucene.index
(Simon Willnauer, Robert Muir, Mike
McCandless)
- LUCENE-6041: Remove sugar methods FieldInfo.isIndexed and
FieldInfo.hasDocValues.
(Robert Muir, Mike McCandless)
- LUCENE-6044: Fix backcompat support for token filters with enablePositionIncrements=false.
Also fixed backcompat for TrimFilter with updateOffsets=true. These options
are supported with a match version before 4.4, and no longer valid at all with 5.0.
(Ryan Ernst)
- LUCENE-6042: CustomScoreQuery explain was incorrect in some cases,
such as when nested inside a boolean query.
(Denis Lantsman via Robert Muir)
- LUCENE-6046: Add maxDeterminizedStates safety to determinize (which has
an exponential worst case) so that if it would create too many states, it
now throws an exception instead of exhausting CPU/RAM.
(Nik
Everett via Mike McCandless)
- LUCENE-6054: Allow repeating the empty automaton
(Nik Everett via
Mike McCandless)
- LUCENE-6049: Don't throw cryptic exception writing a segment when
the only docs in it had fields that hit non-aborting exceptions
during indexing but also had doc values.
(Mike McCandless)
- LUCENE-6055: PayloadAttribute.clone() now does a deep clone of the underlying
bytes.
(Shai Erera)
- LUCENE-6060: Remove dangerous IndexWriter.unlock method
(Simon
Willnauer, Mike McCandless)
- LUCENE-6062: Pass correct fieldinfos to docvalues producer when the
segment has updates.
(Mike McCandless, Shai Erera, Robert Muir)
- LUCENE-6075: Don't overflow int in SimpleRateLimiter
(Boaz Leskes
via Mike McCandless)
- LUCENE-5987: IndexWriter will now forcefully close itself on
aborting exception (an exception that would otherwise cause silent
data loss).
(Robert Muir, Mike McCandless)
- LUCENE-6094: Allow IW.rollback to stop ConcurrentMergeScheduler even
when it's stalling because there are too many merges.
(Mike McCandless)
- LUCENE-6105: Don't cache FST root arcs if the number of root arcs is
small, or if the cache would be > 20% of the size of the FST.
(Robert Muir, Mike McCandless)
- LUCENE-6124: Fix double-close() problems in codec and store APIs.
(Robert Muir)
- LUCENE-6152: Fix double close problems in OutputStreamIndexOutput.
(Uwe Schindler)
- LUCENE-6139: Highlighter: TokenGroup start & end offset getters should have
been returning the offsets of just the matching tokens in the group when
there's a distinction.
(David Smiley)
- LUCENE-6173: NumericTermAttribute and spatial/CellTokenStream do not clone
their BytesRef(Builder)s. Also equals/hashCode was missing.
(Uwe Schindler)
- LUCENE-6205: Fixed intermittent concurrency issue that could cause
FileNotFoundException when writing doc values updates at the same
time that a merge kicks off.
(Mike McCandless)
- LUCENE-6192: Fix int overflow corruption case in skip data for
high frequency terms in extremely large indices
(Robert Muir, Mike
McCandless)
- LUCENE-6093: Don't throw NullPointerException from
BlendedInfixSuggester for lookups that do not end in a prefix
token.
(jane chang via Mike McCandless)
- LUCENE-6214: Fixed IndexWriter deadlock when one thread is
committing while another opens a near-real-time reader and an
unrecoverable (tragic) exception is hit.
(Simon Willnauer, Mike
McCandless)
- Documentation (3)
- LUCENE-5392: Add/improve analysis package documentation to reflect
analysis API changes.
(Benson Margulies via Robert Muir - pull request #17)
- LUCENE-6057: Improve Sort(SortField) docs
(Martin Braun via Mike McCandless)
- LUCENE-6112: Fix compile error in FST package example code
(Tomoko Uchida via Koji Sekiguchi)
- Tests (6)
- LUCENE-5957: Add option for tests to not randomize codec
(Ryan Ernst)
- LUCENE-5974: Add check that backcompat indexes use default codecs
(Ryan Ernst)
- LUCENE-5971: Create addBackcompatIndexes.py script to build and add
backcompat test indexes for a given lucene version. Also renamed backcompat
index files to use Version.toString() in filename.
(Ryan Ernst)
- LUCENE-6002: Monster tests no longer fail. Most of them now have an 80 hour
timeout, effectively removing the timeout. The tests that operate near the 2
billion limit now use IndexWriter.MAX_DOCS instead of Integer.MAX_VALUE.
Some of the slow Monster tests now explicitly choose the default codec.
(Mike McCandless, Shawn Heisey)
- LUCENE-5968: Improve error message when 'ant beast' is run on top-level
modules.
(Ramkumar Aiyengar, Uwe Schindler)
- LUCENE-6120: Fix MockDirectoryWrapper's close() handling.
(Mike McCandless, Robert Muir)
- Build (5)
- LUCENE-5909: Smoke tester now has better command line parsing and
optionally also runs on Java 8.
(Ryan Ernst, Uwe Schindler)
- LUCENE-5902: Add bumpVersion.py script to manage version increase after release branch is cut.
- LUCENE-5962: Rename diffSources.py to createPatch.py and make it work with all text file types.
(Ryan Ernst)
- LUCENE-5995: Upgrade ICU to 54.1
(Robert Muir)
- LUCENE-6070: Upgrade forbidden-apis to 1.7
(Uwe Schindler)
- Other (5)
- LUCENE-5563: Removed sep layout: which has fallen behind on features and doesn't
perform as well as other options.
(Robert Muir)
- LUCENE-4086: Removed support for Lucene 3.x indexes. See migration guide for
more information.
(Robert Muir)
- LUCENE-5858: Moved Lucene 4 compatibility codecs to 'lucene-backward-codecs.jar'.
(Adrien Grand, Robert Muir)
- LUCENE-5915: Remove Pulsing postings format.
(Robert Muir)
- LUCENE-6213: Add useful exception message when commit contains segments from legacy codecs.
(Ryan Ernst)
- Bug fixes (12)
- LUCENE-6019, LUCENE-6117: Remove -Dtests.assert to make IndexWriter
infoStream sane.
(Robert Muir, Mike McCandless)
- LUCENE-6161: Resolving deletes was failing to reuse DocsEnum likely
causing substantial performance cost for use cases that frequently
delete old documents
(Mike McCandless)
- LUCENE-6192: Fix int overflow corruption case in skip data for
high frequency terms in extremely large indices
(Robert Muir, Mike
McCandless)
- LUCENE-6207: Fixed consumption of several terms enums on the same
sorted (set) doc values instance at the same time.
(Tom Shally, Robert Muir, Adrien Grand)
- LUCENE-6093: Don't throw NullPointerException from
BlendedInfixSuggester for lookups that do not end in a prefix
token.
(jane chang via Mike McCandless)
- LUCENE-6279: Don't let an abusive leftover _N_upgraded.si in the
index directory cause index corruption on upgrade
(Robert Muir, Mike
McCandless)
- LUCENE-6287: Fix concurrency bug in IndexWriter that could cause
index corruption (missing _N.si files) the first time 4.x kisses a
3.x index if merges are also running.
(Simon Willnauer, Mike
McCandless)
- LUCENE-6205: Fixed intermittent concurrency issue that could cause
FileNotFoundException when writing doc values updates at the same
time that a merge kicks off.
(Mike McCandless)
- LUCENE-6214: Fixed IndexWriter deadlock when one thread is
committing while another opens a near-real-time reader and an
unrecoverable (tragic) exception is hit.
(Simon Willnauer, Mike
McCandless)
- LUCENE-6105: Don't cache FST root arcs if the number of root arcs is
small, or if the cache would be > 20% of the size of the FST.
(Robert Muir, Mike McCandless)
- LUCENE-6001: DrillSideways hits NullPointerException for certain
BooleanQuery searches.
(Dragan Jotannovic, jane chang via Mike
McCandless)
- LUCENE-6306: Merging of doc values and norms now checks whether the
merge was aborted so IndexWriter.rollback can more promptly abort a
running merge.
(Robert Muir, Mike McCandless)
- API Changes (1)
- LUCENE-6212: Deprecate IndexWriter APIs that accept per-document Analyzer.
These methods were trappy as they made it easy to accidentally index
tokens that were not easily searchable and will be removed in 5.0.0.
(Mike McCandless)
- Bug fixes (12)
- LUCENE-6046: Add maxDeterminizedStates safety to determinize (which has
an exponential worst case) so that if it would create too many states, it
now throws an exception instead of exhausting CPU/RAM.
(Nik
Everett via Mike McCandless)
- LUCENE-6054: Allow repeating the empty automaton
(Nik Everett via
Mike McCandless)
- LUCENE-6049: Don't throw cryptic exception writing a segment when
the only docs in it had fields that hit non-aborting exceptions
during indexing but also had doc values.
(Mike McCandless)
- LUCENE-6060: Deprecate IndexWriter.unlock
(Simon Willnauer, Mike
McCandless)
- LUCENE-3229: Overlapping ordered SpanNearQuery spans should not match.
(Ludovic Boutros, Paul Elschot, Greg Dearing, ehatcher)
- LUCENE-6004: Don't highlight the LookupResult.key returned from
AnalyzingInfixSuggester
(Christian Reuschling, jane chang via Mike McCandless)
- LUCENE-6075: Don't overflow int in SimpleRateLimiter
(Boaz Leskes
via Mike McCandless)
- LUCENE-5980: Don't let document length overflow.
(Robert Muir)
- LUCENE-6042: CustomScoreQuery explain was incorrect in some cases,
such as when nested inside a boolean query.
(Denis Lantsman via Robert Muir)
- LUCENE-5948: RateLimiter now fully inits itself on init.
(Varun
Thacker via Mike McCandless)
- LUCENE-6055: PayloadAttribute.clone() now does a deep clone of the underlying
bytes.
(Shai Erera)
- LUCENE-6094: Allow IW.rollback to stop ConcurrentMergeScheduler even
when it's stalling because there are too many merges.
(Mike McCandless)
- Documentation (1)
- LUCENE-6057: Improve Sort(SortField) docs
(Martin Braun via Mike McCandless)
- Bug fixes (2)
- LUCENE-5977: Fix tokenstream safety checks in IndexWriter to properly
work across multi-valued fields. Previously some cases across multi-valued
fields would happily create a corrupt index.
(Dawid Weiss, Robert Muir)
- LUCENE-6019: Detect when DocValuesType illegally changes for the
same field name. Also added -Dtests.asserts=true|false so we can
run tests with and without assertions.
(Simon Willnauer, Robert
Muir, Mike McCandless).
- Bug fixes (7)
- LUCENE-5934: Fix backwards compatibility for 4.0 indexes.
(Ian Lea, Uwe Schindler, Robert Muir, Ryan Ernst)
- LUCENE-5939: Regenerate old backcompat indexes to ensure they were built with
the exact release
(Ryan Ernst, Uwe Schindler)
- LUCENE-5952: Improve error messages when version cannot be parsed;
don't check for too old or too new major version (it's too low level
to enforce here); use simple string tokenizer.
(Ryan Ernst, Uwe Schindler,
Robert Muir, Mike McCandless)
- LUCENE-5958: Don't let exceptions during checkpoint corrupt the index.
Refactor existing OOM handling too, so you don't need to handle OOM special
for every IndexWriter method: instead such disasters will cause IW to close itself
defensively.
(Robert Muir, Mike McCandless)
- LUCENE-5904: Fixed a corruption case that can happen when 1)
IndexWriter is uncleanly shut-down (OS crash, power loss, etc.), 2)
on startup, when a new IndexWriter is created, a virus checker is
holding some of the previously written but unused files open and
preventing deletion, 3) IndexWriter writes these files again during
the course of indexing, then the files can later be deleted, causing
corruption. This case was detected by adding evilness to
MockDirectoryWrapper to have it simulate a virus checker holding a
file open and preventing deletion
(Robert Muir, Mike McCandless)
- LUCENE-5916: Static scope test components should be consistent between
tests (and test iterations). Fix for FaultyIndexInput in particular.
(Dawid Weiss)
- LUCENE-5975: Fix reading of 3.0-3.3 indexes, where bugs in these old
index formats would result in CorruptIndexException "did not read all
bytes from file" when reading the deleted docs file.
(Patrick Mi, Robert MUir)
- Tests (1)
- LUCENE-5936: Add backcompat checks to verify what is tested matches known versions
(Ryan Ernst)
- New Features (11)
- LUCENE-5778: Support hunspell morphological description fields/aliases.
(Robert Muir)
- LUCENE-5801: Added (back) OrdinalMappingAtomicReader for merging search
indexes that contain category ordinals from separate taxonomy indexes.
(Nicola Buso via Shai Erera)
- LUCENE-4175, LUCENE-5714, LUCENE-5779: Index and search rectangles with spatial
BBoxSpatialStrategy using most predicates. Sort documents by relative overlap
of query areas or just by indexed shape area.
(Ryan McKinley, David Smiley)
- LUCENE-5806: Extend expressions grammar to support array access in variables.
Added helper class VariableContext to parse complex variable into pieces.
(Ryan Ernst)
- LUCENE-5826: Support proper hunspell case handling, LANG, KEEPCASE, NEEDAFFIX,
and ONLYINCOMPOUND flags.
(Robert Muir)
- LUCENE-5815: Add TermAutomatonQuery, a proximity query allowing you
to create an arbitrary automaton, using terms on the transitions,
expressing which sequence of sequential terms (including a special
"any" term) are allowed. This is a generalization of
MultiPhraseQuery and span queries, and enables "correct" (including
position) length search-time graph synonyms.
(Mike McCandless)
- LUCENE-5819: Add OrdsLucene41 block tree terms dict and postings
format, to include term ordinals in the index so the optional
TermsEnum.ord() and TermsEnum.seekExact(long ord) APIs work.
(Mike
McCandless)
- LUCENE-5835: TermValComparator can sort missing values last.
(Adrien Grand)
- LUCENE-5825: Benchmark module can use custom postings format, e.g.:
codec.postingsFormat=Memory
(Varun Shenoy, David Smiley)
- LUCENE-5842: When opening large files (where it's too expensive to compare
checksum against all the bytes), retrieve checksum to validate structure
of footer, this can detect some forms of corruption such as truncation.
(Robert Muir)
- LUCENE-5739: Added DataInput.readZ(Int|Long) and DataOutput.writeZ(Int|Long)
to read and write small signed integers.
(Adrien Grand)
- API Changes (8)
- LUCENE-5752: Simplified Automaton API to be immutable.
(Mike McCandless)
- LUCENE-5793: Add equals/hashCode to FieldType.
(Shay Banon, Robert Muir)
- LUCENE-5692: DisjointSpatialFilter is deprecated (used by RecursivePrefixTreeStrategy)
(David Smiley)
- LUCENE-5771: SpatialOperation's predicate names are now aliased to OGC standard names.
Thus you can use: Disjoint, Equals, Intersects, Overlaps, Within, Contains, Covers,
CoveredBy. The area requirement on the predicates was removed, and Overlaps' definition
was fixed.
(David Smiley)
- LUCENE-5850: Made Version handling more robust and extensible. Deprecated
Constants.LUCENE_MAIN_VERSION, Constants.LUCENE_VERSION and current Version
constants of the form LUCENE_X_Y. Added version constants that include bugfix
number of form LUCENE_X_Y_Z. Changed Version.LUCENE_CURRENT to Version.LATEST.
CheckIndex now prints the Lucene version used to write each segment.
(Ryan Ernst, Uwe Schindler, Robert Muir, Mike McCandless)
- LUCENE-5836: BytesRef has been splitted into BytesRef, whose intended usage is
to be just a reference to a section of a larger byte[] and BytesRefBuilder
which is a StringBuilder-like class for BytesRef instances.
(Adrien Grand)
- LUCENE-5883: You can now change the MergePolicy instance on a live IndexWriter,
without first closing and reopening the writer. This allows to e.g. run a special
merge with UpgradeIndexMergePolicy without reopening the writer. Also, MergePolicy
no longer implements Closeable; if you need to release your custom MergePolicy's
resources, you need to implement close() and call it explicitly.
(Shai Erera)
- LUCENE-5859: Deprecate Analyzer constructors taking Version. Use Analyzer.setVersion()
to set the version an analyzer to replicate behavior from a specific release.
(Ryan Ernst, Robert Muir)
- Optimizations (14)
- LUCENE-5780: Make OrdinalMap more memory-efficient, especially in case the
first segment has all values.
(Adrien Grand, Robert Muir)
- LUCENE-5782: OrdinalMap now sorts enums before being built in order to
improve compression.
(Adrien Grand)
- LUCENE-5798: Optimize MultiDocsEnum reuse.
(Robert Muir)
- LUCENE-5799: Optimize numeric docvalues merging.
(Robert Muir)
- LUCENE-5797: Optimize norms merging
(Adrien Grand, Robert Muir)
- LUCENE-5803: Add DelegatingAnalyzerWrapper, an optimized variant
of AnalyzerWrapper that doesn't allow to wrap components or readers.
This wrapper class is the base class of all analyzers that just delegate
to another analyzer, e.g. per field name: PerFieldAnalyzerWrapper and
Solr's schema support.
(Shay Banon, Uwe Schindler, Robert Muir)
- LUCENE-5795: MoreLikeThisQuery now only collects the top N terms instead
of collecting all terms from the like text when building the query.
(Alex Ksikes, Simon Willnauer)
- LUCENE-5681: Fix RAMDirectory's IndexInput to not do double buffering
on slices (causes useless data copying, especially on random access slices).
This also improves slices of NRTCachingDirectory, because the cache
is based on RAMDirectory. BufferedIndexInput.wrap() was marked with a
warning in javadocs. It is almost always a better idea to implement
slicing on your own!
(Uwe Schindler, Robert Muir)
- LUCENE-5834: Empty sorted set and numeric doc values are now singletons.
(Adrien Grand)
- LUCENE-5841: Improve performance of block tree terms dictionary when
assigning terms to blocks.
(Mike McCandless)
- LUCENE-5856: Optimize Fixed/Open/LongBitSet to remove unnecessary AND.
(Robert Muir)
- LUCENE-5884: Optimize FST.ramBytesUsed.
(Adrien Grand, Robert Muir,
Mike McCandless)
- LUCENE-5882: Add Lucene410DocValuesFormat, with faster term lookups
for SORTED/SORTED_SET fields.
(Robert Muir)
- LUCENE-5887: Remove WeakIdentityMap caching in AttributeFactory,
AttributeSource, and VirtualMethod in favour of Java 7's ClassValue.
Always use MethodHandles to create AttributeImpl classes.
(Uwe Schindler)
- Bug Fixes (9)
- LUCENE-5796: Fixes the Scorer.getChildren() method for two combinations
of BooleanQuery.
(Terry Smith via Robert Muir)
- LUCENE-5790: Fix compareTo in MutableValueDouble and MutableValueBool, this caused
incorrect results when grouping on fields with missing values.
(海老澤 志信, hossman)
- LUCENE-5817: Fix hunspell zero-affix handling: previously only zero-strips worked
correctly.
(Robert Muir)
- LUCENE-5818, LUCENE-5823: Fix hunspell overgeneration for short strings that also
match affixes, words are only stripped to a zero-length string if FULLSTRIP option
is specified in the dictionary.
(Robert Muir)
- LUCENE-5824: Fix hunspell 'long' flag handling.
(Robert Muir)
- LUCENE-5838: Fix hunspell when the .aff file has over 64k affixes.
(Robert Muir)
- LUCENE-5869: Added restriction to positive values for maxExpansions in
FuzzyQuery.
(Ryan Ernst)
- LUCENE-5672: IndexWriter.addIndexes() calls maybeMerge(), to ensure the index stays
healthy. If you don't want merging use NoMergePolicy instead.
(Robert Muir)
- LUCENE-5908: Fix Lucene43NGramTokenizer to be final
- Test Framework (2)
- LUCENE-5786: Unflushed/ truncated events file (hung testing subprocess).
(Dawid Weiss)
- LUCENE-5881: Add "beasting" of tests: repeats the whole "test" Ant target
N times with "ant beast -Dbeast.iters=N".
(Uwe Schindler, Robert Muir,
Ryan Ernst, Dawid Weiss)
- Build (2)
- LUCENE-5770: Upgrade to JFlex 1.6, which has direct support for
supplementary code points - as a result, ICU4J is no longer used
to generate surrogate pairs to augment JFlex scanner specifications.
(Steve Rowe)
- SOLR-6358: Remove VcsDirectoryMappings from idea configuration
vcs.xml
(Ramkumar Aiyengar via Steve Rowe)
- Bug fixes (7)
- LUCENE-5907: Fix corruption case when opening a pre-4.x index with
IndexWriter, then opening an NRT reader from that writer, then
calling commit from the writer, then closing the NRT reader. This
case would remove the wrong files from the index leading to a
corrupt index.
(Mike McCandless)
- LUCENE-5919: Fix exception handling inside IndexWriter when
deleteFile throws an exception, to not over-decRef index files,
possibly deleting a file that's still in use in the index, leading
to corruption.
(Mike McCandless)
- LUCENE-5922: DocValuesDocIdSet on 5.x and FieldCacheDocIdSet on 4.x
are not cacheable.
(Adrien Grand)
- LUCENE-5843: Added IndexWriter.MAX_DOCS which is the maximum number
of documents allowed in a single index, and any operations that add
documents will now throw IllegalStateException if the max count
would be exceeded, instead of silently creating an unusable
index.
(Mike McCandless)
- LUCENE-5844: ArrayUtil.grow/oversize now returns a maximum of
Integer.MAX_VALUE - 8 for the maximum array size.
(Robert Muir,
Mike McCandless)
- LUCENE-5827: Make all Directory implementations correctly fail with
IllegalArgumentException if slices are out of bounds.
(Uwe Schindler)
- LUCENE-5897, LUCENE-5400: JFlex-based tokenizers StandardTokenizer and
UAX29URLEmailTokenizer tokenize extremely slowly over long sequences of
text partially matching certain grammar rules. The scanner default
buffer size was reduced, and scanner buffer growth was disabled, resulting
in much, much faster tokenization for these text sequences.
(Chris Geeringh, Robert Muir, Steve Rowe)
- Changes in Runtime Behavior (2)
- LUCENE-5611: Changing the term vector options for multiple field
instances by the same name in one document is not longer accepted;
IndexWriter will now throw IllegalArgumentException.
(Robert Muir,
Mike McCandless)
- LUCENE-5646: Remove rare/undertested bulk merge algorithm in
CompressingStoredFieldsWriter.
(Robert Muir, Adrien Grand)
- New Features (8)
- LUCENE-5610: Add Terms.getMin and Terms.getMax to get the lowest and
highest terms, and NumericUtils.get{Min/Max}{Int/Long} to get the
minimum numeric values from the provided Terms.
(Robert Muir, Mike
McCandless)
- LUCENE-5675: Add IDVersionPostingsFormat, a postings format
optimized for primary-key (ID) fields that also record a version
(long) for each ID.
(Robert Muir, Mike McCandless)
- LUCENE-5680: Add ability to atomically update a set of DocValues
fields.
(Shai Erera)
- LUCENE-5717: Add support for multiterm queries nested inside
filtered and constant-score queries to postings highlighter.
(Luca Cavanna via Robert Muir)
- LUCENE-5731, LUCENE-5760: Add RandomAccessInput, a random access API for directory.
Add DirectReader/Writer, optimized for reading packed integers directly
from Directory. Add Lucene49Codec and Lucene49DocValuesFormat that make
use of these.
(Robert Muir)
- LUCENE-5743: Add Lucene49NormsFormat, which can compress in some cases
such as very short fields.
(Ryan Ernst, Adrien Grand, Robert Muir)
- LUCENE-5748: Add SORTED_NUMERIC docvalues type, which is efficient
for processing numeric fields with multiple values.
(Robert Muir)
- LUCENE-5754: Allow "$" as part of variable and function names in
expressions module.
(Uwe Schindler)
- Changes in Backwards Compatibility Policy (4)
- LUCENE-5634: Add reuse argument to IndexableField.tokenStream. This
can be used by custom fieldtypes, which don't use the Analyzer, but
implement their own TokenStream.
(Uwe Schindler, Robert Muir)
- LUCENE-5640: AttributeSource.AttributeFactory was moved to a
top-level class: org.apache.lucene.util.AttributeFactory
(Uwe Schindler, Robert Muir)
- LUCENE-4371: Removed IndexInputSlicer and Directory.createSlicer() and replaced
with IndexInput.slice().
(Robert Muir)
- LUCENE-5727, LUCENE-5678: Remove IndexOutput.seek, IndexOutput.setLength().
(Robert Muir, Uwe Schindler)
- API Changes (20)
- LUCENE-5756: IndexWriter now implements Accountable and IW#ramSizeInBytes()
has been deprecated in favor of IW#ramBytesUsed()
(Simon Willnauer)
- LUCENE-5725: MoreLikeThis#like now accepts multiple values per field.
The pre-existing method has been deprecated in favor of a variable arguments
for the like text.
(Alex Ksikes via Simon Willnauer)
- LUCENE-5711: MergePolicy accepts an IndexWriter instance
on each method rather than holding state against a single
IndexWriter instance.
(Simon Willnauer)
- LUCENE-5582: Deprecate IndexOutput.length (just use
IndexOutput.getFilePointer instead) and IndexOutput.setLength.
(Mike McCandless)
- LUCENE-5621: Deprecate IndexOutput.flush: this is not used by Lucene.
(Robert Muir)
- LUCENE-5611: Simplified Lucene's default indexing chain / APIs.
AttributeSource/TokenStream.getAttribute now returns null if the
attribute is not present (previously it threw
IllegalArgumentException). StoredFieldsWriter.startDocument no
longer receives the number of fields that will be added
(Robert
Muir, Mike McCandless)
- LUCENE-5632: In preparation for coming Lucene versions, the Version
enum constants were renamed to make them better readable. The constant
for Lucene 4.9 is now "LUCENE_4_9". Version.parseLeniently() is still
able to parse the old strings ("LUCENE_49"). The old identifiers got
deprecated and will be removed in Lucene 5.0.
(Uwe Schindler,
Robert Muir)
- LUCENE-5633: Change NoMergePolicy to a singleton with no distinction between
compound and non-compound types.
(Shai Erera)
- LUCENE-5640: The Token class was deprecated. Since Lucene 2.9, TokenStreams
are using Attributes, Token is no longer used.
(Uwe Schindler, Robert Muir)
- LUCENE-5679: Consolidated IndexWriter.deleteDocuments(Term) and
IndexWriter.deleteDocuments(Query) with their varargs counterparts.
(Shai Erera)
- LUCENE-5701: Core closed listeners are now available in the AtomicReader API,
they used to sit only in SegmentReader.
(Adrien Grand, Robert Muir)
- LUCENE-5706: Removed the option to unset a DocValues field through DocValues
updates.
(Shai Erera)
- LUCENE-5700: Added oal.util.Accountable that is now implemented by all
classes whose memory usage can be estimated.
(Robert Muir, Adrien Grand)
- LUCENE-5708: Remove IndexWriterConfig.clone, so now IndexWriter
simply uses the IndexWriterConfig you pass it, and you must create a
new IndexWriterConfig for each IndexWriter.
(Mike McCandless)
- LUCENE-5678: IndexOutput no longer allows seeking, so it is no longer required
to use RandomAccessFile to write Indexes. Lucene now uses standard FileOutputStream
wrapped with OutputStreamIndexOutput to write index data. BufferedIndexOutput was
removed, because buffering and checksumming is provided by FilterOutputStreams,
provided by the JDK.
(Uwe Schindler, Mike McCandless)
- LUCENE-5703: BinaryDocValues API changed to work like TermsEnum and not allocate/
copy bytes on each access, you are responsible for cloning if you want to keep
data around.
(Adrien Grand)
- LUCENE-5695: DocIdSet implements Accountable.
(Adrien Grand)
- LUCENE-5757: Moved RamUsageEstimator's reflection-based processing to RamUsageTester
in the test-framework module.
(Robert Muir)
- LUCENE-5761: Removed DiskDocValuesFormat, it was very inefficient and saved very little
RAM over the default codec.
(Robert Muir)
- LUCENE-5775: Deprecate JaspellLookup.
(Mike McCandless)
- Optimizations (18)
- LUCENE-5603: hunspell stemmer more efficiently strips prefixes
and suffixes.
(Robert Muir)
- LUCENE-5599: HttpReplicator did not properly delegate bulk read() to wrapped
InputStream.
(Christoph Kaser via Shai Erera)
- LUCENE-5591: pass an IOContext with estimated flush size when applying DV
updates.
(Shai Erera)
- LUCENE-5634: IndexWriter reuses TokenStream instances for String and Numeric
fields by default.
(Uwe Schindler, Shay Banon, Mike McCandless, Robert Muir)
- LUCENE-5638, LUCENE-5640: TokenStream uses a more performant AttributeFactory
by default, that packs the core attributes into one implementation
(PackedTokenAttributeImpl), for faster clearAttributes(), saveState(), and
restoreState(). In addition, AttributeFactory uses Java 7 MethodHandles for
instantiating Attribute implementations.
(Uwe Schindler, Robert Muir)
- LUCENE-5609: Changed the default NumericField precisionStep from 4
to 8 (for int/float) and 16 (for long/double), for faster indexing
time and smaller indices.
(Robert Muir, Uwe Schindler, Mike McCandless)
- LUCENE-5670: Add skip/FinalOutput to FST Outputs.
(Christian
Ziech via Mike McCandless).
- LUCENE-4236: Optimize BooleanQuery's in-order scoring. This speeds up
some types of boolean queries.
(Robert Muir)
- LUCENE-5694: Don't score() subscorers in DisjunctionSumScorer or
DisjunctionMaxScorer unless score() is called.
(Robert Muir)
- LUCENE-5720: Optimize DirectPackedReader's decompression.
(Robert Muir)
- LUCENE-5722: Optimize ByteBufferIndexInput#seek() by specializing
implementations. This improves random access as used by docvalues codecs
if used with MMapDirectory.
(Robert Muir, Uwe Schindler)
- LUCENE-5730: FSDirectory.open returns MMapDirectory for 64-bit operating
systems, not just Linux and Windows.
(Robert Muir)
- LUCENE-5703: BinaryDocValues producers don't allocate or copy bytes on
each access anymore.
(Adrien Grand)
- LUCENE-5721: Monotonic compression doesn't use zig-zag encoding anymore.
(Robert Muir, Adrien Grand)
- LUCENE-5750: Speed up monotonic addressing for BINARY and SORTED_SET
docvalues.
(Robert Muir)
- LUCENE-5751: Speed up MemoryDocValues.
(Adrien Grand, Robert Muir)
- LUCENE-5767: OrdinalMap optimizations, that mostly help on low cardinalities.
(Martijn van Groningen, Adrien Grand)
- LUCENE-5769: SingletonSortedSetDocValues now supports random access ordinals.
(Robert Muir)
- Bug fixes (11)
- LUCENE-5738: Ensure NativeFSLock prevents opening the file channel for the
lock if the lock is already obtained by the JVM. Trying to obtain an already
obtained lock in the same JVM can unlock the file might allow other processes
to lock the file even without explicitly unlocking the FileLock. This behavior
is operating system dependent.
(Simon Willnauer)
- LUCENE-5673: MMapDirectory: Work around a "bug" in the JDK that throws
a confusing OutOfMemoryError wrapped inside IOException if the FileChannel
mapping failed because of lack of virtual address space. The IOException is
rethrown with more useful information about the problem, omitting the
incorrect OutOfMemoryError.
(Robert Muir, Uwe Schindler)
- LUCENE-5682: NPE in QueryRescorer when Scorer is null
(Joel Bernstein, Mike McCandless)
- LUCENE-5691: DocTermOrds lookupTerm(BytesRef) would return incorrect results
if the underlying TermsEnum supports ord() and the insertion point would
be at the end.
(Robert Muir)
- LUCENE-5618, LUCENE-5636: SegmentReader referenced unneeded files following
doc-values updates. Now doc-values field updates are written in separate file
per field.
(Shai Erera, Robert Muir)
- LUCENE-5684: Make best effort to detect invalid usage of Lucene,
when IndexReader is reopened after all files in its index were
removed and recreated by the application (the proper way to do
this is IndexWriter.deleteAll, or opening an IndexWriter with
OpenMode.CREATE)
(Mike McCandless)
- LUCENE-5704: Fix compilation error with Java 8u20.
(Uwe Schindler)
- LUCENE-5710: Include the inner exception as the cause and in the
exception message when an immense term is hit during indexing
(Lee
Hinman via Mike McCandless)
- LUCENE-5724: CompoundFileWriter was failing to pass through the
IOContext in some cases, causing NRTCachingDirectory to cache
compound files when it shouldn't, then causing OOMEs.
(Mike
McCandless)
- LUCENE-5747: Project-specific settings for the eclipse development
environment will prevent automatic code reformatting.
(Shawn Heisey)
- LUCENE-5768, LUCENE-5777: Hunspell condition checks containing character classes
were buggy.
(Clinton Gormley, Robert Muir)
- Test Framework (2)
- LUCENE-5622: Fail tests if they print over the given limit of bytes to
System.out or System.err.
(Robert Muir, Dawid Weiss)
- LUCENE-5619: Added backwards compatibility tests to ensure we can update existing
indexes with doc-values updates.
(Shai Erera, Robert Muir)
- Build (2)
- LUCENE-5442: The Ant check-lib-versions target now runs Ivy resolution
transitively, then fails the build when it finds a version conflict: when a
transitive dependency's version is more recent than the direct dependency's
version specified in lucene/ivy-versions.properties. Exceptions are
specifiable in lucene/ivy-ignore-conflicts.properties.
(Steve Rowe)
- LUCENE-5715: Upgrade direct dependencies known to be older than transitive
dependencies: com.sun.jersey.version:1.8->1.9; com.sun.xml.bind:jaxb-impl:2.2.2->2.2.3-1;
commons-beanutils:commons-beanutils:1.7.0->1.8.3; commons-digester:commons-digester:2.0->2.1;
commons-io:commons-io:2.1->2.3; commons-logging:commons-logging:1.1.1->1.1.3;
io.netty:netty:3.6.2.Final->3.7.0.Final; javax.activation:activation:1.1->1.1.1;
javax.mail:mail:1.4.1->1.4.3; log4j:log4j:1.2.16->1.2.17; org.apache.avro:avro:1.7.4->1.7.5;
org.tukaani:xz:1.2->1.4; org.xerial.snappy:snappy-java:1.0.4.1->1.0.5
(Steve Rowe)
- Bug fixes (15)
- LUCENE-5639: Fix PositionLengthAttribute implementation in Token class.
(Uwe Schindler, Robert Muir)
- LUCENE-5635: IndexWriter didn't properly handle IOException on TokenStream.reset(),
which could leave the analyzer in an inconsistent state.
(Robert Muir)
- LUCENE-5599: HttpReplicator did not properly delegate bulk read() to wrapped
InputStream.
(Christoph Kaser via Shai Erera)
- LUCENE-5600: HttpClientBase did not properly consume a connection if a server
error occurred.
(Christoph Kaser via Shai Erera)
- LUCENE-5628: Change getFiniteStrings to iterative not recursive
implementation, so that building suggesters on a long suggestion
doesn't risk overflowing the stack; previously it consumed one Java
stack frame per character in the expanded suggestion. If you are building
a suggester this is a nasty trap.
(Robert Muir, Simon Willnauer,
Mike McCandless).
- LUCENE-5559: Add additional argument validation for CapitalizationFilter
and CodepointCountFilter.
(Ahmet Arslan via Robert Muir)
- LUCENE-5641: SimpleRateLimiter would silently rate limit at 8 MB/sec
even if you asked for higher rates.
(Mike McCandless)
- LUCENE-5644: IndexWriter clears which threads use which internal
thread states on flush, so that if an application reduces how many
threads it uses for indexing, that results in a reduction of how
many segments are flushed on a full-flush (e.g. to obtain a
near-real-time reader).
(Simon Willnauer, Mike McCandless)
- LUCENE-5653: JoinUtil with ScoreMode.Avg on a multi-valued field
with more than 256 values would throw exception.
(Mikhail Khludnev via Robert Muir)
- LUCENE-5654: Fix various close() methods that could suppress
throwables such as OutOfMemoryError, instead returning scary messages
that look like index corruption.
(Mike McCandless, Robert Muir)
- LUCENE-5656: Fix rare fd leak in SegmentReader when multiple docvalues
fields have been updated with IndexWriter.updateXXXDocValue and one
hits exception.
(Shai Erera, Robert Muir)
- LUCENE-5660: AnalyzingSuggester.build will now throw IllegalArgumentException if
you give it a longer suggestion than it can handle
(Robert Muir, Mike McCandless)
- LUCENE-5662: Add missing checks to Field to prevent IndexWriter.abort
if a stored value is null.
(Robert Muir)
- LUCENE-5668: Fix off-by-one in TieredMergePolicy
(Mike McCandless)
- LUCENE-5671: Upgrade ICU version to fix an ICU concurrency problem that
could cause exceptions when indexing.
(feedly team, Robert Muir)
- System Requirements (1)
- LUCENE-4747, LUCENE-5514: Move to Java 7 as minimum Java version.
(Robert Muir, Uwe Schindler)
- Changes in Runtime Behavior (1)
- LUCENE-5472: IndexWriter.addDocument will now throw an IllegalArgumentException
if a Term to be indexed exceeds IndexWriter.MAX_TERM_LENGTH. To recreate previous
behavior of silently ignoring these terms, use LengthFilter in your Analyzer.
(hossman, Mike McCandless, Varun Thacker)
- New Features (24)
- LUCENE-5356: Morfologik filter can accept custom dictionary resources.
(Michal Hlavac, Dawid Weiss)
- LUCENE-5454: Add SortedSetSortField to lucene/sandbox, to allow sorting
on multi-valued field.
(Robert Muir)
- LUCENE-5478: CommonTermsQuery now allows to create custom term queries
similar to the query parser by overriding a newTermQuery method.
(Simon Willnauer)
- LUCENE-5477: AnalyzingInfixSuggester now supports near-real-time
additions and updates (to change weight or payload of an existing
suggestion).
(Mike McCandless)
- LUCENE-5482: Improve default TurkishAnalyzer by adding apostrophe
handling suitable for Turkish.
(Ahmet Arslan via Robert Muir)
- LUCENE-5479: FacetsConfig subclass can now customize the default
per-dim facets configuration.
(Rob Audenaerde via Mike McCandless)
- LUCENE-5485: Add circumfix support to HunspellStemFilter.
(Robert Muir)
- LUCENE-5224: Add iconv, oconv, and ignore support to HunspellStemFilter.
(Robert Muir)
- LUCENE-5493: SortingMergePolicy, and EarlyTerminatingSortingCollector
support arbitrary Sort specifications.
(Robert Muir, Mike McCandless, Adrien Grand)
- LUCENE-3758: Allow the ComplexPhraseQueryParser to search order or
un-order proximity queries.
(Ahmet Arslan via Erick Erickson)
- LUCENE-5530: ComplexPhraseQueryParser throws ParseException for fielded queries.
(Erick Erickson via Tomas Fernandez Lobbe and Ahmet Arslan)
- LUCENE-5513: Add IndexWriter.updateBinaryDocValue which lets
you update the value of a BinaryDocValuesField without reindexing the
document(s).
(Shai Erera)
- LUCENE-4072: Add ICUNormalizer2CharFilter, which lets you do unicode normalization
with offset correction before the tokenizer.
(David Goldfarb, Ippei UKAI via Robert Muir)
- LUCENE-5476: Add RandomSamplingFacetsCollector for computing facets on a sampled
set of matching hits, in cases where there are millions of hits.
(Rob Audenaerde, Gilad Barkai, Shai Erera)
- LUCENE-4984: Add SegmentingTokenizerBase, abstract class for tokenizers
that want to do two-pass tokenization such as by sentence and then by word.
(Robert Muir)
- LUCENE-5489: Add Rescorer/QueryRescorer, to resort the hits from a
first pass search using scores from a more costly second pass
search.
(Simon Willnauer, Robert Muir, Mike McCandless)
- LUCENE-5528: Add context to suggesters (InputIterator and Lookup
classes), and fix AnalyzingInfixSuggester to handle contexts.
Suggester contexts allow you to filter suggestions.
(Areek Zillur,
Mike McCandless)
- LUCENE-5545: Add SortRescorer and Expression.getRescorer, to
resort the hits from a first pass search using a Sort or an
Expression.
(Simon Willnauer, Robert Muir, Mike McCandless)
- LUCENE-5558: Add TruncateTokenFilter which truncates terms to
the specified length.
(Ahmet Arslan via Robert Muir)
- LUCENE-2446: Added checksums to lucene index files. As of 4.8, the last 8
bytes of each file contain a zlib-crc32 checksum. Small metadata files are
verified on load. Larger files can be checked on demand via
AtomicReader.checkIntegrity. You can configure this to happen automatically
before merges by enabling IndexWriterConfig.setCheckIntegrityAtMerge.
(Robert Muir)
- LUCENE-5580: Checksums are automatically verified on the default stored
fields format when performing a bulk merge.
(Adrien Grand)
- LUCENE-5602: Checksums are automatically verified on the default term
vectors format when performing a bulk merge.
(Adrien Grand, Robert Muir)
- LUCENE-5583: Added DataInput.skipBytes. ChecksumIndexInput can now seek, but
only forward.
(Adrien Grand, Mike McCandless, Simon Willnauer, Uwe Schindler)
- LUCENE-5588: Lucene now calls fsync() on the index directory, ensuring
that all file metadata is persisted on disk in case of power failure.
This does not work on all file systems and operating systems, but Linux
and MacOSX are known to work. On Windows, fsyncing a directory is not
possible with Java APIs.
(Mike McCandless, Uwe Schindler)
- API Changes (10)
- LUCENE-5454: Add RandomAccessOrds, an optional extension of SortedSetDocValues
that supports random access to the ordinals in a document.
(Robert Muir)
- LUCENE-5468: Move offline Sort (from suggest module) to OfflineSort.
(Robert Muir)
- LUCENE-5493: SortingMergePolicy and EarlyTerminatingSortingCollector take
Sort instead of Sorter. BlockJoinSorter is removed, replaced with
BlockJoinComparatorSource, which can take a Sort for ordering of parents
and a separate Sort for ordering of children within a block.
(Robert Muir, Mike McCandless, Adrien Grand)
- LUCENE-5516: MergeScheduler#merge() now accepts a MergeTrigger as well as
a boolean that indicates if a new merge was found in the caller thread before
the scheduler was called.
(Simon Willnauer)
- LUCENE-5487: Separated bulk scorer (new Weight.bulkScorer method) from
normal scoring (Weight.scorer) for those queries that can do bulk
scoring more efficiently, e.g. BooleanQuery in some cases. This
also simplified the Weight.scorer API by removing the two confusing
booleans.
(Robert Muir, Uwe Schindler, Mike McCandless)
- LUCENE-5519: TopNSearcher now allows to retrieve incomplete results if the max
size of the candidate queue is unknown. The queue can still be bound in order
to apply pruning while retrieving the top N but will not throw an exception if
too many results are rejected to guarantee an absolutely correct top N result.
The TopNSearcher now returns a struct like class that indicates if the result
is complete in the sense of the top N or not. Consumers of this API should assert
on the completeness if the bounded queue size is know ahead of time.
(Simon Willnauer)
- LUCENE-4984: Deprecate ThaiWordFilter and smartcn SentenceTokenizer and WordTokenFilter.
These filters would not work correctly with CharFilters and could not be safely placed
at an arbitrary position in the analysis chain. Use ThaiTokenizer and HMMChineseTokenizer
instead.
(Robert Muir)
- LUCENE-5543: Remove/deprecate Directory.fileExists
(Mike McCandless)
- LUCENE-5573: Move docvalues constants and helper methods to o.a.l.index.DocValues.
(Dawid Weiss, Robert Muir)
- LUCENE-5604: Switched BytesRef.hashCode to MurmurHash3 (32 bit).
TermToBytesRefAttribute.fillBytesRef no longer returns the hash
code. BytesRefHash now uses MurmurHash3 for its hashing.
(Robert
Muir, Mike McCandless)
- Optimizations (4)
- LUCENE-5468: HunspellStemFilter uses 10 to 100x less RAM. It also loads
all known openoffice dictionaries without error, and supports an additional
longestOnly option for a less aggressive approach.
(Robert Muir)
- LUCENE-4848: Use Java 7 NIO2-FileChannel instead of RandomAccessFile
for NIOFSDirectory and MMapDirectory. This allows to delete open files
on Windows if NIOFSDirectory is used, mmapped files are still locked.
(Michael Poindexter, Robert Muir, Uwe Schindler)
- LUCENE-5515: Improved TopDocs#merge to create a merged ScoreDoc
array with length of at most equal to the specified size instead of length
equal to at most from + size as was before.
(Martijn van Groningen)
- LUCENE-5529: Spatial search of non-point indexed shapes should be a little
faster due to skipping intersection tests on redundant cells.
(David Smiley)
- Bug fixes (14)
- LUCENE-5483: Fix inaccuracies in HunspellStemFilter. Multi-stage affix-stripping,
prefix-suffix dependencies, and COMPLEXPREFIXES now work correctly according
to the hunspell algorithm. Removed recursionCap parameter, as it's no longer needed, rules for
recursive affix application are driven correctly by continuation classes in the affix file.
(Robert Muir)
- LUCENE-5497: HunspellStemFilter properly handles escaped terms and affixes without conditions.
(Robert Muir)
- LUCENE-5505: HunspellStemFilter ignores BOM markers in dictionaries and handles varying
types of whitespace in SET/FLAG commands.
(Robert Muir)
- LUCENE-5507: Fix HunspellStemFilter loading of dictionaries with large amounts of aliases
etc before the encoding declaration.
(Robert Muir)
- LUCENE-5111: Fix WordDelimiterFilter to return offsets in correct order.
(Robert Muir)
- LUCENE-5555: Fix SortedInputIterator to correctly encode/decode contexts in presence of payload
(Areek Zillur)
- LUCENE-5559: Add missing argument checks to tokenfilters taking
numeric arguments.
(Ahmet Arslan via Robert Muir)
- LUCENE-5568: Benchmark module's "default.codec" option didn't work.
(David Smiley)
- SOLR-5983: HTMLStripCharFilter is treating CDATA sections incorrectly.
(Dan Funk, Steve Rowe)
- LUCENE-5615: Validate per-segment delete counts at write time, to
help catch bugs that might otherwise cause corruption
(Mike McCandless)
- LUCENE-5612: NativeFSLockFactory no longer deletes its lock file. This cannot be done
safely without the risk of deleting someone else's lock file. If you use NativeFSLockFactory,
you may see write.lock hanging around from time to time: it's harmless.
(Uwe Schindler, Mike McCandless, Robert Muir)
- LUCENE-5624: Ensure NativeFSLockFactory does not leak file handles if it is unable
to obtain the lock.
(Uwe Schindler, Robert Muir)
- LUCENE-5626: Fix bug in SimpleFSLockFactory's obtain() that sometimes throwed
IOException (ERROR_ACCESS_DENIED) on Windows if the lock file was created
concurrently. This error is now handled the same way like in NativeFSLockFactory
by returning false.
(Uwe Schindler, Robert Muir, Dawid Weiss)
- LUCENE-5630: Add missing META-INF entry for UpperCaseFilterFactory.
(Robert Muir)
- Tests (1)
- LUCENE-5630: Fix TestAllAnalyzersHaveFactories to correctly check for existence
of class and corresponding Map<String,String> ctor.
(Uwe Schindler, Robert Muir)
- Test Framework (5)
- LUCENE-5592: Incorrectly reported uncloseable files.
(Dawid Weiss)
- LUCENE-5577: Temporary folder and file management (and cleanup facilities)
(Mark Miller, Uwe Schindler, Dawid Weiss)
- LUCENE-5567: When a suite fails with zombie threads failure marker and count
is not propagated properly.
(Dawid Weiss)
- LUCENE-5449: Rename _TestUtil and _TestHelper to remove the leading _.
- LUCENE-5501: Added random out-of-order collection testing (when the collector
supports it) to AssertingIndexSearcher.
(Adrien Grand)
- Build (4)
- LUCENE-5463: RamUsageEstimator.(human)sizeOf(Object) is now a forbidden API.
(Adrien Grand, Robert Muir)
- LUCENE-5512: Remove redundant typing (use diamond operator) throughout
the codebase.
(Furkan KAMACI via Robert Muir)
- LUCENE-5614: Enable building on Java 8 using Apache Ant 1.8.3 or 1.8.4
by adding a workaround for the Ant bug.
(Uwe Schindler)
- LUCENE-5612: Add a new Ant target in lucene/core to test LockFactory
implementations: "ant test-lock-factory".
(Uwe Schindler, Mike McCandless,
Robert Muir)
- Documentation (1)
- LUCENE-5534: Add javadocs to GreekStemmer methods.
(Stamatis Pitsios via Robert Muir)
- Bug Fixes (2)
- LUCENE-5574: Closing a near-real-time reader no longer attempts to
delete unreferenced files if the original writer has been closed;
this could cause index corruption in certain cases where index files
were directly changed (deleted, overwritten, etc.) in the index
directory outside of Lucene.
(Simon Willnauer, Shai Erera, Robert
Muir, Mike McCandless)
- LUCENE-5570: Don't let FSDirectory.sync() create new zero-byte files, instead throw
exception if a file is missing.
(Uwe Schindler, Mike McCandless, Robert Muir)
- Changes in Runtime Behavior (1)
- LUCENE-5532: AutomatonQuery.equals is no longer implemented as "accepts same language".
This was inconsistent with hashCode, and unnecessary for any subclasses in Lucene.
If you desire this in a custom subclass, minimize the automaton.
(Robert Muir)
- Bug Fixes (14)
- LUCENE-5450: Fix getField() NPE issues with SpanOr/SpanNear when they have an
empty list of clauses. This can happen for example, when a wildcard matches
no terms.
(Tim Allison via Robert Muir)
- LUCENE-5473: Throw IllegalArgumentException, not
NullPointerException, if the synonym map is empty when creating
SynonymFilter
(帅广应 via Mike McCandless)
- LUCENE-5432: EliasFanoDocIdSet: Fix number of index entry bits when the maximum
entry is a power of 2.
(Paul Elschot via Adrien Grand)
- LUCENE-5466: query is always null in countDocsWithClass() of SimpleNaiveBayesClassifier.
(Koji Sekiguchi)
- LUCENE-5502: Fixed TermsFilter.equals that could return true for different
filters.
(Igor Motov via Adrien Grand)
- LUCENE-5522: FacetsConfig didn't add drill-down terms for association facet
fields labels.
(Shai Erera)
- LUCENE-5520: ToChildBlockJoinQuery would hit
ArrayIndexOutOfBoundsException if a parent document had no children
(Sally Ang via Mike McCandless)
- LUCENE-5532: AutomatonQuery.hashCode was not thread-safe.
(Robert Muir)
- LUCENE-5525: Implement MultiFacets.getAllDims, so you can do sparse
facets through DrillSideways, for example.
(Jose Peleteiro, Mike
McCandless)
- LUCENE-5481: IndexWriter.forceMerge used to run a merge even if there was a
single segment in the index.
(Adrien Grand, Mike McCandless)
- LUCENE-5538: Fix FastVectorHighlighter bug with index-time synonyms when the
query is more complex than a single phrase.
(Robert Muir)
- LUCENE-5544: Exceptions during IndexWriter.rollback could leak file handles
and the write lock.
(Robert Muir)
- LUCENE-4978: Spatial RecursivePrefixTree queries could result in false-negatives for
indexed shapes within 1/2 maxDistErr from the edge of the query shape. This meant
searching for a point by the same point as a query rarely worked.
(David Smiley)
- LUCENE-5553: IndexReader#ReaderClosedListener is not always invoked when
IndexReader#close() is called or if refCount is 0. If an exception is
thrown during internal close or on any of the close listeners some or all
listeners might be missed. This can cause memory leaks if the core listeners
are used to clear caches.
(Simon Willnauer)
- Build (1)
- LUCENE-5511: "ant precommit" / "ant check-svn-working-copy" now work again
with any working copy format (thanks to svnkit 1.8.4).
(Uwe Schindler)
- New Features (25)
- LUCENE-5336: Add SimpleQueryParser: parser for human-entered queries.
(Jack Conradson via Robert Muir)
- LUCENE-5337: Add Payload support to FileDictionary (Suggest) and make it more
configurable
(Areek Zillur via Erick Erickson)
- LUCENE-5329: suggest: DocumentDictionary and
DocumentExpressionDictionary are now lenient for dirty documents
(missing the term, weight or payload).
(Areek Zillur via
Mike McCandless)
- LUCENE-5404: Add .getCount method to all suggesters (Lookup); persist count
metadata on .store(); Dictionary returns InputIterator; Dictionary.getWordIterator
renamed to .getEntryIterator.
(Areek Zillur)
- SOLR-1871: The RangeMapFloatFunction accepts an arbitrary ValueSource
as target and default values.
(Chris Harris, shalin)
- LUCENE-5371: Speed up Lucene range faceting from O(N) per hit to
O(log(N)) per hit using segment trees; this only really starts to
matter in practice if the number of ranges is over 10 or so.
(Mike
McCandless)
- LUCENE-5379: Add Analyzer for Kurdish.
(Robert Muir)
- LUCENE-5369: Added an UpperCaseFilter to make UPPERCASE tokens.
(ryan)
- LUCENE-5345: Add a new BlendedInfixSuggester, which is like
AnalyzingInfixSuggester but boosts suggestions that matched tokens
with lower positions.
(Remi Melisson via Mike McCandless)
- LUCENE-5399: When sorting by String (SortField.STRING), you can now
specify whether missing values should be sorted first (the default),
using SortField.setMissingValue(SortField.STRING_FIRST), or last,
using SortField.setMissingValue(SortField.STRING_LAST).
(Rob Muir,
Mike McCandless)
- LUCENE-5099: QueryNode should have the ability to detach from its node
parent. Added QueryNode.removeFromParent() that allows nodes to be
detached from its parent node.
(Adriano Crestani)
- LUCENE-5395 LUCENE-5451: Upgrade to Spatial4j 0.4.1: Parses WKT (including
ENVELOPE) with extension "BUFFER"; buffering a point results in a Circle.
JTS isn't needed for WKT any more but remains required for Polygons. New
Shapes: ShapeCollection and BufferedLineString. Various other improvements and
bug fixes too. More info:
https://github.com/spatial4j/spatial4j/blob/master/CHANGES.md
(David Smiley)
- LUCENE-5415: Add multitermquery (wildcards,prefix,etc) to PostingsHighlighter.
(Mike McCandless, Robert Muir)
- LUCENE-3069: Add two memory resident dictionaries (FST terms dictionary and
FSTOrd terms dictionary) to improve primary key lookups. The PostingsBaseFormat
API is also changed so that term dictionaries get the ability to block
encode term metadata, and all dictionary implementations can now plug in any
PostingsBaseFormat.
(Han Jiang, Mike McCandless)
- LUCENE-5353: ShingleFilter's filler token should be configurable.
(Ahmet Arslan, Simon Willnauer, Steve Rowe)
- LUCENE-5320: Add SearcherTaxonomyManager over search and taxonomy index
directories (i.e. not only NRT).
(Shai Erera)
- LUCENE-5410: Add fuzzy and near support via '~' operator to SimpleQueryParser.
(Lee Hinman via Robert Muir)
- LUCENE-5426: Make SortedSetDocValuesReaderState abstract to allow
custom implementations for Lucene doc values faceting
(John Wang via
Mike McCandless)
- LUCENE-5434: NRT support for file systems that do no have delete on last
close or cannot delete while referenced semantics.
(Mark Miller, Mike McCandless)
- LUCENE-5418: Drilling down or sideways on a Lucene facet range
(using Range.getFilter()) is now faster for costly filters (uses
random access, not iteration); range facet counts now accept a
fast-match filter to avoid computing the value for documents that
are out of bounds, e.g. using a bounding box filter with distance
range faceting.
(Mike McCandless)
- LUCENE-5440: Add LongBitSet for managing more than 2.1B bits (otherwise use
FixedBitSet).
(Shai Erera)
- LUCENE-5437: ASCIIFoldingFilter now has an option to preserve the original token
and emit it on the same position as the folded token only if the actual token was
folded.
(Simon Willnauer, Nik Everett)
- LUCENE-5408: Add spatial SerializedDVStrategy that serializes a binary
representations of a shape into BinaryDocValues. It supports exact geometry
relationship calculations.
(David Smiley)
- LUCENE-5457: Add SloppyMath.earthDiameter(double latitude) that returns an
approximate value of the diameter of the earth at the given latitude.
(Adrien Grand)
- LUCENE-5979: FilteredQuery uses the cost API to decide on whether to use
random-access or leap-frog to intersect the filter with the query.
(Adrien Grand)
- Build (11)
- LUCENE-5217,LUCENE-5420: Maven config: get dependencies from Ant+Ivy config;
disable transitive dependency resolution for all depended-on artifacts by
putting an exclusion for each transitive dependency in the
<dependencyManagement> section of the grandparent POM.
(Steve Rowe)
- LUCENE-5322: Clean up / simplify Maven-related Ant targets.
(Steve Rowe)
- LUCENE-5347: Upgrade forbidden-apis checker to version 1.4.
(Uwe Schindler)
- LUCENE-4381: Upgrade analysis/icu to 52.1.
(Robert Muir)
- LUCENE-5357: Upgrade StandardTokenizer and UAX29URLEmailTokenizer to
Unicode 6.3; update UAX29URLEmailTokenizer's recognized top level
domains in URLs and Emails from the IANA Root Zone Database.
(Steve Rowe)
- LUCENE-5360: Add support for developing in Netbeans IDE.
(Michal Hlavac, Uwe Schindler, Steve Rowe)
- SOLR-5590: Upgrade HttpClient/HttpComponents to 4.3.x.
(Karl Wright via Shawn Heisey)
- LUCENE-5385: "ant precommit" / "ant check-svn-working-copy" now work
for SVN 1.8 or GIT checkouts. The ANT target prints a warning instead
of failing. It also instructs the user, how to run on SVN 1.8 working
copies.
(Robert Muir, Uwe Schindler)
- LUCENE-5383: fix changes2html to link pull requests
(Steve Rowe)
- LUCENE-5411: Upgrade to released JFlex 1.5.0; stop requiring
a locally built JFlex snapshot jar.
(Steve Rowe)
- LUCENE-5465: Solr Contrib "map-reduce" breaks Manifest of all other
JAR files by adding a broken Main-Class attribute.
(Uwe Schindler, Steve Rowe)
- Bug fixes (14)
- LUCENE-5285: Improved highlighting of multi-valued fields with
FastVectorHighlighter.
(Nik Everett via Adrien Grand)
- LUCENE-5391: UAX29URLEmailTokenizer should not tokenize no-scheme
domain-only URLs that are followed by an alphanumeric character.
(Chris Geeringh, Steve Rowe)
- LUCENE-5405: If an analysis component throws an exception, Lucene
logs the field name to the info stream to assist in
diagnosis.
(Benson Margulies)
- SOLR-5661: PriorityQueue now refuses to allocate itself if the
incoming maxSize is too large
(Raintung Li via Mike McCandless)
- LUCENE-5228: IndexWriter.addIndexes(Directory[]) now acquires a
write lock in each Directory, to ensure that no open IndexWriter is
changing the incoming indices. This also means that you cannot pass
the same Directory to multiple concurrent addIndexes calls (which is
anyways unusual).
(Robert Muir, Mike McCandless)
- LUCENE-5415: SpanMultiTermQueryWrapper didn't handle its boost in
hashcode/equals/tostring/rewrite.
(Robert Muir)
- LUCENE-5409: ToParentBlockJoinCollector.getTopGroups would fail to
return any groups when the joined query required more than one
rewrite step
(Peng Cheng via Mike McCandless)
- LUCENE-5398: NormValueSource was incorrectly casting the long value
to byte, before calling Similarity.decodeNormValue.
(Peng Cheng via
Mike McCandless)
- LUCENE-5436: ReferenceManager#accquire can result in infinite loop if
managed resource is abused outside of the ReferenceManager. Decrementing
the reference without a corresponding incRef() call can cause an infinite
loop. ReferenceManager now throws IllegalStateException if currently managed
resources ref count is 0.
(Simon Willnauer)
- LUCENE-5443: Lucene45DocValuesProducer.ramBytesUsed() may throw
ConcurrentModificationException.
(Shai Erera, Simon Willnauer)
- LUCENE-5444: MemoryIndex didn't respect the analyzers offset gap and
offsets were corrupted if multiple fields with the same name were
added to the memory index.
(Britta Weber, Simon Willnauer)
- LUCENE-5447: StandardTokenizer should break at consecutive chars matching
Word_Break = MidLetter, MidNum and/or MidNumLet
(Steve Rowe)
- LUCENE-5462: RamUsageEstimator.sizeOf(Object) is not used anymore to
estimate memory usage of segments. This used to make
SegmentReader.ramBytesUsed very CPU-intensive.
(Adrien Grand)
- LUCENE-5461: ControlledRealTimeReopenThread would sometimes wait too
long (up to targetMaxStaleSec) when a searcher is waiting for a
specific generation, when it should have waited for at most
targetMinStaleSec.
(Hans Lund via Mike McCandless)
- API Changes (6)
- LUCENE-5339: The facet module was simplified/reworked to make the
APIs more approachable to new users. Note: when migrating to the new
API, you must pass the Document that is returned from FacetConfig.build()
to IndexWriter.addDocument().
(Shai Erera, Gilad Barkai, Rob
Muir, Mike McCandless)
- LUCENE-5405: Make ShingleAnalyzerWrapper.getWrappedAnalyzer() public final
(gsingers)
- LUCENE-5395: The SpatialArgsParser now only reads WKT, no more "lat, lon"
etc. but it's easy to override the parseShape method if you wish.
(David
Smiley)
- LUCENE-5414: DocumentExpressionDictionary was renamed to
DocumentValueSourceDictionary and all dependencies to the lucene-expression
module were removed from lucene-suggest. DocumentValueSourceDictionary now
only accepts a ValueSource instead of a convenience ctor for an expression
string.
(Simon Willnauer)
- LUCENE-3069: PostingsWriterBase and PostingsReaderBase are no longer
responsible for encoding/decoding a block of terms. Instead, they
should encode/decode each term to/from a long[] and byte[].
(Han
Jiang, Mike McCandless)
- LUCENE-5425: FacetsCollector and MatchingDocs use a general DocIdSet,
allowing for custom implementations to be used when faceting.
(John Wang, Lei Wang, Shai Erera)
- Optimizations (3)
- LUCENE-5372: Replace StringBuffer by StringBuilder, where possible.
(Joshua Hartman via Uwe Schindler, Dawid Weiss, Mike McCandless)
- LUCENE-5271: A slightly more accurate SloppyMath distance.
(Gilad Barkai via Ryan Ernst)
- LUCENE-5399: Deep paging using IndexSearcher.searchAfter when
sorting by fields is faster
(Rob Muir, Mike McCandless)
- Changes in Runtime Behavior (1)
- LUCENE-5362: IndexReader and SegmentCoreReaders now throw
AlreadyClosedException if the refCount in incremented but
is less that 1.
(Simon Willnauer)
- Documentation (2)
- LUCENE-5384: Add some tips for making tokenfilters and tokenizers
to the analysis package overview.
(Benson Margulies via Robert Muir - pull request #12)
- LUCENE-5389: Add more guidance in the analysis documentation
package overview.
(Benson Margulies via Robert Muir - pull request #14)
- Bug fixes (8)
- LUCENE-5373: Memory usage of
[Lucene40/Lucene42/Memory/Direct]DocValuesFormat was over-estimated.
(Shay Banon, Adrien Grand, Robert Muir)
- LUCENE-5361: Fixed handling of query boosts in FastVectorHighlighter.
(Nik Everett via Adrien Grand)
- LUCENE-5374: IndexWriter processes internal events after the it
closed itself internally. This rare condition can happen if an
IndexWriter has internal changes that were not fully applied yet
like when index / flush requests happen concurrently to the close or
rollback call.
(Simon Willnauer)
- LUCENE-5394: Fix TokenSources.getTokenStream to return payloads if
they were indexed with the term vectors.
(Mike McCandless)
- LUCENE-5344: Flexible StandardQueryParser behaves differently than
ClassicQueryParser.
(Adriano Crestani)
- LUCENE-5375: ToChildBlockJoinQuery works harder to detect mis-use,
when the parent query incorrectly returns child documents, and throw
a clear exception saying so.
(Dr. Oleg Savrasov via Mike McCandless)
- LUCENE-5401: Field.StringTokenStream#end() calls super.end() now,
preventing wrong term positions for fields that use
StringTokenStream.
(Michael Busch)
- LUCENE-5377: IndexWriter.addIndexes(Directory[]) would cause corruption
on Lucene 4.6 if any index segments were Lucene 4.0-4.5.
(Littlestar, Mike McCandless, Shai Erera, Robert Muir)
- New Features (23)
- LUCENE-4906: PostingsHighlighter can now render to custom Object,
for advanced use cases where String is too restrictive
(Luca
Cavanna, Robert Muir, Mike McCandless)
- LUCENE-5133: Changed AnalyzingInfixSuggester.highlight to return
Object instead of String, to allow for advanced use cases where
String is too restrictive
(Robert Muir, Shai Erera, Mike
McCandless)
- LUCENE-5207, LUCENE-5334: Added expressions module for customizing ranking
with script-like syntax.
(Jack Conradson, Ryan Ernst, Uwe Schindler via Robert Muir)
- LUCENE-5180: ShingleFilter now creates shingles with trailing holes,
for example if a StopFilter had removed the last token.
(Mike
McCandless)
- LUCENE-5219: Add support to SynonymFilterFactory for custom
parsers.
(Ryan Ernst via Robert Muir)
- LUCENE-5235: Tokenizers now throw an IllegalStateException if the
consumer does not call reset() before consuming the stream. Previous
versions throwed NullPointerException or ArrayIndexOutOfBoundsException
on best effort which was not user-friendly.
(Uwe Schindler, Robert Muir)
- LUCENE-5240: Tokenizers now throw an IllegalStateException if the
consumer neglects to call close() on the previous stream before consuming
the next one.
(Uwe Schindler, Robert Muir)
- LUCENE-5214: Add new FreeTextSuggester, to predict the next word
using a simple ngram language model. This is useful for the "long
tail" suggestions, when a primary suggester fails to find a
suggestion.
(Mike McCandless)
- LUCENE-5251: New DocumentDictionary allows building suggesters via
contents of existing field, weight and optionally payload stored
fields in an index
(Areek Zillur via Mike McCandless)
- LUCENE-5261: Add QueryBuilder, a simple API to build queries from
the analysis chain directly, or to make it easier to implement
query parsers.
(Robert Muir, Uwe Schindler)
- LUCENE-5270: Add Terms.hasFreqs, to determine whether a given field
indexed per-doc term frequencies.
(Mike McCandless)
- LUCENE-5269: Add CodepointCountFilter.
(Robert Muir)
- LUCENE-5294: Suggest module: add DocumentExpressionDictionary to
compute each suggestion's weight using a javascript expression.
(Areek Zillur via Mike McCandless)
- LUCENE-5274: FastVectorHighlighter now supports highlighting against several
indexed fields.
(Nik Everett via Adrien Grand)
- LUCENE-5304: SingletonSortedSetDocValues can now return the wrapped
SortedDocValues
(Robert Muir, Adrien Grand)
- LUCENE-2844: The benchmark module can now test the spatial module. See
spatial.alg
(David Smiley, Liviy Ambrose)
- LUCENE-5302: Make StemmerOverrideMap's methods public
(Alan Woodward)
- LUCENE-5296: Add DirectDocValuesFormat, which holds all doc values
in heap as uncompressed java native arrays.
(Mike McCandless)
- LUCENE-5189: Add IndexWriter.updateNumericDocValues, to update
numeric DocValues fields of documents, without re-indexing them.
(Shai Erera, Mike McCandless, Robert Muir)
- LUCENE-5298: Add SumValueSourceFacetRequest for aggregating facets by
a ValueSource, such as a NumericDocValuesField or an expression.
(Shai Erera)
- LUCENE-5323: Add .sizeInBytes method to all suggesters (Lookup).
(Areek Zillur via Mike McCandless)
- LUCENE-5312: Add BlockJoinSorter, a new Sorter implementation that makes sure
to never split up blocks of documents indexed with IndexWriter.addDocuments.
(Adrien Grand)
- LUCENE-5297: Allow to range-facet on any ValueSource, not just
NumericDocValues fields.
(Shai Erera)
- Bug Fixes (5)
- LUCENE-5272: OpenBitSet.ensureCapacity did not modify numBits, causing
false assertion errors in fastSet.
(Shai Erera)
- LUCENE-5303: OrdinalsCache did not use coreCacheKey, resulting in
over caching across multiple threads.
(Mike McCandless, Shai Erera)
- LUCENE-5307: Fix topScorer inconsistency in handling QueryWrapperFilter
inside ConstantScoreQuery, which now rewrites to a query removing the
obsolete QueryWrapperFilter.
(Adrien Grand, Uwe Schindler)
- LUCENE-5330: IndexWriter didn't process all internal events on
#getReader(), #close() and #rollback() which causes files to be
deleted at a later point in time. This could cause short-term disk
pollution or OOM if in-memory directories are used.
(Simon Willnauer)
- LUCENE-5342: Fixed bulk-merge issue in CompressingStoredFieldsFormat which
created corrupted segments when mixing chunk sizes.
Lucene41StoredFieldsFormat is not impacted.
(Adrien Grand, Robert Muir)
- API Changes (9)
- LUCENE-5222: Add SortField.needsScores(). Previously it was not possible
for a custom Sort that makes use of the relevance score to work correctly
with IndexSearcher when an ExecutorService is specified.
(Ryan Ernst, Mike McCandless, Robert Muir)
- LUCENE-5275: Change AttributeSource.toString() to display the current
state of attributes.
(Robert Muir)
- LUCENE-5277: Modify FixedBitSet copy constructor to take an additional
numBits parameter to allow growing/shrinking the copied bitset. You can
use FixedBitSet.clone() if you only need to clone the bitset.
(Shai Erera)
- LUCENE-5260: Use TermFreqPayloadIterator for all suggesters; those
suggesters that can't support payloads will throw an exception if
hasPayloads() is true.
(Areek Zillur via Mike McCandless)
- LUCENE-5280: Rename TermFreqPayloadIterator -> InputIterator, along
with associated suggest/spell classes.
(Areek Zillur via Mike
McCandless)
- LUCENE-5157: Rename OrdinalMap methods to clarify API and internal structure.
(Boaz Leskes via Adrien Grand)
- LUCENE-5313: Move preservePositionIncrements from setter to ctor in
Analyzing/FuzzySuggester.
(Areek Zillur via Mike McCandless)
- LUCENE-5321: Remove Facet42DocValuesFormat. Use DirectDocValuesFormat if you
want to load the category list into memory.
(Shai Erera, Mike McCandless)
- LUCENE-5324: AnalyzerWrapper.getPositionIncrementGap and getOffsetGap can now
be overridden.
(Adrien Grand)
- Optimizations (4)
- LUCENE-5225: The ToParentBlockJoinQuery only keeps tracks of the the child
doc ids and child scores if the ToParentBlockJoinCollector is used.
(Martijn van Groningen)
- LUCENE-5236: EliasFanoDocIdSet now has an index and uses broadword bit
selection to speed-up advance().
(Paul Elschot via Adrien Grand)
- LUCENE-5266: Improved number of read calls and branches in DirectPackedReader.
(Ryan Ernst)
- LUCENE-5300: Optimized SORTED_SET storage for fields which are single-valued.
(Adrien Grand)
- Documentation (1)
- LUCENE-5211: Better javadocs and error checking of 'format' option in
StopFilterFactory, as well as comments in all snowball formatted files
about specifying format option.
(hossman)
- Changes in backwards compatibility policy (2)
- LUCENE-5235: Sub classes of Tokenizer have to call super.reset()
when implementing reset(). Otherwise the consumer will get an
IllegalStateException because the Reader is not correctly assigned.
It is important to never change the "input" field on Tokenizer
without using setReader(). The "input" field must not be used
outside reset(), incrementToken(), or end() - especially not in
the constructor.
(Uwe Schindler, Robert Muir)
- LUCENE-5204: Directory doesn't have default implementations for
LockFactory-related methods, which have been moved to BaseDirectory. If you
had a custom Directory implementation that extended Directory, you need to
extend BaseDirectory instead.
(Adrien Grand)
- Build (4)
- LUCENE-5283: Fail the build if ant test didn't execute any tests
(everything filtered out).
(Dawid Weiss, Uwe Schindler)
- LUCENE-5249, LUCENE-5257: All Lucene/Solr modules should use the same
dependency versions.
(Steve Rowe)
- LUCENE-5273: Binary artifacts in Lucene and Solr convenience binary
distributions accompanying a release, including on Maven Central,
should be identical across all distributions.
(Steve Rowe, Uwe Schindler,
Shalin Shekhar Mangar)
- LUCENE-4753: Run forbidden-apis Ant task per module. This allows more
improvements and prevents OOMs after the number of class files
raised recently.
(Uwe Schindler)
- Tests (1)
- LUCENE-5278: Fix MockTokenizer to work better with more regular expression
patterns. Previously it could only behave like CharTokenizer (where a character
is either a "word" character or not), but now it gives a general longest-match
behavior.
(Nik Everett via Robert Muir)
- Bug Fixes (8)
- LUCENE-4998: Fixed a few places to pass IOContext.READONCE instead
of IOContext.READ
(Shikhar Bhushan via Mike McCandless)
- LUCENE-5242: DirectoryTaxonomyWriter.replaceTaxonomy did not fully reset
its state, which could result in exceptions being thrown, as well as
incorrect ordinals returned from getParent.
(Shai Erera)
- LUCENE-5254: Fixed bounded memory leak, where objects like live
docs bitset were not freed from an starting reader after reopening
to a new reader and closing the original one.
(Shai Erera, Mike
McCandless)
- LUCENE-5262: Fixed file handle leaks when multiple attempts to open an
NRT reader hit exceptions.
(Shai Erera)
- LUCENE-5263: Transient IOExceptions, e.g. due to disk full or file
descriptor exhaustion, hit at unlucky times inside IndexWriter could
lead to silently losing deletions.
(Shai Erera, Mike McCandless)
- LUCENE-5264: CommonTermsQuery ignored minMustMatch if only high-frequent
terms were present in the query and the high-frequent operator was set
to SHOULD.
(Simon Willnauer)
- LUCENE-5269: Fix bug in NGramTokenFilter where it would sometimes count
unicode characters incorrectly.
(Mike McCandless, Robert Muir)
- LUCENE-5289: IndexWriter.hasUncommittedChanges was returning false
when there were buffered delete-by-Term.
(Shalin Shekhar Mangar,
Mike McCandless)
- New features (15)
- LUCENE-5084: Added new Elias-Fano encoder, decoder and DocIdSet
implementations.
(Paul Elschot via Adrien Grand)
- LUCENE-5081: Added WAH8DocIdSet, an in-memory doc id set implementation based
on word-aligned hybrid encoding.
(Adrien Grand)
- LUCENE-5098: New broadword utility methods in oal.util.BroadWord.
(Paul Elschot via Adrien Grand, Dawid Weiss)
- LUCENE-5030: FuzzySuggester now supports optional unicodeAware
(default is false). If true then edits are measured in Unicode code
points instead of UTF8 bytes.
(Artem Lukanin via Mike McCandless)
- LUCENE-5118: SpatialStrategy.makeDistanceValueSource() now has an optional
multiplier for scaling degrees to another unit.
(David Smiley)
- LUCENE-5091: SpanNotQuery can now be configured with pre and post slop to act
as a hypothetical SpanNotNearQuery.
(Tim Allison via David Smiley)
- LUCENE-4985: FacetsAccumulator.create() is now able to create a
MultiFacetsAccumulator over a mixed set of facet requests. MultiFacetsAccumulator
allows wrapping multiple FacetsAccumulators, allowing to easily mix
existing and custom ones. TaxonomyFacetsAccumulator supports any
FacetRequest which implements createFacetsAggregator and was indexed
using the taxonomy index.
(Shai Erera)
- LUCENE-5153: AnalyzerWrapper.wrapReader allows wrapping the Reader given to
inputReader.
(Shai Erera)
- LUCENE-5155: FacetRequest.getValueOf and .getFacetArraysSource replaced by
FacetsAggregator.createOrdinalValueResolver. This gives better options for
resolving an ordinal's value by FacetAggregators.
(Shai Erera)
- LUCENE-5165: Add SuggestStopFilter, to be used with analyzing
suggesters, so that a stop word at the very end of the lookup query,
and without any trailing token characters, will be preserved. This
enables query "a" to suggest apple; see
http://blog.mikemccandless.com/2013/08/suggeststopfilter-carefully-removes.html
for details.
- LUCENE-5178: Added support for missing values to DocValues fields.
AtomicReader.getDocsWithField returns a Bits of documents with a value,
and FieldCache.getDocsWithField forwards to that for DocValues fields. Things like
SortField.setMissingValue, FunctionValues.exists, and FieldValueFilter now
work with DocValues fields.
(Robert Muir)
- LUCENE-5124: Lucene 4.5 has a new Lucene45Codec with Lucene45DocValues,
supporting missing values and with most datastructures residing off-heap.
Added "Memory" docvalues format that works entirely in heap, and "Disk"
loads no datastructures into RAM. Both of these also support missing values.
Added DiskNormsFormat (in case you want norms entirely on disk).
(Robert Muir)
- LUCENE-2750: Added PForDeltaDocIdSet, an in-memory doc id set implementation
based on the PFOR encoding.
(Adrien Grand)
- LUCENE-5186: Added CachingWrapperFilter.getFilter in order to be able to get
the wrapped filter.
(Trejkaz via Adrien Grand)
- LUCENE-5197: Added SegmentReader.ramBytesUsed to return approximate heap RAM
used by index datastructures.
(Areek Zillur via Robert Muir)
- Bug Fixes (16)
- LUCENE-5116: IndexWriter.addIndexes(IndexReader...) should drop empty (or all
deleted) segments.
(Robert Muir, Shai Erera)
- LUCENE-5132: Spatial RecursivePrefixTree Contains predicate will throw an NPE
when there's no indexed data and maybe in other circumstances too.
(David Smiley)
- LUCENE-5146: AnalyzingSuggester sort comparator read part of the input key as the
weight that caused the sorter to never sort by weight first since the weight is only
considered if the input is equal causing the malformed weight to be identical as well.
(Simon Willnauer)
- LUCENE-5151: Associations FacetsAggregators could enter an infinite loop when
some result documents were missing category associations.
(Shai Erera)
- LUCENE-5152: Fix MemoryPostingsFormat to not modify borrowed BytesRef from FSTEnum
seek/lookup which can cause side effects if done on a cached FST root arc.
(Simon Willnauer)
- LUCENE-5160: Handle the case where reading from a file or FileChannel returns -1,
which could happen in rare cases where something happens to the file between the
time we start the read loop (where we check the length) and when we actually do
the read.
(gsingers, yonik, Robert Muir, Uwe Schindler)
- LUCENE-5166: PostingsHighlighter would throw IOOBE if a term spanned the maxLength
boundary, made it into the top-N and went to the formatter.
(Manuel Amoabeng, Michael McCandless, Robert Muir)
- LUCENE-4583: Indexing core no longer enforces a limit on maximum
length binary doc values fields, but individual codecs (including
the default one) have their own limits
(David Smiley, Robert Muir,
Mike McCandless)
- LUCENE-3849: TokenStreams now set the position increment in end(),
so we can handle trailing holes. If you have a custom TokenStream
implementing end() then be sure it calls super.end().
(Robert Muir,
Mike McCandless)
- LUCENE-5192: IndexWriter could allow adding same field name with different
DocValueTypes under some circumstances.
(Shai Erera)
- LUCENE-5191: SimpleHTMLEncoder in Highlighter module broke Unicode
outside BMP because it encoded UTF-16 chars instead of codepoints.
The escaping of codepoints > 127 was removed (not needed for valid HTML)
and missing escaping for ' and / was added.
(Uwe Schindler)
- LUCENE-5201: Fixed compression bug in LZ4.compressHC when the input is highly
compressible and the start offset of the array to compress is > 0.
(Adrien Grand)
- LUCENE-5221: SimilarityBase did not write norms the same way as DefaultSimilarity
if discountOverlaps == false and index-time boosts are present for the field.
(Yubin Kim via Robert Muir)
- LUCENE-5223: Fixed IndexUpgrader command line parsing: -verbose is not required
and -dir-impl option now works correctly.
(hossman)
- LUCENE-5245: Fix MultiTermQuery's constant score rewrites to always
return a ConstantScoreQuery to make scoring consistent. Previously it
returned an empty unwrapped BooleanQuery, if no terms were available,
which has a different query norm.
(Nik Everett, Uwe Schindler)
- LUCENE-5218: In some cases, trying to retrieve or merge a 0-length
binary doc value would hit an ArrayIndexOutOfBoundsException.
(Littlestar via Mike McCandless)
- API Changes (13)
- LUCENE-5094: Add ramBytesUsed() to MultiDocValues.OrdinalMap.
(Robert Muir)
- LUCENE-5114: Remove unused boolean useCache parameter from
TermsEnum.seekCeil and .seekExact
(Mike McCandless)
- LUCENE-5128: IndexSearcher.searchAfter throws IllegalArgumentException if
searchAfter exceeds the number of documents in the reader.
(Crocket via Shai Erera)
- LUCENE-5129: CategoryAssociationsContainer no longer supports null
association values for categories. If you want to index categories without
associations, you should add them using FacetFields.
(Shai Erera)
- LUCENE-4876: IndexWriter no longer clones the given IndexWriterConfig. If you
need to use the same config more than once, e.g. when sharing between multiple
writers, make sure to clone it before passing to each writer.
(Shai Erera, Mike McCandless)
- LUCENE-5144: StandardFacetsAccumulator renamed to OldFacetsAccumulator, and all
associated classes were moved under o.a.l.facet.old. The intention to remove it
one day, when the features it covers (complements, partitions, sampling) will be
migrated to the new FacetsAggregator and FacetsAccumulator API. Also,
FacetRequest.createAggregator was replaced by OldFacetsAccumulator.createAggregator.
(Shai Erera)
- LUCENE-5149: CommonTermsQuery now allows to set the minimum number of terms that
should match for its high and low frequent sub-queries. Previously this was only
supported on the low frequent terms query.
(Simon Willnauer)
- LUCENE-5156: CompressingTermVectors TermsEnum no longer supports ord().
(Robert Muir)
- LUCENE-5161, LUCENE-5164: Fix default chunk sizes in FSDirectory to not be
unnecessarily large (now 8192 bytes); also use chunking when writing to index
files. FSDirectory#setReadChunkSize() is now deprecated and will be removed
in Lucene 5.0.
(Uwe Schindler, Robert Muir, gsingers)
- LUCENE-5170: Analyzer.ReuseStrategy instances are now stateless and can
be reused in other Analyzer instances, which was not possible before.
Lucene ships now with stateless singletons for per field and global reuse.
Legacy code can still instantiate the deprecated implementation classes,
but new code should use the constants. Implementors of custom strategies
have to take care of new method signatures. AnalyzerWrapper can now be
configured to use a custom strategy, too, ideally the one from the wrapped
Analyzer. Analyzer adds a getter to retrieve the strategy for this use-case.
(Uwe Schindler, Robert Muir, Shay Banon)
- LUCENE-5173: Lucene never writes segments with 0 documents anymore.
(Shai Erera, Uwe Schindler, Robert Muir)
- LUCENE-5178: SortedDocValues always returns -1 ord when a document is missing
a value for the field. Previously it only did this if the SortedDocValues
was produced by uninversion on the FieldCache.
(Robert Muir)
- LUCENE-5183: remove BinaryDocValues.MISSING. In order to determine a document
is missing a field, use getDocsWithField instead.
(Robert Muir)
- Changes in Runtime Behavior (2)
- LUCENE-5178: DocValues codec consumer APIs (iterables) return null values
when the document has no value for the field.
(Robert Muir)
- LUCENE-5200: The HighFreqTerms command-line tool returns the true top-N
by totalTermFreq when using the -t option, it uses the term statistics (faster)
and now always shows totalTermFreq in the output.
(Robert Muir)
- Optimizations (12)
- LUCENE-5088: Added TermFilter to filter docs by a specific term.
(Martijn van Groningen)
- LUCENE-5119: DiskDV keeps the document-to-ordinal mapping on disk for
SortedDocValues.
(Robert Muir)
- LUCENE-5145: New AppendingPackedLongBuffer, a new variant of the former
AppendingLongBuffer which assumes values are 0-based.
(Boaz Leskes via Adrien Grand)
- LUCENE-5145: All Appending*Buffer now support bulk get.
(Boaz Leskes via Adrien Grand)
- LUCENE-5140: Fixed a performance regression of span queries caused by
LUCENE-4946.
(Alan Woodward, Adrien Grand)
- LUCENE-5150: Make WAH8DocIdSet able to inverse its encoding in order to
compress dense sets efficiently as well.
(Adrien Grand)
- LUCENE-5159: Prefix-code the sorted/sortedset value dictionaries in DiskDV.
(Robert Muir)
- LUCENE-5170: Fixed several wrapper analyzers to inherit the reuse strategy
of the wrapped Analyzer.
(Uwe Schindler, Robert Muir, Shay Banon)
- LUCENE-5006: Simplified DocumentsWriter and DocumentsWriterPerThread
synchronization and concurrent interaction with IndexWriter. DWPT is now
only setup once and has no reset logic. All segment publishing and state
transition from DWPT into IndexWriter is now done via an Event-Queue
processed from within the IndexWriter in order to prevent situations
where DWPT or DW calling int IW causing deadlocks.
(Simon Willnauer)
- LUCENE-5182: Terminate phrase searches early if max phrase window is
exceeded in FastVectorHighlighter to prevent very long running phrase
extraction if phrase terms are high frequent.
(Simon Willnauer)
- LUCENE-5188: CompressingStoredFieldsFormat now slices chunks containing big
documents into fixed-size blocks so that requesting a single field does not
necessarily force to decompress the whole chunk.
(Adrien Grand)
- LUCENE-5101: CachingWrapper makes it easier to plug-in a custom cacheable
DocIdSet implementation and uses WAH8DocIdSet by default, which should be
more memory efficient than FixedBitSet on average as well as faster on small
sets.
(Robert Muir)
- Documentation (2)
- LUCENE-4894: remove facet userguide as it was outdated. Partially absorbed into
package's documentation and classes javadocs.
(Shai Erera)
- LUCENE-5206: Clarify FuzzyQuery's unexpected behavior on short
terms.
(Tim Allison via Mike McCandless)
- Changes in backwards compatibility policy (5)
- LUCENE-5141: CheckIndex.fixIndex(Status,Codec) is now
CheckIndex.fixIndex(Status). If you used to pass a codec to this method, just
remove it from the arguments.
(Adrien Grand)
- LUCENE-5089, SOLR-5126: Update to Morfologik 1.7.1. MorfologikAnalyzer and MorfologikFilter
no longer support multiple "dictionaries" as there is only one dictionary available.
(Dawid Weiss)
- LUCENE-5170: Changed method signatures of Analyzer.ReuseStrategy to take
Analyzer. Closeable interface was removed because the class was changed to
be stateless.
(Uwe Schindler, Robert Muir, Shay Banon)
- LUCENE-5187: SlowCompositeReaderWrapper constructor is now private,
SlowCompositeReaderWrapper.wrap should be used instead.
(Adrien Grand)
- LUCENE-5101: CachingWrapperFilter doesn't always return FixedBitSet instances
anymore. Users of the join module can use
oal.search.join.FixedBitSetCachingWrapperFilter instead.
(Adrien Grand)
- Build (2)
- SOLR-5159: Manifest includes non-parsed maven variables.
(Artem Karpenko via Steve Rowe)
- LUCENE-5193: Add jar-src as top-level target to generate all Lucene and Solr
*-src.jar.
(Steve Rowe, Shai Erera)
- Changes in backwards compatibility policy (18)
- LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
(Dawid Weiss, Grzegorz Sobczyk)
- LUCENE-4955: NGramTokenFilter now emits all n-grams for the same token at the
same position and preserves the position length and the offsets of the
original token.
(Simon Willnauer, Adrien Grand)
- LUCENE-4955: NGramTokenizer now emits n-grams in a different order
(a, ab, b, bc, c) instead of (a, b, c, ab, bc) and doesn't trim trailing
whitespaces.
(Adrien Grand)
- LUCENE-5042: The n-gram and edge n-gram tokenizers and filters now correctly
handle supplementary characters, and the tokenizers have the ability to
pre-tokenize the input stream similarly to CharTokenizer.
(Adrien Grand)
- LUCENE-4967: NRTManager is replaced by
ControlledRealTimeReopenThread, for controlling which requests must
see which indexing changes, so that it can work with any
ReferenceManager
(Mike McCandless)
- LUCENE-4973: SnapshotDeletionPolicy no longer requires a unique
String id
(Mike McCandless, Shai Erera)
- LUCENE-4946: The internal sorting API (SorterTemplate, now Sorter) has been
completely refactored to allow for a better implementation of TimSort.
(Adrien Grand, Uwe Schindler, Dawid Weiss)
- LUCENE-4963: Some TokenFilter options that generate broken TokenStreams have
been deprecated: updateOffsets=true on TrimFilter and
enablePositionIncrements=false on all classes that inherit from
FilteringTokenFilter: JapanesePartOfSpeechStopFilter, KeepWordFilter,
LengthFilter, StopFilter and TypeTokenFilter.
(Adrien Grand)
- LUCENE-4963: In order not to take position increments into account in
suggesters, you now need to call setPreservePositionIncrements(false) instead
of configuring the token filters to not increment positions.
(Adrien Grand)
- LUCENE-3907: EdgeNGramTokenizer now supports maxGramSize > 1024, doesn't trim
the input, sets position increment = 1 for all tokens and doesn't support
backward grams anymore.
(Adrien Grand)
- LUCENE-3907: EdgeNGramTokenFilter does not support backward grams and does
not update offsets anymore.
(Adrien Grand)
- LUCENE-4981: PositionFilter is now deprecated as it can corrupt token stream
graphs. Since it main use-case was to make query parsers generate boolean
queries instead of phrase queries, it is now advised to use
QueryParser.setAutoGeneratePhraseQueries(false) (for simple cases) or to
override QueryParser.newFieldQuery.
(Adrien Grand, Steve Rowe)
- LUCENE-5018: CompoundWordTokenFilterBase and its children
DictionaryCompoundWordTokenFilter and HyphenationCompoundWordTokenFilter don't
update offsets anymore.
(Adrien Grand)
- LUCENE-5015: SamplingAccumulator no longer corrects the counts of the sampled
categories. You should set TakmiSampleFixer on SamplingParams if required (but
notice that this means slower search).
(Rob Audenaerde, Gilad Barkai, Shai Erera)
- LUCENE-4933: Replace ExactSimScorer/SloppySimScorer with just SimScorer. Previously
there were 2 implementations as a performance hack to support tableization of
sqrt(), but this caching is removed, as sqrt is implemented in hardware with modern
jvms and it's faster not to cache.
(Robert Muir)
- LUCENE-5038: MergePolicy now has a default implementation for useCompoundFile based
on segment size and noCFSRatio. The default implementation was pulled up from
TieredMergePolicy.
(Simon Willnauer)
- LUCENE-5063: FieldCache.get(Bytes|Shorts), SortField.Type.(BYTE|SHORT) and
FieldCache.DEFAULT_(BYTE|SHORT|INT|LONG|FLOAT|DOUBLE)_PARSER are now
deprecated. These methods/types assume that data is stored as strings although
Lucene has much better support for numeric data through (Int|Long)Field,
NumericRangeQuery and FieldCache.get(Int|Long)s.
(Adrien Grand)
- LUCENE-5078: TfIDFSimilarity lets you encode the norm value as any arbitrary long.
As a result, encode/decodeNormValue were made abstract with their signatures changed.
The default implementation was moved to DefaultSimilarity, which encodes the norm as
a single-byte value.
(Shai Erera)
- Bug Fixes (23)
- LUCENE-4890: QueryTreeBuilder.getBuilder() only finds interfaces on the
most derived class.
(Adriano Crestani)
- LUCENE-4997: Internal test framework's tests are sensitive to previous
test failures and tests.failfast.
(Dawid Weiss, Shai Erera)
- LUCENE-4955: NGramTokenizer now supports inputs larger than 1024 chars.
(Adrien Grand)
- LUCENE-4959: Fix incorrect return value in
SimpleNaiveBayesClassifier.assignClass.
(Alexey Kutin via Adrien Grand)
- LUCENE-4972: DirectoryTaxonomyWriter created empty commits even if no changes
were made.
(Shai Erera, Michael McCandless)
- LUCENE-949: AnalyzingQueryParser can't work with leading wildcards.
(Tim Allison, Robert Muir, Steve Rowe)
- LUCENE-4980: Fix issues preventing mixing of RangeFacetRequest and
non-RangeFacetRequest when using DrillSideways.
(Mike McCandless,
Shai Erera)
- LUCENE-4996: Ensure DocInverterPerField always includes field name
in exception messages.
(Markus Jelsma via Robert Muir)
- LUCENE-4992: Fix constructor of CustomScoreQuery to take FunctionQuery
for scoringQueries. Instead use QueryValueSource to safely wrap arbitrary
queries and use them with CustomScoreQuery.
(John Wang, Robert Muir)
- LUCENE-5016: SamplingAccumulator returned inconsistent label if asked to
aggregate a non-existing category. Also fixed a bug in RangeAccumulator if
some readers did not have the requested numeric DV field.
(Rob Audenaerde, Shai Erera)
- LUCENE-5028: Remove pointless and confusing doShare option in FST's
PositiveIntOutputs
(Han Jiang via Mike McCandless)
- LUCENE-5032: Fix IndexOutOfBoundsExc in PostingsHighlighter when
multi-valued fields exceed maxLength
(Tomás Fernández Löbbe
via Mike McCandless)
- LUCENE-4933: SweetSpotSimilarity didn't apply its tf function to some
queries (SloppyPhraseQuery, SpanQueries).
(Robert Muir)
- LUCENE-5033: SlowFuzzyQuery was accepting too many terms (documents) when
provided minSimilarity is an int > 1
(Tim Allison via Mike McCandless)
- LUCENE-5045: DrillSideways.search did not work on an empty index.
(Shai Erera)
- LUCENE-4995: CompressingStoredFieldsReader now only reuses an internal buffer
when there is no more than 32kb to decompress. This prevents from running
into out-of-memory errors when working with large stored fields.
(Adrien Grand)
- LUCENE-5062: If the spatial data for a document was comprised of multiple
overlapping or adjacent parts then a CONTAINS predicate query might not match
when the sum of those shapes contain the query shape but none do individually.
A flag was added to use the original faster algorithm.
(David Smiley)
- LUCENE-4971: Fixed NPE in AnalyzingSuggester when there are too many
graph expansions.
(Alexey Kudinov via Mike McCandless)
- LUCENE-5080: Combined setMaxMergeCount and setMaxThreadCount into one
setter in ConcurrentMergePolicy: setMaxMergesAndThreads. Previously these
setters would not work unless you invoked them very carefully.
(Robert Muir, Shai Erera)
- LUCENE-5068: QueryParserUtil.escape() does not escape forward slash.
(Matias Holte via Steve Rowe)
- LUCENE-5103: A join on A single-valued field with deleted docs scored too few
docs.
(David Smiley)
- LUCENE-5090: Detect mismatched readers passed to
SortedSetDocValuesReaderState and SortedSetDocValuesAccumulator.
(Robert Muir, Mike McCandless)
- LUCENE-5120: AnalyzingSuggester modified its FST's cached root arc if payloads
are used and the entire output resided on the root arc on the first access. This
caused subsequent suggest calls to fail.
(Simon Willnauer)
- Optimizations (7)
- LUCENE-4936: Improve numeric doc values compression in case all values share
a common divisor. In particular, this improves the compression ratio of dates
without time when they are encoded as milliseconds since Epoch. Also support
TABLE compressed numerics in the Disk codec.
(Robert Muir, Adrien Grand)
- LUCENE-4951: DrillSideways uses the new Scorer.cost() method to make
better decisions about which scorer to use internally.
(Mike McCandless)
- LUCENE-4976: PersistentSnapshotDeletionPolicy writes its state to a
single snapshots_N file, and no longer requires closing
(Mike
McCandless, Shai Erera)
- LUCENE-5035: Compress addresses in FieldCacheImpl.SortedDocValuesImpl more
efficiently.
(Adrien Grand, Robert Muir)
- LUCENE-4941: Sort "from" terms only once when using JoinUtil.
(Martijn van Groningen)
- LUCENE-5050: Close the stored fields and term vectors index files as soon as
the index has been loaded into memory to save file descriptors.
(Adrien Grand)
- LUCENE-5086: RamUsageEstimator now uses official Java 7 API or a proprietary
Oracle Java 6 API to get Hotspot MX bean, preventing AWT classes to be
loaded on MacOSX.
(Shay Banon, Dawid Weiss, Uwe Schindler)
- New Features (19)
- LUCENE-5085: MorfologikFilter will no longer stem words marked as keywords
(Dawid Weiss, Grzegorz Sobczyk)
- LUCENE-5064: Added PagedMutable (internal), a paged extension of
PackedInts.Mutable which allows for storing more than 2B values.
(Adrien Grand)
- LUCENE-4766: Added a PatternCaptureGroupTokenFilter that uses Java regexes to
emit multiple tokens one for each capture group in one or more patterns.
(Simon Willnauer, Clinton Gormley)
- LUCENE-4952: Expose control (protected method) in DrillSideways to
force all sub-scorers to be on the same document being collected.
This is necessary when using collectors like
ToParentBlockJoinCollector with DrillSideways.
(Mike McCandless)
- SOLR-4761: Add SimpleMergedSegmentWarmer, which just initializes terms,
norms, docvalues, and so on.
(Mark Miller, Mike McCandless, Robert Muir)
- LUCENE-4964: Allow arbitrary Query for per-dimension drill-down to
DrillDownQuery and DrillSideways, to support future dynamic faceting
methods
(Mike McCandless)
- LUCENE-4966: Add CachingWrapperFilter.sizeInBytes()
(Mike McCandless)
- LUCENE-4965: Add dynamic (no taxonomy index used) numeric range
faceting to Lucene's facet module
(Mike McCandless, Shai Erera)
- LUCENE-4979: LiveFieldFields can work with any ReferenceManager, not
just ReferenceManager<IndexSearcher>
(Mike McCandless).
- LUCENE-4975: Added a new Replicator module which can replicate index
revisions between server and client.
(Shai Erera, Mike McCandless)
- LUCENE-5022: Added FacetResult.mergeHierarchies to merge multiple
FacetResult of the same dimension into a single one with the reconstructed
hierarchy.
(Shai Erera)
- LUCENE-5026: Added PagedGrowableWriter, a new internal packed-ints structure
that grows the number of bits per value on demand, can store more than 2B
values and supports random write and read access.
(Adrien Grand)
- LUCENE-5025: FST's Builder can now handle more than 2.1 billion
"tail nodes" while building a minimal FST.
(Aaron Binns, Adrien
Grand, Mike McCandless)
- LUCENE-5063: FieldCache.DEFAULT.get(Ints|Longs) now uses bit-packing to save
memory.
(Adrien Grand)
- LUCENE-5079: IndexWriter.hasUncommittedChanges() returns true if there are
changes that have not been committed.
(yonik, Mike McCandless, Uwe Schindler)
- SOLR-4565: Extend NorwegianLightStemFilter and NorwegianMinimalStemFilter
to handle "nynorsk"
(Erlend Garåsen, janhoy via Robert Muir)
- LUCENE-5087: Add getMultiValuedSeparator to PostingsHighlighter, for cases
where you want a different logical separator between field values. This can
be set to e.g. U+2029 PARAGRAPH SEPARATOR if you never want passes to span
values.
(Mike McCandless, Robert Muir)
- LUCENE-5013: Added ScandinavianFoldingFilterFactory and
ScandinavianNormalizationFilterFactory
(Karl Wettin via janhoy)
- LUCENE-4845: AnalyzingInfixSuggester finds suggestions based on
matches to any tokens in the suggestion, not just based on pure
prefix matching.
(Mike McCandless, Robert Muir)
- API Changes (3)
- LUCENE-5077: Make it easier to use compressed norms. Lucene42NormsFormat takes
an overhead parameter, so you can easily pass a different value other than
PackedInts.FASTEST from your own codec.
(Robert Muir)
- LUCENE-5097: Analyzer now has an additional tokenStream(String fieldName,
String text) method, so wrapping by StringReader for common use is no
longer needed. This method uses an internal reusable reader, which was
previously only used by the Field class.
(Uwe Schindler, Robert Muir)
- LUCENE-4542: HunspellStemFilter's maximum recursion level is now configurable.
(Piotr, Rafał Kuć via Adrien Grand)
- Build (4)
- LUCENE-4987: Upgrade randomized testing to version 2.0.10:
Test framework may fail internally due to overly aggressive J9 optimizations.
(Dawid Weiss, Shai Erera)
- LUCENE-5043: The eclipse target now uses the containing directory for the
project name. This also enforces UTF-8 encoding when files are copied with
filtering.
- LUCENE-5055: "rat-sources" target now checks also build.xml, ivy.xml,
forbidden-api signatures, and parts of resources folders.
(Ryan Ernst,
Uwe Schindler)
- LUCENE-5072: Automatically patch javadocs generated by JDK versions
before 7u25 to work around the frame injection vulnerability (CVE-2013-1571,
VU#225657).
(Uwe Schindler)
- Tests (1)
- LUCENE-4901: TestIndexWriterOnJRECrash should work on any
JRE vendor via Runtime.halt().
(Mike McCandless, Robert Muir, Uwe Schindler, Rodrigo Trujillo, Dawid Weiss)
- Changes in runtime behavior (2)
- LUCENE-5038: New segments written by IndexWriter are now wrapped into CFS
by default. DocumentsWriterPerThread doesn't consult MergePolicy anymore
to decide if a CFS must be written, instead IndexWriterConfig now has a
property to enable / disable CFS for newly created segments.
(Simon Willnauer)
- LUCENE-5107: Properties files by Lucene are now written in UTF-8 encoding,
Unicode is no longer escaped. Reading of legacy properties files with
\u escapes is still possible.
(Uwe Schindler, Robert Muir)
- Bug Fixes (12)
- SOLR-4813: Fix SynonymFilterFactory to allow init parameters for
tokenizer factory used when parsing synonyms file.
(Shingo Sasaki, hossman)
- LUCENE-4935: CustomScoreQuery wrongly applied its query boost twice
(boost^2).
(Robert Muir)
- LUCENE-4948: Fixed ArrayIndexOutOfBoundsException in PostingsHighlighter
if you had a 64-bit JVM without compressed OOPS: IBM J9, or Oracle with
large heap/explicitly disabled.
(Mike McCandless, Uwe Schindler, Robert Muir)
- LUCENE-4953: Fixed ParallelCompositeReader to inform ReaderClosedListeners of
its synthetic subreaders. FieldCaches keyed on the atomic children will be purged
earlier and FC insanity prevented. In addition, ParallelCompositeReader's
toString() was changed to better reflect the reader structure.
(Mike McCandless, Uwe Schindler)
- LUCENE-4968: Fixed ToParentBlockJoinQuery/Collector: correctly handle parent
hits that had no child matches, don't throw IllegalArgumentEx when
the child query has no hits, more aggressively catch cases where childQuery
incorrectly matches parent documents
(Mike McCandless)
- LUCENE-4970: Fix boost value of rewritten NGramPhraseQuery.
(Shingo Sasaki via Adrien Grand)
- LUCENE-4974: CommitIndexTask was broken if no params were set.
(Shai Erera)
- LUCENE-4986: Fixed case where a newly opened near-real-time reader
fails to reflect a delete from IndexWriter.tryDeleteDocument
(Reg,
Mike McCandless)
- LUCENE-4994: Fix PatternKeywordMarkerFilter to have public constructor.
(Uwe Schindler)
- LUCENE-4993: Fix BeiderMorseFilter to preserve custom attributes when
inserting tokens with position increment 0.
(Uwe Schindler)
- LUCENE-4991: Fix handling of synonyms in classic QueryParser.getFieldQuery for
terms not separated by whitespace. PositionIncrementAttribute was ignored, so with
default AND synonyms wrongly became mandatory clauses, and with OR, the
coordination factor was wrong.
(李威, Robert Muir)
- LUCENE-5002: IndexWriter#deleteAll() caused a deadlock in DWPT / DWSC if a
DwPT was flushing concurrently while deleteAll() aborted all DWPT. The IW
should never wait on DWPT via the flush control while holding on to the IW
Lock.
(Simon Willnauer)
- Optimizations (1)
- LUCENE-4938: Don't use an unnecessarily large priority queue in IndexSearcher
methods that take top-N.
(Uwe Schindler, Mike McCandless, Robert Muir)
- Changes in backwards compatibility policy (8)
- LUCENE-4810: EdgeNGramTokenFilter no longer increments position for
multiple ngrams derived from the same input token.
(Walter Underwood
via Mike McCandless)
- LUCENE-4822: KeywordTokenFilter is now an abstract class. Subclasses
need to implement #isKeyword() in order to mark terms as keywords.
The existing functionality has been factored out into a new
SetKeywordTokenFilter class.
(Simon Willnauer, Uwe Schindler)
- LUCENE-4642: Remove Tokenizer's and subclasses' ctors taking
AttributeSource.
(Renaud Delbru, Uwe Schindler, Steve Rowe)
- LUCENE-4833: IndexWriterConfig used to use LogByteSizeMergePolicy when
calling setMergePolicy(null) although the default merge policy is
TieredMergePolicy. IndexWriterConfig setters now throw an exception when
passed null if null is not a valid value.
(Adrien Grand)
- LUCENE-4849: Made ParallelTaxonomyArrays abstract with a concrete
implementation for DirectoryTaxonomyWriter/Reader. Also moved it under
o.a.l.facet.taxonomy.
(Shai Erera)
- LUCENE-4876: IndexDeletionPolicy is now an abstract class instead of an
interface. IndexDeletionPolicy, MergeScheduler and InfoStream now implement
Cloneable.
(Adrien Grand)
- LUCENE-4874: FilterAtomicReader and related classes (FilterTerms,
FilterDocsEnum, ...) don't forward anymore to the filtered instance when the
method has a default implementation through other abstract methods.
(Adrien Grand, Robert Muir)
- LUCENE-4642, LUCENE-4877: Implementors of TokenizerFactory, TokenFilterFactory,
and CharFilterFactory now need to provide at least one constructor taking
Map<String,String> to be able to be loaded by the SPI framework (e.g., from Solr).
In addition, TokenizerFactory needs to implement the abstract
create(AttributeFactory,Reader) method.
(Renaud Delbru, Uwe Schindler,
Steve Rowe, Robert Muir)
- API Changes (3)
- LUCENE-4896: Made PassageFormatter abstract in PostingsHighlighter, made
members of DefaultPassageFormatter protected.
(Luca Cavanna via Robert Muir)
- LUCENE-4844: removed TaxonomyReader.getParent(), you should use
TaxonomyReader.getParallelArrays().parents() instead.
(Shai Erera)
- LUCENE-4742: Renamed spatial 'Node' to 'Cell', along with any method names
and variables using this terminology.
(David Smiley)
- New Features (34)
- LUCENE-4815: DrillSideways now allows more than one FacetRequest per
dimension
(Mike McCandless)
- LUCENE-3918: IndexSorter has been ported to 4.3 API and now supports
sorting documents by a numeric DocValues field, or reverse the order of
the documents in the index. Additionally, apps can implement their own
sort criteria.
(Anat Hashavit, Shai Erera)
- LUCENE-4817: Added KeywordRepeatFilter that allows to emit a token twice
once as a keyword and once as an ordinary token allow stemmers to emit
a stemmed version along with the un-stemmed version.
(Simon Willnauer)
- LUCENE-4822: PatternKeywordTokenFilter can mark tokens as keywords based
on regular expressions.
(Simon Willnauer, Uwe Schindler)
- LUCENE-4821: AnalyzingSuggester now uses the ending offset to
determine whether the last token was finished or not, so that a
query "i " will no longer suggest "Isla de Muerta" for example.
(Mike McCandless)
- LUCENE-4642: Add create(AttributeFactory) to TokenizerFactory and
subclasses with ctors taking AttributeFactory.
(Renaud Delbru, Uwe Schindler, Steve Rowe)
- LUCENE-4820: Add payloads to Analyzing/FuzzySuggester, to record an
arbitrary byte[] per suggestion
(Mike McCandless)
- LUCENE-4816: Add WholeBreakIterator to PostingsHighlighter
for treating the entire content as a single Passage.
(Robert
Muir, Mike McCandless)
- LUCENE-4827: Add additional ctor to PostingsHighlighter PassageScorer
to provide bm25 k1,b,avgdl parameters.
(Robert Muir)
- LUCENE-4607: Add DocIDSetIterator.cost() and Spans.cost() for optimizing
scoring.
(Simon Willnauer, Robert Muir)
- LUCENE-4795: Add SortedSetDocValuesFacetFields and
SortedSetDocValuesAccumulator, to compute topK facet counts from a
field's SortedSetDocValues. This method only supports flat
(dim/label) facets, is a bit (~25%) slower, has added cost
per-IndexReader-open to compute its ordinal map, but it requires no
taxonomy index and it tie-breaks facet labels in an understandable
(by Unicode sort order) way.
(Robert Muir, Mike McCandless)
- LUCENE-4843: Add LimitTokenPositionFilter: don't emit tokens with
positions that exceed the configured limit.
(Steve Rowe)
- LUCENE-4832: Add ToParentBlockJoinCollector.getTopGroupsWithAllChildDocs, to retrieve
all children in each group.
(Aleksey Aleev via Mike McCandless)
- LUCENE-4846: PostingsHighlighter subclasses can override where the
String values come from (it still defaults to pulling from stored
fields).
(Robert Muir, Mike McCandless)
- LUCENE-4853: Add PostingsHighlighter.highlightFields method that
takes int[] docIDs instead of TopDocs.
(Robert Muir, Mike
McCandless)
- LUCENE-4856: If there are no matches for a given field, return the
first maxPassages sentences
(Robert Muir, Mike McCandless)
- LUCENE-4859: IndexReader now exposes Terms statistics: getDocCount,
getSumDocFreq, getSumTotalTermFreq.
(Shai Erera)
- LUCENE-4862: It is now possible to terminate collection of a single
IndexReader leaf by throwing a CollectionTerminatedException in
Collector.collect.
(Adrien Grand, Shai Erera)
- LUCENE-4752: New SortingMergePolicy (in lucene/misc) that sorts documents
before merging segments.
(Adrien Grand, Shai Erera, David Smiley)
- LUCENE-4860: Customize scoring and formatting per-field in
PostingsHighlighter by subclassing and overriding the getFormatter
and/or getScorer methods. This also changes Passage.getMatchTerms()
to return BytesRef[] instead of Term[].
(Robert Muir, Mike
McCandless)
- LUCENE-4839: Added SorterTemplate.timSort, a O(n log n) stable sort algorithm
that performs well on partially sorted data.
(Adrien Grand)
- LUCENE-4644: Added support for the "IsWithin" spatial predicate for
RecursivePrefixTreeStrategy. It's for matching non-point indexed shapes; if
you only have points (1/doc) then "Intersects" is equivalent and faster.
See the javadocs.
(David Smiley)
- LUCENE-4861: Make BreakIterator per-field in PostingsHighlighter. This means
you can override getBreakIterator(String field) to use different mechanisms
for e.g. title vs. body fields.
(Mike McCandless, Robert Muir)
- LUCENE-4645: Added support for the "Contains" spatial predicate for
RecursivePrefixTreeStrategy.
(David Smiley)
- LUCENE-4898: DirectoryReader.openIfChanged now allows opening a reader
on an IndexCommit starting from a near-real-time reader (previously
this would throw IllegalArgumentException).
(Mike McCandless)
- LUCENE-4905: Made the maxPassages parameter per-field in PostingsHighlighter.
(Robert Muir)
- LUCENE-4897: Added TaxonomyReader.getChildren for traversing a category's
children.
(Shai Erera)
- LUCENE-4902: Added FilterDirectoryReader to allow easy filtering of a
DirectoryReader's subreaders.
(Alan Woodward, Adrien Grand, Uwe Schindler)
- LUCENE-4858: Added EarlyTerminatingSortingCollector to be used in conjunction
with SortingMergePolicy, which allows to early terminate queries on sorted
indexes, when the sort order matches the index order.
(Adrien Grand, Shai Erera)
- LUCENE-4904: Added descending sort order to NumericDocValuesSorter.
(Shai Erera)
- LUCENE-3786: Added SearcherTaxonomyManager, to manage access to both
IndexSearcher and DirectoryTaxonomyReader for near-real-time
faceting.
(Shai Erera, Mike McCandless)
- LUCENE-4915: DrillSideways now allows drilling down on fields that
are not faceted.
(Mike McCandless)
- LUCENE-4895: Added support for the "IsDisjointTo" spatial predicate for
RecursivePrefixTreeStrategy.
(David Smiley)
- LUCENE-4774: Added FieldComparator that allows sorting parent documents based on
fields on the child / nested document level.
(Martijn van Groningen)
- Optimizations (7)
- LUCENE-4839: SorterTemplate.merge can now be overridden in order to replace
the default implementation which merges in-place by a faster implementation
that could require fewer swaps at the expense of some extra memory.
ArrayUtil and CollectionUtil override it so that their mergeSort and timSort
methods are faster but only require up to 1% of extra memory.
(Adrien Grand)
- LUCENE-4571: Speed up BooleanQuerys with minNrShouldMatch to use
skipping.
(Stefan Pohl via Robert Muir)
- LUCENE-4863: StemmerOverrideFilter now uses a FST to represent its overrides
in memory.
(Simon Willnauer)
- LUCENE-4889: UnicodeUtil.codePointCount implementation replaced with a
non-array-lookup version.
(Dawid Weiss)
- LUCENE-4923: Speed up BooleanQuerys processing of in-order disjunctions.
(Robert Muir)
- LUCENE-4926: Speed up DisjunctionMatchQuery.
(Robert Muir)
- LUCENE-4930: Reduce contention in older/buggy JVMs when using
AttributeSource#addAttribute() because java.lang.ref.ReferenceQueue#poll()
is implemented using synchronization.
(Christian Ziech, Karl Wright,
Uwe Schindler)
- Bug Fixes (18)
- LUCENE-4868: SumScoreFacetsAggregator used an incorrect index into
the scores array.
(Shai Erera)
- LUCENE-4882: FacetsAccumulator did not allow to count ROOT category (i.e.
count dimensions).
(Shai Erera)
- LUCENE-4876: IndexWriterConfig.clone() now clones its MergeScheduler,
IndexDeletionPolicy and InfoStream in order to make an IndexWriterConfig and
its clone fully independent.
(Adrien Grand)
- LUCENE-4893: Facet counts were multiplied as many times as
FacetsCollector.getFacetResults() is called.
(Shai Erera)
- LUCENE-4888: Fixed SloppyPhraseScorer, MultiDocs(AndPositions)Enum and
MultiSpansWrapper which happened to sometimes call DocIdSetIterator.advance
with target<=current (in this case the behavior of advance is undefined).
(Adrien Grand)
- LUCENE-4899: FastVectorHighlighter failed with StringIndexOutOfBoundsException
if a single highlight phrase or term was greater than the fragCharSize producing
negative string offsets.
(Simon Willnauer)
- LUCENE-4877: Throw exception for invalid arguments in analysis factories.
(Steve Rowe, Uwe Schindler, Robert Muir)
- LUCENE-4914: SpatialPrefixTree's Node/Cell.reset() forgot to reset the 'leaf'
flag. It affects SpatialRecursivePrefixTreeStrategy on non-point indexed
shapes, as of Lucene 4.2.
(David Smiley)
- LUCENE-4913: FacetResultNode.ordinal was always 0 when all children
are returned.
(Mike McCandless)
- LUCENE-4918: Highlighter closes the given IndexReader if QueryScorer
is used with an external IndexReader.
(Simon Willnauer, Sirvan Yahyaei)
- LUCENE-4880: Fix MemoryIndex to consume empty terms from the tokenstream consistent
with IndexWriter. Previously it discarded them.
(Timothy Allison via Robert Muir)
- LUCENE-4885: FacetsAccumulator did not set the correct value for
FacetResult.numValidDescendants.
(Mike McCandless, Shai Erera)
- LUCENE-4925: Fixed IndexSearcher.search when the argument list contains a Sort
and one of the sort fields is the relevance score. Only IndexSearchers created
with an ExecutorService are concerned.
(Adrien Grand)
- LUCENE-4738, LUCENE-2727, LUCENE-2812: Simplified
DirectoryReader.indexExists so that it's more robust to transient
IOExceptions (e.g. due to issues like file descriptor exhaustion),
but this will also cause it to err towards returning true for
example if the directory contains a corrupted index or an incomplete
initial commit. In addition, IndexWriter with OpenMode.CREATE will
now succeed even if the directory contains a corrupted index
(Billow
Gao, Robert Muir, Mike McCandless)
- LUCENE-4928: Stored fields and term vectors could become super slow in case
of tiny documents (a few bytes). This is especially problematic when switching
codecs since bulk-merge strategies can't be applied and the same chunk of
documents can end up being decompressed thousands of times. A hard limit on
the number of documents per chunk has been added to fix this issue.
(Robert Muir, Adrien Grand)
- LUCENE-4934: Fix minor equals/hashcode problems in facet/DrillDownQuery,
BoostingQuery, MoreLikeThisQuery, FuzzyLikeThisQuery, and block join queries.
(Robert Muir, Uwe Schindler)
- LUCENE-4504: Fix broken sort comparator in ValueSource.getSortField,
used when sorting by a function query.
(Tom Shally via Robert Muir)
- LUCENE-4937: Fix incorrect sorting of float/double values (+/-0, NaN).
(Robert Muir, Uwe Schindler)
- Documentation (1)
- LUCENE-4841: Added example SimpleSortedSetFacetsExample to show how
to use the new SortedSetDocValues backed facet implementation.
(Shai Erera, Mike McCandless)
- Build (1)
- LUCENE-4879: Upgrade randomized testing to version 2.0.9:
Filter stack traces on console output.
(Dawid Weiss, Robert Muir)
- Bug Fixes (9)
- LUCENE-4713: The SPI components used to load custom codecs or analysis
components were fixed to also scan the Lucene ClassLoader in addition
to the context ClassLoader, so Lucene is always able to find its own
codecs. The special case of a null context ClassLoader is now also
supported.
(Christian Kohlschütter, Uwe Schindler)
- LUCENE-4819: seekExact(BytesRef, boolean) did not work correctly with
Sorted[Set]DocValuesTermsEnum.
(Robert Muir)
- LUCENE-4826: PostingsHighlighter was not returning the top N best
scoring passages.
(Robert Muir, Mike McCandless)
- LUCENE-4854: Fix DocTermOrds.getOrdTermsEnum() to not return negative
ord on initial next().
(Robert Muir)
- LUCENE-4836: Fix SimpleRateLimiter#pause to return the actual time spent
sleeping instead of the wakeup timestamp in nano seconds.
(Simon Willnauer)
- LUCENE-4828: BooleanQuery no longer extracts terms from its MUST_NOT
clauses.
(Mike McCandless)
- SOLR-4589: Fixed CPU spikes and poor performance in lazy field loading
of multivalued fields.
(hossman)
- LUCENE-4870: Fix bug where an entire index might be deleted by the IndexWriter
due to false detection if an index exists in the directory when
OpenMode.CREATE_OR_APPEND is used. This might also affect application that set
the open mode manually using DirectoryReader#indexExists.
(Simon Willnauer)
- LUCENE-4878: Override getRegexpQuery in MultiFieldQueryParser to prevent
NullPointerException when regular expression syntax is used with
MultiFieldQueryParser.
(Simon Willnauer, Adam Rauch)
- Optimizations (3)
- LUCENE-4819: Added Sorted[Set]DocValues.termsEnum(), and optimized the
default codec for improved enumeration performance.
(Robert Muir)
- LUCENE-4854: Speed up TermsEnum of FieldCache.getDocTermOrds.
(Robert Muir)
- LUCENE-4857: Don't unnecessarily copy stem override map in
StemmerOverrideFilter.
(Simon Willnauer)
- Changes in backwards compatibility policy (12)
- LUCENE-4602: FacetFields now stores facet ordinals in a DocValues field,
rather than a payload. This forces rebuilding existing indexes, or do a
one time migration using FacetsPayloadMigratingReader. Since DocValues
support in-memory caching, CategoryListCache was removed too.
(Shai Erera, Michael McCandless)
- LUCENE-4697: FacetResultNode is now a concrete class with public members
(instead of getter methods).
(Shai Erera)
- LUCENE-4600: FacetsCollector is now an abstract class with two
implementations: StandardFacetsCollector (the old version of
FacetsCollector) and CountingFacetsCollector. FacetsCollector.create()
returns the most optimized collector for the given parameters.
(Shai Erera, Michael McCandless)
- LUCENE-4700: OrdinalPolicy is now per CategoryListParams, and is no longer
an interface, but rather an enum with values NO_PARENTS and ALL_PARENTS.
PathPolicy was removed, you should extend FacetFields and DrillDownStream
to control which categories are added as drill-down terms.
(Shai Erera)
- LUCENE-4547: DocValues improvements:
-
Simplified codec API: codecs are now only responsible for encoding and
decoding docvalues, they do not need to do buffering or RAM accounting.
-
Per-Field support: added PerFieldDocValuesFormat, which allows you to
use a different DocValuesFormat per field (like postings).
-
Unified with FieldCache api: DocValues can be accessed via FieldCache API,
so it works automatically with grouping/join/sort/function queries, etc.
-
Simplified types: There are only 3 types (NUMERIC, BINARY, SORTED), so it's
not necessary to specify for example that all of your binary values have
the same length. Instead it's easy for the Codec API to optimize encoding
based on any properties of the content.
(Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
- LUCENE-4757: Cleanup and refactoring of FacetsAccumulator, FacetRequest,
FacetsAggregator and FacetResultsHandler API. If your application did
FacetsCollector.create(), you should not be affected, but if you wrote
an Aggregator, then you should migrate it to the per-segment
FacetsAggregator. You can still use StandardFacetsAccumulator, which works
with the old API (for now).
(Shai Erera)
- LUCENE-4761: Facet packages reorganized. Should be easy to fix your import
statements, if you use an IDE such as Eclipse.
(Shai Erera)
- LUCENE-4750: Convert DrillDown to DrillDownQuery, so you can initialize it
and add drill-down categories to it.
(Michael McCandless, Shai Erera)
- LUCENE-4759: remove FacetRequest.SortBy; result categories are always
sorted by value, while ties are broken by category ordinal.
(Shai Erera)
- LUCENE-4772: Facet associations moved to new FacetsAggregator API. You
should override FacetsAccumulator and return the relevant aggregator,
for aggregating the association values.
(Shai Erera)
- LUCENE-4748: A FacetRequest on a non-existent field now returns an
empty FacetResult instead of skipping it.
(Shai Erera, Mike McCandless)
- LUCENE-4806: The default category delimiter character was changed
from U+F749 to U+001F, since the latter uses 1 byte vs 3 bytes for
the former. Existing facet indices must be reindexed.
(Robert
Muir, Shai Erera, Mike McCandless)
- Optimizations (11)
- LUCENE-4687: BloomFilterPostingsFormat now lazily initializes delegate
TermsEnum only if needed to do a seek or get a DocsEnum.
(Simon Willnauer)
- LUCENE-4677, LUCENE-4682: unpacked FSTs now use vInt to encode the node target,
to reduce their size
(Mike McCandless)
- LUCENE-4678: FST now uses a paged byte[] structure instead of a
single byte[] internally, to avoid large memory spikes during
building
(James Dyer, Mike McCandless)
- LUCENE-3298: FST can now be larger than 2.1 GB / 2.1 B nodes.
(James Dyer, Mike McCandless)
- LUCENE-4690: Performance improvements and non-hashing versions
of NumericUtils.*ToPrefixCoded()
(yonik)
- LUCENE-4715: CategoryListParams.getOrdinalPolicy now allows to return a
different OrdinalPolicy per dimension, to better tune how you index
facets. Also added OrdinalPolicy.ALL_BUT_DIMENSION.
(Shai Erera, Michael McCandless)
- LUCENE-4740: Don't track clones of MMapIndexInput if unmapping
is disabled. This reduces GC overhead.
(Kristofer Karlsson, Uwe Schindler)
- LUCENE-4733: The default Lucene 4.2 codec now uses a more compact
TermVectorsFormat (Lucene42TermVectorsFormat) based on
CompressingTermVectorsFormat.
(Adrien Grand)
- LUCENE-3729: The default Lucene 4.2 codec now uses a more compact
DocValuesFormat (Lucene42DocValuesFormat). Sorted values are stored in an
FST, Numerics and Ordinals use a number of strategies (delta-compression,
table-compression, etc), and memory addresses use MonotonicBlockPackedWriter.
(Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
- LUCENE-4792: Reduction of the memory required to build the doc ID maps used
when merging segments.
(Adrien Grand)
- LUCENE-4794: Spatial RecursivePrefixTreeStrategy's search filter: Skip calls
to termsEnum.seek() when the next term is known to follow the current cell.
(David Smiley)
- New Features (16)
- LUCENE-4686: New specialized DGapVInt8IntEncoder for facets (now the
default).
(Shai Erera)
- LUCENE-4703: Add simple PrintTaxonomyStats tool to see summary
information about the facets taxonomy index.
(Mike McCandless)
- LUCENE-4599: New oal.codecs.compressing.CompressingTermVectorsFormat which
compresses term vectors into chunks of documents similarly to
CompressingStoredFieldsFormat.
(Adrien Grand)
- LUCENE-4695: Added LiveFieldValues utility class, for getting the
current (live, real-time) value for any indexed doc/field. The
class buffers recently indexed doc/field values until a new
near-real-time reader is opened that contains those changes.
(Robert Muir, Mike McCandless)
- LUCENE-4723: Add AnalyzerFactoryTask to benchmark, and enable analyzer
creation via the resulting factories using NewAnalyzerTask.
(Steve Rowe)
- LUCENE-4728: Unknown and not explicitly mapped queries are now rewritten
against the highlighting IndexReader to obtain primitive queries before
discarding the query entirely. WeightedSpanTermExtractor now builds a
MemoryIndex only once even if multiple fields are highlighted.
(Simon Willnauer)
- LUCENE-4035: Added ICUCollationDocValuesField, more efficient
support for Locale-sensitive sort and range queries for
single-valued fields.
(Robert Muir)
- LUCENE-4547: Added MonotonicBlockPacked(Reader/Writer), which provide
efficient random access to large amounts of monotonically increasing
positive values (e.g. file offsets). Each block stores the minimum value
and the average gap, and values are encoded as signed deviations from
the expected value.
(Adrien Grand)
- LUCENE-4547: Added AppendingLongBuffer, an append-only buffer that packs
signed long values in memory and provides an efficient iterator API.
(Adrien Grand)
- LUCENE-4540: It is now possible for a codec to represent norms with
less than 8 bits per value. For performance reasons this is not done
by default, but you can customize your codec (e.g. pass PackedInts.DEFAULT
to Lucene42DocValuesConsumer) if you want to make this tradeoff.
(Adrien Grand, Robert Muir)
- LUCENE-4764: A new Facet42Codec and Facet42DocValuesFormat provide
faster but more RAM-consuming facet performance.
(Shai Erera, Mike
McCandless)
- LUCENE-4769: Added OrdinalsCache and CachedOrdsCountingFacetsAggregator
which uses the cache to obtain a document's ordinals. This aggregator
is faster than others, however consumes much more RAM.
(Michael McCandless, Shai Erera)
- LUCENE-4778: Add a getter for the delegate in RateLimitedDirectoryWrapper.
(Mark Miller)
- LUCENE-4765: Add a multi-valued docvalues type (SORTED_SET). This is equivalent
to building a FieldCache.getDocTermOrds at index-time.
(Robert Muir)
- LUCENE-4780: Add MonotonicAppendingLongBuffer: an append-only buffer for
monotonically increasing values.
(Adrien Grand)
- LUCENE-4748: Added DrillSideways utility class for computing both
drill-down and drill-sideways counts for a DrillDownQuery.
(Mike
McCandless)
- API Changes (4)
- LUCENE-4709: FacetResultNode no longer has a residue field.
(Shai Erera)
- LUCENE-4716: DrillDown.query now takes Occur, allowing to specify if
categories should be OR'ed or AND'ed.
(Shai Erera)
- LUCENE-4695: ReferenceManager.RefreshListener.afterRefresh now takes
a boolean indicating whether a new reference was in fact opened, and
a new beforeRefresh method notifies you when a refresh attempt is
starting.
(Robert Muir, Mike McCandless)
- LUCENE-4794: Spatial RecursivePrefixTreeFilter replaced by
IntersectsPrefixTreeFilter and some extensible base classes.
(David Smiley)
- Bug Fixes (17)
- LUCENE-4705: Pass on FilterStrategy in FilteredQuery if the filtered query is
rewritten.
(Simon Willnauer)
- LUCENE-4712: MemoryIndex#normValues() throws NPE if field doesn't exist.
(Simon Willnauer, Ricky Pritchett)
- LUCENE-4550: Shapes wider than 180 degrees would use too much accuracy for the
PrefixTree based SpatialStrategy. For a pathological case of nearly 360
degrees and barely any height, it would generate so many indexed terms
(> 500k) that it could even cause an OutOfMemoryError. Fixed.
(David Smiley)
- LUCENE-4704: Make join queries override hashcode and equals methods.
(Martijn van Groningen)
- LUCENE-4724: Fix bug in CategoryPath which allowed passing null or empty
string components. This is forbidden now (throws an exception). Note that if
you have a taxonomy index created with such strings, you should rebuild it.
(Michael McCandless, Shai Erera)
- LUCENE-4732: Fixed TermsEnum.seekCeil/seekExact on term vectors.
(Adrien Grand, Robert Muir)
- LUCENE-4739: Fixed bugs that prevented FSTs more than ~1.1GB from
being saved and loaded
(Adrien Grand, Mike McCandless)
- LUCENE-4717: Fixed bug where Lucene40DocValuesFormat would sometimes write
an extra unused ordinal for sorted types. The bug is detected and corrected
on-the-fly for old indexes.
(Robert Muir)
- LUCENE-4547: Fixed bug where Lucene40DocValuesFormat was unable to encode
segments that would exceed 2GB total data. This could happen in some surprising
cases, for example if you had an index with more than 260M documents and a
VAR_INT field.
(Simon Willnauer, Adrien Grand, Mike McCandless, Robert Muir)
- LUCENE-4775: Remove SegmentInfo.sizeInBytes() and make
MergePolicy.OneMerge.totalBytesSize thread safe
(Josh Bronson via
Robert Muir, Mike McCandless)
- LUCENE-4770: If spatial's TermQueryPrefixTreeStrategy was used to search
indexed non-point shapes, then there was an edge case where a query should
find a shape but it didn't. The fix is the removal of an optimization that
simplifies some leaf cells into a parent. The index data for such a field is
now ~20% larger. This optimization is still done for the query shape, and for
indexed data for RecursivePrefixTreeStrategy. Furthermore, this optimization
is enhanced to roll up beyond the bottom cell level.
(David Smiley,
Florian Schilling)
- LUCENE-4790: Fix FieldCacheImpl.getDocTermOrds to not bake deletes into the
cached datastructure. Otherwise this can cause inconsistencies with readers
at different points in time.
(Robert Muir)
- LUCENE-4791: A conjunction of terms (ConjunctionTermScorer) scanned on
the lowest frequency term instead of skipping, leading to potentially
large performance impacts for many non-random or non-uniform
term distributions.
(John Wang, yonik)
- LUCENE-4798: PostingsHighlighter's formatter sometimes didn't highlight
matched terms.
(Robert Muir)
- LUCENE-4796, SOLR-4373: Fix concurrency issue in NamedSPILoader and
AnalysisSPILoader when doing reload (e.g. from Solr).
(Uwe Schindler, Hossman)
- LUCENE-4802: Don't compute norms for drill-down facet fields.
(Mike McCandless)
- LUCENE-4804: PostingsHighlighter sometimes applied terms to the wrong passage,
if they started exactly on a passage boundary.
(Robert Muir)
- Documentation (2)
- LUCENE-4718: Fixed documentation of oal.queryparser.classic.
(Hayden Muhl via Adrien Grand)
- LUCENE-4784, LUCENE-4785, LUCENE-4786: Fixed references to deprecated classes
SinkTokenizer, ValueSourceQuery and RangeQuery.
(Hao Zhong via Adrien Grand)
- Build (4)
- LUCENE-4654: Test duration statistics from multiple test runs should be
reused.
(Dawid Weiss)
- LUCENE-4636: Upgrade ivy to 2.3.0
(Shawn Heisey via Robert Muir)
- LUCENE-4570: Use the Policeman Forbidden API checker, released separately
from Lucene and downloaded via Ivy.
(Uwe Schindler, Robert Muir)
- LUCENE-4758: 'ant jar', 'ant compile', and 'ant compile-test' should
recurse.
(Steve Rowe)
- Changes in backwards compatibility policy (16)
- LUCENE-4514: Scorer's freq() method returns an integer value indicating
the number of times the scorer matches the current document. Previously
this was only sometimes the case, in some cases it returned a (meaningless)
floating point value. Scorer now extends DocsEnum so it has attributes().
(Robert Muir)
- LUCENE-4543: TFIDFSimilarity's index-time computeNorm is now final to
match the fact that its query-time norm usage requires a FIXED_8 encoding.
Override lengthNorm and/or encode/decodeNormValue to change the specifics,
like Lucene 3.x.
(Robert Muir)
- LUCENE-3441: The facet module now supports NRT. As a result, the following
changes were made:
-
DirectoryTaxonomyReader has a new constructor which takes a
DirectoryTaxonomyWriter. You should use that constructor in order to get
the NRT support (or the old one for non-NRT).
-
TaxonomyReader.refresh() removed in exchange for TaxonomyReader.openIfChanged
static method. Similar to DirectoryReader, the method either returns null
if no changes were made to the taxonomy, or a new TR instance otherwise.
Instead of calling refresh(), you should write similar code to how you reopen
a regular DirectoryReader.
-
TaxonomyReader.openIfChanged (previously refresh()) no longer throws
InconsistentTaxonomyException, and supports recreate. InconsistentTaxoEx
was removed.
-
ChildrenArrays was pulled out of TaxonomyReader into a top-level class.
-
TaxonomyReader was made an abstract class (instead of an interface), with
methods such as close() and reference counting management pulled from
DirectoryTaxonomyReader, and made final. The rest of the methods, remained
abstract.
(Shai Erera, Gilad Barkai)
- LUCENE-4576: Remove CachingWrapperFilter(Filter, boolean). This recacheDeletes
option gave less than 1% speedup at the expense of cache churn (filters were
invalidated on reopen if even a single delete was posted against the segment).
(Robert Muir)
- LUCENE-4575: Replace IndexWriter's commit/prepareCommit versions that take
commitData with setCommitData(). That allows committing changes to IndexWriter
even if the commitData is the only thing that changes.
(Shai Erera, Michael McCandless)
- LUCENE-4565: TaxonomyReader.getParentArray and .getChildrenArrays consolidated
into one getParallelTaxonomyArrays(). You can obtain the 3 arrays that the
previous two methods returned by calling parents(), children() or siblings()
on the returned ParallelTaxonomyArrays.
(Shai Erera)
- LUCENE-4585: Spatial PrefixTree based Strategies (either TermQuery or
RecursivePrefix based) MAY want to re-index if used for point data. If a
re-index is not done, then an indexed point is ~1/2 the smallest grid cell
larger and as such is slightly more likely to match a query shape.
(David Smiley)
- LUCENE-4604: DefaultOrdinalPolicy removed in favor of OrdinalPolicy.ALL_PARENTS.
Same for DefaultPathPolicy (now PathPolicy.ALL_CATEGORIES). In addition, you
can use OrdinalPolicy.NO_PARENTS to never write any parent category ordinal
to the fulltree posting payload (but note that you need a special
FacetsAccumulator - see javadocs).
(Shai Erera)
- LUCENE-4594: Spatial PrefixTreeStrategy no longer indexes center points of
non-point shapes. If you want to call makeDistanceValueSource() based on
shape centers, you need to do this yourself in another spatial field.
(David Smiley)
- LUCENE-4615: Replace IntArrayAllocator and FloatArrayAllocator by ArraysPool.
FacetArrays no longer takes those allocators; if you need to reuse the arrays,
you should use ReusingFacetArrays.
(Shai Erera, Gilad Barkai)
- LUCENE-4621: FacetIndexingParams is now a concrete class (instead of DefaultFIP).
Also, the entire IndexingParams chain is now immutable. If you need to override
a setting, you should extend the relevant class.
Additionally, FacetSearchParams is now immutable, and requires all FacetRequests
to specified at initialization time.
(Shai Erera)
- LUCENE-4647: CategoryDocumentBuilder and EnhancementsDocumentBuilder are replaced
by FacetFields and AssociationsFacetFields respectively. CategoryEnhancement and
AssociationEnhancement were removed in favor of a simplified CategoryAssociation
interface, with CategoryIntAssociation and CategoryFloatAssociation
implementations.
NOTE: indexes that contain category enhancements/associations are not supported
by the new code and should be recreated.
(Shai Erera)
- LUCENE-4659: Massive cleanup to CategoryPath API. Additionally, CategoryPath is
now immutable, so you don't need to clone() it.
(Shai Erera)
- LUCENE-4670: StoredFieldsWriter and TermVectorsWriter have new finish* callbacks
which are called after a doc/field/term has been completely added.
(Adrien Grand, Robert Muir)
- LUCENE-4620: IntEncoder/Decoder were changed to do bulk encoding/decoding. As a
result, few other classes such as Aggregator and CategoryListIterator were
changed to handle bulk category ordinals.
(Shai Erera)
- LUCENE-4683: CategoryListIterator and Aggregator are now per-segment. As such
their implementations no longer take a top-level IndexReader in the constructor
but rather implement a setNextReader.
(Shai Erera)
- New Features (14)
- LUCENE-4226: New experimental StoredFieldsFormat that compresses chunks of
documents together in order to improve the compression ratio.
(Adrien Grand)
- LUCENE-4426: New ValueSource implementations (in lucene/queries) for
DocValues fields.
(Adrien Grand)
- LUCENE-4410: FilteredQuery now exposes a FilterStrategy that exposes
how filters are applied during query execution.
(Simon Willnauer)
- LUCENE-4404: New ListOfOutputs (in lucene/misc) for FSTs wraps
another Outputs implementation, allowing you to store more than one
output for a single input. UpToTwoPositiveIntsOutputs was moved
from lucene/core to lucene/misc.
(Mike McCandless)
- LUCENE-3842: New AnalyzingSuggester, for doing auto-suggest
using an analyzer. This can create powerful suggesters: if the analyzer
remove stop words then "ghost chr..." could suggest "The Ghost of
Christmas Past"; if SynonymFilter is used to map wifi and wireless
network to hotspot, then "wirele..." could suggest "wifi router";
token normalization likes stemmers, accent removal, etc. would allow
the suggester to ignore such variations.
(Robert Muir, Sudarshan
Gaikaiwari, Mike McCandless)
- LUCENE-4446: Lucene 4.1 has a new default index format (Lucene41Codec)
that incorporates the previously experimental "Block" postings format
for better search performance.
(Han Jiang, Adrien Grand, Robert Muir, Mike McCandless)
- LUCENE-3846: New FuzzySuggester, like AnalyzingSuggester except it
also finds completions allowing for fuzzy edits in the input string.
(Robert Muir, Simon Willnauer, Mike McCandless)
- LUCENE-4515: MemoryIndex now supports adding the same field multiple
times.
(Simon Willnauer)
- LUCENE-4489: Added consumeAllTokens option to LimitTokenCountFilter
(hossman, Robert Muir)
- LUCENE-4566: Add NRT/SearcherManager.RefreshListener/addListener to
be notified whenever a new searcher was opened.
(selckin via Shai
Erera, Mike McCandless)
- SOLR-4123: Add per-script customizability to ICUTokenizerFactory via
rule files in the ICU RuleBasedBreakIterator format.
(Shawn Heisey, Robert Muir, Steve Rowe)
- LUCENE-4590: Added WriteEnwikiLineDocTask - a benchmark task for writing
Wikipedia category pages and non-category pages into separate line files.
extractWikipedia.alg was changed to use this task, so now it creates two
files.
(Doron Cohen)
- LUCENE-4290: Added PostingsHighlighter to the highlighter module. It uses
offsets from the postings lists to highlight documents.
(Robert Muir)
- LUCENE-4628: Added CommonTermsQuery that executes high-frequency terms
in a optional sub-query to prevent slow queries due to "common" terms
like stopwords.
(Simon Willnauer)
- API Changes (11)
- LUCENE-4399: Deprecated AppendingCodec. Lucene's term dictionaries
no longer seek when writing.
(Adrien Grand, Robert Muir)
- LUCENE-4479: Rename TokenStream.getTokenStream(IndexReader, int, String)
to TokenStream.getTokenStreamWithOffsets, and return null on failure
rather than throwing IllegalArgumentException.
(Alan Woodward)
- LUCENE-4472: MergePolicy now accepts a MergeTrigger that provides
information about the trigger of the merge ie. merge triggered due
to a segment merge or a full flush etc.
(Simon Willnauer)
- LUCENE-4415: TermsFilter is now immutable. All terms need to be provided
as constructor argument.
(Simon Willnauer)
- LUCENE-4520: ValueSource.getSortField no longer throws IOExceptions
(Alan Woodward)
- LUCENE-4537: RateLimiter is now separated from FSDirectory and exposed via
RateLimitingDirectoryWrapper. Any Directory can now be rate-limited.
(Simon Willnauer)
- LUCENE-4591: CompressingStoredFields{Writer,Reader} now accept a segment
suffix as a constructor parameter.
(Renaud Delbru via Adrien Grand)
- LUCENE-4605: Added DocsEnum.FLAG_NONE which can be passed instead of 0 as
the flag to .docs() and .docsAndPositions().
(Shai Erera)
- LUCENE-4617: Remove FST.pack() method. Previously to make a packed FST,
you had to make a Builder with willPackFST=true (telling it you will later pack it),
create your fst with finish(), and then call pack() to get another FST.
Instead just pass true for doPackFST to Builder and finish() returns a packed FST.
(Robert Muir)
- LUCENE-4663: Deprecate IndexSearcher.document(int, Set). This was not intended
to be final, nor named document(). Use IndexSearcher.doc(int, Set) instead.
(Robert Muir)
- LUCENE-4684: Made DirectSpellChecker extendable.
(Martijn van Groningen)
- Bug Fixes (31)
- LUCENE-1822: BaseFragListBuilder hard-coded 6 char margin is too naive.
(Alex Vigdor, Arcadius Ahouansou, Koji Sekiguchi)
- LUCENE-4468: Fix rareish integer overflows in Lucene41 postings
format.
(Robert Muir)
- LUCENE-4486: Add support for ConstantScoreQuery in Highlighter.
(Simon Willnauer)
- LUCENE-4485: When CheckIndex terms, terms/docs pairs and tokens,
these counts now all exclude deleted documents.
(Mike McCandless)
- LUCENE-4479: Highlighter works correctly for fields with term vector
positions, but no offsets.
(Alan Woodward)
- SOLR-3906: JapaneseReadingFormFilter in romaji mode will return
romaji even for out-of-vocabulary kana cases (e.g. half-width forms).
(Robert Muir)
- LUCENE-4511: TermsFilter might return wrong results if a field is not
indexed or doesn't exist in the index.
(Simon Willnauer)
- LUCENE-4521: IndexWriter.tryDeleteDocument could return true
(successfully deleting the document) but then on IndexWriter
close/commit fail to write the new deletions, if no other changes
happened in the IndexWriter instance.
(Ivan Vasilev via Mike
McCandless)
- LUCENE-4513: Fixed that deleted nested docs are scored into the
parent doc when using ToParentBlockJoinQuery.
(Martijn van Groningen)
- LUCENE-4534: Fixed WFSTCompletionLookup and Analyzing/FuzzySuggester
to allow 0 byte values in the lookup keys.
(Mike McCandless)
- LUCENE-4532: DirectoryTaxonomyWriter use a timestamp to denote taxonomy
index re-creation, which could cause a bug in case machine clocks were
not synced. Instead, it now tracks an 'epoch' version, which is incremented
whenever the taxonomy is re-created, or replaced.
(Shai Erera)
- LUCENE-4544: Fixed off-by-1 in ConcurrentMergeScheduler that would
allow 1+maxMergeCount merges threads to be created, instead of just
maxMergeCount
(Radim Kolar, Mike McCandless)
- LUCENE-4567: Fixed NullPointerException in analyzing, fuzzy, and
WFST suggesters when no suggestions were added
(selckin via Mike
McCandless)
- LUCENE-4568: Fixed integer overflow in
PagedBytes.PagedBytesData{In,Out}put.getPosition.
(Adrien Grand)
- LUCENE-4581: GroupingSearch.setAllGroups(true) was failing to
actually compute allMatchingGroups
(dizh@neusoft.com via Mike
McCandless)
- LUCENE-4009: Improve TermsFilter.toString
(Tim Costermans via Chris
Male, Mike McCandless)
- LUCENE-4588: Benchmark's EnwikiContentSource was discarding last wiki
document and had leaking threads in 'forever' mode.
(Doron Cohen)
- LUCENE-4585: Spatial RecursivePrefixTreeFilter had some bugs that only
occurred when shapes were indexed. In what appears to be rare circumstances,
documents with shapes near a query shape were erroneously considered a match.
In addition, it wasn't possible to index a shape representing the entire
globe.
- LUCENE-4595: EnwikiContentSource had a thread safety problem (NPE) in
'forever' mode
(Doron Cohen)
- LUCENE-4587: fix WordBreakSpellChecker to not throw AIOOBE when presented
with 2-char codepoints, and to correctly break/combine terms containing
non-latin characters.
(James Dyer, Andreas Hubold)
- LUCENE-4596: fix a concurrency bug in DirectoryTaxonomyWriter.
(Shai Erera)
- LUCENE-4594: Spatial PrefixTreeStrategy would index center-points in addition
to the shape to index if it was non-point, in the same field. But sometimes
the center-point isn't actually in the shape (consider a LineString), and for
highly precise shapes it could cause makeDistanceValueSource's cache to load
parts of the shape's boundary erroneously too. So center points aren't
indexed any more; you should use another spatial field.
(David Smiley)
- LUCENE-4629: IndexWriter misses to delete documents if a document block is
indexed and the Iterator throws an exception. Documents were only rolled back
if the actual indexing process failed.
(Simon Willnauer)
- LUCENE-4608: Handle large number of requested fragments better.
(Martijn van Groningen)
- LUCENE-4633: DirectoryTaxonomyWriter.replaceTaxonomy did not refresh its
internal reader, which could cause an existing category to be added twice.
(Shai Erera)
- LUCENE-4461: If you added the same FacetRequest more than once, you would get
inconsistent results.
(Gilad Barkai via Shai Erera)
- LUCENE-4656: Fix regression in IndexWriter to work with empty TokenStreams
that have no TermToBytesRefAttribute (commonly provided by CharTermAttribute),
e.g., oal.analysis.miscellaneous.EmptyTokenStream.
(Uwe Schindler, Adrien Grand, Robert Muir)
- LUCENE-4660: ConcurrentMergeScheduler was taking too long to
un-pause incoming threads it had paused when too many merges were
queued up.
(Mike McCandless)
- LUCENE-4662: Add missing elided articles and prepositions to FrenchAnalyzer's
DEFAULT_ARTICLES list passed to ElisionFilter.
(David Leunen via Steve Rowe)
- LUCENE-4671: Fix CharsRef.subSequence method.
(Tim Smith via Robert Muir)
- LUCENE-4465: Let ConstantScoreQuery's Scorer return its child scorer.
(selckin via Uwe Schindler)
- Changes in Runtime Behavior (2)
- LUCENE-4586: Change default ResultMode of FacetRequest to PER_NODE_IN_TREE.
This only affects requests with depth>1. If you execute such requests and
rely on the facet results being returned flat (i.e. no hierarchy), you should
set the ResultMode to GLOBAL_FLAT.
(Shai Erera, Gilad Barkai)
- LUCENE-1822: Improves the text window selection by recalculating the starting margin
once all phrases in the fragment have been identified in FastVectorHighlighter. This
way if a single word is matched in a fragment, it will appear in the middle of the highlight,
instead of 6 characters from the beginning. This way one can also guarantee that
the entirety of short texts are represented in a fragment by specifying a large
enough fragCharSize.
- Optimizations (16)
- LUCENE-2221: oal.util.BitUtil was modified to use Long.bitCount and
Long.numberOfTrailingZeros (which are intrinsics since Java 6u18) instead of
pure java bit twiddling routines in order to improve performance on modern
JVMs/hardware.
(Dawid Weiss, Adrien Grand)
- LUCENE-4509: Enable stored fields compression by default in the Lucene 4.1
default codec.
(Adrien Grand)
- LUCENE-4536: PackedInts on-disk format is now byte-aligned (it used to be
long-aligned), saving up to 7 bytes per array of values.
(Adrien Grand, Mike McCandless)
- LUCENE-4512: Additional memory savings for CompressingStoredFieldsFormat.
(Adrien Grand, Robert Muir)
- LUCENE-4443: Lucene41PostingsFormat no longer writes unnecessary offsets
into the skipdata.
(Robert Muir)
- LUCENE-4459: Improve WeakIdentityMap.keyIterator() to remove GCed keys
from backing map early instead of waiting for reap(). This makes test
failures in TestWeakIdentityMap disappear, too.
(Uwe Schindler, Mike McCandless, Robert Muir)
- LUCENE-4473: Lucene41PostingsFormat encodes offsets more efficiently
for low frequency terms (< 128 occurrences).
(Robert Muir)
- LUCENE-4462: DocumentsWriter now flushes deletes, segment infos and builds
CFS files if necessary during segment flush and not during publishing. The latter
was a single threaded process while now all IO and CPU heavy computation is done
concurrently in DocumentsWriterPerThread.
(Simon Willnauer)
- LUCENE-4496: Optimize Lucene41PostingsFormat when requesting a subset of
the postings data (via flags to TermsEnum.docs/docsAndPositions) to use
ForUtil.skipBlock.
(Robert Muir)
- LUCENE-4497: Don't write PosVIntCount to the positions file in
Lucene41PostingsFormat, as it's always totalTermFreq % BLOCK_SIZE.
(Robert Muir)
- LUCENE-4498: In Lucene41PostingsFormat, when a term appears in only one document,
Instead of writing a file pointer to a VIntBlock containing the doc id, just
write the doc id.
(Mike McCandless, Robert Muir)
- LUCENE-4515: MemoryIndex now uses Byte/IntBlockPool internally to hold terms and
posting lists. All index data is represented as consecutive byte/int arrays to
reduce GC cost and memory overhead.
(Simon Willnauer)
- LUCENE-4538: DocValues now caches direct sources in a ThreadLocal exposed via SourceCache.
Users of this API can now simply obtain an instance via DocValues#getDirectSource per thread.
(Simon Willnauer)
- LUCENE-4580: DrillDown.query variants return a ConstantScoreQuery with boost set to 0.0f
so that documents scores are not affected by running a drill-down query.
(Shai Erera)
- LUCENE-4598: PayloadIterator no longer uses top-level IndexReader to iterate on the
posting's payload.
(Shai Erera, Michael McCandless)
- LUCENE-4661: Drop default maxThreadCount to 1 and maxMergeCount to 2
in ConcurrentMergeScheduler, for faster merge performance on
spinning-magnet drives
(Mike McCandless)
- Documentation (1)
- LUCENE-4483: Refer to BytesRef.deepCopyOf in Term's constructor that takes BytesRef.
(Paul Elschot via Robert Muir)
- Build (6)
- LUCENE-4650: Upgrade randomized testing to version 2.0.8: make the
test framework more robust under low memory conditions.
(Dawid Weiss)
- LUCENE-4603: Upgrade randomized testing to version 2.0.5: print forked
JVM PIDs on heartbeat from hung tests
(Dawid Weiss)
- Upgrade randomized testing to version 2.0.4: avoid hangs on shutdown
hooks hanging forever by calling Runtime.halt() in addition to
Runtime.exit() after a short delay to allow graceful shutdown
(Dawid Weiss)
- LUCENE-4451: Memory leak per unique thread caused by
RandomizedContext.contexts static map. Upgrade randomized testing
to version 2.0.2
(Mike McCandless, Dawid Weiss)
- LUCENE-4589: Upgraded benchmark module's Nekohtml dependency to version
1.9.17, removing the workaround in Lucene's HTML parser for the
Turkish locale.
(Uwe Schindler)
- LUCENE-4601: Fix ivy availability check to use typefound, so it works
if called from another build file.
(Ryan Ernst via Robert Muir)
- Changes in backwards compatibility policy (2)
- LUCENE-4392: Class org.apache.lucene.util.SortedVIntList has been removed.
(Adrien Grand)
- LUCENE-4393: RollingCharBuffer has been moved to the o.a.l.analysis.util
package of lucene-analysis-common.
(Adrien Grand)
- New Features (5)
- LUCENE-1888: Added the option to store payloads in the term
vectors (IndexableFieldType.storeTermVectorPayloads()). Note
that you must store term vector positions to store payloads.
(Robert Muir)
- LUCENE-3892: Add a new BlockPostingsFormat that bulk-encodes docs,
freqs and positions in large (size 128) packed-int blocks for faster
search performance. This was from Han Jiang's 2012 Google Summer of
Code project
(Han Jiang, Adrien Grand, Robert Muir, Mike McCandless)
- LUCENE-4323: Added support for an absolute maximum CFS segment size
(in MiB) to LogMergePolicy and TieredMergePolicy.
(Alexey Lef via Uwe Schindler)
- LUCENE-4339: Allow deletes against 3.x segments for easier upgrading.
Lucene3x Codec is still otherwise read-only, you should not set it
as the default Codec on IndexWriter, because it cannot write new segments.
(Mike McCandless, Robert Muir)
- SOLR-3441: ElisionFilterFactory is now MultiTermAware
(Jack Krupansky via hossman)
- API Changes (15)
- LUCENE-4391, LUCENE-4440: All methods of Lucene40Codec but
getPostingsFormatForField are now final. To reuse functionality
of Lucene40, you should extend FilterCodec and delegate to Lucene40
instead of extending Lucene40Codec.
(Adrien Grand, Shai Erera,
Robert Muir, Uwe Schindler)
- LUCENE-4299: Added Terms.hasPositions() and Terms.hasOffsets().
Previously you had no real way to know that a term vector field
had positions or offsets, since this can be configured on a
per-field-per-document basis.
(Robert Muir)
- Removed DocsAndPositionsEnum.hasPayload() and simplified the
contract of getPayload(). It returns null if there is no payload,
otherwise returns the current payload. You can now call it multiple
times per position if you want.
(Robert Muir)
- Removed FieldsEnum. Fields API instead implements Iterable<String>
and exposes Iterator, so you can iterate over field names with
for (String field : fields) instead.
(Robert Muir)
- LUCENE-4152: added IndexReader.leaves(), which lets you enumerate
the leaf atomic reader contexts for all readers in the tree.
(Uwe Schindler, Robert Muir)
- LUCENE-4304: removed PayloadProcessorProvider. If you want to change
payloads (or other things) when merging indexes, it's recommended
to just use a FilterAtomicReader + IndexWriter.addIndexes. See the
OrdinalMappingAtomicReader and TaxonomyMergeUtils in the facets
module if you want an example of this.
(Mike McCandless, Uwe Schindler, Shai Erera, Robert Muir)
- LUCENE-4304: Make CompositeReader.getSequentialSubReaders()
protected. To get atomic leaves of any IndexReader use the new method
leaves() (LUCENE-4152), which lists AtomicReaderContexts including
the doc base of each leaf.
(Uwe Schindler, Robert Muir)
- LUCENE-4307: Renamed IndexReader.getTopReaderContext to
IndexReader.getContext.
(Robert Muir)
- LUCENE-4316: Deprecate Fields.getUniqueTermCount and remove it from
AtomicReader. If you really want the unique term count across all
fields, just sum up Terms.size() across those fields. This method
only exists so that this statistic can be accessed for Lucene 3.x
segments, which don't support Terms.size().
(Uwe Schindler, Robert Muir)
- LUCENE-4321: Change CharFilter to extend Reader directly, as FilterReader
overdelegates (read(), read(char[], int, int), skip, etc). This made it
hard to implement CharFilters that were correct. Instead only close() is
delegated by default: read(char[], int, int) and correct(int) are abstract
so that it's obvious which methods you should implement. The protected
inner Reader is 'input' like CharFilter in the 3.x series, instead of 'in'.
(Dawid Weiss, Uwe Schindler, Robert Muir)
- LUCENE-3309: The expert FieldSelector API, used to load only certain
fields in a stored document, has been replaced with the simpler
StoredFieldVisitor API.
(Mike McCandless)
- LUCENE-4343: Made Tokenizer.setReader final. This is a setter that should
not be overridden by subclasses: per-stream initialization should happen
in reset().
(Robert Muir)
- LUCENE-4377: Remove IndexInput.copyBytes(IndexOutput, long).
Use DataOutput.copyBytes(DataInput, long) instead.
(Mike McCandless, Robert Muir)
- LUCENE-4355: Simplify AtomicReader's sugar methods such as termDocsEnum,
termPositionsEnum, docFreq, and totalTermFreq to only take Term as a
parameter. If you want to do expert things such as pass a different
Bits as liveDocs, then use the flex apis (fields(), terms(), etc) directly.
(Mike McCandless, Robert Muir)
- LUCENE-4425: clarify documentation of StoredFieldVisitor.binaryValue
and simplify the api to binaryField(FieldInfo, byte[]).
(Adrien Grand, Robert Muir)
- Bug Fixes (17)
- LUCENE-4423: DocumentStoredFieldVisitor.binaryField ignored offset and
length.
(Adrien Grand)
- LUCENE-4297: BooleanScorer2 would multiply the coord() factor
twice for conjunctions: for most users this is no problem, but
if you had a customized Similarity that returned something other
than 1 when overlap == maxOverlap (always the case for conjunctions),
then the score would be incorrect.
(Pascal Chollet, Robert Muir)
- LUCENE-4298: MultiFields.getTermDocsEnum(IndexReader, Bits, String, BytesRef)
did not work at all, it would infinitely recurse.
(Alberto Paro via Robert Muir)
- LUCENE-4300: BooleanQuery's rewrite was not always safe: if you
had a custom Similarity where coord(1,1) != 1F, then the rewritten
query would be scored differently.
(Robert Muir)
- Don't allow negatives in the positions file. If you have an index
from 2.4.0 or earlier with such negative positions, and you already
upgraded to 3.x, then to Lucene 4.0-ALPHA or -BETA, you should run
CheckIndex. If it fails, then you need to upgrade again to 4.0
(Robert Muir)
- LUCENE-4303: PhoneticFilterFactory and SnowballPorterFilterFactory load their
encoders / stemmers via the ResourceLoader now instead of Class.forName().
Solr users should now no longer have to embed these in its war.
(David Smiley)
- SOLR-3737: StempelPolishStemFilterFactory loaded its stemmer table incorrectly.
Also, ensure immutability and use only one instance of this table in RAM (lazy
loaded) since it's quite large.
(sausarkar, Steven Rowe, Robert Muir)
- LUCENE-4310: MappingCharFilter was failing to match input strings
containing non-BMP Unicode characters.
(Dawid Weiss, Robert Muir,
Mike McCandless)
- LUCENE-4224: Add in-order scorer to query time joining and the
out-of-order scorer throws an UOE.
(Martijn van Groningen, Robert Muir)
- LUCENE-4333: Fixed NPE in TermGroupFacetCollector when faceting on mv fields.
(Jesse MacVicar, Martijn van Groningen)
- LUCENE-4218: Document.get(String) and Field.stringValue() again return
values for numeric fields, like Lucene 3.x and consistent with the documentation.
(Jamie, Uwe Schindler, Robert Muir)
- NRTCachingDirectory was always caching a newly flushed segment in
RAM, instead of checking the estimated size of the segment
to decide whether to cache it.
(Mike McCandless)
- LUCENE-3720: fix memory-consumption issues with BeiderMorseFilter.
(Thomas Neidhart via Robert Muir)
- LUCENE-4401: Fix bug where DisjunctionSumScorer would sometimes call score()
on a subscorer that had already returned NO_MORE_DOCS.
(Liu Chao, Robert Muir)
- LUCENE-4411: when sampling is enabled for a FacetRequest, its depth
parameter is reset to the default (1), even if set otherwise.
(Gilad Barkai via Shai Erera)
- LUCENE-4455: Fix bug in SegmentInfoPerCommit.sizeInBytes() that was
returning 2X the true size, inefficiently. Also fixed bug in
CheckIndex that would report no deletions when a segment has
deletions, and vice/versa.
(Uwe Schindler, Robert Muir, Mike McCandless)
- LUCENE-4456: Fixed double-counting sizeInBytes for a segment
(affects how merge policies pick merges); fixed CheckIndex's
incorrect reporting of whether a segment has deletions; fixed case
where on abort Lucene could remove files it didn't create; fixed
many cases where IndexWriter could leave leftover files (on
exception in various places, on reuse of a segment name after crash
and recovery.
(Uwe Schindler, Robert Muir, Mike McCandless)
- Optimizations (4)
- LUCENE-4322: Decrease lucene-core JAR size. The core JAR size had increased a
lot because of generated code introduced in LUCENE-4161 and LUCENE-3892.
(Adrien Grand)
- LUCENE-4317: Improve reuse of internal TokenStreams and StringReader
in oal.document.Field.
(Uwe Schindler, Chris Male, Robert Muir)
- LUCENE-4327: Support out-of-order scoring in FilteredQuery for higher
performance.
(Mike McCandless, Robert Muir)
- LUCENE-4364: Optimize MMapDirectory to not make a mapping per-cfs-slice,
instead one map per .cfs file. This reduces the total number of maps.
Additionally factor out a (package-private) generic
ByteBufferIndexInput from MMapDirectory.
(Uwe Schindler, Robert Muir)
- Build (6)
- LUCENE-4406, LUCENE-4407: Upgrade to randomizedtesting 2.0.1.
Workaround for broken test output XMLs due to non-XML text unicode
chars in strings. Added printing of failed tests at the end of a
test run
(Dawid Weiss)
- LUCENE-4252: Detect/Fail tests when they leak RAM in static fields
(Robert Muir, Dawid Weiss)
- LUCENE-4360: Support running the same test suite multiple times in
parallel
(Dawid Weiss)
- LUCENE-3985: Upgrade to randomizedtesting 2.0.0. Added support for
thread leak detection. Added support for suite timeouts.
(Dawid Weiss)
- LUCENE-4354: Corrected maven dependencies to be consistent with
the licenses/ folder and the binary release. Some had different
versions or additional unnecessary dependencies.
(selckin via Robert Muir)
- LUCENE-4340: Move all non-default codec, postings format and terms
dictionary implementations to lucene/codecs.
(Adrien Grand)
- Documentation (1)
- LUCENE-4302: Fix facet userguide to have HTML loose doctype like
all other javadocs.
(Karl Nicholas via Uwe Schindler)
- New features (10)
- LUCENE-4249: Changed the explanation of the PayloadTermWeight to use the
underlying PayloadFunction's explanation as the explanation
for the payload score.
(Scott Smerchek via Robert Muir)
- LUCENE-4069: Added BloomFilteringPostingsFormat for use with low-frequency terms
such as primary keys
(Mark Harwood, Mike McCandless)
- LUCENE-4201: Added JapaneseIterationMarkCharFilter to normalize Japanese
iteration marks.
(Robert Muir, Christian Moen)
- LUCENE-3832: Added BasicAutomata.makeStringUnion method to efficiently
create automata from a fixed collection of UTF-8 encoded BytesRef
(Dawid Weiss, Robert Muir)
- LUCENE-4153: Added option to fast vector highlighting via BaseFragmentsBuilder to
respect field boundaries in the case of highlighting for multivalued fields.
(Martijn van Groningen)
- LUCENE-4227: Added DirectPostingsFormat, to hold all postings in
memory as uncompressed simple arrays. This uses a tremendous amount
of RAM but gives good search performance gains.
(Mike McCandless)
- LUCENE-2510, LUCENE-4044: Migrated Solr's Tokenizer-, TokenFilter-, and
CharFilterFactories to the lucene-analysis module. The API is still
experimental.
(Chris Male, Robert Muir, Uwe Schindler)
- LUCENE-4230: When pulling a DocsAndPositionsEnum you can now
specify whether or not you require payloads (in addition to
offsets); turning one or both off may allow some codec
implementations to optimize the enum implementation.
(Robert Muir,
Mike McCandless)
- LUCENE-4203: Add IndexWriter.tryDeleteDocument(AtomicReader reader,
int docID), to attempt deletion by docID as long as the provided
reader is an NRT reader, and the segment has not yet been merged
away
(Mike McCandless).
- LUCENE-4286: Added option to CJKBigramFilter to always also output
unigrams. This can be used for a unigram+bigram approach, or at
index-time only for better support of short queries.
(Tom Burton-West, Robert Muir)
- API Changes (12)
- LUCENE-4138: update of morfologik (Polish morphological analyzer) to 1.5.3.
The tag attribute class has been renamed to MorphosyntacticTagsAttribute and
has a different API (carries a list of tags instead of a compound tag). Upgrade
of embedded morfologik dictionaries to version 1.9.
(Dawid Weiss)
- LUCENE-4178: set 'tokenized' to true on FieldType by default, so that if you
make a custom FieldType and set indexed = true, it's analyzed by the analyzer.
(Robert Muir)
- LUCENE-4220: Removed the buggy JavaCC-based HTML parser in the benchmark
module and replaced by NekoHTML. HTMLParser interface was cleaned up while
changing method signatures.
(Uwe Schindler, Robert Muir)
- LUCENE-2191: Rename Tokenizer.reset(Reader) to Tokenizer.setReader(Reader).
The purpose of this method was always to set a new Reader on the Tokenizer,
reusing the object. But the name was often confused with TokenStream.reset().
(Robert Muir)
- LUCENE-4228: Refactored CharFilter to extend java.io.FilterReader. CharFilters
filter another reader and you override correct() for offset correction.
(Robert Muir)
- LUCENE-4240: Analyzer api now just takes fieldName for getOffsetGap. If the
field is not analyzed (e.g. StringField), then the analyzer is not invoked
at all. If you want to tweak things like positionIncrementGap and offsetGap,
analyze the field with KeywordTokenizer instead.
(Grant Ingersoll, Robert Muir)
- LUCENE-4250: Pass fieldName to the PayloadFunction explain method, so it
parallels with docScore and the default implementation is correct.
(Robert Muir)
- LUCENE-3747: Support Unicode 6.1.0.
(Steve Rowe)
- LUCENE-3884: Moved ElisionFilter out of org.apache.lucene.analysis.fr
package into org.apache.lucene.analysis.util.
(Robert Muir)
- LUCENE-4230: When pulling a DocsAndPositionsEnum you now pass an int
flags instead of the previous boolean needOffsets. Currently
recognized flags are DocsAndPositionsEnum.FLAG_PAYLOADS and
DocsAndPositionsEnum.FLAG_OFFSETS
(Robert Muir, Mike McCandless)
- LUCENE-4273: When pulling a DocsEnum, you can pass an int flags
instead of the previous boolean needsFlags; consistent with the changes
for DocsAndPositionsEnum in LUCENE-4230. Currently the only flag
is DocsEnum.FLAG_FREQS.
(Robert Muir, Mike McCandless)
- LUCENE-3616: TextField(String, Reader, Store) was reduced to TextField(String, Reader),
as the Store parameter didn't make sense: if you supplied Store.YES, you would only
receive an exception anyway.
(Robert Muir)
- Optimizations (5)
- LUCENE-4171: Performance improvements to Packed64.
(Toke Eskildsen via Adrien Grand)
- LUCENE-4184: Performance improvements to the aligned packed bits impl.
(Toke Eskildsen, Adrien Grand)
- LUCENE-4235: Remove enforcing of Filter rewrite for NRQ queries.
(Uwe Schindler)
- LUCENE-4279: Regenerated snowball Stemmers from snowball r554,
making them substantially more lightweight. Behavior is unchanged.
(Robert Muir)
- LUCENE-4291: Reduced internal buffer size for Jflex-based tokenizers
such as StandardTokenizer from 32kb to 8kb.
(Raintung Li, Steven Rowe, Robert Muir)
- Bug Fixes (13)
- LUCENE-4109: BooleanQueries are not parsed correctly with the
flexible query parser.
(Karsten Rauch via Robert Muir)
- LUCENE-4176: Fix AnalyzingQueryParser to analyze range endpoints as bytes,
so that it works correctly with Analyzers that produce binary non-UTF-8 terms
such as CollationAnalyzer.
(Nattapong Sirilappanich via Robert Muir)
- LUCENE-4209: Fix FSTCompletionLookup to close its sorter, so that it won't
leave temp files behind in /tmp. Fix SortedTermFreqIteratorWrapper to not
leave temp files behind in /tmp on Windows. Fix Sort to not leave
temp files behind when /tmp is a separate volume.
(Uwe Schindler, Robert Muir)
- LUCENE-4221: Fix overeager CheckIndex validation for term vector offsets.
(Robert Muir)
- LUCENE-4222: TieredMergePolicy.getFloorSegmentMB was returning the
size in bytes not MB
(Chris Fuller via Mike McCandless)
- LUCENE-3505: Fix bug (Lucene 4.0alpha only) where boolean conjunctions
were sometimes scored incorrectly. Conjunctions of only termqueries where
at least one term omitted term frequencies (IndexOptions.DOCS_ONLY) would
be scored as if all terms omitted term frequencies.
(Robert Muir)
- LUCENE-2686, LUCENE-3505: Fixed BooleanQuery scorers to return correct
freq(). Added support for scorer navigation API (Scorer.getChildren) to
all queries. Made Scorer.freq() abstract.
(Koji Sekiguchi, Mike McCandless, Robert Muir)
- LUCENE-4234: Exception when FacetsCollector is used with ScoreFacetRequest,
and the number of matching documents is too large.
(Gilad Barkai via Shai Erera)
- LUCENE-4245: Make IndexWriter#close() and MergeScheduler#close()
non-interruptible.
(Mark Miller, Uwe Schindler)
- LUCENE-4190: restrict allowed filenames that a codec may create to
the patterns recognized by IndexFileNames. This also fixes
IndexWriter to only delete files matching this pattern from an index
directory, to reduce risk when the wrong index path is accidentally
passed to IndexWriter
(Robert Muir, Mike McCandless)
- LUCENE-4277: Fix IndexWriter deadlock during rollback if flushable DWPT
instance are already checked out and queued up but not yet flushed.
(Simon Willnauer)
- LUCENE-4282: Automaton FuzzyQuery didn't always deliver all results.
(Johannes Christen, Uwe Schindler, Robert Muir)
- LUCENE-4289: Fix minor idf inconsistencies/inefficiencies in highlighter.
(Robert Muir)
- Changes in Runtime Behavior (2)
- LUCENE-4109: Enable position increments in the flexible queryparser by default.
(Karsten Rauch via Robert Muir)
- LUCENE-3616: Field throws exception if you try to set a boost on an
unindexed field or one that omits norms.
(Robert Muir)
- Build (7)
- LUCENE-4094: Support overriding file.encoding on forked test JVMs
(force via -Drandomized.file.encoding=XXX).
(Dawid Weiss)
- LUCENE-4189: Test output should include timestamps (start/end for each
test/ suite). Added -Dtests.timestamps=[off by default].
(Dawid Weiss)
- LUCENE-4110: Report long periods of forked jvm inactivity (hung tests/ suites).
Added -Dtests.heartbeat=[seconds] with the default of 60 seconds.
(Dawid Weiss)
- LUCENE-4160: Added a property to quit the tests after a given
number of failures has occurred. This is useful in combination
with -Dtests.iters=N (you can start N iterations and wait for M
failures, in particular M = 1). -Dtests.maxfailures=M. Alternatively,
specify -Dtests.failfast=true to skip all tests after the first failure.
(Dawid Weiss)
- LUCENE-4115: JAR resolution/ cleanup should be done automatically for ant
clean/ eclipse/ resolve
(Dawid Weiss)
- LUCENE-4199, LUCENE-4202, LUCENE-4206: Add a new target "check-forbidden-apis"
that parses all generated .class files for use of APIs that use default
charset, default locale, or default timezone and fail build if violations
found. This ensures, that Lucene / Solr is independent on local configuration
options.
(Uwe Schindler, Robert Muir, Dawid Weiss)
- LUCENE-4217: Add the possibility to run tests with Atlassian Clover
loaded from IVY. A development License solely for Apache code was added in
the tools/ folder, but is not included in releases.
(Uwe Schindler)
- Documentation (1)
- LUCENE-4195: Added package documentation and examples for
org.apache.lucene.codecs
(Alan Woodward via Robert Muir)
- More information about this release, including any errata related to the
release notes, upgrade instructions, or other changes may be found online at:
https://wiki.apache.org/lucene-java/Lucene4.0
- For "contrib" changes prior to 4.0, please see:
http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_6_0/lucene/contrib/CHANGES.txt
- Changes in backwards compatibility policy (42)
- LUCENE-1458, LUCENE-2111, LUCENE-2354: Changes from flexible indexing:
-
On upgrading to 4.0, if you do not fully reindex your documents,
Lucene will emulate the new flex API on top of the old index,
incurring some performance cost (up to ~10% slowdown, typically).
To prevent this slowdown, use oal.index.IndexUpgrader
to upgrade your indexes to latest file format (LUCENE-3082).
-
Mixed flex/pre-flex indexes are perfectly fine -- the two
emulation layers (flex API on pre-flex index, and pre-flex API on
flex index) will remap the access as required. So on upgrading to
4.0 you can start indexing new documents into an existing index.
To get optimal performance, use oal.index.IndexUpgrader
to upgrade your indexes to latest file format (LUCENE-3082).
-
The postings APIs (TermEnum, TermDocsEnum, TermPositionsEnum)
have been removed in favor of the new flexible
indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum,
DocsEnum, DocsAndPositionsEnum). One big difference is that field
and terms are now enumerated separately: a TermsEnum provides a
BytesRef (wraps a byte[]) per term within a single field, not a
Term. Another is that when asking for a Docs/AndPositionsEnum, you
now specify the skipDocs explicitly (typically this will be the
deleted docs, but in general you can provide any Bits).
-
The term vectors APIs (TermFreqVector, TermPositionVector,
TermVectorMapper) have been removed in favor of the above
flexible indexing APIs, presenting a single-document inverted
index of the document from the term vectors.
-
MultiReader ctor now throws IOException
-
Directory.copy/Directory.copyTo now copies all files (not just
index files), since what is and isn't and index file is now
dependent on the codecs used.
-
UnicodeUtil now uses BytesRef for UTF-8 output, and some method
signatures have changed to CharSequence. These are internal APIs
and subject to change suddenly.
-
Positional queries (PhraseQuery, *SpanQuery) will now throw an
exception if use them on a field that omits positions during
indexing (previously they silently returned no results).
-
FieldCache.{Byte,Short,Int,Long,Float,Double}Parser's API has
changed -- each parse method now takes a BytesRef instead of a
String. If you have an existing Parser, a simple way to fix it is
invoke BytesRef.utf8ToString, and pass that String to your
existing parser. This will work, but performance would be better
if you could fix your parser to instead operate directly on the
byte[] in the BytesRef.
-
The internal (experimental) API of NumericUtils changed completely
from String to BytesRef. Client code should never use this class,
so the change would normally not affect you. If you used some of
the methods to inspect terms or create TermQueries out of
prefix encoded terms, change to use BytesRef. Please note:
Do not use TermQueries to search for single numeric terms.
The recommended way is to create a corresponding NumericRangeQuery
with upper and lower bound equal and included. TermQueries do not
score correct, so the constant score mode of NRQ is the only
correct way to handle single value queries.
-
NumericTokenStream now works directly on byte[] terms. If you
plug a TokenFilter on top of this stream, you will likely get
an IllegalArgumentException, because the NTS does not support
TermAttribute/CharTermAttribute. If you want to further filter
or attach Payloads to NTS, use the new NumericTermAttribute.
(Mike McCandless, Robert Muir, Uwe Schindler, Mark Miller, Michael Busch)
- LUCENE-2858, LUCENE-3733: IndexReader was refactored into abstract
AtomicReader, CompositeReader, and DirectoryReader. To open Directory-
based indexes use DirectoryReader.open(), the corresponding method in
IndexReader is now deprecated for easier migration. Only DirectoryReader
supports commits, versions, and reopening with openIfChanged(). Terms,
postings, docvalues, and norms can from now on only be retrieved using
AtomicReader; DirectoryReader and MultiReader extend CompositeReader,
only offering stored fields and access to the sub-readers (which may be
composite or atomic). SlowCompositeReaderWrapper (LUCENE-2597) can be
used to emulate atomic readers on top of composites.
Please review MIGRATE.txt for information how to migrate old code.
(Uwe Schindler, Robert Muir, Mike McCandless)
- LUCENE-2265: FuzzyQuery and WildcardQuery now operate on Unicode codepoints,
not unicode code units. For example, a Wildcard "?" represents any unicode
character. Furthermore, the rest of the automaton package and RegexpQuery use
true Unicode codepoint representation.
(Robert Muir, Mike McCandless)
- LUCENE-2380: The String-based FieldCache methods (getStrings,
getStringIndex) have been replaced with BytesRef-based equivalents
(getTerms, getTermsIndex). Also, the sort values (returned in
FieldDoc.fields) when sorting by SortField.STRING or
SortField.STRING_VAL are now BytesRef instances. See MIGRATE.txt
for more details.
(yonik, Mike McCandless)
- LUCENE-2480: Though not a change in backwards compatibility policy, pre-3.0
indexes are no longer supported. You should upgrade to 3.x first, then run
optimize(), or reindex.
(Shai Erera, Earwin Burrfoot)
- LUCENE-2484: Removed deprecated TermAttribute. Use CharTermAttribute
and TermToBytesRefAttribute instead.
(Uwe Schindler)
- LUCENE-2600: Remove IndexReader.isDeleted in favor of
AtomicReader.getDeletedDocs().
(Mike McCandless)
- LUCENE-2667: FuzzyQuery's defaults have changed for more performant
behavior: the minimum similarity is 2 edit distances from the word,
and the priority queue size is 50. To support this, FuzzyQuery now allows
specifying unscaled edit distances (foobar~2). If your application depends
upon the old defaults of 0.5 (scaled) minimum similarity and Integer.MAX_VALUE
priority queue size, you can use FuzzyQuery(Term, float, int, int) to specify
those explicitly.
- LUCENE-2674: MultiTermQuery.TermCollector.collect now accepts the
TermsEnum as well.
(Robert Muir, Mike McCandless)
- LUCENE-588: WildcardQuery and QueryParser now allows escaping with
the '\' character. Previously this was impossible (you could not escape */?,
for example). If your code somehow depends on the old behavior, you will
need to change it (e.g. using "\\" to escape '\' itself).
(Sunil Kamath, Terry Yang via Robert Muir)
- LUCENE-2837: Collapsed Searcher, Searchable into IndexSearcher;
removed contrib/remote and MultiSearcher (Mike McCandless); absorbed
ParallelMultiSearcher into IndexSearcher as an optional
ExecutorServiced passed to its ctor.
(Mike McCandless)
- LUCENE-2908, LUCENE-4037: Removed serialization code from lucene classes.
It is recommended that you serialize user search needs at a higher level
in your application.
(Robert Muir, Benson Margulies)
- LUCENE-2831: Changed Weight#scorer, Weight#explain & Filter#getDocIdSet to
operate on a AtomicReaderContext instead of directly on IndexReader to enable
searches to be aware of IndexSearcher's context.
(Simon Willnauer)
- LUCENE-2839: Scorer#score(Collector,int,int) is now public because it is
called from other classes and part of public API.
(Uwe Schindler)
- LUCENE-2865: Weight#scorer(AtomicReaderContext, boolean, boolean) now accepts
a ScorerContext struct instead of booleans.(Simon Willnauer)
- LUCENE-2882: Cut over SpanQuery#getSpans to AtomicReaderContext to enforce
per segment semantics on SpanQuery & Spans.
(Simon Willnauer)
- LUCENE-2236: Similarity can now be configured on a per-field basis. See the
migration notes in MIGRATE.txt for more details.
(Robert Muir, Doron Cohen)
- LUCENE-2315: AttributeSource's methods for accessing attributes are now final,
else it's easy to corrupt the internal states.
(Uwe Schindler)
- LUCENE-2814: The IndexWriter.flush method no longer takes "boolean
flushDocStores" argument, as we now always flush doc stores (index
files holding stored fields and term vectors) while flushing a
segment.
(Mike McCandless)
- LUCENE-2548: Field names (eg in Term, FieldInfo) are no longer
interned.
(Mike McCandless)
- LUCENE-2883: The contents of o.a.l.search.function has been consolidated into
the queries module and can be found at o.a.l.queries.function. See
MIGRATE.txt for more information
(Chris Male)
- LUCENE-2392, LUCENE-3299: Decoupled vector space scoring from
Query/Weight/Scorer. If you extended Similarity directly before, you should
extend TFIDFSimilarity instead. Similarity is now a lower-level API to
implement other scoring algorithms. See MIGRATE.txt for more details.
(David Nemeskey, Simon Willnauer, Mike McCandless, Robert Muir)
- LUCENE-3330: The expert visitor API in Scorer has been simplified and
extended to support arbitrary relationships. To navigate to a scorer's
children, call Scorer.getChildren().
(Robert Muir)
- LUCENE-2308: Field is now instantiated with an instance of IndexableFieldType,
of which there is a core implementation FieldType. Most properties
describing a Field have been moved to IndexableFieldType. See MIGRATE.txt
for more details.
(Nikola Tankovic, Mike McCandless, Chris Male)
- LUCENE-3396: ReusableAnalyzerBase.TokenStreamComponents.reset(Reader) now
returns void instead of boolean. If a Component cannot be reset, it should
throw an Exception.
(Chris Male)
- LUCENE-3396: ReusableAnalyzerBase has been renamed to Analyzer. All Analyzer
implementations must now use Analyzer.TokenStreamComponents, rather than
overriding .tokenStream() and .reusableTokenStream() (which are now final).
(Chris Male)
- LUCENE-3346: Analyzer.reusableTokenStream() has been renamed to tokenStream()
with the old tokenStream() method removed. Consequently it is now mandatory
for all Analyzers to support reusability.
(Chris Male)
- LUCENE-3473: AtomicReader.getUniqueTermCount() no longer throws UOE when
it cannot be easily determined. Instead, it returns -1 to be consistent with
this behavior across other index statistics.
(Robert Muir)
- LUCENE-1536: The abstract FilteredDocIdSet.match() method is no longer
allowed to throw IOException. This change was required to make it conform
to the Bits interface. This method should never do I/O for performance reasons.
(Mike McCandless, Uwe Schindler, Robert Muir, Chris Male, Yonik Seeley,
Jason Rutherglen, Paul Elschot)
- LUCENE-3559: The methods "docFreq" and "maxDoc" on IndexSearcher were removed,
as these are no longer used by the scoring system. See MIGRATE.txt for more
details.
(Robert Muir)
- LUCENE-3533: Removed SpanFilters, they created large lists of objects and
did not scale.
(Robert Muir)
- LUCENE-3606: IndexReader and subclasses were made read-only. It is no longer
possible to delete or undelete documents using IndexReader; you have to use
IndexWriter now. As deleting by internal Lucene docID is no longer possible,
this requires adding a unique identifier field to your index. Deleting/
relying upon Lucene docIDs is not recommended anyway, because they can
change. Consequently commit() was removed and DirectoryReader.open(),
openIfChanged() no longer take readOnly booleans or IndexDeletionPolicy
instances. Furthermore, IndexReader.setNorm() was removed. If you need
customized norm values, the recommended way to do this is by modifying
Similarity to use an external byte[] or one of the new DocValues
fields (LUCENE-3108). Alternatively, to dynamically change norms (boost
*and* length norm) at query time, wrap your AtomicReader using
FilterAtomicReader, overriding FilterAtomicReader.norms(). To persist the
changes on disk, copy the FilteredIndexReader to a new index using
IndexWriter.addIndexes().
(Uwe Schindler, Robert Muir)
- LUCENE-3640: Removed IndexSearcher.close(), because IndexSearcher no longer
takes a Directory and no longer "manages" IndexReaders, it is a no-op.
(Robert Muir)
- LUCENE-3684: Add offsets into DocsAndPositionsEnum, and a few
FieldInfo.IndexOption: DOCS_AND_POSITIONS_AND_OFFSETS.
(Robert
Muir, Mike McCandless)
- LUCENE-2858, LUCENE-3770: FilterIndexReader was renamed to
FilterAtomicReader and now extends AtomicReader. If you want to filter
composite readers like DirectoryReader or MultiReader, filter their
atomic leaves and build a new CompositeReader (e.g. MultiReader) around
them.
(Uwe Schindler, Robert Muir)
- LUCENE-3736: ParallelReader was split into ParallelAtomicReader
and ParallelCompositeReader. Lucene 3.x's ParallelReader is now
ParallelAtomicReader; but the new composite variant has improved performance
as it works on the atomic subreaders. It requires that all parallel
composite readers have the same subreader structure. If you cannot provide this,
you can use SlowCompositeReaderWrapper to make all parallel readers atomic
and use ParallelAtomicReader.
(Uwe Schindler, Mike McCandless, Robert Muir)