LuceneTM Core News
6 May 2013 - Lucene Core 4.3.0 Available
The Lucene PMC is pleased to announce the release of Apache Lucene 4.3
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains a handful of bug fixes and optimizations, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Lucene 4.3.0 Release Highlights:
-
Significant performance improvements for minShouldMatch BooleanQuery due to skipping resulting in up to 4000% faster queries.
-
A new SortingAtomicReader which allows sorting an index based on a sort criteria (e.g. a numeric DocValues field), as well as SortingMergePolicy which sorts documents before segments are merged.
-
DocIdSetIterator and Scorer now has a cost API that provides an upper bound of the number of documents the iterator might match. This API allows optimisation during query execution or how filters are applied.
-
Analyzing/FuzzySuggester now allow to record arbitrary byte[] as a payload. The suggesters also use an ending offset to determine whether the last token was finished or not, so that a query "i " will no longer suggest "Isla de Muerta" for example.
-
Lucene Spatial Module can now search for indexed shapes by Within, Contains, and Disjoint relationships, in addition to typical Intersects.
-
PostingsHighlighter now allows custom passage scores, per-field BreakIterators and has been detached from TopDocs. Additionally, subclasses can override where string values for highlighting are pulled from alternatively to stored fields.
-
New SearcherTaxonomyManager manages near-real-time reopens of both IndexSearcher and TaxonomyReader (for faceting).
-
Added new facet method to the facet module to compute facet counts using SortedSetDocValuesField, without a separate taxonomy index.
-
DrillSideways class, for computing sideways facet counts, is now more flexible: it allows more than one FacetRequest per dimension and now allows drilling down on dimensions that do not have a facet request.
-
Various bugfixes and optimizations since the 4.2.1 release.
3 April 2013 - Lucene Core 4.2.1 Available
The Lucene PMC is pleased to announce the release of Apache Lucene 4.2.1
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains a handful of bug fixes and optimizations, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Lucene 4.2.1 Release Highlights:
- Lucene 4.2.1 includes 9 bug fixes and 3 optimizations, including a fix for a serious bug that could result in the loss of an index.
11 March 2013 - Lucene Core 4.2.0 Available
The Lucene PMC is pleased to announce the release of Apache Lucene 4.2
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Lucene 4.2 Release Highlights:
-
Lucene 4.2 has a new default codec (Lucene42Codec) with a more efficient docvalues format (sorted bytes in FST, less addressing overhead, improved numeric compression) and smaller term vectors (LZ4-compressed terms dictionaries and payloads, delta-encoded positions and offsets using blocks of packed integers).
-
Doc values external and codec API and implementations have been simplified: the codec is no longer responsible for buffering doc values; the numerous types have been consolidated down to only three (NUMERIC, BINARY, SORTED); PerFieldDocValuesFormat lets you set a different format for each field, and the doc values and FieldCache APIs were unified.
-
Significant refactoring and performance enhancements to the facet module, resulting in overall ~3.8X speedup in one case (single Date field faceting).
-
DrillDownQuery in the facet module now supports multi-select.
-
A new DrillSideways class enables counting facet labels and counts for both hits and near-misses in a single query. See http://blog.mikemccandless.com/2013/02/drill-sideways-faceting-with-lucene.html
-
An additional docvalues type (SORTED_SET) was added that supports multiple values.
-
FSTs are a bit smaller, and the FST package supports FSTs over 2GB in size.
-
A new LiveFieldValues class lets you get live or real-time values for any indexed doc / field. See http://blog.mikemccandless.com/2013/01/getting-real-time-field-values-in-lucene.html
-
Added a new classification module.
-
Various bugfixes and optimizations since the 4.1 release.
22 January 2013 - Lucene Core 4.1.0 Available
The Lucene PMC is pleased to announce the release of Apache Lucene 4.1
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Lucene 4.1 Release Highlights:
-
Lucene 4.1 has a new default codec (Lucene41Codec) based on the previously-experimental "Block" indexing format for improved performance, but also incorporating the functionality of "Appending" and "Pulsing".
-
The default codec incorporates the optimization of Pulsing: terms that appear in only one document (such as primary key/id fields) just store the document id in the term dictionary instead of a pointer to this document id in a separate file.
-
The default codec incorporates an efficient compressed stored fields implementation that compresses chunks of documents together with LZ4. (see http://blog.jpountz.net/post/33247161884/efficient-compressed-stored-fields-with-lucene)
-
Lucene no longer seeks when writing files (all fields are written in an append-only way). This means it works by default with append-only streams, hdfs, etc.
-
New suggest implementations: AnalyzingSuggester, where the underlying form (computed from a lucene Analyzer) used for suggestions is separate from the returned text (see http://blog.mikemccandless.com/2012/09/lucenes-new-analyzing-suggester.html), and FuzzySuggester, which additionally allows for inexact matching on the input.
-
Near-realtime support was added to the facet module. (see http://shaierera.blogspot.com/2012/11/lucene-facets-part-1.html)
-
New Highlighter (postingshighlighter) added to the highlighter module. (see http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html)
-
Added FilterStrategy to FilteredQuery for more flexibility in filtered query execution.
-
Added CommonTermsQuery to speed up queries with very highly frequent terms. Term frequencies are efficiently detected at query time - no index time preparation required.
-
Several bugfixes and optimizations since the 4.0 release.
25 December 2012 - Lucene Core 3.6.2 Available
The Lucene PMC is pleased to announce the release of Apache Lucene 3.6.2.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release is a bug fix release for version 3.6.1. It contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-3x-redir.html.
See the CHANGES.txt file included with the release for a full list of details.
Lucene 3.6.2 Release Highlights:
-
Fixed ArrayIndexOutOfBoundsException when the in-memory terms index requires more than 2.1GB of RAM (billions of terms).
-
Fixed a bug in contrib/queryparser's parsing of boolean queries.
-
Fixed BooleanScorer2 to return the correct freq() when using the scorer visitor API.
-
Fixed IndexWriter RAM accounting bug that would cause it to flush too early when using many different field names.
-
Several other minor bugfixes: scoring bugs when using a custom coord(), a rare IndexWriter thread-safety issue, and fixes to the faceting and highlighting modules.
12 October 2012 - Lucene Core 4.0 Available
The Lucene PMC is pleased to announce the release of Apache Lucene 4.0
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Noteworthy changes since 4.0-BETA:
- A new "Block" PostingsFormat offering improved search performance and index compression. This will likely become the default format in a future release.
- All non-default codec implementations were moved to a separated codecs module. Just add lucene-codecs-4.0.0.jar to your classpath to test these out.
- Payloads can be optionally stored on the term vectors.
- Many bugfixes and optimizations.
13 August 2012 - Lucene Core 4.0-BETA
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Highlights of changes since 4.0-alpha:
-
IndexWriter.tryDeleteDocument can sometimes delete by document ID, for higher performance in some applications.
-
New experimental postings formats: BloomFilteringPostingsFormat uses a bloom filter to sometimes avoid disk seeks when looking up terms, DirectPostingsFormat holds all postings as simple byte[] and int[] for very fast performance at the cost of very high RAM consumption.
-
CJK analysis improvements: JapaneseIterationMarkCharFilter normalizes Japanese iteration marks, added unigram+bigram support to CJKBigramFilter.
-
Improvements to Scorer navigation API ( Scorer.getChildren) to support all queries, useful for determining which portions of the query matched.
-
Analysis improvements: factories for creating Tokenizer, TokenFilter, and CharFilter have been moved from Solr to Lucene's analysis module, less memory overhead for StandardTokenizer and Snowball filters.
-
Improved highlighting for multi-valued fields.
-
Various other API changes, optimizations and bug fixes.
Please read CHANGES.txt and MIGRATE.txt for a full list of new features and notes on upgrading. Particularly, the new apis are not compatible with previous version of Lucene, however, file format backwards compatibility is provided for indexes from the 3.0 series and the 4.0-alpha release.
This is a beta release for early adopters. The guarantee for this beta release is that the index format will be the 4.0 index format, supported through the 5.x series of Apache Lucene, unless there is a critical bug (e.g. that would cause index corruption) that would prevent this.
Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html)
22 July 2012 - Apache Lucene 3.6.1
The Lucene PMC is pleased to announce the release of Apache Lucene 3.6.1.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release is a bug fix release for version 3.6.0. It contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-3x-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Lucene 3.6.1 Release Highlights:
-
The concurrency of MMapIndexInput.clone() was improved, which caused a performance regression in comparison to Lucene 3.5.0.
-
MappingCharFilter was fixed to return correct final token positions.
-
QueryParser now supports +/- operators with any amount of whitespace.
-
DisjunctionMaxScorer now implements visitSubScorers().
-
Changed the visibility of Scorer#visitSubScorers() to public, otherwise it's impossible to implement Scorers outside the Lucene package. This is a small backwards break, affecting a few users who implemented custom Scorers.
-
Various analyzer bugs where fixed: Kuromoji to not produce invalid token graph due to UNK with punctuation being decompounded, invalid position length in SynonymFilter, loading of Hunspell dictionaries that use aliasing, be consistent with closing streams when loading Hunspell affix files.
-
Various bugs in FST components were fixed: Offline sorter minimum buffer size, integer overflow in sorter, FSTCompletionLookup missed to close its sorter.
-
Fixed a synchronization bug in handling taxonomies in facet module.
-
Various minor bugs were fixed: BytesRef/CharsRef copy methods with nonzero offsets and subSequence off-by-one, TieredMergePolicy returned wrong-scaled floor segment setting.
3 July 2012 - Lucene Core 4.0-ALPHA
The Lucene PMC is pleased to announce the release of Apache Lucene 4.0-alpha
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html
See the CHANGES.txt file included with the release for a full list of details.
Lucene 4.0-alpha Release Highlights:
-
The index formats for terms, postings lists, stored fields, term vectors, etc are pluggable via the Codec api. You can select from the provided implementations or customize the index format with your own Codec to meet your needs.
-
Similarity has been decoupled from the vector space model (TF/IDF). Additional models such as BM25, Divergence from Randomness, Language Models, and Information-based models are provided (see http://www.lucidimagination.com/blog/2011/09/12/flexible-ranking-in-lucene-4).
-
Added support for per-document values (DocValues). DocValues can be used for custom scoring factors (accessible via Similarity), for pre-sorted Sort values, and more.
-
When indexing via multiple threads, each IndexWriter thread now flushes its own segment to disk concurrently, resulting in substantial performance improvements (see http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html).
-
Per-document normalization factors ("norms") are no longer limited to a single byte. Similarity implementations can use any DocValues type to store norms.
-
Added index statistics such as the number of tokens for a term or field, number of postings for a field, and number of documents with a posting for a field: these support additional scoring models (see http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html).
-
Implemented a new default term dictionary/index (BlockTree) that indexes shared prefixes instead of every n'th term. This is not only more time- and space- efficient, but can also sometimes avoid going to disk at all for terms that do not exist. Alternative term dictionary implementions are provided and pluggable via the Codec api.
-
Indexed terms are no longer UTF-16 char sequences, instead terms can be any binary value encoded as byte arrays. By default, text terms are now encoded as UTF-8 bytes. Sort order of terms is now defined by their binary value, which is identical to UTF-8 sort order.
-
Substantially faster performance when using a Filter during searching.
-
File-system based directories can rate-limit the IO (MB/sec) of merge threads, to reduce IO contention between merging and searching threads.
-
Added a number of alternative Codecs and components for different use-cases: "Appending" works with append-only filesystems (such as Hadoop DFS), "Memory" writes the entire terms+postings as an FST read into RAM (see http://blog.mikemccandless.com/2011/06/primary-key-lookups-are-28x-faster-with.html), "Pulsing" inlines the postings for low-frequency terms into the term dictionary (see http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html), "SimpleText" writes all files in plain-text for easy debugging/transparency (see http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html), among others.
-
Term offsets can be optionally encoded into the postings lists and can be retrieved per-position.
-
A new AutomatonQuery returns all documents containing any term matching a provided finite-state automaton (see http://www.slideshare.net/otisg/finite-state-queries-in-lucene).
-
FuzzyQuery is 100-200 times faster than in past releases (see http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html).
-
A new spell checker, DirectSpellChecker, finds possible corrections directly against the main search index without requiring a separate index.
-
Various in-memory data structures such as the term dictionary and FieldCache are represented more efficiently with less object overhead (see http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html).
-
All search logic is now required to work per segment, IndexReader was therefore refactored to differentiate between atomic and composite readers (see http://blog.thetaphi.de/2012/02/is-your-indexreader-atomic-major.html).
-
Lucene 4.0 provides a modular API, consolidating components such as Analyzers and Queries that were previously scattered across Lucene core, contrib, and Solr. These modules also include additional functionality such as UIMA analyzer integration and a completely reworked spatial search implementation.
Please read CHANGES.txt and MIGRATE.txt for a full list of new features and notes on upgrading. Particularly, the new apis are not compatible with previous version of Lucene, however, file format backwards compatibility is provided for indexes from the 3.0 series.
This is an alpha release for early adopters. The guarantee for this alpha release is that the index format will be the 4.0 index format, supported through the 5.x series of Apache Lucene, unless there is a critical bug (e.g. that would cause index corruption) that would prevent this.
Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html)
