To build PyLucene, JCC needs to be built first. Sources for JCC are
included with the PyLucene sources. Instructions for building and
installing JCC are here.
PyLucene is closely tracking Java Lucene releases. It intends to
supports the entire Lucene API.
PyLucene also includes a number of Lucene contrib packages: the
Snowball analyzer and stemmers, the highlighter package, analyzers
for other languages than english, regular expression queries,
specialized queries such as 'more like this' and more.
This document only covers the pythonic extensions to Lucene offered
by PyLucene as well as some differences between the Java and Python
APIs. For the documentation on Java Lucene APIs,
see here.
To help with debugging and to support some Lucene APIs, PyLucene also
exposes some Java runtime APIs.
Samples
The best way to learn PyLucene is to look at the many samples
included with the PyLucene source release or on the web at:
A large number of samples are shipped with PyLucene. Most notably,
all the samples published in
the Lucene in
Action book that did not depend on a third party Java
library for which there was no obvious Python equivalent were
ported to Python and PyLucene.
Lucene in Action is a great companion to learning
Lucene. Having all the samples available in Python should make it
even easier for Python developers.
Lucene in Action was written by Erik Hatcher and Otis
Gospodnetic, both part of the Java Lucene development team, and is
available from
Manning Publications.
Threading support with attachCurrentThread
Before PyLucene APIs can be used from a thread other than the main
thread that was not created by the Java Runtime, the
attachCurrentThread() method must be called on the
JCCEnv object returned by the initVM()
or getVMEnv() functions.
Exception handling with lucene.JavaError
Java exceptions are caught at the language barrier and reported to
Python by raising a JavaError instance whose args tuple contains the
actual Java Exception instance.
Handling Java arrays
Java arrays are returned to Python in a JArray
wrapper instance that implements the Python sequence protocol. It
is possible to change array elements but not to change the array
size.
A few Lucene APIs take array arguments and expect values to be
returned in them. To call such an API and be able to retrieve the
array values after the call, a Java array needs to instantiated
first.
For example, accessing termDocs:
termDocs = reader.termDocs(Term("isbn", isbn))
docs = JArray('int')(1) # allocate an int[1] array
freq = JArray('int')(1) # allocate an int[1] array
if termDocs.read(docs, freq) == 1:
bits.set(docs[0]) # access the array's first element
In addition to 'int', the 'JArray'
function accepts 'object', 'string',
'bool', 'byte', 'char',
'double', 'float', 'long'
and 'short' to create an array of the corresponding
type. The JArray('object') constructor takes a second
argument denoting the class of the object elements. This argument
is optional and defaults to Object.
To convert a char array to a Python string use a
''.join(array) construct.
Instead of an integer denoting the size of the desired Java array,
a sequence of objects of the expected element type may be passed
in to the array constructor.
For example:
creating a Java array of double from the [1.5, 2.5] list
JArray('double')([1.5, 2.5])
All methods that expect an array also accept a sequence of Python
objects of the expected element type. If no values are expected
from the array arguments after the call, it is hence not necessary
to instantiate a Java array to make such calls.
See JCC for more
information about handling arrays.
Differences between the Java Lucene and PyLucene APIs
The PyLucene API exposes all Java Lucene classes in a flat namespace
in the PyLucene module. For example, the Java import
statement import
org.apache.lucene.index.IndexReader; corresponds to the
Python import statement from lucene import
IndexReader
Downcasting is a common operation in Java but not a concept in
Python. Because the wrapper objects implementing exactly the
APIs of the declared type of the wrapped object, all classes
implement two class methods called instance_ and cast_ that
verify and cast an instance respectively.
Pythonic extensions to the Java Lucene APIs
Java is a very verbose language. Python, on the other hand, offers
many syntactically attractive constructs for iteration, property
access, etc... As the Java Lucene samples from the Lucene in
Action book were ported to Python, PyLucene received a number
of pythonic extensions listed here:
Iterating search hits is a very common operation. Hits instances
are iterable in Python. Two values are returned for each
iteration, the zero-based number of the document in the Hits
instance and the document instance itself.
The Java loop:
for (int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
System.out.println(hits.score(i) + " : " + doc.get("title"));
}
can be written in Python:
for hit in hits:
hit = Hit.cast_(hit)
print hit.getScore(), ':', hit.getDocument['title']
if hit.iterator()'s next() method were declared to return
Hit instead of Object, the above
cast_() call would not be unnecessary.
The same java loop can also be written:
for i xrange(len(hits)):
print hits.score(i), ':', hits[i]['title']
Hits instances partially implement the Python 'sequence'
protocol.
The Java expressions:
hits.length()
doc = hits.get(i)
are better written in Python:
len(hits)
doc = hits[i]
Document instances have fields whose values can be accessed
through the mapping protocol.
The Java expression:
doc.get("title")
is better written in Python:
doc['title']
Document instances can be iterated over for their fields.
The Java loop:
Enumeration fields = doc.getFields();
while (fields.hasMoreElements()) {
Field field = (Field) fields.nextElement();
...
}
is better written in Python:
for field in doc.getFields():
field = Field.cast_(field)
...
Once JCC heeds Java 1.5 type parameters and once Java Lucene
makes use of them, such casting should become unncessary.
Extending Java Lucene classes from Python
Many areas of the Lucene API expect the programmer to provide
their own implementation or specialization of a feature where
the default is inappropriate. For example, text analyzers and
tokenizers are an area where many parameters and environmental
or cultural factors are calling for customization.
PyLucene enables this by providing Java extension points listed
below that serve as proxies for Java to call back into the
Python implementations of these customizations.
These extension points are simple Java classes that JCC
generates the native C++ implementations for. It is easy to add
more such extensions classes into the 'java' directory of the
PyLucene source tree.
To learn more about this topic, please refer to the JCC
documentation.
Please refer to the classes in the 'java' tree for currently
available extension points. Examples of uses of these extension
points are to be found in PyLucene's unit tests and Lucene
in
Actionsamples.
News
12 April 2012 - Lucene Core 3.6.0 and Solr 3.6.0 Available
The Lucene PMC is pleased to announce the availability
of Apache Lucene 3.6.0 and Apache Solr 3.6.0
In addition to Java 5 and Java 6, this release has now
full Java 7 support (minimum JDK 7u1 required).
TypeTokenFilter filters tokens based on their TypeAttribute.
Fixed offset bugs in a number of CharFilters, Tokenizers and TokenFilters
that could lead to exceptions during highlighting.
Added phonetic encoders: Metaphone, Soundex, Caverphone,
Beider-Morse, etc.
CJKBigramFilter and CJKWidthFilter replace CJKTokenizer.
Kuromoji morphological analyzer tokenizes Japanese text, producing
both compound words and their segmentation.
Static index pruning (Carmel pruning) removes postings with low
within-document term frequency.
QueryParser now interprets '*' as an open end for range
queries.
FieldValueFilter excludes documents missing the specified field.
CheckIndex and IndexUpgrader allow you to specify the
specific FSDirectory implementation to use with the new -dir-impl
command-line option.
FSTs can now do reverse lookup (by output) in certain cases and
can be packed to reduce their size. There is now a method to
retrieve top N shortest paths from a start node in an FST.
New WFSTCompletionLookup suggester supports finer-grained
ranking for suggestions.
FST based suggesters now use an offline (disk-based) sort, instead
of in-memory sort, when pre-sorting the suggestions.
ToChildBlockJoinQuery joins in the opposite direction (parent down
to child documents).
New query-time joining is more flexible (but less performant) than
index-time joins.
Added HTMLStripCharFilter to strip HTML markup.
Security fix: Better prevention of virtual machine SIGSEGVs when
using MMapDirectory: Code using cloned IndexInputs of already
closed indexes could possibly crash VM, allowing DoS attacks to
your application.
Many bug fixes.
Highlights of the Solr release include:
New SolrJ client connector using Apache Http Components http client
(SOLR-2020)
Many analyzer factories are now "multi term query aware" allowing for things
like field type aware lowercasing when building prefix & wildcard queries.
(SOLR-2438)
New Kuromoji morphological analyzer tokenizes Japanese text, producing
both compound words and their segmentation. (SOLR-3056)
Range Faceting (Dates & Numbers) is now supported in distributed search
(SOLR-1709)
HTMLStripCharFilter has been completely re-implemented, fixing many bugs
and greatly improving the performance (LUCENE-3690)
StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)
New LFU Cache option for use in Solr's internal caches. (SOLR-2906)
Memory performance improvements to all FST based suggesters (SOLR-2888)
New WFSTLookupFactory suggester supports finer-grained ranking for
suggestions. (LUCENE-3714)
New options for configuring the amount of concurrency used in distributed
searches (SOLR-3221)
Many bug fixes
27 November 2011 - Lucene Core 3.5.0 and Solr 3.5.0 Available
The Lucene PMC is pleased to announce the availability
of Apache Lucene 3.5.0 and Apache Solr 3.5.0.
Added a very substantial (3-5X) RAM reduction required to hold the
terms index on opening an IndexReader. (LUCENE-2205)
Added IndexSearcher.searchAfter which returns results after a
specified ScoreDoc (e.g. last document on the previous page) to
support deep paging use cases.
(LUCENE-2215)
Added SearcherManager to manage sharing and reopening IndexSearchers
across multiple search threads. Underlying IndexReader instances are
safely closed if not referenced anymore.
(LUCENE-3445,
LUCENE-3558)
Added SearcherLifetimeManager which safely provides a consistent
view of the index across multiple requests (e.g. paging/drilldown).
(LUCENE-3558,
LUCENE-3486)
Renamed IndexWriter.optimize to forceMerge to discourage use of
this method since it is horribly costly and rarely justified anymore.
(LUCENE-3454)
Added NGramPhraseQuery that speeds up phrase queries 30-50% when
n-gram analysis is used. (LUCENE-3426)
Added a new reopen API (IndexReader.openIfChanged) that returns
null instead of the old reader if there are no changes in the index.
(LUCENE-3464)
Improvements to vector highlighting: support for more queries
such as wildcards and boundary analysis for generated snippets.
(LUCENE-1824,
LUCENE-1889)
IndexSearcher and IndexReader now perform additional checks to
throw AlreadyClosedExceptions if searches are performed on a
closed IndexReader. Performing searches on already closed reader can
cause JVM crashes when invalid memory mapped files are referenced.
Several bugfixes, including a bug where closing an NRT reader
after the writer was closed was incorrectly invoking the
DeletionPolicy. See CHANGES.txt entries for full details.
Highlights of the Solr release include:
Bug fixes and improvements from Apache Lucene 3.5.0, including a
very substantial (3-5X) RAM reduction required to hold the terms
index on opening an IndexReader.
(LUCENE-2205)
Added support for Hunspell stemmer TokenFilter supporting
stemming for 99 languages.
(SOLR-2769)
A new contrib module "langid" adds language identification
capabilities as an Update Processor, using Tika's
LanguageIdentifier or Cybozu language-detection library
(SOLR-1979)
Numeric types including Trie and date types now support
sortMissingFirst/Last.
(SOLR-2881)
Added hl.q parameter. It is optional and if it is specified,
it overrides q parameter in Highlighter.
(SOLR-1926)
Several minor bugfixes like date parsing for years from 0001-1000, ignored
configurations when using QueryAnalyzer with
SpellCheckComponent and many more.
See CHANGES.txt entries for full details.
26 October 2011 - Java 7u1 fixes index corruption and crash bugs in Apache Lucene Core and Apache Solr
Oracle released Java 7u1 on October 19.
According to the release notes and tests done by the Lucene committers, all bugs reported on July 28 are fixed in this release,
so code using Porter stemmer no longer crashes with SIGSEGV. We were not able to experience any index corruption anymore,
so it is safe to use Java 7u1 with Lucene Core and Solr.
On the same day, Oracle released Java 6u29
fixing the same problems occurring with Java 6, if the JVM switches -XX:+AggressiveOpts
or -XX:+OptimizeStringConcat were used. Of course, you should not use experimental JVM options like
-XX:+AggressiveOpts in production environments! We recommend everybody to upgrade to this latest version 6u29.
In case you upgrade to Java 7, remember that you may have to reindex, as the unicode
version shipped with Java 7 changed and tokenization behaves differently
(e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt
in your distribution package!
14 September 2011 - Lucene Core 3.4.0 and Solr 3.4.0 Available
The Lucene PMC is pleased to announce the availability
of Apache Lucene 3.4.0 and Apache Solr 3.4.0.
If you are already using Apache Lucene 3.1, 3.2 or 3.3, we strongly recommend you upgrade to 3.4.0 because of the index corruption bug on OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0.
Highlights of the Lucene release include:
Fixed a major bug (LUCENE-3418) whereby a Lucene index could
easily become corrupted if the OS or computer crashed or lost
power.
Added a new faceting module (contrib/facet) for computing facet
counts (both hierarchical and non-hierarchical) at search
time (LUCENE-3079).
Added a new join module (contrib/join), enabling indexing and
searching of nested (parent/child) documents using
BlockJoinQuery/Collector (LUCENE-3171).
It is now possible to index documents with term frequencies
included but without positions (LUCENE-2048); previously
omitTermFreqAndPositions always omitted both.
The modular QueryParser (contrib/queryparser) can now create
NumericRangeQuery.
Added SynonymFilter, in contrib/analyzers, to apply multi-word
synonyms during indexing or querying, including parsers to read
the wordnet and solr synonym formats (LUCENE-3233).
You can now control how documents that don't have a value on the
sort field should sort (LUCENE-3390), using SortField.setMissingValue.
Fixed a case where term vectors could be silently deleted from the
index after addIndexes (LUCENE-3402).
Highlights of the Solr release include:
SolrJ client can now parse grouped and range facets results
(SOLR-2523).
A new XsltUpdateRequestHandler allows posting XML that's
transformed by a provided XSLT into a valid Solr document
(SOLR-2630).
Post-group faceting option (group.truncate) can now compute
facet counts for only the highest ranking documents per-group.
(SOLR-2665).
Add commitWithin update request parameter to all update handlers
that were previously missing it. This tells Solr to commit the
change within the specified amount of time (SOLR-2540).
New parameter hl.phraseLimit speeds up FastVectorHighlighter
(LUCENE-3234).
The query cache and filter cache can now be disabled per request.
See this wiki page
(SOLR-2429).
Improved memory usage, build time, and performance of
SynonymFilterFactory (LUCENE-3233).
Added omitPositions to the schema, so you can omit position
information while still indexing term frequencies (LUCENE-2048).
Various fixes for multi-threaded DataImportHandler.
28 July 2011 - WARNING: Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7
Oracle released Java 7 today.
Unfortunately it contains hotspot compiler optimizations, which miscompile some loops.
This can affect code of several Apache projects. Sometimes JVMs only crash, but in several cases,
results calculated can be incorrect, leading to bugs in applications
(see Hotspot bugs 7070134,
7044738,
7068051).
Apache Lucene Core and Apache Solr are two Apache projects,
which are affected by these bugs, namely all versions released until today.
Solr users with the default configuration will have
Java crashing with SIGSEGV as soon as they start to index documents, as one
affected part is the well-known Porter stemmer
(see LUCENE-3335).
Other loops in Lucene may be miscompiled, too, leading to index corruption
(especially on Lucene trunk with pulsing codec; other loops may be
affected, too - LUCENE-3346).
These problems were detected only 5 days before the official Java 7 release,
so Oracle had no time to fix those bugs, affecting also many more applications.
In response to our questions, they proposed to include the fixes into service
release u2 (eventually into service release u1, see
this mail).
This means you cannot use Apache Lucene/Solr with Java 7 releases before Update 2!
If you do, please don't open bug reports, it is not the committers' fault!
At least disable loop optimizations using the -XX:-UseLoopPredicate JVM option
to not risk index corruptions.
Please note: Also Java 6 users are affected, if they use one of those
JVM options, which are not enabled by default: -XX:+OptimizeStringConcat
or -XX:+AggressiveOpts.
It is strongly recommended not to use any hotspot optimization switches in any Java
version without extensive testing!
In case you upgrade to Java 7, remember that you may have to reindex, as the unicode
version shipped with Java 7 changed and tokenization behaves differently
(e.g. lowercasing). For more information, read JRE_VERSION_MIGRATION.txt
in your distribution package!
1 July 2011 - Lucene Core 3.3 and Solr 3.3 Available
The Lucene PMC is pleased to announce the availability
of Apache Lucene 3.3 and Apache Solr 3.3.
The spellchecker module now includes suggest/auto-complete functionality,
with three implementations: Jaspell, Ternary Trie, and Finite State.
Support for merging results from multiple shards, for both "normal"
search results (TopDocs.merge) as well as grouped results using the
grouping module (SearchGroup.merge, TopGroups.merge).
An optimized implementation of KStem, a less aggressive stemmer
for English
Single-pass grouping implementation based on block document indexing.
Improvements to MMapDirectory (now also the default implementation
returned by FSDirectory.open on 64-bit Linux).
NRTManager simplifies handling near-real-time search with multiple
search threads, allowing the application to control which indexing
changes must be visible to which search requests.
TwoPhaseCommitTool facilitates performing a multi-resource
two-phased commit, including IndexWriter.
The default merge policy, TieredMergePolicy, has a new method
(set/getReclaimDeletesWeight) to control how aggressively it
targets segments with deletions, and is now more aggressive than
before by default.
PKIndexSplitter tool splits an index by a mid-point term.
Highlights of the Solr release include:
Grouping / Field Collapsing
A new, automaton-based suggest/autocomplete implementation offering an
order of magnitude smaller RAM consumption.
KStemFilterFactory, an optimized implementation of a less aggressive
stemmer for English.
Solr defaults to a new, more efficient merge policy (TieredMergePolicy).
See http://s.apache.org/merging for more information.
Important bugfixes, including extremely high RAM usage in spellchecking.
Bugfixes and improvements from Apache Lucene 3.3
4 June 2011 - Lucene Core 3.2 and Solr 3.2 Available
The Lucene PMC is pleased to announce the availability of Apache Lucene 3.2 and Apache Solr 3.2.
A new grouping module, under lucene/contrib/grouping, enables
search results to be grouped by a single-valued indexed field
A new IndexUpgrader tool fully converts an old index to the
current format.
A new Directory implementation, NRTCachingDirectory, caches small
segments in RAM, to reduce the I/O load for applications with fast
NRT reopen rates.
A new Collector implementation, CachingCollector, is able to
gather search hits (document IDs and optionally also scores) and
then replay them. This is useful for Collectors that require two
or more passes to produce results.
Index a document block using IndexWriter's new addDocuments or
updateDocuments methods. These experimental APIs ensure that the
block of documents will forever remain contiguous in the index,
enabling interesting future features like grouping and joins.
A new default merge policy, TieredMergePolicy, which is more
efficient due to being able to merge non-contiguous segments.
See http://s.apache.org/merging for details.
NumericField is now returned correctly when you load a stored
document (previously you received a normal Field back, with the
numeric value converted string).
Deleted terms are now applied during flushing to the newly flushed
segment, which is more efficient than having to later initialize a
reader for that segment.
Highlights of the Solr release include:
Ability to specify overwrite and commitWithin as request parameters when
using the JSON update format.
TermQParserPlugin, useful when generating filter queries from terms
returned from field faceting or the terms component.
DebugComponent now supports using a NamedList to model Explanation objects
in its responses instead of Explanation.toString.
Improvements to the UIMA and Carrot2 integrations.
Highlighting performance improvements.
A test-framework jar for easy testing of Solr extensions.
Bugfixes and improvements from Apache Lucene 3.2.
31 March 2011 - Lucene Core 3.1 and Solr 3.1 Available
The Lucene PMC is pleased to announce the availability of Apache Lucene 3.1 and Apache Solr 3.1.
The version number for Solr 3.1 was chosen to reflect the merge of
development with Lucene, which is currently also on 3.1. Going
forward, we expect the Solr version to be the same as the Lucene
version. Solr 3.1 contains Lucene 3.1 and is the release after Solr 1.4.1.
Numerous performance improvements: faster exact PhraseQuery; merging
favors segments with deletions; primary key lookup is faster;
IndexWriter.addIndexes(Directory[]) uses file copy instead of
merging; various Directory performance improvements; compound file
is dynamically turned off for large segments; fully deleted segments
are dropped on commit; faster snowball analyzers (in contrib);
ConcurrentMergeScheduler is more careful about setting priority of
merge threads.
ReusableAnalyzerBase makes it easier to reuse TokenStreams
correctly.
Improved Analysis capabilities: Improved Unicode support, including
Unicode 4, more friendly term handling (CharTermAttribute), easier
object reuse and better support for protected words in lossy token
filters (e.g. stemmers).
ConstantScoreQuery now allows directly wrapping a Query.
IndexWriter is now configured with a new separate builder API,
IndexWriterConfig. You can now control IndexWriter's previously
fixed internal thread limit by calling setMaxThreadStates.
IndexWriter.getReader is replaced by IndexReader.open(IndexWriter).
In addition you can now specify whether deletes should be resolved
when you open an NRT reader.
MultiSearcher is deprecated; ParallelMultiSearcher has been
absorbed directly into IndexSearcher.
On 64bit Windows and Solaris JVMs, MMapDirectory is now the
default implementation (returned by FSDirectory.open).
MMapDirectory also enables unmapping if the JVM supports it.
New TotalHitCountCollector just counts total number of hits.
ReaderFinishedListener API enables external caches to evict entries
once a segment is finished.
Highlights of the Solr release include:
Numeric range facets (similar to date faceting).
New spatial search, including spatial filtering, boosting and sorting capabilities.
Example Velocity driven search UI at http://localhost:8983/solr/browse
A new termvector-based highlighter
Extend dismax (edismax) query parser which addresses some
missing features in the dismax query parser along with some
extensions.
Several more components now support distributed mode:
TermsComponent, SpellCheckComponent.
A new Auto Suggest component.
Ability to sort by functions.
JSON document indexing.
CSV response format.
Apache UIMA integration for metadata extraction.
Leverages Lucene 3.1 and it's inherent optimizations and bug fixes
as well as new analysis capabilities.
Numerous improvements, bug fixes, and optimizations.
The Apache Software Foundation
The Apache Software Foundation provides support for the Apache community of open-source software projects. The Apache projects are defined by collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software that leads the way in its field. Apache Lucene, Apache Solr, Apache PyLucene, Apache Open Relevance Project and their respective logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.