Warning
Before calling any PyLucene API that requires the Java VM, start it by
calling initVM(classpath, ...). More about this function
in here.
Installing JCC
JCC is a Python extension written in Python and C++. It requires a
Java Runtime Environment (JRE) to operate as it uses Java's
reflection APIs to do its work. It is built and installed
via distutils or setuptools.
See installation for more information and operating system specific notes.
Invoking JCC
JCC is installed as a package and how to invoke it depends on the Python version used:
- python 2.7:
python -m jcc - python 2.6:
python -m jcc.main - python 2.5:
python -m jcc -
python 2.4:
-
no setuptools:
pythonsite-packages/jcc/init.py -
with setuptools:
pythonsite-packages/jcc egg directory/jcc/init.py -
python 2.3:
pythonsite-packages/jcc egg directory/jcc/init.py
Generating C++ and Python wrappers with JCC
JCC started as a C++ code generator for hiding the gory details of accessing methods and fields on Java classes via Java's Native Invocation Interface. These C++ wrappers make it possible to access a Java object as if it was a regular C++ object very much like GCJ's CNI interface.
It then became apparent that JCC could also generate the C++ wrappers for making these classes available to Python. Every class that gets thus wrapped becomes a CPython type.
JCC generates wrappers for all public classes that are requested by
name on the command line or via the --jar command line
argument. It generates wrapper methods for all public methods and
fields on these classes whose return type and parameter types are
found in one of the following ways:
-
the type is one of the requested classes
-
the type is one of the requested classes' superclass or implemented interfaces
-
the type is available from one of the packages listed via the
--packagecommand line argument
Overloaded methods are supported and are selected at runtime on the basis of the type and number of arguments passed in.
JCC does not generate wrappers for methods or fields which don't satisfy these requirements. Thus, JCC can avoid generating code for runaway transitive closures of type dependencies.
JCC generates property accessors for a property
called field when it finds Java methods
named setField(value),
getField() or
isField().
The C++ wrappers are declared in a C++ namespace structure that mirrors the Java classes' Java packages. The Python types are declared in a flat namespace at the top level of the resulting Python extension module.
JCC's command-line arguments are best illustrated via the PyLucene example:
There are limits to both how many files can fit on the command line and how large a C++ file the C++ compiler can handle. By default, JCC generates one large C++ file containing the source code for all wrapper classes.
Using the --files command line argument, this behaviour
can be tuned to workaround various limits:
for example:
-
to break up the large wrapper class file into about 2 files:
--files 2 -
to break up the large wrapper class file into about 10 files:
--files 10 -
to generate one C++ file per Java class wrapped:
--files separate
The --prefix and --root arguments are
passed through to distutils' setup().
Classpath considerations
When generating wrappers for Python, the JAR files passed to JCC
via --jar are copied into the resulting Python extension
egg as resources and added to the extension
module's CLASSPATH variable. Classes or JAR files that
are required by the classes contained in the argument JAR files need
to be made findable via JCC's --classpath command line
argument. At runtime, these need to be appended to the
extension's CLASSPATH variable before starting the VM
with initVM(CLASSPATH).
To have such required jar files also automatically copied into
resulting Python extension egg and added to the classpath at build
and runtime, use the --include option. This option
works like the --jar option except that no wrappers are
generated for the classes contained in them unless they're
explicitely named on the command line.
When more than one JCC-built extension module is going to be used in
the same Python VM and these extension modules share Java classes,
only one extension module should be generated with wrappers for these
shared classes. The other extension modules must be built by importing
the one with the shared classes by using the --import
command line parameter. This ensures that only one copy of the
wrappers for the shared classes are generated and that they are
compatible among all extension modules sharing them.
Using distutils vs setuptools
By default, when building a Python extension,
if setuptools is found to be installed, it is used
over distutils. If you want to force the use
of distutils over setuptools, use
the --use-distutils command line argument.
Distributing an egg
The --bdist option can be used to ask JCC to
invoke distutils with bdist
or setuptools
with bdist_egg. If setuptools is used,
the resulting egg has to be installed with the
easy_install
installer which is normally part of a Python installation that
includes setuptools.
JCC's runtime API functions
JCC includes a small runtime component that is compiled into any Python extension it produces.
This runtime component makes it possible to manage the Java VM from Python. Because a Java VM can be configured with a myriad of options, it is not automatically started when the resulting Python extension module is loaded into the Python interpreter.
Instead, the initVM() function must be called from the
main thread before using any of the wrapped classes. It takes the
following keyword arguments:
classpath
A string containing one or more directories or jar files for the Java VM to search for classes. Every Python extension produced by JCC exports aCLASSPATHvariable that is hardcoded to the jar files that it was produced from. A copy of each jar file is installed as a resource file with the extension when JCC is invoked with the--installcommand line argument. This parameter is optional and defaults to theCLASSPATHstring exported by the moduleinitVMis imported from.import lucene lucene.initVM(classpath=lucene.CLASSPATH)
-
initialheap
The initial amount of Java heap to start the Java VM with. This argument is a string that follows the same syntax as the similar-Xmsjava command line argument.import lucene lucene.initVM(initialheap='32m') lucene.Runtime.getRuntime().totalMemory() 33357824L
-
maxheap
The maximum amount of Java heap that could become available to the Java VM. This argument is a string that follows the same syntax as the similar-Xmxjava command line argument. -
maxstack
The maximum amount of stack space that available to the Java VM. This argument is a string that follows the same syntax as the similar-Xssjava command line argument. -
vmargs
A string of comma separated additional options to pass to the VM startup rountine. These are passed through as-is. For example:import lucene lucene.initVM(vmargs='-Xcheck:jni,-verbose:jni,-verbose:gc')
The initVM() and getVMEnv() functions
return a JCCEnv object that has a few utility methods on it:
-
attachCurrentThread(name, asDaemon)
Before a thread created in Python or elsewhere but not in the Java VM can be used with the Java VM, this method needs to be invoked. The two arguments it takes are optional and self-explanatory. -
detachCurrentThread()The opposite ofattachCurrentThread(). This method should be used with extreme caution as Python's and java VM's garbage collectors may use a thread detached too early causing a system crash. The utility of this method seems dubious at the moment.
There are several differences between JNI's findClass()
and Java's Class.forName():
-
className is a '/' separated string of names
-
the class loaders are different,
findClass()may find classes thatClass.forName()won't.
For example:
from lucene import * initVM(CLASSPATH) findClass('org/apache/lucene/document/Document') <Class: class org.apache.lucene.document.Document> Class.forName('org.apache.lucene.document.Document') Traceback (most recent call last): File "<stdin>", line 1, in <module> lucene.JavaError: java.lang.ClassNotFoundException: org/apache/lucene/document/Document Class.forName('java.lang.Object') <Class: class java.lang.Object>
Type casting and instance checks
Many Java APIs are declared to return types that are less specific
than the types actually returned. In Java 1.5, this is worked around
with type parameters. JCC generates code to heed type parameters
unless the --no-generics is used. See next section for
details on Java generics support.
In C++, casting the object into its actual type is supported via the regular C casting operator.
In Python each wrapped class has a class method
called cast_ that implements the same functionality.
Similarly, each wrapped class has a class method
called instance_ that tests whether the wrapped java
instance is of the given type. For example:
print booleanQuery.getClauses()
Handling generic classes
Java 1.5 added support for parameterized types. JCC generates code
to heed type parameters unless the --no-generics
command line parameter is used. Java type parameterization is a
runtime feature. The same class is used for all its
parameterizations. Similarly, JCC wrapper objects all use the same
class but store type parameterizations on instances and make them
accessible as a tuple via the parameters_ property.
For example, an ArrayList<Document> instance,
has (<type 'Document'>,)
for parameters_ and its get() method uses
that type parameter to wrap its return values.
To allocate an instance of a generic Java class with specific type
parameters use the of_() method. This method accepts
one or more Python wrapper classes to use as type parameters. For
example, java.util.ArrayList<E> is declared to
accept one type parameter. Its wrapper's of_() method
hence accepts one parameter, a Python class, to use as type
parameter for the return type of its get() method, among
others:
a = ArrayList().of_(Document) a <ArrayList: []> a.parameters_ (<type 'Document'>,) a.add(Document()) True a.get(0) <Document: Document<>>
The use of type parameters is, of course, optional. A generic Java
class can still be used as before, without type parameters.
Downcasting from Object is then necessary:
a = ArrayList() a <ArrayList: []> a.parameters_ (None,) a.add(Document()) True a.get(0) <Object: Document<>> Document.cast_(a.get(0)) <Document: Document<>>
Handling arrays
Java arrays are wrapped with a C++ JArray
template. The [] is available for read
access. This template, JArray<T>, accomodates all
java primitive types, jstring, jobject and
wrapper class arrays.
Java arrays are returned to Python in a JArray wrapper
instance that implements the Python sequence protocol. It is
possible to change an array's elements but not to change an array's
size.
To convert a char array to a Python string use
a ''.join(array) construct.
Any Java method expecting an array can be called with the corresponding sequence object from python.
To instantiate a Java array from Python, use one of the following forms:
array = JArray('int')(size)
the resulting Java int array is initialized with zeroes
array = JArray('int')(sequence)
the sequence must only contain ints
the resulting Java int array contains the ints in the sequence
Instead of 'int', you may also use one
of 'object', 'string', 'bool',
'byte', 'char', 'double',
'float', 'long' and 'short'
to create an array of the corresponding type.
Because there is only one wrapper class for object arrays,
the JArray('object') type's constructor takes a second
argument denoting the class of the object elements. This argument is
optional and defaults to Object.
As with the Object types, the JArray types
also include a cast_ method. This method becomes useful
when the array returned to Python is wrapped as a
plain Object. This is the case, for example, with
nested arrays since there is no distinct Python type for every
different java object array class - all java object arrays are
wrapped by JArray('object'). For example:
cast obj to an array of ints
JArray('int').cast_(obj)
cast obj to an array of Document
JArray('object').cast_(obj, Document)
In both cases, the java type of obj must be compatible with the array type it is being cast to.
using nested array:
d = JArray('object')(1, Document) d[0] = Document() d JArray<object>[<Document: Document<>>] d[0] <Document: Document<>> a = JArray('object')(2) a[0] = d a[1] = JArray('int')([0, 1, 2]) a JArray<object>[<Object: [Lorg.apache.lucene.document.Document;@694f12>, <Object: [I@234265>] a[0] <Object: [Lorg.apache.lucene.document.Document;@694f12> a[1] <Object: [I@234265> JArray('object').cast_(a[0])[0] <Object: Document<>> JArray('object').cast_(a[0], Document)[0] <Document: Document<>> JArray('int').cast_(a[1]) JArray<int>[0, 1, 2] JArray('int').cast_(a[1])[0] 0
To verify that a Java object is of a given array type, use
the instance_() method available on the array
type. This is not the same as verifying that it is assignable with
elements of a given type. For example, using the arrays created
above:
is d array of Object ? are d's elements of type Object ?
JArray('object').instance_(d) True
can it receive Object instances ?
JArray('object').assignable_(d) False
is it array of Document ? are d's elements of type Document ?
JArray('object').instance_(d, Document) True
is it array of Class ? are d's elements of type Class ?
JArray('object').instance_(d, Class) False
can it receive Document instances ?
JArray('object').assignable_(d, Document) True
Exception reporting
Exceptions that occur in the Java VM and that escape to C++ are
reported as a javaError C++ exception. When using
Python wrappers, the C++ exceptions are handled and reported with
Python exceptions. When using C++ only, failure to handle the
exception in your C++ code will cause the process to crash.
Exceptions that occur in the Java VM and that escape to the Python
VM are reported with a JavaError python exception
object. The getJavaException() method can be called
on JavaError objects to obtain the original java
exception object wrapped as any other Java object. This Java object
can be used to obtain a Java stack trace for the error, for example.
Exceptions that occur in the Python VM and that escape to the Java
VM, as for example can happen in Python extensions (see topic below)
are reported to the Java VM as a RuntimeException or as
a PythonException when using shared
mode. See installation
instructions for more information about shared mode.
Writing Java class extensions in Python
JCC makes it relatively easy to extend a Java class from
Python. This is done via an intermediary class written in Java that
implements a special method called pythonExtension()
and that declares a number of native methods that are to be
implemented by the actual Python extension.
When JCC sees these special extension java classes it generates the C++ code implementing the native methods they declare. These native methods call the corresponding Python method implementations passing in parameters and returning the result to the Java VM caller.
For example, to implement a Lucene analyzer in Python, one would implement first such an extension class in Java:
import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import java.io.Reader;
public class PythonAnalyzer extends Analyzer { private long pythonObject;
public PythonAnalyzer() { }
public void pythonExtension(long pythonObject) { this.pythonObject = pythonObject; } public long pythonExtension() { return this.pythonObject; }
public void finalize() throws Throwable { pythonDecRef(); }
public native void pythonDecRef(); public native TokenStream tokenStream(String fieldName, Reader reader); }
The pythonExtension() methods is what makes this class
recognized as an extension class by JCC. They should be included
verbatim as above along with the declaration of
the pythonObject instance variable.
The implementation of the native pythonDecRef() method
is generated by JCC and is necessary because it seems
that finalize() cannot itself be native. Since an
extension class wraps the Python instance object it's going to be
calling methods on, its ref count needs to be decremented when this
Java wrapper class disappears. A declaration
for pythonDecRef() and a finalize()
implementation should always be included verbatim as above.
Really, the only non boilerplate user input is the constructor of the
class and the other native methods, tokenStream() in
the example above.
The corresponding Python class(es) are implemented as follows:
When an init() is declared, super()
must be called or else the Java wrapper class will not know about
the Python instance it needs to invoke.
When a java extension class declares native methods for which there
are public or protected equivalents available on the parent class,
JCC generates code that makes it possible to
call super() on these methods from Python as well.
There are a number of extension examples available in PyLucene's test suite and samples.
Embedding a Python VM in a Java VM
Using the same techniques used when writing a Python extension of a Java class, JCC may also be used to embed a Python VM in a Java VM. Following are the steps and constraints to follow to achieve this:
-
JCC must be built in shared mode. See installation instructions for more information about shared mode. Note that for this use on Mac OS X, JCC must also be built with the link flags
"-framework", "Python"in theLFLAGSvalue. -
As described in the previous section, define one or more Java classes to be "extended" from Python to provide the implementations of the native methods declared on them. Instances of these classes implement the bridges into the Python VM from Java.
-
The
org.apache.jcc.PythonVMJava class is going be used from the Java VM's main thread to initialize the embedded Python VM. This class is installed inside the JCC egg under thejcc/classesdirectory and the full path to this directory must be on the JavaCLASSPATH. -
The JCC egg directory contains the JCC shared runtime library - not the JCC Python extension shared library - but a library called
libjcc.dylibon Mac OS X,libjcc.soon Linux orjcc.dllon Windows. This directory must be added to the Java VM's shared library path via the-Djava.library.pathcommand line parameter. -
In the Java VM's main thread, initialize the Python VM by calling its static
start()method passing it a Python program name string and optional start-up arguments in a string array that will be made accessible in Python viasys.argv. Note that the program name string is purely informational, and is not used by thestart()code other than to initialize that Python variable. This method returns the singleton PythonVM instance to be used in this Java VM.start()may be called multiple times; it will always return the same singleton instance. This instance may also be retrieved at any later time via the staticget()method defined on theorg.apache.jcc.PythonVMclass. -
Any Java VM thread that is going to be calling into the Python VM should start with acquiring a reference to the Python thread state object by calling
acquireThreadState()method on the Python VM instance. It should then release the Python thread state before terminating by callingreleaseThreadState(). Calling these methods is optional but strongly recommended as it ensures that Python is not creating and throwing away a thread state everytime the Python VM is entered and exited from a given Java VM thread. -
Any Java VM thread may instantiate a Python object for which an extension class was defined in Java as described in the previous section by calling the
instantiate()method on the PythonVM instance. This method takes two string parameters, the name of the Python module and the name of the Python class to import and instantiate from it. Theinit()constructor on this class must be callable without any parameters and, if defined, must callsuper()in order to initialize the Java side. Theinstantiate()method is declared to returnjava.lang.Objectbut the return value is actually an instance of the Java extension class used and must be downcast to it.
Pythonic protocols
When generating wrappers for Python, JCC attempts to detect which classes can be made iterable:
-
When a class declares to implement
java.lang.Iterable, JCC makes it iterable from Python. -
When a Java class declares a method called
next()with no arguments returning an object type, this class is made iterable. Itsnext()method is assumed to terminate iteration by returningnull.
JCC generates a Python mapping get method for a class when requested
to do so via the --mapping command line option which
takes two arguments, the class to generate the mapping get for and
the Java method to use. The method is specified with its name
followed by ':' and its Java
signature.
For example, System.getProperties()['java.class.path'] is
made possible by:
JCC generates Python sequence length and get methods for a class
when requested to do so via the --sequence command line
option which takes three arguments, the class to generate the
sequence length and get for and the two java methods to use. The
methods are specified with their name followed by ':' and their Java
signature. For example:
is made possible by:
News
23 Jul 2011 - PyLucene 3.3-3 available
This release tracks Lucene Core's recent 3.3 release.
See PyLucene 3.3 CHANGES and JCC 2.10 CHANGES for details.
Source distributions are available here.
09 Jun 2011 - PyLucene 3.2.0-1 available
This release tracks Lucene Core's recent 3.2 release.
See PyLucene 3.2.0 CHANGES and JCC 2.9 CHANGES for details.
Source distributions are available here.
04 Apr 2011 - PyLucene 3.1.0-1 available
This release tracks Lucene Core's recent 3.1 release.
See PyLucene 3.1.0 CHANGES and JCC 2.8 CHANGES for details.
Source distributions are available here.
The Apache Software Foundation
The Apache Software Foundation provides support for the Apache community of open-source software projects. The Apache projects are defined by collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software that leads the way in its field. Apache Lucene, Apache Solr, Apache PyLucene, Apache Open Relevance Project and their respective logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.
