public class MemoryIndex extends Object
Overview
This class is a replacement/substitute for a large subset of
RAMDirectory
functionality. It is designed to
enable maximum efficiency for on-the-fly matchmaking combining structured and
fuzzy fulltext search in realtime streaming applications such as Nux XQuery based XML
message queues, publish-subscribe systems for Blogs/newsfeeds, text chat, data acquisition and
distribution systems, application level routers, firewalls, classifiers, etc.
Rather than targeting fulltext search of infrequent queries over huge persistent
data archives (historic search), this class targets fulltext search of huge
numbers of queries over comparatively small transient realtime data (prospective
search).
For example as in
float score = search(String text, Query query)
Each instance can hold at most one Lucene "document", with a document containing
zero or more "fields", each field having a name and a fulltext value. The
fulltext value is tokenized (split and transformed) into zero or more index terms
(aka words) on addField()
, according to the policy implemented by an
Analyzer. For example, Lucene analyzers can split on whitespace, normalize to lower case
for case insensitivity, ignore common terms with little discriminatory value such as "he", "in", "and" (stop
words), reduce the terms to their natural linguistic root form such as "fishing"
being reduced to "fish" (stemming), resolve synonyms/inflexions/thesauri
(upon indexing and/or querying), etc. For details, see
Lucene Analyzer Intro.
Arbitrary Lucene queries can be run against this class - see Lucene Query Syntax as well as Query Parser Rules. Note that a Lucene query selects on the field names and associated (indexed) tokenized terms, not on the original fulltext(s) - the latter are not stored but rather thrown away immediately after tokenization.
For some interesting background information on search technology, see Bob Wyman's Prospective Search, Jim Gray's A Call to Arms - Custom subscriptions, and Tim Bray's On Search, the Series.
Example Usage
Analyzer analyzer = new SimpleAnalyzer(version); MemoryIndex index = new MemoryIndex(); index.addField("content", "Readings about Salmons and other select Alaska fishing Manuals", analyzer); index.addField("author", "Tales of James", analyzer); QueryParser parser = new QueryParser(version, "content", analyzer); float score = index.search(parser.parse("+author:james +salmon~ +fish* manual~")); if (score > 0.0f) { System.out.println("it's a match"); } else { System.out.println("no match found"); } System.out.println("indexData=" + index.toString());
Example XQuery Usage
(: An XQuery that finds all books authored by James that have something to do with "salmon fishing manuals", sorted by relevance :) declare namespace lucene = "java:nux.xom.pool.FullTextUtil"; declare variable $query := "+salmon~ +fish* manual~"; (: any arbitrary Lucene query can go here :) for $book in /books/book[author="James" and lucene:match(abstract, $query) > 0.0] let $score := lucene:match($book/abstract, $query) order by $score descending return $book
Thread safety guarantees
MemoryIndex is not normally thread-safe for adds or queries. However, queries
are thread-safe after freeze()
has been called.
Performance Notes
Internally there's a new data structure geared towards efficient indexing and searching, plus the necessary support code to seamlessly plug into the Lucene framework.
This class performs very well for very small texts (e.g. 10 chars)
as well as for large texts (e.g. 10 MB) and everything in between.
Typically, it is about 10-100 times faster than RAMDirectory
.
Note that RAMDirectory
has particularly
large efficiency overheads for small to medium sized texts, both in time and space.
Indexing a field with N tokens takes O(N) in the best case, and O(N logN) in the worst
case. Memory consumption is probably larger than for RAMDirectory
.
Example throughput of many simple term queries over a single MemoryIndex: ~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.
If you're curious about the whereabouts of bottlenecks, run java 1.5 with the non-perturbing '-server -agentlib:hprof=cpu=samples,depth=10' flags, then study the trace log and correlate its hotspot trailer with its call stack headers (see hprof tracing ).
Constructor and Description |
---|
MemoryIndex()
Constructs an empty instance that will not store offsets or payloads.
|
MemoryIndex(boolean storeOffsets)
Constructs an empty instance that can optionally store the start and end
character offset of each token term in the text.
|
MemoryIndex(boolean storeOffsets,
boolean storePayloads)
Constructs an empty instance with the option of storing offsets and payloads.
|
Modifier and Type | Method and Description |
---|---|
void |
addField(IndexableField field,
Analyzer analyzer)
Adds a lucene
IndexableField to the MemoryIndex using the provided analyzer. |
void |
addField(String fieldName,
String text,
Analyzer analyzer)
Convenience method; Tokenizes the given field text and adds the resulting
terms to the index; Equivalent to adding an indexed non-keyword Lucene
Field that is tokenized, not stored,
termVectorStored with positions (or termVectorStored with positions and offsets), |
void |
addField(String fieldName,
TokenStream stream)
Iterates over the given token stream and adds the resulting terms to the index;
Equivalent to adding a tokenized, indexed, termVectorStored, unstored,
Lucene
Field . |
void |
addField(String fieldName,
TokenStream stream,
int positionIncrementGap)
Iterates over the given token stream and adds the resulting terms to the index;
Equivalent to adding a tokenized, indexed, termVectorStored, unstored,
Lucene
Field . |
void |
addField(String fieldName,
TokenStream tokenStream,
int positionIncrementGap,
int offsetGap)
Iterates over the given token stream and adds the resulting terms to the index;
Equivalent to adding a tokenized, indexed, termVectorStored, unstored,
Lucene
Field . |
IndexSearcher |
createSearcher()
Creates and returns a searcher that can be used to execute arbitrary
Lucene queries and to collect the resulting query results as hits.
|
void |
freeze()
Prepares the MemoryIndex for querying in a non-lazy way.
|
static MemoryIndex |
fromDocument(Iterable<? extends IndexableField> document,
Analyzer analyzer)
Builds a MemoryIndex from a lucene
Document using an analyzer |
static MemoryIndex |
fromDocument(Iterable<? extends IndexableField> document,
Analyzer analyzer,
boolean storeOffsets,
boolean storePayloads)
Builds a MemoryIndex from a lucene
Document using an analyzer |
static MemoryIndex |
fromDocument(Iterable<? extends IndexableField> document,
Analyzer analyzer,
boolean storeOffsets,
boolean storePayloads,
long maxReusedBytes)
Builds a MemoryIndex from a lucene
Document using an analyzer |
<T> TokenStream |
keywordTokenStream(Collection<T> keywords)
Convenience method; Creates and returns a token stream that generates a
token for each keyword in the given collection, "as is", without any
transforming text analysis.
|
void |
reset()
Resets the
MemoryIndex to its initial state and recycles all internal buffers. |
float |
search(Query query)
Convenience method that efficiently returns the relevance score by
matching this index against the given Lucene query expression.
|
void |
setSimilarity(Similarity similarity)
Set the Similarity to be used for calculating field norms
|
String |
toStringDebug()
Returns a String representation of the index data for debugging purposes.
|
public MemoryIndex()
public MemoryIndex(boolean storeOffsets)
storeOffsets
- whether or not to store the start and end character offset of
each token term in the textpublic MemoryIndex(boolean storeOffsets, boolean storePayloads)
storeOffsets
- store term offsets at each positionstorePayloads
- store term payloads at each positionpublic void addField(String fieldName, String text, Analyzer analyzer)
Field
that is tokenized, not stored,
termVectorStored with positions (or termVectorStored with positions and offsets),fieldName
- a name to be associated with the texttext
- the text to tokenize and index.analyzer
- the analyzer to use for tokenizationpublic static MemoryIndex fromDocument(Iterable<? extends IndexableField> document, Analyzer analyzer)
Document
using an analyzerdocument
- the document to indexanalyzer
- the analyzer to usepublic static MemoryIndex fromDocument(Iterable<? extends IndexableField> document, Analyzer analyzer, boolean storeOffsets, boolean storePayloads)
Document
using an analyzerdocument
- the document to indexanalyzer
- the analyzer to usestoreOffsets
- true
if offsets should be storedstorePayloads
- true
if payloads should be storedpublic static MemoryIndex fromDocument(Iterable<? extends IndexableField> document, Analyzer analyzer, boolean storeOffsets, boolean storePayloads, long maxReusedBytes)
Document
using an analyzerdocument
- the document to indexanalyzer
- the analyzer to usestoreOffsets
- true
if offsets should be storedstorePayloads
- true
if payloads should be storedmaxReusedBytes
- the number of bytes that should remain in the internal memory pools after reset()
is calledpublic <T> TokenStream keywordTokenStream(Collection<T> keywords)
addField(String, TokenStream)
, perhaps wrapped into another
TokenFilter
, as desired.keywords
- the keywords to generate tokens forpublic void addField(IndexableField field, Analyzer analyzer)
IndexableField
to the MemoryIndex using the provided analyzer.
Also stores doc values based on IndexableFieldType.docValuesType()
if set.field
- the field to addanalyzer
- the analyzer to use for term analysispublic void addField(String fieldName, TokenStream stream)
Field
.
Finally closes the token stream. Note that untokenized keywords can be added with this method via
keywordTokenStream(Collection)
, the Lucene KeywordTokenizer
or similar utilities.fieldName
- a name to be associated with the textstream
- the token stream to retrieve tokens from.public void addField(String fieldName, TokenStream stream, int positionIncrementGap)
Field
.
Finally closes the token stream. Note that untokenized keywords can be added with this method via
keywordTokenStream(Collection)
, the Lucene KeywordTokenizer
or similar utilities.fieldName
- a name to be associated with the textstream
- the token stream to retrieve tokens from.positionIncrementGap
- the position increment gap if fields with the same name are added more than oncepublic void addField(String fieldName, TokenStream tokenStream, int positionIncrementGap, int offsetGap)
Field
.
Finally closes the token stream. Note that untokenized keywords can be added with this method via
keywordTokenStream(Collection)
, the Lucene KeywordTokenizer
or similar utilities.fieldName
- a name to be associated with the texttokenStream
- the token stream to retrieve tokens from. It's guaranteed to be closed no matter what.positionIncrementGap
- the position increment gap if fields with the same name are added more than onceoffsetGap
- the offset gap if fields with the same name are added more than oncepublic void setSimilarity(Similarity similarity)
public IndexSearcher createSearcher()
public void freeze()
After calling this you can query the MemoryIndex from multiple threads, but you cannot subsequently add new data.
public float search(Query query)
query
- an arbitrary Lucene query to run against this indexpublic String toStringDebug()
public void reset()
MemoryIndex
to its initial state and recycles all internal buffers.Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.