Package org.apache.lucene.monitor

Monitoring framework

This package contains classes to allow the monitoring of a stream of documents with a set of queries.

To use, instantiate a Monitor object, register queries with it via Monitor.register(org.apache.lucene.monitor.MonitorQuery...), and then match documents against it either individually via Monitor.match(org.apache.lucene.document.Document, org.apache.lucene.monitor.MatcherFactory) or in batches via Monitor.match(org.apache.lucene.document.Document[], org.apache.lucene.monitor.MatcherFactory)

Matcher types

A number of matcher types are included: Matchers can be wrapped in PartitionMatcher or ParallelMatcher to increase performance in low-concurrency systems.

Pre-filtering of queries

Monitoring is done efficiently by extracting minimal sets of terms from queries, and using these to build a query index. When a document is passed to Monitor.match(org.apache.lucene.document.Document, org.apache.lucene.monitor.MatcherFactory), it is converted into a small index, and the terms dictionary from that index is then used to build a disjunction query to run against the query index. Queries that match this disjunction are then run against the document. In this way, the Monitor can avoid running queries that have no chance of matching. The process of extracting terms and building document disjunctions is handled by a Presearcher

In addition, extra per-field filtering can be specified by passing a set of keyword fields to filter on. When queries are registered with the monitor, field-value pairs can be added as optional metadata for each query, and these can then be used to restrict which queries a document is checked against. For example, you can specify a language that each query should apply to, and documents containing a value in their language field would only be checked against queries that have that same value in their language metadata. Note that when matching documents in batches, all documents in the batch must have the same values in their filter fields.

Query analysis uses the QueryVisitor API to extract terms, which will work for all basic term-based queries shipped with Lucene. The analyzer builds a representation of the query called a QueryTree, and then selects a minimal set of terms, one of which must be present in a document for that document to match. Individual terms are weighted using a TermWeightor, which allows some selectivity when building the term set. For example, given a conjunction of terms (a boolean query with several MUST clauses, or a phrase, span or interval query), we need only extract one term. The TermWeightor can be configured in a number of ways; by default it will weight longer terms more highly.

For query sets that contain many conjunctions, it can be useful to extract and index different minimal term combinations. For example, a phrase query on 'the quick brown fox' could index both 'quick' and 'brown', and avoid being run against documents that contain only one of these terms. The MultipassTermFilteredPresearcher allows this sort of indexing, taking a minimum term weight so that very common terms such as 'the' can be avoided.

Custom Query implementations that are based on term matching, and that implement Query.visit( will work with no extra configuration; for more complicated custom queries, you can register a CustomQueryHandler with the presearcher. Included in this package is a RegexpQueryHandler, which gives an example of a different method of indexing automaton-based queries by extracting fixed substrings from a regular expression, and then using ngram filtering to build the document disjunction.

Persistent query sets

By default, Monitor instances are ephemeral, storing their query indexes in memory. To make a persistent monitor, build a MonitorConfiguration object and call MonitorConfiguration.setIndexPath(java.nio.file.Path, org.apache.lucene.monitor.MonitorQuerySerializer) to tell the Monitor to store its query index on disk. All queries registered with this Monitor will need to have a string representation that is also stored, and can be re-parsed by the associated MonitorQuerySerializer when the index is loaded by a new Monitor instance.