Package org.apache.lucene.demo.facet

Facet Userguide and Demo.

Table of Contents

  1. Introduction
  2. Facet Features
    1. Facet Counting
    2. Facet Associations
  3. Indexing Categories Illustrated
  4. Collecting Facets Illustrated
  5. Indexed Facet Information
    1. Category Terms Field
    2. Category List Field
  6. Taxonomy Index
  7. Facet Configuration
  8. Advanced Faceted Examples
    1. Drill-Down with Regular Facets
    2. Multiple Category Lists
    3. Sampling
  9. Concurrent Indexing and Search
  10. All demo packages and classes

Introduction

A category is an aspect of indexed documents which can be used to classify the documents. For example, in a collection of books at an online bookstore, categories of a book can be its price, author, publication date, binding type, and so on.

In faceted search, in addition to the standard set of search results, we also get facet results, which are lists of subcategories for certain categories. For example, for the price facet, we get a list of relevant price ranges; for the author facet, we get a list of relevant authors; and so on. In most UIs, when users click one of these subcategories, the search is narrowed, or drilled down, and a new search limited to this subcategory (e.g., to a specific price range or author) is performed.

Note that faceted search is more than just the ordinary fielded search. In fielded search, users can add search keywords like price:10 or author:"Mark Twain" to the query to narrow the search, but this requires knowledge of which fields are available, and which values are worth trying. This is where faceted search comes in: it provides a list of useful subcategories, which ensures that the user only drills down into useful subcategories and never into a category for which there are no results. In essence, faceted search makes it easy to navigate through the search results. The list of subcategories provided for each facet is also useful to the user in itself, even when the user never drills down. This list allows the user to see at one glance some statistics on the search results, e.g., what price ranges and which authors are most relevant to the given query.

In recent years, faceted search has become a very common UI feature in search engines, especially in e-commerce websites. Faceted search makes it easy for untrained users to find the specific item they are interested in, whereas manually adding search keywords (as in the examples above) proved too cumbersome for ordinary users, and required too much guesswork, trial-and-error, or the reading of lengthy help pages.

See http://en.wikipedia.org/wiki/Faceted_search for more information on faceted search.

Facet Features

First and main faceted search capability that comes to mind is counting, but in fact faceted search is more than facet counting. We now briefly discuss the available faceted search features.

Facet Counting

Which of the available subcategories of a facet should a UI display? A query in a book store might yield books by a hundred different authors, but normally we'd want do display only, say, ten of those.

Most available faceted search implementations use counts to determine the importance of each subcategory. These implementations go over all search results for the given query, and count how many results are in each subcategory. Finally, the subcategories with the most results can be displayed. So the user sees the price ranges, authors, and so on, for which there are most results. Often, the count is displayed next to the subcategory name, in parentheses, telling the user how many results he can expect to see if he drills down into this subcategory.

The main API for obtaining facet counting is Facets.

Code examples can be found in SimpleFacetsExample.java, see details in Collecting Facets section.

Facet Associations

So far we've discussed categories as binary features, where a document either belongs to a category, or not.

While counts are useful in most situations, they are sometimes not sufficiently informative for the user, with respect to deciding which subcategory is more important to display.

For this, the facets package allows to associate a value with a category. The search time interpretation of the associated value is application dependent. For example, a possible interpretation is as a match level (e.g., confidence level). This value can then be used so that a document that is very weakly associated with a certain category will only contribute little to this category's aggregated weight.

Indexing Categories Illustrated

In order to find facets at search time they must first be added to the index at indexing time. Recall that Lucene documents are made of fields. To index document categories you use special field, FacetField. The field requires following parameters:

  • Facet dimension, for example author or publication date.
  • Facet path from root to a leaf for the current document. For example, for publication date dimension the path can be <"2010", "07", "28">. Constructed this way, this path allows us to refine search or counting for all books, published in the same year, or in the same year and month.
From taxonomy point of view, dimension is just a root element - or the top, the first element - in a category path. Indeed, the dimension stands out as a top important part of the category path, such as "Location" for the category "Location/Europe/France/Paris".

After all facet fields are added to the document, you should translate them into "normal" fields for indexing and, if required, updates taxonomy index. To do that, you need to call FacetsConfig.build(...). Before building, you might want to customize the per-dimension facets configuration, see details in Indexed Facet Information.

Indexing of each document therefore usually goes like this:

  • Create a fresh (empty) Lucene Document.
  • Parse input attributes and add appropriate index fields.
  • Add all input categories associated with the document as FacetField fields to the Lucene document.
  • Build facet fields with FacetsConfig.build(...). This actually adds the categories to the Lucene document and, if required, updates taxonomy to contain the newly added categories (if not already there) - see more on this in the section about the Taxonomy Index below.
  • Add the document to the index. As a result, category info is saved also in the regular search index, for supporting facet aggregation at search time (e.g. facet counting) as well as facet drill-down. For more information on indexed facet information see below the section Indexed Facet Information.
There is a category indexing code example in SimpleFacetsExample.index(), see SimpleFacetsExample.java source code.

Collecting Facets Illustrated

Facets collection reflects a set of documents over some facet requests:

  • Document set - a subset of the index documents, usually documents matching a user query.
  • Facet requests - facet collection specification, e.g. count a certain facet dimension.

Facets is a basic component in faceted search. It provides multiple methods to get facet results, most of these methods take dimension and path which correspond to dimensions and paths of indexed documents, see Indexing Categories. For example, getTopChildren(10, "Publish Date", "2010", "07") can be used to return top 10 labels for books published in July 2010.

Facets in an abstract class, open for extensions. The most often used implementation is FastTaxonomyFacetCounts, it is used for counting facets.

NOTE You must make sure that FacetsConfig used for searching matches the one that was used for indexing.

Facet collectors collect hits for subsequent faceting. The most commonly used one is FacetsCollector. The collectors extend Collector, and as such can be passed to the search() method of Lucene's IndexSearcher. In case the application also needs to collect documents (in addition to accumulating/collecting facets), you can use one of FacetsCollector.search(...) utility methods.

There is a facets collecting code example in SimpleFacetsExample.facetsWithSearch(), see SimpleFacetsExample.java source code.

Returned FacetResult instances contain:

Indexed Facet Information

When indexing a document with facet fields (categories), information on these categories is added to the search index, in two locations:

Category Terms Field

Category terms, or drill-down terms, are added to the document that contains facets fields. These categories can be used at search time for drill-down.

FacetsConfig has a per-dimension config of the Category Terms field, e.g. you can choose to not index them at all or index dimension, all sub-paths and full path, default config). For example, indexing a document with a dimension "author" and path <"American", "Mark Twain"> results in creating three tokens: "/author", "/author/American", and "/author/American/Mark Twain" (the character '/' here is just a human-readable separator, there's no such element in the actual index). This allows drilling down any category in the taxonomy, and not just leaf nodes.

Category List Field

Category List field is added to each document containing information on the categories that were added to this document. This can be used at search time for facet accumulation, e.g. facet counting.

If dimension is hierarchical (see FacetsConfig.DimConfig.hierarchical), the field allows counting any sub-category in the taxonomy, and not just leaf nodes, e.g. in the example above it enables a UI application to show either how many books have authors, or how many books have American authors, or how many books have Mark Twain as their (American) author.

If separate taxonomy index is used (see Taxonomy Index for when it's not), in order to keep the counting list compact, Category List field is built using category ordinal - an ordinal is an integer number attached to a category when it is added for the first time into the taxonomy.

For ways to further alter facet index see the section below on Facet Indexing Parameters.

Taxonomy Index

The taxonomy is an auxiliary data-structure that can be maintained side-by-side with the regular index to support faceted search operations. It contains information about all the categories that ever existed in any document in the index. Its API is open and allows simple usage, or more advanced for the interested users.

Not all Facet field types use Taxonomy Index to store data. SortedSetDocValuesFacetField writes taxonomy data as a SortedSetDocValues field in the regular index, therefore if you only use these fields for taxonomy, you don't need taxonomy index and its writer. All details in this section below are only applicable to cases where Taxonomy Index is created.

When FacetsConfig.build(...) is called on a document with facet fields, a corresponding node is added to the taxonomy index (unless already there). In fact, sometimes more than one node is added - each parent category is added as well, so that the taxonomy is maintained as a Tree, with a virtual root.

So, for the above example, adding the category <"author", "American", "Mark Twain"> actually added three nodes: one for the dimension "/author", one for "/author/American" and one for "/author/American/Mark Twain".

An integer number - called ordinal is attached to each category the first time the category is added to the taxonomy. This allows for a compact representation of category list tokens in the index, for facets accumulation.

One interesting fact about the taxonomy index is worth knowing: once a category is added to the taxonomy, it is never removed, even if all related documents are removed. This differs from a regular index, where if all documents containing a certain term are removed, and their segments are merged, the term will also be removed. This might cause a performance issue: large taxonomy means large ordinal numbers for categories, and hence large categories values arrays would be maintained during accumulation. It is probably not a real problem for most applications, but be aware of this. If, for example, an application at a certain point in time removes an index entirely in order to recreate it, or, if it removed all the documents from the index in order to re-populate it, it also makes sense in this opportunity to remove the taxonomy index and create a new, fresh one, without the unused categories.

Facet Configuration

Facet configuration controls how categories and facets are indexed and searched. It is not required to provide any parameters, as there are ready to use working defaults for everything. However, some aspects are configurable and can be modified by providing altered facet configuration parameters for search and indexing.

The most often used configuration options are:

Advanced Faceted Examples

We now provide examples for more advanced facet indexing and search, such as drilling-down on facet values and multiple category lists.

Drill-Down with Regular Facets

Drill-down allows users to focus on part of the results. Assume a commercial sport equipment site where a user is searching for a tennis racquet. The user issues the query tennis racquet and as result is shown a page with 10 tennis racquets, by various providers, of various types and prices. In addition, the site UI shows to the user a break-down of all available racquets by price and make. The user now decides to focus on racquets made by Head, and will now be shown a new page, with 10 Head racquets, and new break down of the results into racquet types and prices. Additionally, the application can choose to display a new breakdown, by racquet weights. This step of moving from results (and facet statistics) of the entire (or larger) data set into a portion of it by specifying a certain category, is what we call Drilldown.

You can find code example for drill-down in SimpleFacetsExample.drillDown(), see SimpleFacetsExample.java source code.

Multiple Category Lists

The default is to maintain all category list information in a single field. While this will suit most applications, in some situations an application may wish to use multiple fields, for example, when the distribution of some category values is different from that of other categories and calls for using a different encoding, more efficient for the specific distribution. Another example is when most facets are rarely used while some facets are used very heavily, so an application may opt to maintain the latter in memory - and in order to keep memory footprint lower it is useful to maintain only those heavily used facets in a separate category list.

You can find full example code in MultiCategoryListsFacetsExample.java.

First we need to change facets configuration to use different fields for different dimensions.

This will cause the Author categories to be maintained in one category list field, and Publish Date facets to be maintained in a another field. Note that any other category, if encountered, will still be maintained in the default field.

These non-default facets parameters should now be used both at indexing and search time, so make sure you use the same or similar FacetsConfig in both cases.

Sampling

Faceted search through a large collection of documents with large numbers of facets altogether and/or large numbers of facets per document is challenging performance wise, either in CPU, RAM, or both.

Facet sampling allows to accumulate facets over a sample of the matching documents set. In many cases, once top facets are found over the sample set, exact accumulations are computed for those facets only, this time over the entire matching document set.

Sampling support is implemented in RandomSamplingFacetsCollector.

Sometimes, indexing is done once, and when the index is fully prepared, searching starts. However, in most real applications indexing is incremental (new data comes in once in a while, and needs to be indexed), and indexing often needs to happen while searching is continuing at full steam.

Luckily, Lucene supports multiprocessing - one process writing to an index while another is reading from it. One of the key insights behind how Lucene allows multiprocessing is Point In Time semantics. The idea is that when an IndexReader is opened, it gets a view of the index at the point in time it was opened. If an IndexWriter in a different process or thread modifies the index, the reader does not know about it until a new IndexReader is opened.

In faceted search, we complicate things somewhat by adding a second index - the taxonomy index. The taxonomy API also follows point-in-time semantics, but this is not quite enough. Some attention must be paid by the user to keep those two indexes consistently in sync.

The main index refers to category numbers defined in the taxonomy index. Therefore, it is important that we open the TaxonomyReader after opening the IndexReader. Moreover, every time an IndexReader is reopened, the TaxonomyReader needs to be reopened as well.

But there is one extra caution: whenever the application deems it has written enough information worthy a commit, it must first call TwoPhaseCommit.commit() and only after that call IndexWriter.commit(). Closing the indices should also be done in this order - first close the taxonomy, and only after that close the index.

Note that the above discussion assumes that the underlying file-system on which the index and the taxonomy are stored respects ordering: if index A is written before index B, then any reader finding a modified index B will also see a modified index A.

All demo packages and classes