org.apache.lucene.search.grouping (Lucene 3.6.0 API)

Class Summary
Class	Description
AbstractAllGroupHeadsCollector<GH extends AbstractAllGroupHeadsCollector.GroupHead>	This collector specializes in collecting the most relevant document (group head) for each group that match the query.
AbstractAllGroupHeadsCollector.GroupHead<GROUP_VALUE_TYPE>	Represents a group head.
AbstractAllGroupsCollector<GROUP_VALUE_TYPE>	A collector that collects all groups that match the query.
AbstractFirstPassGroupingCollector<GROUP_VALUE_TYPE>	FirstPassGroupingCollector is the first of two passes necessary to collect grouped hits.
AbstractSecondPassGroupingCollector<GROUP_VALUE_TYPE>	SecondPassGroupingCollector is the second of two passes necessary to collect grouped docs.
BlockGroupingCollector	BlockGroupingCollector performs grouping with a single pass collector, as long as you are grouping by a doc block field, ie all documents sharing a given group value were indexed as a doc block using the atomic `IndexWriter.addDocuments(java.util.Collection<org.apache.lucene.document.Document>)` or `IndexWriter.updateDocuments(org.apache.lucene.index.Term, java.util.Collection<org.apache.lucene.document.Document>)` API.
GroupDocs<GROUP_VALUE_TYPE>	Represents one group in the results.
SearchGroup<GROUP_VALUE_TYPE>	Represents a group that is found during the first pass search.
SentinelIntSet	A native int set where one value is reserved to mean "EMPTY"
TermAllGroupHeadsCollector<GH extends AbstractAllGroupHeadsCollector.GroupHead>	A base implementation of `AbstractAllGroupHeadsCollector` for retrieving the most relevant groups when grouping on a string based group field.
TermAllGroupsCollector	A collector that collects all groups that match the query.
TermFirstPassGroupingCollector	Concrete implementation of `AbstractFirstPassGroupingCollector` that groups based on field values and more specifically uses `FieldCache.StringIndex` to collect groups.
TermSecondPassGroupingCollector	Concrete implementation of `AbstractSecondPassGroupingCollector` that groups based on field values and more specifically uses `FieldCache.StringIndex` to collect grouped docs.
TopGroups<GROUP_VALUE_TYPE>	Represents result returned by a grouping search.

Package org.apache.lucene.search.grouping Description

This module enables search result grouping with Lucene, where hits with the same value in the specified single-valued group field are grouped together. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group.

Grouping requires a number of inputs:

groupField: this is the field used for grouping. For example, if you use the author field then each group has all books by the same author. Documents that don't have this field are grouped under a single group with a null group value.
groupSort: how the groups are sorted. For sorting purposes, each group is "represented" by the highest-sorted document according to the groupSort within it. For example, if you specify "price" (ascending) then the first group is the one with the lowest price book within it. Or if you specify relevance group sort, then the first group is the one containing the highest scoring book.
topNGroups: how many top groups to keep. For example, 10 means the top 10 groups are computed.
groupOffset: which "slice" of top groups you want to retrieve. For example, 3 means you'll get 7 groups back (assuming topNGroups is 10). This is useful for paging, where you might show 5 groups per page.
withinGroupSort: how the documents within each group are sorted. This can be different from the group sort.
maxDocsPerGroup: how many top documents within each group to keep.
withinGroupOffset: which "slice" of top documents you want to retrieve from each group.

The implementation is two-pass: the first pass (TermFirstPassGroupingCollector) gathers the top groups, and the second pass (TermSecondPassGroupingCollector) gathers documents within those groups. If the search is costly to run you may want to use the CachingCollector class, which caches hits and can (quickly) replay them for the second pass. This way you only run the query once, but you pay a RAM cost to (briefly) hold all hits. Results are returned as a TopGroups instance.

This module abstracts away what defines group and how it is collected. All grouping collectors are abstract and have currently term based implementations. One can implement collectors that for example group on multiple fields.

Known limitations:

For the two-pass grouping collector, the group field must be a single-valued indexed field. FieldCache is used to load the FieldCache.StringIndex for this field.
Although Solr support grouping by function and this module has abstraction of what a group is, there are currently only implementations for grouping based on terms.
Sharding is not directly supported, though is not too difficult, if you can merge the top groups and top documents per group yourself.

Typical usage for the generic two-pass collector looks like this (using the CachingCollector):

  TermFirstPassGroupingCollector c1 = new TermFirstPassGroupingCollector("author", groupSort, groupOffset+topNGroups);

  boolean cacheScores = true;
  double maxCacheRAMMB = 4.0;
  CachingCollector cachedCollector = CachingCollector.create(c1, cacheScores, maxCacheRAMMB);
  s.search(new TermQuery(new Term("content", searchTerm)), cachedCollector);

  Collection<SearchGroup<String>> topGroups = c1.getTopGroups(groupOffset, fillFields);

  if (topGroups == null) {
    // No groups matched
    return;
  }

  boolean getScores = true;
  boolean getMaxScores = true;
  boolean fillFields = true;
  TermSecondPassGroupingCollector c2 = new TermSecondPassGroupingCollector("author", topGroups, groupSort, docSort, docOffset+docsPerGroup, getScores, getMaxScores, fillFields);

  //Optionally compute total group count
  TermAllGroupsCollector allGroupsCollector = null;
  if (requiredTotalGroupCount) {
    allGroupsCollector = new TermAllGroupsCollector("author");
    c2 = MultiCollector.wrap(c2, allGroupsCollector);
  }

  if (cachedCollector.isCached()) {
    // Cache fit within maxCacheRAMMB, so we can replay it:
    cachedCollector.replay(c2);
  } else {
    // Cache was too large; must re-execute query:
    s.search(new TermQuery(new Term("content", searchTerm)), c2);
  }

  TopGroups<String> groupsResult = c2.getTopGroups(docOffset);
  if (requiredTotalGroupCount) {
    groupsResult = new TopGroups<String>(groupsResult, allGroupsCollector.getGroupCount());
  }

  // Render groupsResult...

To use the single-pass BlockGroupingCollector, first, at indexing time, you must ensure all docs in each group are added as a block, and you have some way to find the last document of each group. One simple way to do this is to add a marker binary field:

  // Create Documents from your source:
  List<Document> oneGroup = ...;
  
  Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
  groupEndField.setOmitTermFreqAndPositions(true);
  groupEndField.setOmitNorms(true);
  oneGroup.get(oneGroup.size()-1).add(groupEndField);

  // You can also use writer.updateDocuments(); just be sure you
  // replace an entire previous doc block with this new one.  For
  // example, each group could have a "groupID" field, with the same
  // value for all docs in this group:
  writer.addDocuments(oneGroup);

Then, at search time, do this up front:

  // Set this once in your app & save away for reusing across all queries:
  Filter groupEndDocs = new CachingWrapperFilter(new QueryWrapperFilter(new TermQuery(new Term("groupEnd", "x"))));

Finally, do this per search:

  // Per search:
  BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
  s.search(new TermQuery(new Term("content", searchTerm)), c);
  TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);

  // Render groupsResult...

Note that the groupValue of each GroupDocs will be null, so if you need to present this value you'll have to separately retrieve it (for example using stored fields, FieldCache, etc.).

Another collector is the TermAllGroupHeadsCollector that can be used to retrieve all most relevant documents per group. Also known as group heads. This can be useful in situations when one wants to compute grouping based facets / statistics on the complete query result. The collector can be executed during the first or second phase.

  AbstractAllGroupHeadsCollector c = TermAllGroupHeadsCollector.create(groupField, sortWithinGroup);
  s.search(new TermQuery(new Term("content", searchTerm)), c);
  // Return all group heads as int array
  int[] groupHeadsArray = c.retrieveGroupHeads()
  // Return all group heads as FixedBitSet.
  int maxDoc = s.maxDoc();
  FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)