org.apache.lucene.search.grouping (Lucene 8.6.0 API)

Grouping.

This module enables search result grouping with Lucene, where hits with the same value in the specified single-valued group field are grouped together. For example, if you group by the author field, then all documents with the same value in the author field fall into a single group.

Grouping requires a number of inputs:

groupSelector: this defines how groups are created from values per-document. The grouping module ships with selectors for grouping by term, and by long and double ranges.
groupSort: how the groups are sorted. For sorting purposes, each group is "represented" by the highest-sorted document according to the groupSort within it. For example, if you specify "price" (ascending) then the first group is the one with the lowest price book within it. Or if you specify relevance group sort, then the first group is the one containing the highest scoring book.
topNGroups: how many top groups to keep. For example, 10 means the top 10 groups are computed.
groupOffset: which "slice" of top groups you want to retrieve. For example, 3 means you'll get 7 groups back (assuming topNGroups is 10). This is useful for paging, where you might show 5 groups per page.
withinGroupSort: how the documents within each group are sorted. This can be different from the group sort.
maxDocsPerGroup: how many top documents within each group to keep.
withinGroupOffset: which "slice" of top documents you want to retrieve from each group.

The implementation is two-pass: the first pass (FirstPassGroupingCollector) gathers the top groups, and the second pass (SecondPassGroupingCollector) gathers documents within those groups. If the search is costly to run you may want to use the CachingCollector class, which caches hits and can (quickly) replay them for the second pass. This way you only run the query once, but you pay a RAM cost to (briefly) hold all hits. Results are returned as a TopGroups instance.

Groups are defined by GroupSelector implementations:

TermGroupSelector groups based on the value of a SortedDocValues field
ValueSourceGroupSelector groups based on the value of a ValueSource
DoubleRangeGroupSelector groups based on the value of a DoubleValuesSource
LongRangeGroupSelector groups based on the value of a LongValuesSource

Known limitations:

Sharding is not directly supported, though is not too difficult, if you can merge the top groups and top documents per group yourself.

Typical usage for the generic two-pass grouping search looks like this using the grouping convenience utility (optionally using caching for the second pass search):

   GroupingSearch groupingSearch = new GroupingSearch("author");
   groupingSearch.setGroupSort(groupSort);
   groupingSearch.setFillSortFields(fillFields);
 
   if (useCache) {
     // Sets cache in MB
     groupingSearch.setCachingInMB(4.0, true);
   }
 
   if (requiredTotalGroupCount) {
     groupingSearch.setAllGroups(true);
   }
 
   TermQuery query = new TermQuery(new Term("content", searchTerm));
   TopGroups<BytesRef> result = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);
 
   // Render groupsResult...
   if (requiredTotalGroupCount) {
     int totalGroupCount = result.totalGroupCount;
   }

To use the single-pass BlockGroupingCollector, first, at indexing time, you must ensure all docs in each group are added as a block, and you have some way to find the last document of each group. One simple way to do this is to add a marker binary field:

   // Create Documents from your source:
   List<Document> oneGroup = ...;
   
   Field groupEndField = new Field("groupEnd", "x", Field.Store.NO, Field.Index.NOT_ANALYZED);
   groupEndField.setIndexOptions(IndexOptions.DOCS_ONLY);
   groupEndField.setOmitNorms(true);
   oneGroup.get(oneGroup.size()-1).add(groupEndField);
 
   // You can also use writer.updateDocuments(); just be sure you
   // replace an entire previous doc block with this new one.  For
   // example, each group could have a "groupID" field, with the same
   // value for all docs in this group:
   writer.addDocuments(oneGroup);

Then, at search time:

   Query groupEndDocs = new TermQuery(new Term("groupEnd", "x"));
   BlockGroupingCollector c = new BlockGroupingCollector(groupSort, groupOffset+topNGroups, needsScores, groupEndDocs);
   s.search(new TermQuery(new Term("content", searchTerm)), c);
   TopGroups groupsResult = c.getTopGroups(withinGroupSort, groupOffset, docOffset, docOffset+docsPerGroup, fillFields);
 
   // Render groupsResult...

Or alternatively use the GroupingSearch convenience utility:

   // Per search:
   GroupingSearch groupingSearch = new GroupingSearch(groupEndDocs);
   groupingSearch.setGroupSort(groupSort);
   groupingSearch.setIncludeScores(needsScores);
   TermQuery query = new TermQuery(new Term("content", searchTerm));
   TopGroups groupsResult = groupingSearch.search(indexSearcher, query, groupOffset, groupLimit);

   // Render groupsResult...

Note that the groupValue of each GroupDocs will be null, so if you need to present this value you'll have to separately retrieve it (for example using stored fields, FieldCache, etc.).

Another collector is the AllGroupHeadsCollector that can be used to retrieve all most relevant documents per group. Also known as group heads. This can be useful in situations when one wants to compute group based facets / statistics on the complete query result. The collector can be executed during the first or second phase. This collector can also be used with the GroupingSearch convenience utility, but when if one only wants to compute the most relevant documents per group it is better to just use the collector as done here below.

   TermGroupSelector grouper = new TermGroupSelector(groupField);
   AllGroupHeadsCollector c = AllGroupHeadsCollector.newCollector(grouper, sortWithinGroup);
   s.search(new TermQuery(new Term("content", searchTerm)), c);
   // Return all group heads as int array
   int[] groupHeadsArray = c.retrieveGroupHeads()
   // Return all group heads as FixedBitSet.
   int maxDoc = s.maxDoc();
   FixedBitSet groupHeadsBitSet = c.retrieveGroupHeads(maxDoc)

Class Summary
Class	Description
AllGroupHeadsCollector<T>	This collector specializes in collecting the most relevant document (group head) for each group that matches the query.
AllGroupHeadsCollector.GroupHead<T>	Represents a group head.
AllGroupsCollector<T>	A collector that collects all groups that match the query.
BlockGroupingCollector	BlockGroupingCollector performs grouping with a single pass collector, as long as you are grouping by a doc block field, ie all documents sharing a given group value were indexed as a doc block using the atomic `IndexWriter.addDocuments()` or `IndexWriter.updateDocuments()` API.
CollectedSearchGroup<T>	Expert: representation of a group in `FirstPassGroupingCollector`, tracking the top doc and `FieldComparator` slot.
DistinctValuesCollector<T,R>	A second pass grouping collector that keeps track of distinct values for a specified field for the top N group.
DistinctValuesCollector.GroupCount<T,R>	Returned by `DistinctValuesCollector.getGroups()`, representing the value and set of distinct values for the group.
DoubleRange	Represents a contiguous range of double values, with an inclusive minimum and exclusive maximum
DoubleRangeFactory	Groups double values into ranges
DoubleRangeGroupSelector	A GroupSelector implementation that groups documents by double values
FirstPassGroupingCollector<T>	FirstPassGroupingCollector is the first of two passes necessary to collect grouped hits.
GroupDocs<T>	Represents one group in the results.
GroupFacetCollector	Base class for computing grouped facets.
GroupFacetCollector.FacetEntry	Represents a facet entry with a value and a count.
GroupFacetCollector.GroupedFacetResult	The grouped facet result.
GroupFacetCollector.SegmentResult	Contains the local grouped segment counts for a particular segment.
GroupingSearch	Convenience class to perform grouping in a non distributed environment.
GroupReducer<T,C extends Collector>	Concrete implementations of this class define what to collect for individual groups during the second-pass of a grouping search.
GroupSelector<T>	Defines a group, for use by grouping collectors A GroupSelector acts as an iterator over documents.
LongRange	Represents a contiguous range of long values, with an inclusive minimum and exclusive maximum
LongRangeFactory	Groups double values into ranges
LongRangeGroupSelector	A GroupSelector implementation that groups documents by long values
SearchGroup<T>	Represents a group that is found during the first pass search.
SecondPassGroupingCollector<T>	SecondPassGroupingCollector runs over an already collected set of groups, further applying a `GroupReducer` to each group
TermGroupFacetCollector	An implementation of `GroupFacetCollector` that computes grouped facets based on the indexed terms from DocValues.
TermGroupSelector	A GroupSelector implementation that groups via SortedDocValues
TopGroups<T>	Represents result returned by a grouping search.
TopGroupsCollector<T>	A second-pass collector that collects the TopDocs for each group, and returns them as a `TopGroups` object
ValueSourceGroupSelector	A GroupSelector that groups via a ValueSource

Enum Summary
Enum Description

GroupSelector.State
What to do with the current value

TopGroups.ScoreMergeMode
How the GroupDocs score (if any) should be merged.

Enum	Description
GroupSelector.State	What to do with the current value
TopGroups.ScoreMergeMode	How the GroupDocs score (if any) should be merged.

Package org.apache.lucene.search.grouping