Class DiversifiedTopDocsCollector

All Implemented Interfaces:
Collector

public abstract class DiversifiedTopDocsCollector extends TopDocsCollector<DiversifiedTopDocsCollector.ScoreDocKey>
A TopDocsCollector that controls diversity in results by ensuring no more than maxHitsPerKey results from a common source are collected in the final results.

An example application might be a product search in a marketplace where no more than 3 results per retailer are permitted in search results.

To compare behaviour with other forms of collector, a useful analogy might be the problem of making a compilation album of 1967's top hit records:

  1. A vanilla query's results might look like a "Best of the Beatles" album - high quality but not much diversity
  2. A GroupingSearch would produce the equivalent of "The 10 top-selling artists of 1967 - some killer and quite a lot of filler"
  3. A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits in order to maintain diversity
This collector improves on the "GroupingSearch" type queries by
  • Working in one pass over the data
  • Not requiring the client to guess how many groups are required
  • Removing low-scoring "filler" which sits at the end of each group's hits
This is an abstract class and subclasses have to provide a source of keys for documents which is then used to help identify duplicate sources.
WARNING: This API is experimental and might change in incompatible ways in the next release.