Class DiversifiedTopDocsCollector

  • All Implemented Interfaces:
    Collector

    public abstract class DiversifiedTopDocsCollector
    extends TopDocsCollector<DiversifiedTopDocsCollector.ScoreDocKey>
    A TopDocsCollector that controls diversity in results by ensuring no more than maxHitsPerKey results from a common source are collected in the final results.

    An example application might be a product search in a marketplace where no more than 3 results per retailer are permitted in search results.

    To compare behaviour with other forms of collector, a useful analogy might be the problem of making a compilation album of 1967's top hit records:

    1. A vanilla query's results might look like a "Best of the Beatles" album - high quality but not much diversity
    2. A GroupingSearch would produce the equivalent of "The 10 top-selling artists of 1967 - some killer and quite a lot of filler"
    3. A "diversified" query would be the top 20 hit records of that year - with a max of 3 Beatles hits in order to maintain diversity
    This collector improves on the "GroupingSearch" type queries by
    • Working in one pass over the data
    • Not requiring the client to guess how many groups are required
    • Removing low-scoring "filler" which sits at the end of each group's hits
    This is an abstract class and subclasses have to provide a source of keys for documents which is then used to help identify duplicate sources.
    WARNING: This API is experimental and might change in incompatible ways in the next release.