Class TrecContentSource

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class TrecContentSource
    extends ContentSource
    Implements a ContentSource over the TREC collection.

    Supports the following configuration parameters (on top of ContentSource):

    • work.dir - specifies the working directory. Required if "docs.dir" denotes a relative path (default=work).
    • docs.dir - specifies the directory where the TREC files reside. Can be set to a relative path if "work.dir" is also specified (default=trec).
    • trec.doc.parser - specifies the TrecDocParser class to use for parsing the TREC documents content (default=TrecGov2Parser).
    • html.parser - specifies the HTMLParser class to use for parsing the HTML parts of the TREC documents content (default=DemoHTMLParser).
    • content.source.encoding - if not specified, ISO-8859-1 is used.
    • content.source.excludeIteration - if true, do not append iteration number to docname