Class TrecContentSource

All Implemented Interfaces:
Closeable, AutoCloseable

public class TrecContentSource extends ContentSource
Implements a ContentSource over the TREC collection.

Supports the following configuration parameters (on top of ContentSource):

  • work.dir - specifies the working directory. Required if "docs.dir" denotes a relative path (default=work).
  • docs.dir - specifies the directory where the TREC files reside. Can be set to a relative path if "work.dir" is also specified (default=trec).
  • trec.doc.parser - specifies the TrecDocParser class to use for parsing the TREC documents content (default=TrecGov2Parser).
  • html.parser - specifies the HTMLParser class to use for parsing the HTML parts of the TREC documents content (default=DemoHTMLParser).
  • content.source.encoding - if not specified, ISO-8859-1 is used.
  • content.source.excludeIteration - if true, do not append iteration number to docname