Class TrecDocParser

java.lang.Object
org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
Direct Known Subclasses:
TrecFBISParser, TrecFR94Parser, TrecFTParser, TrecGov2Parser, TrecLATimesParser, TrecParserByPath

public abstract class TrecDocParser extends Object
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.
  • Field Details

  • Constructor Details

    • TrecDocParser

      public TrecDocParser()
  • Method Details

    • pathType

      public static TrecDocParser.ParsePathType pathType(Path f)
      Compute the path type of a file by inspecting name of file and its parents
    • parse

      public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException
      parse the text prepared in docBuf into a result DocData, no synchronization is required.
      Parameters:
      docData - reusable result
      name - name that should be set to the result
      trecSrc - calling trec content source
      docBuf - text to parse
      pathType - type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.
      Throws:
      IOException
    • stripTags

      public static String stripTags(StringBuilder buf, int start)
      strip tags from buf: each tag is replaced by a single blank.
      Returns:
      text obtained when stripping all tags from buf (Input StringBuilder is unmodified).
    • stripTags

      public static String stripTags(String buf, int start)
      strip tags from input.
      See Also:
    • extract

      public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
      Extract from buf the text of interest within specified tags
      Parameters:
      buf - entire input text
      startTag - tag marking start of text of interest
      endTag - tag marking end of text of interest
      maxPos - if ≥ 0 sets a limit on start of text of interest
      Returns:
      text of interest or null if not found