Class TrecDocParser

    • Constructor Detail

      • TrecDocParser

        public TrecDocParser()
    • Method Detail

      • parse

        public abstract DocData parse​(DocData docData,
                                      String name,
                                      TrecContentSource trecSrc,
                                      StringBuilder docBuf,
                                      TrecDocParser.ParsePathType pathType)
                               throws IOException
        parse the text prepared in docBuf into a result DocData, no synchronization is required.
        Parameters:
        docData - reusable result
        name - name that should be set to the result
        trecSrc - calling trec content source
        docBuf - text to parse
        pathType - type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.
        Throws:
        IOException
      • stripTags

        public static String stripTags​(StringBuilder buf,
                                       int start)
        strip tags from buf: each tag is replaced by a single blank.
        Returns:
        text obtained when stripping all tags from buf (Input StringBuilder is unmodified).
      • extract

        public static String extract​(StringBuilder buf,
                                     String startTag,
                                     String endTag,
                                     int maxPos,
                                     String[] noisePrefixes)
        Extract from buf the text of interest within specified tags
        Parameters:
        buf - entire input text
        startTag - tag marking start of text of interest
        endTag - tag marking end of text of interest
        maxPos - if ≥ 0 sets a limit on start of text of interest
        Returns:
        text of interest or null if not found