Class TrecDocParser
- java.lang.Object
-
- org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
-
- Direct Known Subclasses:
TrecFBISParser
,TrecFR94Parser
,TrecFTParser
,TrecGov2Parser
,TrecLATimesParser
,TrecParserByPath
public abstract class TrecDocParser extends Object
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which are handled in TrecContentSource. Required to be stateless and hence thread safe.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
TrecDocParser.ParsePathType
Types of trec parse paths,
-
Field Summary
Fields Modifier and Type Field Description static TrecDocParser.ParsePathType
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
Constructor Summary
Constructors Constructor Description TrecDocParser()
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description static String
extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
Extract frombuf
the text of interest within specified tagsabstract DocData
parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType)
parse the text prepared in docBuf into a result DocData, no synchronization is required.static TrecDocParser.ParsePathType
pathType(Path f)
Compute the path type of a file by inspecting name of file and its parentsstatic String
stripTags(StringBuilder buf, int start)
strip tags frombuf
: each tag is replaced by a single blank.static String
stripTags(String buf, int start)
strip tags from input.
-
-
-
Field Detail
-
DEFAULT_PATH_TYPE
public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
-
Method Detail
-
pathType
public static TrecDocParser.ParsePathType pathType(Path f)
Compute the path type of a file by inspecting name of file and its parents
-
parse
public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException
parse the text prepared in docBuf into a result DocData, no synchronization is required.- Parameters:
docData
- reusable resultname
- name that should be set to the resulttrecSrc
- calling trec content sourcedocBuf
- text to parsepathType
- type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.- Throws:
IOException
-
stripTags
public static String stripTags(StringBuilder buf, int start)
strip tags frombuf
: each tag is replaced by a single blank.- Returns:
- text obtained when stripping all tags from
buf
(Input StringBuilder is unmodified).
-
stripTags
public static String stripTags(String buf, int start)
strip tags from input.- See Also:
stripTags(StringBuilder, int)
-
extract
public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
Extract frombuf
the text of interest within specified tags- Parameters:
buf
- entire input textstartTag
- tag marking start of text of interestendTag
- tag marking end of text of interestmaxPos
- if ≥ 0 sets a limit on start of text of interest- Returns:
- text of interest or null if not found
-
-