Class TrecDocParser
java.lang.Object
org.apache.lucene.benchmark.byTask.feeds.TrecDocParser
- Direct Known Subclasses:
TrecFBISParser
,TrecFR94Parser
,TrecFTParser
,TrecGov2Parser
,TrecLATimesParser
,TrecParserByPath
Parser for trec doc content, invoked on doc text excluding <DOC> and <DOCNO> which
are handled in TrecContentSource. Required to be stateless and hence thread safe.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
Types of trec parse paths, -
Field Summary
Modifier and TypeFieldDescriptionstatic final TrecDocParser.ParsePathType
trec parser type used for unknown extensions -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic String
extract
(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes) Extract frombuf
the text of interest within specified tagsabstract DocData
parse
(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) parse the text prepared in docBuf into a result DocData, no synchronization is required.static TrecDocParser.ParsePathType
Compute the path type of a file by inspecting name of file and its parentsstatic String
stripTags
(StringBuilder buf, int start) strip tags frombuf
: each tag is replaced by a single blank.static String
strip tags from input.
-
Field Details
-
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
-
-
Constructor Details
-
TrecDocParser
public TrecDocParser()
-
-
Method Details
-
pathType
Compute the path type of a file by inspecting name of file and its parents -
parse
public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException parse the text prepared in docBuf into a result DocData, no synchronization is required.- Parameters:
docData
- reusable resultname
- name that should be set to the resulttrecSrc
- calling trec content sourcedocBuf
- text to parsepathType
- type of parsed file, or null if unknown - may be used by parsers to alter their behavior according to the file path type.- Throws:
IOException
-
stripTags
strip tags frombuf
: each tag is replaced by a single blank.- Returns:
- text obtained when stripping all tags from
buf
(Input StringBuilder is unmodified).
-
stripTags
strip tags from input.- See Also:
-
extract
public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes) Extract frombuf
the text of interest within specified tags- Parameters:
buf
- entire input textstartTag
- tag marking start of text of interestendTag
- tag marking end of text of interestmaxPos
- if ≥ 0 sets a limit on start of text of interest- Returns:
- text of interest or null if not found
-