public abstract class TrecDocParser extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
TrecDocParser.ParsePathType
Types of trec parse paths,
|
| Modifier and Type | Field and Description |
|---|---|
static TrecDocParser.ParsePathType |
DEFAULT_PATH_TYPE
trec parser type used for unknown extensions
|
| Constructor and Description |
|---|
TrecDocParser() |
| Modifier and Type | Method and Description |
|---|---|
static String |
extract(StringBuilder buf,
String startTag,
String endTag,
int maxPos,
String[] noisePrefixes)
Extract from
buf the text of interest within specified tags |
abstract DocData |
parse(DocData docData,
String name,
TrecContentSource trecSrc,
StringBuilder docBuf,
TrecDocParser.ParsePathType pathType)
parse the text prepared in docBuf into a result DocData,
no synchronization is required.
|
static TrecDocParser.ParsePathType |
pathType(File f)
Compute the path type of a file by inspecting name of file and its parents
|
static String |
stripTags(StringBuilder buf,
int start)
strip tags from
buf: each tag is replaced by a single blank. |
static String |
stripTags(String buf,
int start)
strip tags from input.
|
public static final TrecDocParser.ParsePathType DEFAULT_PATH_TYPE
public static TrecDocParser.ParsePathType pathType(File f)
public abstract DocData parse(DocData docData, String name, TrecContentSource trecSrc, StringBuilder docBuf, TrecDocParser.ParsePathType pathType) throws IOException, InterruptedException
docData - reusable resultname - name that should be set to the resulttrecSrc - calling trec content sourcedocBuf - text to parsepathType - type of parsed file, or null if unknown - may be used by
parsers to alter their behavior according to the file path type.IOExceptionInterruptedExceptionpublic static String stripTags(StringBuilder buf, int start)
buf: each tag is replaced by a single blank.buf (Input StringBuilder is unmodified).public static String stripTags(String buf, int start)
stripTags(StringBuilder, int)public static String extract(StringBuilder buf, String startTag, String endTag, int maxPos, String[] noisePrefixes)
buf the text of interest within specified tagsbuf - entire input textstartTag - tag marking start of text of interestendTag - tag marking end of text of interestmaxPos - if ≥ 0 sets a limit on start of text of interest