Class DemoHTMLParser
- java.lang.Object
-
- org.apache.lucene.benchmark.byTask.feeds.DemoHTMLParser
-
- All Implemented Interfaces:
HTMLParser
public class DemoHTMLParser extends Object implements HTMLParser
Simple HTML Parser extracting title, meta tags, and body text that is based on NekoHTML.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
DemoHTMLParser.Parser
The actual parser to read HTML documents
-
Constructor Summary
Constructors Constructor Description DemoHTMLParser()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description DocData
parse(DocData docData, String name, Date date, Reader reader, TrecContentSource trecSrc)
Parse the input Reader and return DocData.DocData
parse(DocData docData, String name, Date date, InputSource source, TrecContentSource trecSrc)
-
-
-
Method Detail
-
parse
public DocData parse(DocData docData, String name, Date date, Reader reader, TrecContentSource trecSrc) throws IOException
Description copied from interface:HTMLParser
Parse the input Reader and return DocData. The provided name,title,date are used for the result, unless when they're null, in which case an attempt is made to set them from the parsed data.- Specified by:
parse
in interfaceHTMLParser
- Parameters:
docData
- result reusedname
- name of the result doc data.date
- date of the result doc data. If null, attempt to set by parsed data.reader
- reader of html text to parse.trecSrc
- theTrecContentSource
used to parse dates.- Returns:
- Parsed doc data.
- Throws:
IOException
- If there is a low-level I/O error.
-
parse
public DocData parse(DocData docData, String name, Date date, InputSource source, TrecContentSource trecSrc) throws IOException, SAXException
- Throws:
IOException
SAXException
-
-