org.apache.lucene.ant
Class HtmlDocument

java.lang.Object
  extended by org.apache.lucene.ant.HtmlDocument

public class HtmlDocument
extends Object

The HtmlDocument class creates a Lucene Document from an HTML document.

It does this by using JTidy package. It can take input input from File or InputStream.


Constructor Summary
HtmlDocument(File file)
          Constructs an HtmlDocument from a File.
HtmlDocument(File file, String tidyConfigFile)
          Constructs an HtmlDocument from a File.
HtmlDocument(InputStream is)
          Constructs an HtmlDocument from an InputStream.
 
Method Summary
static Document Document(File file)
          Creates a Lucene Document from a File.
static Document Document(File file, String tidyConfigFile)
          Creates a Lucene Document from a File.
 String getBody()
          Gets the bodyText attribute of the HtmlDocument object.
static Document getDocument(InputStream is)
          Creates a Lucene Document from an InputStream.
 String getTitle()
          Gets the title attribute of the HtmlDocument object.
static void main(String[] args)
          Runs HtmlDocument on the files specified on the command line.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlDocument

public HtmlDocument(File file)
             throws IOException
Constructs an HtmlDocument from a File.

Parameters:
file - the File containing the HTML to parse
Throws:
IOException - if an I/O exception occurs

HtmlDocument

public HtmlDocument(InputStream is)
Constructs an HtmlDocument from an InputStream.

Parameters:
is - the InputStream containing the HTML

HtmlDocument

public HtmlDocument(File file,
                    String tidyConfigFile)
             throws IOException
Constructs an HtmlDocument from a File.

Parameters:
file - the File containing the HTML to parse
tidyConfigFile - the String containing the full path to the Tidy config file
Throws:
IOException - if an I/O exception occurs
Method Detail

Document

public static Document Document(File file,
                                String tidyConfigFile)
                         throws IOException
Creates a Lucene Document from a File.

Parameters:
file -
tidyConfigFile - the full path to the Tidy config file
Throws:
IOException

getDocument

public static Document getDocument(InputStream is)
Creates a Lucene Document from an InputStream.

Parameters:
is -

Document

public static Document Document(File file)
                         throws IOException
Creates a Lucene Document from a File.

Parameters:
file -
Throws:
IOException

main

public static void main(String[] args)
                 throws Exception
Runs HtmlDocument on the files specified on the command line.

Parameters:
args - Command line arguments
Throws:
Exception - Description of Exception

getTitle

public String getTitle()
Gets the title attribute of the HtmlDocument object.

Returns:
the title value

getBody

public String getBody()
Gets the bodyText attribute of the HtmlDocument object.

Returns:
the bodyText value


Copyright © 2000-2010 Apache Software Foundation. All Rights Reserved.