org.apache.solr.handler.extraction
Class SolrContentHandler

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by org.apache.solr.handler.extraction.SolrContentHandler
All Implemented Interfaces:
ExtractingParams, ContentHandler, DTDHandler, EntityResolver, ErrorHandler

public class SolrContentHandler
extends DefaultHandler
implements ExtractingParams

The class responsible for handling Tika events and translating them into SolrInputDocuments. This class is not thread-safe.

User's may wish to override this class to provide their own functionality.

See Also:
SolrContentHandlerFactory, ExtractingRequestHandler, ExtractingDocumentLoader

Field Summary
protected  boolean captureAttribs
           
protected  StringBuilder catchAllBuilder
           
protected  String contentFieldName
           
protected  Collection<String> dateFormats
           
protected  String defaultField
           
protected  SolrInputDocument document
           
protected  Map<String,StringBuilder> fieldBuilders
           
protected  boolean lowerNames
           
protected  org.apache.tika.metadata.Metadata metadata
           
protected  SolrParams params
           
protected  IndexSchema schema
           
protected  String unknownFieldPrefix
           
 
Fields inherited from interface org.apache.solr.handler.extraction.ExtractingParams
BOOST_PREFIX, CAPTURE_ATTRIBUTES, CAPTURE_ELEMENTS, DEFAULT_FIELD, EXTRACT_FORMAT, EXTRACT_ONLY, IGNORE_TIKA_EXCEPTION, LITERALS_OVERRIDE, LITERALS_PREFIX, LOWERNAMES, MAP_PREFIX, PASSWORD_MAP_FILE, RESOURCE_NAME, RESOURCE_PASSWORD, STREAM_TYPE, UNKNOWN_FIELD_PREFIX, XPATH_EXPRESSION
 
Constructor Summary
SolrContentHandler(org.apache.tika.metadata.Metadata metadata, SolrParams params, IndexSchema schema)
           
SolrContentHandler(org.apache.tika.metadata.Metadata metadata, SolrParams params, IndexSchema schema, Collection<String> dateFormats)
           
 
Method Summary
protected  void addCapturedContent()
          Add the per field captured content to the Solr Document.
protected  void addContent()
          Add in the catch all content to the field.
protected  void addField(String fname, String fval, String[] vals)
           
protected  void addLiterals()
          Add in the literals to the document using the params and the ExtractingParams.LITERALS_PREFIX.
protected  void addMetadata()
          Add in any metadata using metadata as the source.
 void characters(char[] chars, int offset, int length)
           
 void endElement(String uri, String localName, String qName)
           
protected  String findMappedName(String name)
          Get the name mapping
protected  float getBoost(String name)
          Get the value of any boost factor for the mapped name.
 SolrInputDocument newDocument()
          This is called by a consumer when it is ready to deal with a new SolrInputDocument.
 void startDocument()
           
 void startElement(String uri, String localName, String qName, Attributes attributes)
           
protected  String transformValue(String val, SchemaField schFld)
          Can be used to transform input values based on their SchemaField

This implementation only formats dates using the DateUtil.

 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

document

protected SolrInputDocument document

dateFormats

protected Collection<String> dateFormats

metadata

protected org.apache.tika.metadata.Metadata metadata

params

protected SolrParams params

catchAllBuilder

protected StringBuilder catchAllBuilder

schema

protected IndexSchema schema

fieldBuilders

protected Map<String,StringBuilder> fieldBuilders

captureAttribs

protected boolean captureAttribs

lowerNames

protected boolean lowerNames

contentFieldName

protected String contentFieldName

unknownFieldPrefix

protected String unknownFieldPrefix

defaultField

protected String defaultField
Constructor Detail

SolrContentHandler

public SolrContentHandler(org.apache.tika.metadata.Metadata metadata,
                          SolrParams params,
                          IndexSchema schema)

SolrContentHandler

public SolrContentHandler(org.apache.tika.metadata.Metadata metadata,
                          SolrParams params,
                          IndexSchema schema,
                          Collection<String> dateFormats)
Method Detail

newDocument

public SolrInputDocument newDocument()
This is called by a consumer when it is ready to deal with a new SolrInputDocument. Overriding classes can use this hook to add in or change whatever they deem fit for the document at that time. The base implementation adds the metadata as fields, allowing for potential remapping.

Returns:
The SolrInputDocument.
See Also:
addMetadata(), addCapturedContent(), addContent(), addLiterals()

addCapturedContent

protected void addCapturedContent()
Add the per field captured content to the Solr Document. Default implementation uses the fieldBuilders info


addContent

protected void addContent()
Add in the catch all content to the field. Default impl. uses the contentFieldName and the catchAllBuilder


addLiterals

protected void addLiterals()
Add in the literals to the document using the params and the ExtractingParams.LITERALS_PREFIX.


addMetadata

protected void addMetadata()
Add in any metadata using metadata as the source.


addField

protected void addField(String fname,
                        String fval,
                        String[] vals)

startDocument

public void startDocument()
                   throws SAXException
Specified by:
startDocument in interface ContentHandler
Overrides:
startDocument in class DefaultHandler
Throws:
SAXException

startElement

public void startElement(String uri,
                         String localName,
                         String qName,
                         Attributes attributes)
                  throws SAXException
Specified by:
startElement in interface ContentHandler
Overrides:
startElement in class DefaultHandler
Throws:
SAXException

endElement

public void endElement(String uri,
                       String localName,
                       String qName)
                throws SAXException
Specified by:
endElement in interface ContentHandler
Overrides:
endElement in class DefaultHandler
Throws:
SAXException

characters

public void characters(char[] chars,
                       int offset,
                       int length)
                throws SAXException
Specified by:
characters in interface ContentHandler
Overrides:
characters in class DefaultHandler
Throws:
SAXException

transformValue

protected String transformValue(String val,
                                SchemaField schFld)
Can be used to transform input values based on their SchemaField

This implementation only formats dates using the DateUtil.

Parameters:
val - The value to transform
schFld - The SchemaField
Returns:
The potentially new value.

getBoost

protected float getBoost(String name)
Get the value of any boost factor for the mapped name.

Parameters:
name - The name of the field to see if there is a boost specified
Returns:
The boost value

findMappedName

protected String findMappedName(String name)
Get the name mapping

Parameters:
name - The name to check to see if there is a mapping
Returns:
The new name, if there is one, else name


Copyright © 2000-2013 Apache Software Foundation. All Rights Reserved.