Class OpenNLPExtractNamedEntitiesUpdateProcessorFactory

  • All Implemented Interfaces:
    NamedListInitializedPlugin, SolrCoreAware

    public class OpenNLPExtractNamedEntitiesUpdateProcessorFactory
    extends UpdateRequestProcessorFactory
    implements SolrCoreAware
    Extracts named entities using an OpenNLP NER modelFile from the values found in any matching source field into a configured dest field, after first tokenizing the source text using the index analyzer on the configured analyzerFieldType, which must include solr.OpenNLPTokenizerFactory as the tokenizer. E.g.:
       <fieldType name="opennlp-en-tokenization" class="solr.TextField">
         <analyzer>
           <tokenizer class="solr.OpenNLPTokenizerFactory"
                      sentenceModel="en-sent.bin"
                      tokenizerModel="en-tokenizer.bin"/>
         </analyzer>
       </fieldType>
     

    See the OpenNLP website for information on downloading pre-trained models.

    Note that in order to use model files larger than 1MB on SolrCloud, ZooKeeper server and client configuration is required.

    The source field(s) can be configured as either:

    The dest field can be a single <str> containing the literal name of a destination field, or it may be a <lst> specifying a regex pattern and a replacement string. If the pattern + replacement option is used the pattern will be matched against all fields matched by the source selector, and the replacement string (including any capture groups specified from the pattern) will be evaluated a using Matcher.replaceAll(String) to generate the literal name of the destination field. Additionally, an occurrence of the string "{EntityType}" in the dest field specification, or in the replacement string, will be replaced with the entity type(s) returned for each entity by the OpenNLP NER model; as a result, if the model extracts more than one entity type, then more than one dest field will be populated.

    If the resolved dest field already exists in the document, then the named entities extracted from the source fields will be added to it.

    In the example below:

    • Named entities will be extracted from the text field and added to the names_ss field
    • Named entities will be extracted from both the title and subtitle fields and added into the titular_people field
    • Named entities will be extracted from any field with a name ending in _txt -- except for notes_txt -- and added into the people_ss field
    • Named entities will be extracted from any field with a name beginning with "desc" and ending in "s" (e.g. "descs" and "descriptions") and added to a field prefixed with "key_", not ending in "s", and suffixed with "_people". (e.g. "key_desc_people" or "key_description_people")
    • Named entities will be extracted from the summary field and added to the summary_person_ss field, assuming that the modelFile only extracts entities of type "person".
     <updateRequestProcessorChain name="multiple-extract">
       <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
         <str name="modelFile">en-test-ner-person.bin</str>
         <str name="analyzerFieldType">opennlp-en-tokenization</str>
         <str name="source">text</str>
         <str name="dest">people_s</str>
       </processor>
       <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
         <str name="modelFile">en-test-ner-person.bin</str>
         <str name="analyzerFieldType">opennlp-en-tokenization</str>
         <arr name="source">
           <str>title</str>
           <str>subtitle</str>
         </arr>
         <str name="dest">titular_people</str>
       </processor>
       <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
         <str name="modelFile">en-test-ner-person.bin</str>
         <str name="analyzerFieldType">opennlp-en-tokenization</str>
         <lst name="source">
           <str name="fieldRegex">.*_txt$</str>
           <lst name="exclude">
             <str name="fieldName">notes_txt</str>
           </lst>
         </lst>
         <str name="dest">people_s</str>
       </processor>
       <processor class="solr.processor.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
         <str name="modelFile">en-test-ner-person.bin</str>
         <str name="analyzerFieldType">opennlp-en-tokenization</str>
         <lst name="source">
           <str name="fieldRegex">^desc(.*)s$</str>
         </lst>
         <lst name="dest">
           <str name="pattern">^desc(.*)s$</str>
           <str name="replacement">key_desc$1_people</str>
         </lst>
       </processor>
       <processor class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
         <str name="modelFile">en-test-ner-person.bin</str>
         <str name="analyzerFieldType">opennlp-en-tokenization</str>
         <str name="source">summary</str>
         <str name="dest">summary_{EntityType}_s</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>
     
    Since:
    7.3.0