Class SimplePatternSplitTokenizerFactory


  • public class SimplePatternSplitTokenizerFactory
    extends TokenizerFactory
    Factory for SimplePatternSplitTokenizer, for producing tokens by splitting according to the provided regexp.

    This tokenizer uses Lucene RegExp pattern matching to construct distinct tokens for the input stream. The syntax is more limited than PatternTokenizer, but the tokenization is quite a bit faster. It takes two arguments:

    • "pattern" (required) is the regular expression, according to the syntax described at RegExp
    • "maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp

    The pattern matches the characters that should split tokens, like String.split, and the matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

    For example, to match tokens delimited by simple whitespace characters:

     <fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
       </analyzer>
     </fieldType>
    Since:
    6.5.0
    See Also:
    SimplePatternSplitTokenizer
    WARNING: This API is experimental and might change in incompatible ways in the next release.
    SPI Name (Note: This is case-insensitive. e.g., if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service):
    "simplePatternSplit"