About Tokenizers

The job of a tokenizer is to break up a stream of text into tokens, where each token is (usually) a sub-sequence of the characters in the text. An analyzer is aware of the field it is configured for, but a tokenizer is not. Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream).

Characters in the input stream may be discarded, such as whitespace or other delimiters. They may also be added to or replaced, such as mapping aliases or abbreviations to normalized forms. A token contains various metadata in addition to its text value, such as the location at which the token occurs in the field. Because a tokenizer may produce tokens that diverge from the input text, you should not assume that the text of the token is the same text that occurs in the field, or that its length is the same as the original text. It’s also possible for more than one token to have the same position or refer to the same offset in the original text. Keep this in mind if you use token metadata for things like highlighting search results in the field text.

<fieldType name="text" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
  </analyzer>
</fieldType>

The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the TokenizerFactory API. This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from Tokenizer, which indicates that they produce sequences of tokens. If the tokenizer produces tokens that are usable as is, it may be the only component of the analyzer. Otherwise, the tokenizer’s output tokens will serve as input to the first filter stage in the pipeline.

A TypeTokenFilterFactory is available that creates a TypeTokenFilter that filters tokens based on their TypeAttribute, which is set in factory.getStopTypes.

For a complete list of the available TokenFilters, see the section Tokenizers.

When to Use a CharFilter vs. a TokenFilter

There are several pairs of CharFilters and TokenFilters that have related (i.e., MappingCharFilter and ASCIIFoldingFilter) or nearly identical (i.e., PatternReplaceCharFilterFactory and PatternReplaceFilterFactory) functionality and it may not always be obvious which is the best choice.

The decision about which to use depends largely on which Tokenizer you are using, and whether you need to preprocess the stream of characters.

For example, suppose you have a tokenizer such as StandardTokenizer and although you are pretty happy with how it works overall, you want to customize how some specific characters behave. You could modify the rules and re-build your own tokenizer with JFlex, but it might be easier to simply map some of the characters before tokenization with a CharFilter.