Package org.apache.lucene.analysis.miscellaneous
package org.apache.lucene.analysis.miscellaneous
Miscellaneous Tokenstreams.
-
ClassDescriptionThis class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.Factory for
ASCIIFoldingFilter
.A filter to apply normal capitalization rules to Tokens.Factory forCapitalizationFilter
.Removes words that are too long or too short from the stream.Factory forCodepointCountFilter
.Concatenates/Joins every incoming token with a separator into one output token for every path through the token stream (which is a graph).Attribute providing access to the term builder and UTF-16 conversionImplementation ofConcatenateGraphFilter.BytesRefBuilderTermAttribute
Factory forConcatenateGraphFilter
.A TokenStream that takes an array of input TokenStreams as sources, and concatenates them together.Allows skipping TokenFilters based on the current set of attributes.Abstract parent class for analysis factories that createConditionalTokenFilter
instancesFilters all tokens that cannot be parsed to a date, using the providedDateFormat
.Factory forDateRecognizerFilter
.Characters before the delimiter are the "token", the textual integer after is the term frequency.Factory forDelimitedTermFrequencyTokenFilter
.Allows Tokens with a given combination of flags to be dropped.Provides a filter that will drop tokens matching a set of flags.An always exhausted token stream.Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.Factory forFingerprintFilter
.Deprecated.Fix the token filters that create broken offsets in the first place.Deprecated.When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines.Factory forHyphenatedWordsFilter
.A TokenFilter that only keeps tokens with text contained in the required words.Factory forKeepWordFilter
.Marks terms as keywords via theKeywordAttribute
.Factory forKeywordMarkerFilter
.This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once withKeywordAttribute.setKeyword(boolean)
set totrue
and once set tofalse
.Factory forKeywordRepeatFilter
.Removes words that are too long or too short from the stream.Factory forLengthFilter
.This Analyzer limits the number of tokens while indexing.This TokenFilter limits the number of tokens while indexing.Factory forLimitTokenCountFilter
.Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream.Factory forLimitTokenOffsetFilter
.This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.Factory forLimitTokenPositionFilter
.Marks terms as keywords via theKeywordAttribute
.This analyzer is used to facilitate scenarios where different fields require different analysis techniques.A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained in a protected set.Factory for aProtectedTermFilter
A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.Factory forRemoveDuplicatesTokenFilter
.This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.Factory forScandinavianFoldingFilter
.This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.Factory forScandinavianNormalizationFilter
.This Normalizer does the heavy lifting for a set of Scandinavian normalization filters, normalizing use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.List of possible foldings that can be used when configuring the filterMarks terms as keywords via theKeywordAttribute
.Provides the ability to override anyKeywordAttribute
aware stemmer with custom dictionary-based stemming.This builder builds anFST
for theStemmerOverrideFilter
A read-only 4-byte FST backed map that allows fast case-insensitive key value lookups forStemmerOverrideFilter
Factory forStemmerOverrideFilter
.Trims leading and trailing whitespace from Tokens in the stream.Factory forTrimFilter
.A token filter for truncating the terms into a specific length.Factory forTruncateTokenFilter
.Adds theTypeAttribute.type()
as a synonym, i.e.Factory forTypeAsSynonymFilter
.Deprecated.UseWordDelimiterGraphFilter
instead: it produces a correct token graph so that e.g.Deprecated.UseWordDelimiterGraphFilterFactory
instead: it produces a correct token graph so that e.g.Splits words into subwords and performs optional transformations on subword groups, producing a correct token graph so that e.g.Factory forWordDelimiterGraphFilter
.A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.