org.apache.lucene.analysis.miscellaneous (Lucene 9.9.1 common API)

package org.apache.lucene.analysis.miscellaneous

Miscellaneous Tokenstreams.

Class

Description

ASCIIFoldingFilter

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

ASCIIFoldingFilterFactory

Factory for ASCIIFoldingFilter.

CapitalizationFilter

A filter to apply normal capitalization rules to Tokens.

CapitalizationFilterFactory

Factory for CapitalizationFilter.

CodepointCountFilter

Removes words that are too long or too short from the stream.

CodepointCountFilterFactory

Factory for CodepointCountFilter.

ConcatenateGraphFilter

Concatenates/Joins every incoming token with a separator into one output token for every path through the token stream (which is a graph).

ConcatenateGraphFilter.BytesRefBuilderTermAttribute

Attribute providing access to the term builder and UTF-16 conversion

ConcatenateGraphFilter.BytesRefBuilderTermAttributeImpl

Implementation of ConcatenateGraphFilter.BytesRefBuilderTermAttribute

ConcatenateGraphFilterFactory

Factory for ConcatenateGraphFilter.

ConcatenatingTokenStream

A TokenStream that takes an array of input TokenStreams as sources, and concatenates them together.

ConditionalTokenFilter

Allows skipping TokenFilters based on the current set of attributes.

ConditionalTokenFilterFactory

Abstract parent class for analysis factories that create ConditionalTokenFilter instances

DateRecognizerFilter

Filters all tokens that cannot be parsed to a date, using the provided DateFormat.

DateRecognizerFilterFactory

Factory for DateRecognizerFilter.

DelimitedTermFrequencyTokenFilter

Characters before the delimiter are the "token", the textual integer after is the term frequency.

DelimitedTermFrequencyTokenFilterFactory

Factory for DelimitedTermFrequencyTokenFilter.

DropIfFlaggedFilter

Allows Tokens with a given combination of flags to be dropped.

DropIfFlaggedFilterFactory

Provides a filter that will drop tokens matching a set of flags.

EmptyTokenStream

An always exhausted token stream.

FingerprintFilter

Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.

FingerprintFilterFactory

Factory for FingerprintFilter.

FixBrokenOffsetsFilter

Deprecated.
Fix the token filters that create broken offsets in the first place.

FixBrokenOffsetsFilterFactory

Deprecated.

HyphenatedWordsFilter

When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines.

HyphenatedWordsFilterFactory

Factory for HyphenatedWordsFilter.

KeepWordFilter

A TokenFilter that only keeps tokens with text contained in the required words.

KeepWordFilterFactory

Factory for KeepWordFilter.

KeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute.

KeywordMarkerFilterFactory

Factory for KeywordMarkerFilter.

KeywordRepeatFilter

This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute.setKeyword(boolean) set to true and once set to false.

KeywordRepeatFilterFactory

Factory for KeywordRepeatFilter.

LengthFilter

Removes words that are too long or too short from the stream.

LengthFilterFactory

Factory for LengthFilter.

LimitTokenCountAnalyzer

This Analyzer limits the number of tokens while indexing.

LimitTokenCountFilter

This TokenFilter limits the number of tokens while indexing.

LimitTokenCountFilterFactory

Factory for LimitTokenCountFilter.

LimitTokenOffsetFilter

Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream.

LimitTokenOffsetFilterFactory

Factory for LimitTokenOffsetFilter.

LimitTokenPositionFilter

This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.

LimitTokenPositionFilterFactory

Factory for LimitTokenPositionFilter.

PatternKeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute.

PerFieldAnalyzerWrapper

This analyzer is used to facilitate scenarios where different fields require different analysis techniques.

ProtectedTermFilter

A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained in a protected set.

ProtectedTermFilterFactory

Factory for a ProtectedTermFilter

RemoveDuplicatesTokenFilter

A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.

RemoveDuplicatesTokenFilterFactory

Factory for RemoveDuplicatesTokenFilter.

ScandinavianFoldingFilter

This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.

ScandinavianFoldingFilterFactory

Factory for ScandinavianFoldingFilter.

ScandinavianNormalizationFilter

This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.

ScandinavianNormalizationFilterFactory

Factory for ScandinavianNormalizationFilter.

ScandinavianNormalizer

This Normalizer does the heavy lifting for a set of Scandinavian normalization filters, normalizing use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.

ScandinavianNormalizer.Foldings

List of possible foldings that can be used when configuring the filter

SetKeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute.

StemmerOverrideFilter

Provides the ability to override any KeywordAttribute aware stemmer with custom dictionary-based stemming.

StemmerOverrideFilter.Builder

This builder builds an FST for the StemmerOverrideFilter

StemmerOverrideFilter.StemmerOverrideMap

A read-only 4-byte FST backed map that allows fast case-insensitive key value lookups for StemmerOverrideFilter

StemmerOverrideFilterFactory

Factory for StemmerOverrideFilter.

TrimFilter

Trims leading and trailing whitespace from Tokens in the stream.

TrimFilterFactory

Factory for TrimFilter.

TruncateTokenFilter

A token filter for truncating the terms into a specific length.

TruncateTokenFilterFactory

Factory for TruncateTokenFilter.

TypeAsSynonymFilter

Adds the TypeAttribute.type() as a synonym, i.e.

TypeAsSynonymFilterFactory

Factory for TypeAsSynonymFilter.

WordDelimiterFilter

Deprecated.
Use WordDelimiterGraphFilter instead: it produces a correct token graph so that e.g.

WordDelimiterFilterFactory

Deprecated.
Use WordDelimiterGraphFilterFactory instead: it produces a correct token graph so that e.g.

WordDelimiterGraphFilter

Splits words into subwords and performs optional transformations on subword groups, producing a correct token graph so that e.g.

WordDelimiterGraphFilterFactory

Factory for WordDelimiterGraphFilter.

WordDelimiterIterator

A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.

Package org.apache.lucene.analysis.miscellaneous