All Classes and Interfaces

Class
Description
Base class for payload encoders.
Abstract parent class for analysis factories that accept a stopwords file as input.
An object representing the analysis result of a simple (non-compound) word
An object representing a prefix or a suffix applied to a word stem
Internal class used by Snowball stemmers
Strips all characters after an apostrophe (including the apostrophe itself).
Factory for ApostropheFilter.
Analyzer for Arabic.
A TokenFilter that applies ArabicNormalizer to normalize the orthography.
A TokenFilter that applies ArabicStemmer to stem Arabic words..
Factory for ArabicStemFilter.
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Armenian.
This class implements the stemming algorithm defined by a snowball script.
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
Factory for ASCIIFoldingFilter.
Base utility class for implementing a CharFilter.
Analyzer for Basque.
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Bengali.
A TokenFilter that applies BengaliNormalizer to normalize the orthography.
A TokenFilter that applies BengaliStemmer to stem Bengali words.
Factory for BengaliStemFilter.
Abstract dictionary base class.
Abstract base dictionary writer class.
Analyzer for Brazilian Portuguese language.
A TokenFilter that applies BrazilianStemmer.
Factory for BrazilianStemFilter.
Analyzer for Bulgarian.
A TokenFilter that applies BulgarianStemmer to stem Bulgarian words.
Factory for BulgarianStemFilter.
This class implements a simple byte vector with access to the underlying array.
A filter to apply normal capitalization rules to Tokens.
Analyzer for Catalan.
This class implements the stemming algorithm defined by a snowball script.
Character category data.
Functional interface to lookup character class
Writes character definition file
A CharacterIterator used internally for use with BreakIterator
An abstract base class for simple, character-oriented tokenizers.
This class implements a simple char vector with access to the underlying array.
An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter
Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.
Factory for CJKBigramFilter.
A CharFilter that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana
Factory for CJKWidthCharFilter.
A TokenFilter that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana
Factory for CJKWidthFilter.
Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
Normalizes tokens extracted with ClassicTokenizer.
Factory for ClassicFilter.
A grammar-based tokenizer constructed with JFlex
Factory for ClassicTokenizer.
Removes words that are too long or too short from the stream.
Extension of CharTermAttributeImpl that encodes the term text as a binary Unicode collation key instead of as UTF-8 bytes.
Converts each token into its CollationKey, and then encodes the bytes as an index term.
Indexes collation keys as a single-valued SortedDocValuesField.
Construct bigrams for frequently occurring terms while indexing.
Constructs a CommonGramsFilter.
Wrap a CommonGramsFilter optimizing phrase queries by only returning single words when they are not a member of a bigram.
Base class for decomposition token filters.
Concatenates/Joins every incoming token with a separator into one output token for every path through the token stream (which is a graph).
Attribute providing access to the term builder and UTF-16 conversion
A TokenStream that takes an array of input TokenStreams as sources, and concatenates them together.
Allows skipping TokenFilters based on the current set of attributes.
Abstract parent class for analysis factories that create ConditionalTokenFilter instances
n-gram connection cost data
Writes connection costs
Utility class for parsing CSV text
A general-purpose Analyzer that can be created with a builder-style API.
Builder for CustomAnalyzer.
Factory class for a ConditionalTokenFilter
Analyzer for Czech language.
A TokenFilter that applies CzechStemmer to stem Czech words.
Factory for CzechStemFilter.
Analyzer for Danish.
This class implements the stemming algorithm defined by a snowball script.
Filters all tokens that cannot be parsed to a date, using the provided DateFormat.
Folds all Unicode digits in [:General_Category=Decimal_Number:] to Basic Latin digits (0-9).
Factory for DecimalDigitFilter.
Characters before the delimiter are the "token", those after are the boost.
Characters before the delimiter are the "token", those after are the payload.
Characters before the delimiter are the "token", the textual integer after is the term frequency.
An object representing homonym dictionary entries.
An object representing *.dic file entry with its word, flags and morphological data.
In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
High-level dictionary interface for morphological analyzers.
A TokenFilter that decomposes compound words found in many Germanic languages.
Abstract writer class to write dictionary entries.
Dl4jModelReader reads the file generated by the library Deeplearning4j and provide a Word2VecModel with normalized vectors
Allows Tokens with a given combination of flags to be dropped.
Provides a filter that will drop tokens matching a set of flags.
Analyzer for Dutch language.
This class implements the stemming algorithm defined by a snowball script.
Creates new instances of EdgeNGramTokenFilter.
Tokenizes the given token into n-grams of given size(s).
Tokenizes the input from an edge into n-grams of given size(s).
Creates new instances of EdgeNGramTokenizer.
Removes elisions from a TokenStream.
Factory for ElisionFilter.
An always exhausted token stream.
Analyzer for English.
A TokenFilter that applies EnglishMinimalStemmer to stem English words.
TokenFilter that removes possessives (trailing 's) from words.
This class implements the stemming algorithm defined by a snowball script.
Suggestion to add/edit dictionary entries to generate a given list of words created by WordFormGenerator.compress(java.util.List<java.lang.String>, java.util.Set<java.lang.String>, java.lang.Runnable).
Analyzer for Estonian.
This class implements the stemming algorithm defined by a snowball script.
Simple ResourceLoader that opens resource files from the local file system, optionally resolving against a base directory.
Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.
Factory for FingerprintFilter.
Analyzer for Finnish.
A TokenFilter that applies FinnishLightStemmer to stem Finnish words.
This class implements the stemming algorithm defined by a snowball script.
Deprecated.
Fix the token filters that create broken offsets in the first place.
Deprecated.
A FixedShingleFilter constructs shingles (token n-grams) from a token stream.
Factory for FixedShingleFilter
Converts an incoming graph token stream, such as one from SynonymGraphFilter, into a flat form so that all nodes form a single linear chain with no side paths.
Factory for FlattenGraphFilter.
Encode a character array Float as a BytesRef.
An oracle for quickly checking that a specific part of a word can never be a valid word.
Analyzer for French language.
A TokenFilter that applies FrenchLightStemmer to stem French words.
A TokenFilter that applies FrenchMinimalStemmer to stem French words.
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Galician.
A TokenFilter that applies GalicianMinimalStemmer to stem Galician words.
A TokenFilter that applies GalicianStemmer to stem Galician words.
Factory for GalicianStemFilter.
Analyzer for German language.
A TokenFilter that applies GermanLightStemmer to stem German words.
A TokenFilter that applies GermanMinimalStemmer to stem German words.
Normalizes German characters according to the heuristics of the German snowball algorithm.
A TokenFilter that stems German words.
Factory for GermanStemFilter.
This class implements the stemming algorithm defined by a snowball script.
Outputs the dot (graphviz) string for the viterbi lattice.
Dictionary provider
Analyzer for the Greek language.
Normalizes token text to lower case, removes some Greek diacritics, and standardizes final sigma to sigma.
A TokenFilter that applies GreekStemmer to stem Greek words.
Factory for GreekStemFilter.
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Hindi.
A TokenFilter that applies HindiNormalizer to normalize the orthography.
A TokenFilter that applies HindiStemmer to stem Hindi words.
Factory for HindiStemFilter.
This class implements the stemming algorithm defined by a snowball script.
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
Factory for HTMLStripCharFilter.
Analyzer for Hungarian.
A TokenFilter that applies HungarianLightStemmer to stem Hungarian words.
This class implements the stemming algorithm defined by a snowball script.
A spell checker based on Hunspell dictionaries.
TokenFilter that uses hunspell affix rules and words to stem tokens.
TokenFilterFactory that creates instances of HunspellStemFilter.
This class represents a hyphen.
When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines.
This class represents a hyphenated word.
A TokenFilter that decomposes compound words found in many Germanic languages.
This tree structure stores the hyphenation patterns in an efficient way for fast lookup.
Does nothing other than convert the char array to a byte array using the specified encoding.
A TokenFilter that applies IndicNormalizer to normalize text in Indian Languages.
Analyzer for Indonesian (Bahasa)
A TokenFilter that applies IndonesianStemmer to stem Indonesian words.
This class implements the stemming algorithm defined by a snowball script.
Encode a character array Integer as a BytesRef.
Analyzer for Irish.
Normalises token text to lower case, handling t-prothesis and n-eclipsis (i.e., that 'nAthair' should become 'n-athair')
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Italian.
A TokenFilter that applies ItalianLightStemmer to stem Italian words.
This class implements the stemming algorithm defined by a snowball script.
A TokenFilter that only keeps tokens with text contained in the required words.
Factory for KeepWordFilter.
"Tokenizes" the entire stream as a single token.
Marks terms as keywords via the KeywordAttribute.
Factory for KeywordMarkerFilter.
This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute.setKeyword(boolean) set to true and once set to false.
Factory for KeywordRepeatFilter.
Emits the entire input as a single token.
Factory for KeywordTokenizer.
A high-performance kstem filter for english.
Factory for KStemFilter.
Analyzer for Latvian.
A TokenFilter that applies LatvianStemmer to stem Latvian words.
Factory for LatvianStemFilter.
Removes words that are too long or too short from the stream.
Factory for LengthFilter.
A LetterTokenizer is a tokenizer that divides text at non-letters.
Factory for LetterTokenizer.
This Analyzer limits the number of tokens while indexing.
This TokenFilter limits the number of tokens while indexing.
Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream.
This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.
Analyzer for Lithuanian.
This class implements the stemming algorithm defined by a snowball script.
Normalizes token text to lower case.
Factory for LowerCaseFilter.
Simplistic CharFilter that applies the mappings contained in a NormalizeCharMap to the character stream, and correcting the resulting changes to the offsets.
Factory for MappingCharFilter.
Generate min hash tokens from an incoming stream of tokens.
High-level interface that represents morphological information in a dictionary
Analyzer for Nepali.
This class implements the stemming algorithm defined by a snowball script.
Factory for NGramTokenFilter.
A FragmentChecker based on all character n-grams possible in a certain language, keeping them in a relatively memory-efficient, but probabilistic data structure.
A callback for n-gram ranges in words
Tokenizes the input into n-grams of the given size(s).
Tokenizes the input into n-grams of the given size(s).
Factory for NGramTokenizer.
Holds a map of String input to String output, to be used with MappingCharFilter.
Builds an NormalizeCharMap.
Analyzer for Norwegian.
A TokenFilter that applies NorwegianLightStemmer to stem Norwegian words.
A TokenFilter that applies NorwegianMinimalStemmer to stem Norwegian words.
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (ae, oe, aa) by transforming them to åÅæÆøØ.
This class implements the stemming algorithm defined by a snowball script.
Assigns a payload to a token based on the TypeAttribute
A StringBuilder that allows one to access the array.
Tokenizer for path-like hierarchies.
CaptureGroup uses Java regexes to emit multiple tokens - one for each capture group in one or more patterns.
This interface is used to connect the XML pattern file parser to the hyphenation tree.
Marks terms as keywords via the KeywordAttribute.
A SAX document handler to read and parse hyphenation patterns from a XML file.
CharFilter that uses a regular expression for the target of replace string.
A TokenFilter which applies a Pattern to each token in the stream, replacing match occurrences with the specified replacement string.
This tokenizer uses regex pattern matching to construct distinct tokens for the input stream.
Factory for PatternTokenizer.
Set a type attribute to a parameterized value when tokens are matched by any of a several regex patterns.
Value holding class for pattern typing rules.
Provides a filter that will analyze tokens with the analyzer from an arbitrary field type.
Mainly for use with the DelimitedPayloadTokenFilter, converts char buffers to BytesRef.
Utility methods for encoding payloads.
This analyzer is used to facilitate scenarios where different fields require different analysis techniques.
Analyzer for Persian.
CharFilter that replaces instances of Zero-width non-joiner with an ordinary space.
Factory for PersianCharFilter.
A TokenFilter that applies PersianNormalizer to normalize the orthography.
A TokenFilter that applies PersianStemmer to stem Persian words.
Factory for PersianStemFilter.
Transforms the token stream as per the Porter stemming algorithm.
Factory for PorterStemFilter.
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Portuguese.
A TokenFilter that applies PortugueseLightStemmer to stem Portuguese words.
A TokenFilter that applies PortugueseMinimalStemmer to stem Portuguese words.
A TokenFilter that applies PortugueseStemmer to stem Portuguese words.
This class implements the stemming algorithm defined by a snowball script.
A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained in a protected set.
Factory for a ProtectedTermFilter
An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.
A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.
Tokenizer for domain-like hierarchies.
Reverse token string, for example "country" => "yrtnuoc".
Factory for ReverseStringFilter.
Acts like a forever growing char[] as you read characters into it from the provided reader, but internally it uses a circular buffer to only hold the characters that haven't been freed yet.
Analyzer for Romanian.
TokenFilter that normalizes cedilla forms to comma forms.
This class implements the stemming algorithm defined by a snowball script.
Base class for stemmers that use a set of RSLP-like stemming steps.
A basic rule, with no exceptions.
A rule with a set of whole-word exceptions.
A rule with a set of exceptional suffixes.
A step containing a list of rules.
Analyzer for Russian language.
A TokenFilter that applies RussianLightStemmer to stem Russian words.
This class implements the stemming algorithm defined by a snowball script.
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
This Normalizer does the heavy lifting for a set of Scandinavian normalization filters, normalizing use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
List of possible foldings that can be used when configuring the filter
Breaks text into sentences with a BreakIterator and allows subclasses to decompose these sentences into words.
Analyzer for Serbian.
Normalizes Serbian Cyrillic and Latin characters to "bald" Latin.
Normalizes Serbian Cyrillic to Latin.
This class implements the stemming algorithm defined by a snowball script.
Marks terms as keywords via the KeywordAttribute.
A ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.
A ShingleFilter constructs shingles (token n-grams) from a token stream.
Factory for ShingleFilter.
This tokenizer uses a Lucene RegExp or (expert usage) a pre-built determinized Automaton, to locate tokens.
Factory for SimplePatternSplitTokenizer, for producing tokens by splitting according to the provided regexp.
This tokenizer uses a Lucene RegExp or (expert usage) a pre-built determinized Automaton, to locate tokens.
Factory for SimplePatternTokenizer, for matching tokens based on the provided regexp.
A filter that stems words using a Snowball-generated stemmer.
Factory for SnowballFilter, with configurable language
Base class for a snowball stemmer
Parent class of all snowball stemmers, which must implement stem
Parser for the Solr synonyms format.
Analyzer for Sorani Kurdish.
A TokenFilter that applies SoraniNormalizer to normalize the orthography.
A TokenFilter that applies SoraniStemmer to stem Sorani words.
Factory for SoraniStemFilter.
The strategy defining how a Hunspell dictionary should be loaded, with different tradeoffs.
Analyzer for Spanish.
A TokenFilter that applies SpanishLightStemmer to stem Spanish words.
Deprecated.
Deprecated.
A TokenFilter that applies SpanishPluralStemmer to stem Spanish words.
This class implements the stemming algorithm defined by a snowball script.
Provides the ability to override any KeywordAttribute aware stemmer with custom dictionary-based stemming.
This builder builds an FST for the StemmerOverrideFilter
A read-only 4-byte FST backed map that allows fast case-insensitive key value lookups for StemmerOverrideFilter
Some commonly-used stemming functions
Removes stop words from a token stream.
Factory for StopFilter.
A generator for misspelled word corrections based on Hunspell flags.
An exception thrown when Hunspell.suggest(java.lang.String) call takes too long, if TimeoutPolicy.THROW_EXCEPTION is used.
Analyzer for Swedish.
A TokenFilter that applies SwedishLightStemmer to stem Swedish words.
A TokenFilter that applies SwedishMinimalStemmer to stem Swedish words.
This class implements the stemming algorithm defined by a snowball script.
Deprecated.
Use SynonymGraphFilter instead, but be sure to also use FlattenGraphFilter at index time (not at search time) as well.
Deprecated.
Use SynonymGraphFilterFactory instead, but be sure to also use FlattenGraphFilterFactory at index time (not at search time) as well.
Applies single- or multi-token synonyms from a SynonymMap to an incoming TokenStream, producing a fully correct graph output.
Factory for SynonymGraphFilter.
A map of synonyms, keys and values are phrases.
Builds an FSTSynonymMap.
Abstraction for parsing synonym files.
Analyzer for Tamil.
This class implements the stemming algorithm defined by a snowball script.
This TokenFilter provides the ability to set aside attribute states that have already been analyzed.
TokenStream output from a tee.
Analyzer for Telugu.
A TokenFilter that applies TeluguNormalizer to normalize the orthography.
A TokenFilter that applies TeluguStemmer to stem Telugu words.
Factory for TeluguStemFilter.
Wraps a term and boost
Ternary Search Tree.
Analyzer for Thai language.
Tokenizer that use BreakIterator to tokenize Thai text.
Factory for ThaiTokenizer.
A strategy determining what to do when Hunspell API calls take too much time
Analyzed token with morphological data.
Thin wrapper around an FST with root-arc caching.
Adds the OffsetAttribute.startOffset() and OffsetAttribute.endOffset() First 4 bytes are the start
Token type reflecting the original source of this token
Trims leading and trailing whitespace from Tokens in the stream.
Factory for TrimFilter.
A token filter for truncating the terms into a specific length.
Factory for TruncateTokenFilter.
Analyzer for Turkish.
Normalizes Turkish token text to lower case.
This class implements the stemming algorithm defined by a snowball script.
Makes the TypeAttribute a payload.
Adds the TypeAttribute.type() as a synonym, i.e.
Factory for TypeAsSynonymFilter.
Removes tokens whose types appear in a set of blocked types from a token stream.
Factory class for TypeTokenFilter.
Filters UAX29URLEmailTokenizer with LowerCaseFilter and StopFilter, using a list of English stop words.
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
This file contains unicode properties used by various CharTokenizers.
An Analyzer that uses UnicodeWhitespaceTokenizer.
A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace.
Normalizes token text to UPPER CASE.
Factory for UpperCaseFilter.
Performs Viterbi algorithm for morphological Tokenizers, which split texts by Hidden Markov Model or Conditional Random Fields.
Holds all back pointers arriving to this position.
Holds partial graph (array of positions) for calculating the minimum cost path
Viterbi subclass for n-best path calculation.
Yet another lattice data structure for keeping n-best path.
Viterbi.Position extension; this holds all forward pointers to calculate n-best path.
An Analyzer that uses WhitespaceTokenizer.
A tokenizer that divides text at whitespace characters as defined by Character.isWhitespace(int).
Factory for WhitespaceTokenizer.
Extension of StandardTokenizer that is aware of Wikipedia syntax.
Factory for WikipediaTokenizer.
Word2VecModel is a class representing the parsed Word2Vec model containing the vectors for each word in dictionary
Applies single-token synonyms from a Word2Vec trained network to an incoming TokenStream.
The Word2VecSynonymProvider generates the list of sysnonyms of a term.
Supply Word2Vec Word2VecSynonymProvider cache avoiding that multiple instances of Word2VecSynonymFilterFactory will instantiate multiple instances of the same SynonymProvider.
Deprecated.
Use WordDelimiterGraphFilter instead: it produces a correct token graph so that e.g.
Deprecated.
Use WordDelimiterGraphFilterFactory instead: it produces a correct token graph so that e.g.
Splits words into subwords and performs optional transformations on subword groups, producing a correct token graph so that e.g.
A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.
A utility class used for generating possible word forms by adding affixes to stems (WordFormGenerator.getAllWordForms(String, String, Runnable)), and suggesting stems and flags to generate the given set of words (WordFormGenerator.compress(List, Set, Runnable)).
Parser for wordnet prolog format
This class implements the stemming algorithm defined by a snowball script.