All Classes and Interfaces
Class
Description
Base class for payload encoders.
Abstract parent class for analysis factories that accept a stopwords file as input.
An object representing the analysis result of a simple (non-compound) word
An object representing a prefix or a suffix applied to a word stem
Internal class used by Snowball stemmers
Strips all characters after an apostrophe (including the apostrophe itself).
Factory for
ApostropheFilter
.Analyzer
for Arabic.A
TokenFilter
that applies ArabicNormalizer
to normalize the orthography.Factory for
ArabicNormalizationFilter
.Normalizer for Arabic.
A
TokenFilter
that applies ArabicStemmer
to stem Arabic words..Factory for
ArabicStemFilter
.Stemmer for Arabic.
This class implements the stemming algorithm defined by a snowball script.
Analyzer
for Armenian.This class implements the stemming algorithm defined by a snowball script.
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the
first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one
exists.
Factory for
ASCIIFoldingFilter
.Base utility class for implementing a
CharFilter
.Analyzer
for Basque.This class implements the stemming algorithm defined by a snowball script.
Analyzer for Bengali.
A
TokenFilter
that applies BengaliNormalizer
to normalize the orthography.Factory for
BengaliNormalizationFilter
.Normalizer for Bengali.
A
TokenFilter
that applies BengaliStemmer
to stem Bengali words.Factory for
BengaliStemFilter
.Stemmer for Bengali.
Analyzer
for Brazilian Portuguese language.A
TokenFilter
that applies BrazilianStemmer
.Factory for
BrazilianStemFilter
.A stemmer for Brazilian Portuguese words.
Analyzer
for Bulgarian.A
TokenFilter
that applies BulgarianStemmer
to stem Bulgarian words.Factory for
BulgarianStemFilter
.Light Stemmer for Bulgarian.
This class implements a simple byte vector with access to the underlying array.
A filter to apply normal capitalization rules to Tokens.
Factory for
CapitalizationFilter
.Analyzer
for Catalan.This class implements the stemming algorithm defined by a snowball script.
A CharacterIterator used internally for use with
BreakIterator
An abstract base class for simple, character-oriented tokenizers.
This class implements a simple char vector with access to the underlying array.
An
Analyzer
that tokenizes text with StandardTokenizer
, normalizes content with
CJKWidthFilter
, folds case with LowerCaseFilter
, forms bigrams of CJK with CJKBigramFilter
, and filters stopwords with StopFilter
Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.
Factory for
CJKBigramFilter
.A
CharFilter
that normalizes CJK width differences:
Folds fullwidth ASCII variants into the equivalent basic latin
Folds halfwidth Katakana variants into the equivalent kana
Factory for
CJKWidthCharFilter
.A
TokenFilter
that normalizes CJK width differences:
Folds fullwidth ASCII variants into the equivalent basic latin
Folds halfwidth Katakana variants into the equivalent kana
Factory for
CJKWidthFilter
.Filters
ClassicTokenizer
with ClassicFilter
, LowerCaseFilter
and StopFilter
, using a list of English stop words.Normalizes tokens extracted with
ClassicTokenizer
.Factory for
ClassicFilter
.A grammar-based tokenizer constructed with JFlex
Factory for
ClassicTokenizer
.Removes words that are too long or too short from the stream.
Factory for
CodepointCountFilter
.Extension of
CharTermAttributeImpl
that encodes the term text as a binary Unicode
collation key instead of as UTF-8 bytes.Converts each token into its
CollationKey
, and then encodes the bytes as an
index term.Indexes collation keys as a single-valued
SortedDocValuesField
.Configures
KeywordTokenizer
with CollationAttributeFactory
.Construct bigrams for frequently occurring terms while indexing.
Constructs a
CommonGramsFilter
.Wrap a CommonGramsFilter optimizing phrase queries by only returning single words when they are
not a member of a bigram.
Construct
CommonGramsQueryFilter
.Base class for decomposition token filters.
Concatenates/Joins every incoming token with a separator into one output token for every path
through the token stream (which is a graph).
Attribute providing access to the term builder and UTF-16 conversion
Implementation of
ConcatenateGraphFilter.BytesRefBuilderTermAttribute
Factory for
ConcatenateGraphFilter
.A TokenStream that takes an array of input TokenStreams as sources, and concatenates them
together.
Allows skipping TokenFilters based on the current set of attributes.
Abstract parent class for analysis factories that create
ConditionalTokenFilter
instancesUtility class for parsing CSV text
A general-purpose Analyzer that can be created with a builder-style API.
Builder for
CustomAnalyzer
.Factory class for a
ConditionalTokenFilter
Analyzer
for Czech language.A
TokenFilter
that applies CzechStemmer
to stem Czech words.Factory for
CzechStemFilter
.Light Stemmer for Czech.
Analyzer
for Danish.This class implements the stemming algorithm defined by a snowball script.
Filters all tokens that cannot be parsed to a date, using the provided
DateFormat
.Factory for
DateRecognizerFilter
.Folds all Unicode digits in
[:General_Category=Decimal_Number:]
to Basic Latin digits
(0-9
).Factory for
DecimalDigitFilter
.Characters before the delimiter are the "token", those after are the boost.
Factory for
DelimitedBoostTokenFilter
.Characters before the delimiter are the "token", those after are the payload.
Factory for
DelimitedPayloadTokenFilter
.Characters before the delimiter are the "token", the textual integer after is the term frequency.
Factory for
DelimitedTermFrequencyTokenFilter
.An object representing homonym dictionary entries.
An object representing *.dic file entry with its word, flags and morphological data.
In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary.
A
TokenFilter
that decomposes compound words found in many
Germanic languages.Factory for
DictionaryCompoundWordTokenFilter
.Dl4jModelReader reads the file generated by the library Deeplearning4j and provide a
Word2VecModel with normalized vectors
Allows Tokens with a given combination of flags to be dropped.
Provides a filter that will drop tokens matching a set of flags.
Analyzer
for Dutch language.This class implements the stemming algorithm defined by a snowball script.
Creates new instances of
EdgeNGramTokenFilter
.Tokenizes the given token into n-grams of given size(s).
Tokenizes the input from an edge into n-grams of given size(s).
Creates new instances of
EdgeNGramTokenizer
.Removes elisions from a
TokenStream
.Factory for
ElisionFilter
.An always exhausted token stream.
Analyzer
for English.A
TokenFilter
that applies EnglishMinimalStemmer
to stem English words.Factory for
EnglishMinimalStemFilter
.Minimal plural stemmer for English.
TokenFilter that removes possessives (trailing 's) from words.
Factory for
EnglishPossessiveFilter
.This class implements the stemming algorithm defined by a snowball script.
Suggestion to add/edit dictionary entries to generate a given list of words created by
WordFormGenerator.compress(java.util.List<java.lang.String>, java.util.Set<java.lang.String>, java.lang.Runnable)
.Analyzer
for Estonian.This class implements the stemming algorithm defined by a snowball script.
Simple
ResourceLoader
that opens resource files from the local file system, optionally
resolving against a base directory.Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of
input tokens.
Factory for
FingerprintFilter
.Analyzer
for Finnish.A
TokenFilter
that applies FinnishLightStemmer
to stem Finnish words.Factory for
FinnishLightStemFilter
.Light Stemmer for Finnish.
This class implements the stemming algorithm defined by a snowball script.
Deprecated.
Fix the token filters that create broken offsets in the first place.
Deprecated.
A FixedShingleFilter constructs shingles (token n-grams) from a token stream.
Factory for
FixedShingleFilter
Converts an incoming graph token stream, such as one from
SynonymGraphFilter
, into a flat
form so that all nodes form a single linear chain with no side paths.Factory for
FlattenGraphFilter
.Encode a character array Float as a
BytesRef
.An oracle for quickly checking that a specific part of a word can never be a valid word.
Analyzer
for French language.A
TokenFilter
that applies FrenchLightStemmer
to stem French words.Factory for
FrenchLightStemFilter
.Light Stemmer for French.
A
TokenFilter
that applies FrenchMinimalStemmer
to stem French words.Factory for
FrenchMinimalStemFilter
.Light Stemmer for French.
This class implements the stemming algorithm defined by a snowball script.
Analyzer
for Galician.A
TokenFilter
that applies GalicianMinimalStemmer
to stem Galician words.Factory for
GalicianMinimalStemFilter
.Minimal Stemmer for Galician
A
TokenFilter
that applies GalicianStemmer
to stem Galician words.Factory for
GalicianStemFilter
.Galician stemmer implementing "Regras do lematizador para o galego".
This class implements the stemming algorithm defined by a snowball script.
Analyzer
for German language.A
TokenFilter
that applies GermanLightStemmer
to stem German words.Factory for
GermanLightStemFilter
.Light Stemmer for German.
A
TokenFilter
that applies GermanMinimalStemmer
to stem German words.Factory for
GermanMinimalStemFilter
.Minimal Stemmer for German.
Normalizes German characters according to the heuristics of the German2 snowball algorithm.
Factory for
GermanNormalizationFilter
.A
TokenFilter
that stems German words.Factory for
GermanStemFilter
.A stemmer for German words.
This class implements the stemming algorithm defined by a snowball script.
Analyzer
for the Greek language.Normalizes token text to lower case, removes some Greek diacritics, and standardizes final sigma
to sigma.
Factory for
GreekLowerCaseFilter
.A
TokenFilter
that applies GreekStemmer
to stem Greek words.Factory for
GreekStemFilter
.A stemmer for Greek words, according to: Development of a Stemmer for the Greek Language.
Georgios Ntais
This class implements the stemming algorithm defined by a snowball script.
Analyzer for Hindi.
A
TokenFilter
that applies HindiNormalizer
to normalize the orthography.Factory for
HindiNormalizationFilter
.Normalizer for Hindi.
A
TokenFilter
that applies HindiStemmer
to stem Hindi words.Factory for
HindiStemFilter
.Light Stemmer for Hindi.
This class implements the stemming algorithm defined by a snowball script.
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
Factory for
HTMLStripCharFilter
.Analyzer
for Hungarian.A
TokenFilter
that applies HungarianLightStemmer
to stem Hungarian words.Factory for
HungarianLightStemFilter
.Light Stemmer for Hungarian.
This class implements the stemming algorithm defined by a snowball script.
A spell checker based on Hunspell dictionaries.
TokenFilter that uses hunspell affix rules and words to stem tokens.
TokenFilterFactory that creates instances of
HunspellStemFilter
.This class represents a hyphen.
When the plain text is extracted from documents, we will often have many words hyphenated and
broken into two lines.
Factory for
HyphenatedWordsFilter
.This class represents a hyphenated word.
A
TokenFilter
that decomposes compound words found in many
Germanic languages.Factory for
HyphenationCompoundWordTokenFilter
.This tree structure stores the hyphenation patterns in an efficient way for fast lookup.
Does nothing other than convert the char array to a byte array using the specified encoding.
A
TokenFilter
that applies IndicNormalizer
to normalize text in Indian Languages.Factory for
IndicNormalizationFilter
.Normalizes the Unicode representation of text in Indian languages.
Analyzer for Indonesian (Bahasa)
A
TokenFilter
that applies IndonesianStemmer
to stem Indonesian words.Factory for
IndonesianStemFilter
.Stemmer for Indonesian.
This class implements the stemming algorithm defined by a snowball script.
Encode a character array Integer as a
BytesRef
.Analyzer
for Irish.Normalises token text to lower case, handling t-prothesis and n-eclipsis (i.e., that 'nAthair'
should become 'n-athair')
Factory for
IrishLowerCaseFilter
.This class implements the stemming algorithm defined by a snowball script.
Analyzer
for Italian.A
TokenFilter
that applies ItalianLightStemmer
to stem Italian words.Factory for
ItalianLightStemFilter
.Light Stemmer for Italian.
This class implements the stemming algorithm defined by a snowball script.
A TokenFilter that only keeps tokens with text contained in the required words.
Factory for
KeepWordFilter
."Tokenizes" the entire stream as a single token.
Marks terms as keywords via the
KeywordAttribute
.Factory for
KeywordMarkerFilter
.This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other
words once with
KeywordAttribute.setKeyword(boolean)
set to true
and once
set to false
.Factory for
KeywordRepeatFilter
.Emits the entire input as a single token.
Factory for
KeywordTokenizer
.This class implements the stemming algorithm defined by a snowball script.
A high-performance kstem filter for english.
Factory for
KStemFilter
.This class implements the Kstem algorithm
Analyzer
for Latvian.A
TokenFilter
that applies LatvianStemmer
to stem Latvian words.Factory for
LatvianStemFilter
.Light stemmer for Latvian.
Removes words that are too long or too short from the stream.
Factory for
LengthFilter
.A LetterTokenizer is a tokenizer that divides text at non-letters.
Factory for
LetterTokenizer
.This Analyzer limits the number of tokens while indexing.
This TokenFilter limits the number of tokens while indexing.
Factory for
LimitTokenCountFilter
.Lets all tokens pass through until it sees one with a start offset <= a configured limit,
which won't pass and ends the stream.
Factory for
LimitTokenOffsetFilter
.This TokenFilter limits its emitted tokens to those with positions that are not greater than the
configured limit.
Factory for
LimitTokenPositionFilter
.Analyzer
for Lithuanian.This class implements the stemming algorithm defined by a snowball script.
This class implements the stemming algorithm defined by a snowball script.
Normalizes token text to lower case.
Factory for
LowerCaseFilter
.Simplistic
CharFilter
that applies the mappings contained in a NormalizeCharMap
to the character stream, and correcting the resulting changes to the offsets.Factory for
MappingCharFilter
.Generate min hash tokens from an incoming stream of tokens.
Analyzer for Nepali.
This class implements the stemming algorithm defined by a snowball script.
Factory for
NGramTokenFilter
.A
FragmentChecker
based on all character n-grams possible in a certain language, keeping
them in a relatively memory-efficient, but probabilistic data structure.A callback for n-gram ranges in words
Tokenizes the input into n-grams of the given size(s).
Tokenizes the input into n-grams of the given size(s).
Factory for
NGramTokenizer
.Holds a map of String input to String output, to be used with
MappingCharFilter
.Builds an NormalizeCharMap.
Analyzer
for Norwegian.A
TokenFilter
that applies NorwegianLightStemmer
to stem Norwegian words.Factory for
NorwegianLightStemFilter
.Light Stemmer for Norwegian.
A
TokenFilter
that applies NorwegianMinimalStemmer
to stem Norwegian words.Factory for
NorwegianMinimalStemFilter
.Minimal Stemmer for Norwegian Bokmål (no-nb) and Nynorsk (no-nn)
This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded
variants (ae, oe, aa) by transforming them to åÅæÆøØ.
Factory for
NorwegianNormalizationFilter
.This class implements the stemming algorithm defined by a snowball script.
Assigns a payload to a token based on the
TypeAttribute
Factory for
NumericPayloadTokenFilter
.A StringBuilder that allows one to access the array.
Tokenizer for path-like hierarchies.
Factory for
PathHierarchyTokenizer
.Factory for
PatternCaptureGroupTokenFilter
.CaptureGroup uses Java regexes to emit multiple tokens - one for each capture group in one or
more patterns.
This interface is used to connect the XML pattern file parser to the hyphenation tree.
Marks terms as keywords via the
KeywordAttribute
.A SAX document handler to read and parse hyphenation patterns from a XML file.
CharFilter that uses a regular expression for the target of replace string.
Factory for
PatternReplaceCharFilter
.A TokenFilter which applies a Pattern to each token in the stream, replacing match occurrences
with the specified replacement string.
Factory for
PatternReplaceFilter
.This tokenizer uses regex pattern matching to construct distinct tokens for the input stream.
Factory for
PatternTokenizer
.Set a type attribute to a parameterized value when tokens are matched by any of a several regex
patterns.
Value holding class for pattern typing rules.
Provides a filter that will analyze tokens with the analyzer from an arbitrary field type.
Mainly for use with the DelimitedPayloadTokenFilter, converts char buffers to
BytesRef
.Utility methods for encoding payloads.
This analyzer is used to facilitate scenarios where different fields require different analysis
techniques.
Analyzer
for Persian.CharFilter that replaces instances of Zero-width non-joiner with an ordinary space.
Factory for
PersianCharFilter
.A
TokenFilter
that applies PersianNormalizer
to normalize the orthography.Factory for
PersianNormalizationFilter
.Normalizer for Persian.
A
TokenFilter
that applies PersianStemmer
to stem Persian words.Factory for
PersianStemFilter
.Stemmer for Persian.
Transforms the token stream as per the Porter stemming algorithm.
Factory for
PorterStemFilter
.This class implements the stemming algorithm defined by a snowball script.
Analyzer
for Portuguese.A
TokenFilter
that applies PortugueseLightStemmer
to stem Portuguese words.Factory for
PortugueseLightStemFilter
.Light Stemmer for Portuguese
A
TokenFilter
that applies PortugueseMinimalStemmer
to stem Portuguese words.Factory for
PortugueseMinimalStemFilter
.Minimal Stemmer for Portuguese
A
TokenFilter
that applies PortugueseStemmer
to stem Portuguese words.Factory for
PortugueseStemFilter
.Portuguese stemmer implementing the RSLP (Removedor de Sufixos da Lingua Portuguesa) algorithm.
This class implements the stemming algorithm defined by a snowball script.
A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained
in a protected set.
Factory for a
ProtectedTermFilter
An
Analyzer
used primarily at query time to wrap another analyzer and provide a layer of
protection which prevents very common words from being passed into queries.A TokenFilter which filters out Tokens at the same position and Term text as the previous token
in the stream.
Factory for
RemoveDuplicatesTokenFilter
.Tokenizer for domain-like hierarchies.
Reverse token string, for example "country" => "yrtnuoc".
Factory for
ReverseStringFilter
.Acts like a forever growing char[] as you read characters into it from the provided reader, but
internally it uses a circular buffer to only hold the characters that haven't been freed yet.
Analyzer
for Romanian.This class implements the stemming algorithm defined by a snowball script.
Base class for stemmers that use a set of RSLP-like stemming steps.
A basic rule, with no exceptions.
A rule with a set of whole-word exceptions.
A rule with a set of exceptional suffixes.
A step containing a list of rules.
Analyzer
for Russian language.A
TokenFilter
that applies RussianLightStemmer
to stem Russian words.Factory for
RussianLightStemFilter
.Light Stemmer for Russian.
This class implements the stemming algorithm defined by a snowball script.
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.
Factory for
ScandinavianFoldingFilter
.This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded
variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
Factory for
ScandinavianNormalizationFilter
.This Normalizer does the heavy lifting for a set of Scandinavian normalization filters,
normalizing use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa,
ao, ae, oe and oo) by transforming them to åÅæÆøØ.
List of possible foldings that can be used when configuring the filter
Breaks text into sentences with a
BreakIterator
and allows subclasses to decompose these
sentences into words.Analyzer
for Serbian.Normalizes Serbian Cyrillic and Latin characters to "bald" Latin.
Factory for
SerbianNormalizationFilter
.Normalizes Serbian Cyrillic to Latin.
This class implements the stemming algorithm defined by a snowball script.
Marks terms as keywords via the
KeywordAttribute
.A ShingleAnalyzerWrapper wraps a
ShingleFilter
around another Analyzer
.A ShingleFilter constructs shingles (token n-grams) from a token stream.
Factory for
ShingleFilter
.Factory for
SimplePatternSplitTokenizer
, for producing tokens by splitting according to
the provided regexp.Factory for
SimplePatternTokenizer
, for matching tokens based on the provided regexp.A filter that stems words using a Snowball-generated stemmer.
Factory for
SnowballFilter
, with configurable languageBase class for a snowball stemmer
Parent class of all snowball stemmers, which must implement
stem
Parser for the Solr synonyms format.
Analyzer
for Sorani Kurdish.A
TokenFilter
that applies SoraniNormalizer
to normalize the orthography.Factory for
SoraniNormalizationFilter
.Normalizes the Unicode representation of Sorani text.
A
TokenFilter
that applies SoraniStemmer
to stem Sorani words.Factory for
SoraniStemFilter
.Light stemmer for Sorani
The strategy defining how a Hunspell dictionary should be loaded, with different tradeoffs.
Analyzer
for Spanish.A
TokenFilter
that applies SpanishLightStemmer
to stem Spanish words.Factory for
SpanishLightStemFilter
.Light Stemmer for Spanish
Deprecated.
Use
SpanishPluralStemFilter
instead.Deprecated.
Use
SpanishPluralStemFilterFactory
insteadDeprecated.
Use
SpanishPluralStemmer
instead.A
TokenFilter
that applies SpanishPluralStemmer
to stem Spanish words.Factory for
SpanishPluralStemFilterFactory
.Plural Stemmer for Spanish
This class implements the stemming algorithm defined by a snowball script.
Provides the ability to override any
KeywordAttribute
aware stemmer with custom
dictionary-based stemming.This builder builds an
FST
for the StemmerOverrideFilter
A read-only 4-byte FST backed map that allows fast case-insensitive key value lookups for
StemmerOverrideFilter
Factory for
StemmerOverrideFilter
.Some commonly-used stemming functions
Removes stop words from a token stream.
Factory for
StopFilter
.A generator for misspelled word corrections based on Hunspell flags.
An exception thrown when
Hunspell.suggest(java.lang.String)
call takes too long, if TimeoutPolicy.THROW_EXCEPTION
is used.Analyzer
for Swedish.A
TokenFilter
that applies SwedishLightStemmer
to stem Swedish words.Factory for
SwedishLightStemFilter
.Light Stemmer for Swedish.
A
TokenFilter
that applies SwedishMinimalStemmer
to stem Swedish words.Factory for
SwedishMinimalStemFilter
.Minimal Stemmer for Swedish.
This class implements the stemming algorithm defined by a snowball script.
Deprecated.
Use
SynonymGraphFilter
instead, but be sure to also use FlattenGraphFilter
at index time (not at search time) as well.Deprecated.
Use
SynonymGraphFilterFactory
instead, but be sure to also use FlattenGraphFilterFactory
at index time (not at search time) as well.Applies single- or multi-token synonyms from a
SynonymMap
to an incoming TokenStream
, producing a fully correct graph output.Factory for
SynonymGraphFilter
.A map of synonyms, keys and values are phrases.
Builds an FSTSynonymMap.
Abstraction for parsing synonym files.
Analyzer for Tamil.
This class implements the stemming algorithm defined by a snowball script.
This TokenFilter provides the ability to set aside attribute states that have already been
analyzed.
TokenStream output from a tee.
Analyzer for Telugu.
A
TokenFilter
that applies TeluguNormalizer
to normalize the orthography.Factory for
TeluguNormalizationFilter
.Normalizer for Telugu.
A
TokenFilter
that applies TeluguStemmer
to stem Telugu words.Factory for
TeluguStemFilter
.Stemmer for Telugu.
Wraps a term and boost
Ternary Search Tree.
Analyzer
for Thai language.Tokenizer that use
BreakIterator
to tokenize Thai text.Factory for
ThaiTokenizer
.A strategy determining what to do when Hunspell API calls take too much time
Adds the
OffsetAttribute.startOffset()
and OffsetAttribute.endOffset()
First 4
bytes are the startFactory for
TokenOffsetPayloadTokenFilter
.Trims leading and trailing whitespace from Tokens in the stream.
Factory for
TrimFilter
.A token filter for truncating the terms into a specific length.
Factory for
TruncateTokenFilter
.Analyzer
for Turkish.Normalizes Turkish token text to lower case.
Factory for
TurkishLowerCaseFilter
.This class implements the stemming algorithm defined by a snowball script.
Makes the
TypeAttribute
a payload.Factory for
TypeAsPayloadTokenFilter
.Adds the
TypeAttribute.type()
as a synonym, i.e.Factory for
TypeAsSynonymFilter
.Removes tokens whose types appear in a set of blocked types from a token stream.
Factory class for
TypeTokenFilter
.Filters
UAX29URLEmailTokenizer
with LowerCaseFilter
and StopFilter
, using a list of English stop words.This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified
in Unicode Standard Annex #29 URLs and email
addresses are also tokenized according to the relevant RFCs.
Factory for
UAX29URLEmailTokenizer
.This class implements Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29
URLs and email addresses are also tokenized according to the relevant RFCs.
This file contains unicode properties used by various
CharTokenizer
s.An Analyzer that uses
UnicodeWhitespaceTokenizer
.A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace.
Normalizes token text to UPPER CASE.
Factory for
UpperCaseFilter
.An Analyzer that uses
WhitespaceTokenizer
.A tokenizer that divides text at whitespace characters as defined by
Character.isWhitespace(int)
.Factory for
WhitespaceTokenizer
.Extension of StandardTokenizer that is aware of Wikipedia syntax.
Factory for
WikipediaTokenizer
.Word2VecModel is a class representing the parsed Word2Vec model containing the vectors for each
word in dictionary
Applies single-token synonyms from a Word2Vec trained network to an incoming
TokenStream
.Factory for
Word2VecSynonymFilter
.The Word2VecSynonymProvider generates the list of sysnonyms of a term.
Supply Word2Vec Word2VecSynonymProvider cache avoiding that multiple instances of
Word2VecSynonymFilterFactory will instantiate multiple instances of the same SynonymProvider.
Deprecated.
Use
WordDelimiterGraphFilter
instead: it produces a correct token graph so
that e.g.Deprecated.
Use
WordDelimiterGraphFilterFactory
instead: it produces a correct token
graph so that e.g.Splits words into subwords and performs optional transformations on subword groups, producing a
correct token graph so that e.g.
Factory for
WordDelimiterGraphFilter
.A BreakIterator-like API for iterating over subwords in text, according to
WordDelimiterGraphFilter rules.
A utility class used for generating possible word forms by adding affixes to stems (
WordFormGenerator.getAllWordForms(String, String, Runnable)
), and suggesting stems and flags to generate the
given set of words (WordFormGenerator.compress(List, Set, Runnable)
).Parser for wordnet prolog format
This class implements the stemming algorithm defined by a snowball script.