Lucene 9.11.0 common API
Analyzers for indexing content in different languages and domains.
For an introduction to Lucene's analysis API, see the org.apache.lucene.analysis
package documentation.
This module contains concrete components (CharFilter
s,
Tokenizer
s, and (TokenFilter
s) for
analyzing different types of content. It also provides a number of Analyzer
s
for different languages that you can use to get started quickly. To define fully custom Analyzers
(like in the index schema of Apache Solr), this module provides CustomAnalyzer
.
Package
Description
Analyzer for Arabic.
Analyzer for Bulgarian.
Analyzer for Bengali Language.
Provides various convenience classes for creating boosts on Tokens.
Analyzer for Brazilian Portuguese.
Analyzer for Catalan.
Normalization of text before the tokenizer.
Analyzer for Chinese, Japanese, and Korean, which indexes bigrams.
Analyzer for Sorani Kurdish.
Fast, general-purpose grammar-based tokenizers.
Construct n-grams for frequently occurring terms and phrases.
A filter that decomposes compound words you find in many Germanic languages into the word parts.
Hyphenation code for the CompoundWordTokenFilter.
Basic, general-purpose analysis components.
A general-purpose Analyzer that can be created with a builder-style API.
Analyzer for Czech.
Analyzer for Danish.
Analyzer for German.
Analyzer for Greek.
Fast, general-purpose URLs and email addresses tokenizers.
Analyzer for English.
Analyzer for Spanish.
Analyzer for Estonian.
Analyzer for Basque.
Analyzer for Persian.
Analyzer for Finnish.
Analyzer for French.
Analyzer for Irish.
Analyzer for Galician.
Analyzer for Hindi.
Analyzer for Hungarian.
A Java implementation of Hunspell stemming and
spell-checking algorithms (
Hunspell
), and a stemming
TokenFilter (HunspellStemFilter
) based on it.Analyzer for Armenian.
Analyzer for Indonesian.
Analyzer for Indian languages.
Analyzer for Italian.
Analyzer for Lithuanian.
Analyzer for Latvian.
MinHash filtering (for LSH).
Miscellaneous Tokenstreams.
Analyzer for Nepali.
Character n-gram tokenizers and filters.
Analyzer for Dutch.
Analyzer for Norwegian.
Analysis components for path-like strings such as filenames.
Set of components for pattern-based (regex) analysis.
Provides various convenience classes for creating payloads on Tokens.
Analyzer for Portuguese.
Automatically filter high-frequency stopwords.
Filter to reverse token text.
Analyzer for Romanian.
Analyzer for Russian.
Word n-gram filters.
Analyzer for Serbian.
Analyzer for Swedish.
Analysis components for Synonyms.
Analysis components for Synonyms using Word2Vec model.
Analyzer for Tamil.
Analyzer for Telugu Language.
Analyzer for Thai.
Analyzer for Turkish.
Utility functions for text analysis.
Tokenizer that is aware of Wikipedia syntax.
Unicode collation support.
Custom
AttributeImpl
for indexing collation keys as index terms.Snowball stemmer API
Autogenerated snowball stemmer implementations.