Overview (Lucene 9.9.1 common API)

Analyzers for indexing content in different languages and domains.

For an introduction to Lucene's analysis API, see the org.apache.lucene.analysis package documentation.

This module contains concrete components (CharFilters, Tokenizers, and (TokenFilters) for analyzing different types of content. It also provides a number of Analyzers for different languages that you can use to get started quickly. To define fully custom Analyzers (like in the index schema of Apache Solr), this module provides CustomAnalyzer.

Packages

Package

Description

org.apache.lucene.analysis.ar

Analyzer for Arabic.

org.apache.lucene.analysis.bg

Analyzer for Bulgarian.

org.apache.lucene.analysis.bn

Analyzer for Bengali Language.

org.apache.lucene.analysis.boost

Provides various convenience classes for creating boosts on Tokens.

org.apache.lucene.analysis.br

Analyzer for Brazilian Portuguese.

org.apache.lucene.analysis.ca

Analyzer for Catalan.

org.apache.lucene.analysis.charfilter

Normalization of text before the tokenizer.

org.apache.lucene.analysis.cjk

Analyzer for Chinese, Japanese, and Korean, which indexes bigrams.

org.apache.lucene.analysis.ckb

Analyzer for Sorani Kurdish.

org.apache.lucene.analysis.classic

Fast, general-purpose grammar-based tokenizers.

org.apache.lucene.analysis.commongrams

Construct n-grams for frequently occurring terms and phrases.

org.apache.lucene.analysis.compound

A filter that decomposes compound words you find in many Germanic languages into the word parts.

org.apache.lucene.analysis.compound.hyphenation

Hyphenation code for the CompoundWordTokenFilter.

org.apache.lucene.analysis.core

Basic, general-purpose analysis components.

org.apache.lucene.analysis.custom

A general-purpose Analyzer that can be created with a builder-style API.

org.apache.lucene.analysis.cz

Analyzer for Czech.

org.apache.lucene.analysis.da

Analyzer for Danish.

org.apache.lucene.analysis.de

Analyzer for German.

org.apache.lucene.analysis.el

Analyzer for Greek.

org.apache.lucene.analysis.email

Fast, general-purpose URLs and email addresses tokenizers.

org.apache.lucene.analysis.en

Analyzer for English.

org.apache.lucene.analysis.es

Analyzer for Spanish.

org.apache.lucene.analysis.et

Analyzer for Estonian.

org.apache.lucene.analysis.eu

Analyzer for Basque.

org.apache.lucene.analysis.fa

Analyzer for Persian.

org.apache.lucene.analysis.fi

Analyzer for Finnish.

org.apache.lucene.analysis.fr

Analyzer for French.

org.apache.lucene.analysis.ga

Analyzer for Irish.

org.apache.lucene.analysis.gl

Analyzer for Galician.

org.apache.lucene.analysis.hi

Analyzer for Hindi.

org.apache.lucene.analysis.hu

Analyzer for Hungarian.

org.apache.lucene.analysis.hunspell

A Java implementation of Hunspell stemming and spell-checking algorithms (Hunspell), and a stemming TokenFilter (HunspellStemFilter) based on it.

org.apache.lucene.analysis.hy

Analyzer for Armenian.

org.apache.lucene.analysis.id

Analyzer for Indonesian.

org.apache.lucene.analysis.in

Analyzer for Indian languages.

org.apache.lucene.analysis.it

Analyzer for Italian.

org.apache.lucene.analysis.lt

Analyzer for Lithuanian.

org.apache.lucene.analysis.lv

Analyzer for Latvian.

org.apache.lucene.analysis.minhash

MinHash filtering (for LSH).

org.apache.lucene.analysis.miscellaneous

Miscellaneous Tokenstreams.

org.apache.lucene.analysis.ne

Analyzer for Nepali.

org.apache.lucene.analysis.ngram

Character n-gram tokenizers and filters.

org.apache.lucene.analysis.nl

Analyzer for Dutch.

org.apache.lucene.analysis.no

Analyzer for Norwegian.

org.apache.lucene.analysis.path

Analysis components for path-like strings such as filenames.

org.apache.lucene.analysis.pattern

Set of components for pattern-based (regex) analysis.

org.apache.lucene.analysis.payloads

Provides various convenience classes for creating payloads on Tokens.

org.apache.lucene.analysis.pt

Analyzer for Portuguese.

org.apache.lucene.analysis.query

Automatically filter high-frequency stopwords.

org.apache.lucene.analysis.reverse

Filter to reverse token text.

org.apache.lucene.analysis.ro

Analyzer for Romanian.

org.apache.lucene.analysis.ru

Analyzer for Russian.

org.apache.lucene.analysis.shingle

Word n-gram filters.

org.apache.lucene.analysis.sinks

TeeSinkTokenFilter.

org.apache.lucene.analysis.snowball

TokenFilter and Analyzer implementations that use a modified version of Snowball stemmers.

org.apache.lucene.analysis.sr

Analyzer for Serbian.

org.apache.lucene.analysis.sv

Analyzer for Swedish.

org.apache.lucene.analysis.synonym

Analysis components for Synonyms.

org.apache.lucene.analysis.synonym.word2vec

Analysis components for Synonyms using Word2Vec model.

org.apache.lucene.analysis.ta

Analyzer for Tamil.

org.apache.lucene.analysis.te

Analyzer for Telugu Language.

org.apache.lucene.analysis.th

Analyzer for Thai.

org.apache.lucene.analysis.tr

Analyzer for Turkish.

org.apache.lucene.analysis.util

Utility functions for text analysis.

org.apache.lucene.analysis.wikipedia

Tokenizer that is aware of Wikipedia syntax.

org.apache.lucene.collation

Unicode collation support.

org.apache.lucene.collation.tokenattributes

Custom AttributeImpl for indexing collation keys as index terms.

org.tartarus.snowball

Snowball stemmer API

org.tartarus.snowball.ext

Autogenerated snowball stemmer implementations.