Lucene 10.0.0 core API
Apache Lucene is a high-performance, full-featured text search engine library. Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
Analyzer analyzer = new StandardAnalyzer(); Path indexPath = Files.createTempDirectory("tempIndex"); Directory directory = FSDirectory.open(indexPath); IndexWriterConfig config = new IndexWriterConfig(analyzer); IndexWriter iwriter = new IndexWriter(directory, config); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, TextField.TYPE_STORED)); iwriter.addDocument(doc); iwriter.close(); // Now search the index: DirectoryReader ireader = DirectoryReader.open(directory); IndexSearcher isearcher = new IndexSearcher(ireader); // Parse a simple query that searches for "text": QueryParser parser = new QueryParser("fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs; assertEquals(1, hits.length); // Iterate through the results: StoredFields storedFields = isearcher.storedFields(); for (int i = 0; i < hits.length; i++) { Document hitDoc = storedFields.document(hits[i].doc); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } ireader.close(); directory.close(); IOUtils.rm(indexPath);
The Lucene API is divided into several packages:
-
org.apache.lucene.analysis
defines an abstractAnalyzer
API for converting text from aReader
into aTokenStream
, an enumeration of tokenAttribute
s. A TokenStream can be composed by applyingTokenFilter
s to the output of aTokenizer
. Tokenizers and TokenFilters are strung together and applied with anAnalyzer
. analysis-common provides a number of Analyzer implementations, including StopAnalyzer and the grammar-based StandardAnalyzer. -
org.apache.lucene.codecs
provides an abstraction over the encoding and decoding of the inverted index structure, as well as different implementations that can be chosen depending upon application needs. -
org.apache.lucene.document
provides a simpleDocument
class. A Document is simply a set of namedField
s, whose values may be strings or instances ofReader
. -
org.apache.lucene.index
provides two primary classes:IndexWriter
, which creates and adds documents to indices; andIndexReader
, which accesses the data in the index. -
org.apache.lucene.search
provides data structures to represent queries (ieTermQuery
for individual words,PhraseQuery
for phrases, andBooleanQuery
for boolean combinations of queries) and theIndexSearcher
which turns queries intoTopDocs
. A number of QueryParsers are provided for producing query structures from strings or xml. -
org.apache.lucene.store
defines an abstract class for storing persistent data, theDirectory
, which is a collection of named files written by anIndexOutput
and read by anIndexInput
. Multiple implementations are provided, butFSDirectory
is generally recommended as it tries to use operating system disk buffer caches efficiently. -
org.apache.lucene.util
contains a few handy data structures and util classes, ieFixedBitSet
andPriorityQueue
.
-
Create
Document
s by addingField
s; -
Create an
IndexWriter
and add documents to it withaddDocument()
; - Call QueryParser.parse() to build a query from a string; and
-
Create an
IndexSearcher
and pass the query to itssearch()
method.
- IndexFiles.java creates an index for all the files contained in a directory.
- SearchFiles.java prompts for queries and searches an index.
> java -cp lucene-core.jar:lucene-demo.jar:lucene-analysis-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
[ ... ]
> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analysis-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
[ ... thirty-four documents contain the word "chowder" ... ]
Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
[ ... two documents contain the phrase "clam chowder" and the word "manhattan" ... ]
[ Note: "+" and "-" are canonical, but "AND", "OR" and "NOT" may be used. ]
StandardTokenizer
implements the Word Break rules from the
Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.Document
for indexing and
searching.TopFieldCollector
.