Package org.apache.lucene.document


package org.apache.lucene.document
The logical representation of a Document for indexing and searching.

The document package provides the user level logical representation of content to be indexed and searched. The package also provides utilities for working with Documents and IndexableFields.

Document and IndexableField

A Document is a collection of IndexableFields. A IndexableField is a logical representation of a user's content that needs to be indexed or stored. IndexableFields have a number of properties that tell Lucene how to treat the content (like indexed, tokenized, stored, etc.) See the Field implementation of IndexableField for specifics on these properties.

Note: it is common to refer to Documents having Fields, even though technically they have IndexableFields.

Working with Documents

First and foremost, a Document is something created by the user application. It is your job to create Documents based on the content of the files you are working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is completely up to you. That being said, there are many tools available in other projects that can make the process of taking a file and converting it into a Lucene Document.

How to index ...

Strings

TextField allows indexing tokens from a String so that one can perform full-text search on it. The way that the input is tokenized depends on the Analyzer that is configured on the IndexWriterConfig. TextField can also be optionally stored.

KeywordField indexes whole values as a single term so that one can perform exact search on it. It also records doc values to enable sorting or faceting on this field. Finally, it also supports optionally storing the value.

If faceting or sorting are not required, StringField is a variant of KeywordField that does not index doc values.

Numbers

If a numeric field represents an identifier rather than a quantity and is more commonly searched on single values than on ranges of values, it is generally recommended to index its string representation via KeywordField (or StringField if doc values are not necessary).

LongField, IntField, DoubleField and FloatField index values in a points index for efficient range queries, and also create doc-values for these fields for efficient sorting and faceting.

If the field is aimed at being used to tune the score, FeatureField helps internally store numeric data as term frequencies in a way that makes it efficient to influence scoring at search time.

Other types of structured data

It is recommended to index dates as a LongField that stores the number of milliseconds since Epoch.

IP fields can be indexed via InetAddressPoint in addition to a SortedDocValuesField (if the field is single-valued) or SortedSetDocValuesField that stores the result of InetAddressPoint.encode(java.net.InetAddress).

Dense numeric vectors

Dense numeric vectors can be indexed with KnnFloatVectorField if its dimensions are floating-point numbers or KnnByteVectorField if its dimensions are bytes. This allows searching for nearest neighbors at search time.

Sparse numeric vectors

To perform nearest-neighbor search on sparse vectors rather than dense vectors, each dimension of the sparse vector should be indexed as a FeatureField. Queries can then be constructed as a BooleanQuery with linear queries as BooleanClause.Occur.SHOULD clauses.