org.apache.lucene.document (Lucene 10.2.1 core API)

package org.apache.lucene.document

The logical representation of a Document for indexing and searching.

The document package provides the user level logical representation of content to be indexed and searched. The package also provides utilities for working with Documents and IndexableFields.

Document and IndexableField

A Document is a collection of IndexableFields. A IndexableField is a logical representation of a user's content that needs to be indexed or stored. IndexableFields have a number of properties that tell Lucene how to treat the content (like indexed, tokenized, stored, etc.) See the Field implementation of IndexableField for specifics on these properties.

Note: it is common to refer to Documents having Fields, even though technically they have IndexableFields.

Working with Documents

First and foremost, a Document is something created by the user application. It is your job to create Documents based on the content of the files you are working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is completely up to you. That being said, there are many tools available in other projects that can make the process of taking a file and converting it into a Lucene Document.

How to index ...

Strings

TextField allows indexing tokens from a String so that one can perform full-text search on it. The way that the input is tokenized depends on the Analyzer that is configured on the IndexWriterConfig. TextField can also be optionally stored.

KeywordField indexes whole values as a single term so that one can perform exact search on it. It also records doc values to enable sorting or faceting on this field. Finally, it also supports optionally storing the value.

If faceting or sorting are not required, StringField is a variant of KeywordField that does not index doc values.

Numbers

If a numeric field represents an identifier rather than a quantity and is more commonly searched on single values than on ranges of values, it is generally recommended to index its string representation via KeywordField (or StringField if doc values are not necessary).

LongField, IntField, DoubleField and FloatField index values in a points index for efficient range queries, and also create doc-values for these fields for efficient sorting and faceting.

If the field is aimed at being used to tune the score, FeatureField helps internally store numeric data as term frequencies in a way that makes it efficient to influence scoring at search time.

Other types of structured data

It is recommended to index dates as a LongField that stores the number of milliseconds since Epoch.

IP fields can be indexed via InetAddressPoint in addition to a SortedDocValuesField (if the field is single-valued) or SortedSetDocValuesField that stores the result of InetAddressPoint.encode(java.net.InetAddress).

Dense numeric vectors

Dense numeric vectors can be indexed with KnnFloatVectorField if its dimensions are floating-point numbers or KnnByteVectorField if its dimensions are bytes. This allows searching for nearest neighbors at search time.

Sparse numeric vectors

To perform nearest-neighbor search on sparse vectors rather than dense vectors, each dimension of the sparse vector should be indexed as a FeatureField. Queries can then be constructed as a BooleanQuery with linear queries as BooleanClause.Occur.SHOULD clauses.

Class

Description

BinaryDocValuesField

Field that stores a per-document BytesRef value.

BinaryPoint

An indexed binary field for fast range filters.

BinaryRangeDocValues

A binary representation of a range that wraps a BinaryDocValues field

DateTools

Provides support for converting dates to strings and vice-versa.

DateTools.Resolution

Specifies the time granularity.

Document

Documents are the unit of indexing and search.

DocumentStoredFieldVisitor

A StoredFieldVisitor that creates a Document from stored fields.

DoubleDocValuesField

Syntactic sugar for encoding doubles as NumericDocValues via Double.doubleToRawLongBits(double).

DoubleField

Field that stores a per-document double value for scoring, sorting or value retrieval and index the field for fast range filters.

DoublePoint

An indexed double field for fast range filters.

DoubleRange

An indexed Double Range field.

DoubleRangeDocValuesField

DocValues field for DoubleRange.

FeatureField

Field that can be used to store static scoring factors into documents.

Field

Expert: directly create a field for a document.

Field.Store

Specifies whether and how a field should be stored.

FieldType

Describes the properties of a field.

FloatDocValuesField

Syntactic sugar for encoding floats as NumericDocValues via Float.floatToRawIntBits(float).

FloatField

Field that stores a per-document float value for scoring, sorting or value retrieval and index the field for fast range filters.

FloatPoint

An indexed float field for fast range filters.

FloatRange

An indexed Float Range field.

FloatRangeDocValuesField

DocValues field for FloatRange.

InetAddressPoint

An indexed 128-bit InetAddress field.

InetAddressRange

An indexed InetAddress Range Field

IntField

Field that stores a per-document int value for scoring, sorting or value retrieval and index the field for fast range filters.

IntPoint

An indexed int field for fast range filters.

IntRange

An indexed Integer Range field.

IntRangeDocValuesField

DocValues field for IntRange.

InvertableType

Describes how an IndexableField should be inverted for indexing terms and postings.

KeywordField

Field that indexes a per-document String or BytesRef into an inverted index for fast filtering, stores values in a columnar fashion using DocValuesType.SORTED_SET doc values for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval.

KnnByteVectorField

A field that contains a single byte numeric vector (or none) for each document.

KnnFloatVectorField

A field that contains a single floating-point numeric vector (or none) for each document.

LatLonDocValuesField

An per-document location field.

LatLonPoint

An indexed location field.

LatLonShape

An geo shape utility class for indexing and searching gis geometries whose vertices are latitude, longitude values (in decimal degrees).

LatLonShapeDocValues

A concrete implementation of ShapeDocValues for storing binary doc value representation of LatLonShape geometries in a LatLonShapeDocValuesField

LatLonShapeDocValuesField

Concrete implementation of a ShapeDocValuesField for geographic geometries.

LongField

Field that stores a per-document long value for scoring, sorting or value retrieval and index the field for fast range filters.

LongPoint

An indexed long field for fast range filters.

LongRange

An indexed Long Range field.

LongRangeDocValuesField

DocValues field for LongRange.

NumericDocValuesField

Field that stores a per-document long value for scoring, sorting or value retrieval.

RangeFieldQuery

Query class for searching RangeField types by a defined PointValues.Relation.

RangeFieldQuery.QueryType

Used by RangeFieldQuery to check how each internal or leaf node relates to the query.

ShapeDocValuesField

A doc values field for LatLonShape and XYShape that uses ShapeDocValues as the underlying binary doc value format.

ShapeField

A base shape utility class used for both LatLon (spherical) and XY (cartesian) shape fields.

ShapeField.DecodedTriangle

Represents a encoded triangle using ShapeField.decodeTriangle(byte[], DecodedTriangle).

ShapeField.DecodedTriangle.TYPE

type of triangle

ShapeField.QueryRelation

Query Relation Types *

ShapeField.Triangle

polygons are decomposed into tessellated triangles using Tessellator these triangles are encoded and inserted as separate indexed POINT fields

SortedDocValuesField

Field that stores a per-document BytesRef value, indexed for sorting.

SortedNumericDocValuesField

Field that stores a per-document long values for scoring, sorting or value retrieval.

SortedSetDocValuesField

Field that stores a set of per-document BytesRef values, indexed for faceting,grouping,joining.

StoredField

A field whose value is stored so that IndexSearcher.storedFields() and IndexReader.storedFields() will return the field and its value.

StoredValue

Abstraction around a stored value.

StoredValue.Type

Type of a StoredValue.

StringField

A field that is indexed but not tokenized: the entire String value is indexed as a single token.

TextField

A field that is indexed and tokenized, without term vectors.

XYDocValuesField

An per-document location field.

XYDocValuesPointInGeometryQuery

XYGeometry query for XYDocValuesField.

XYPointField

An indexed XY position field.

XYShape

A cartesian shape utility class for indexing and searching geometries whose vertices are unitless x, y values.

XYShapeDocValues

A concrete implementation of ShapeDocValues for storing binary doc value representation of XYShape geometries in a XYShapeDocValuesField

XYShapeDocValuesField

Concrete implementation of a ShapeDocValuesField for cartesian geometries.

Package org.apache.lucene.document