Package org.apache.lucene.document
Document
for indexing and
searching.
The document package provides the user level logical representation of content to be indexed
and searched. The package also provides utilities for working with Document
s and IndexableField
s.
Document and IndexableField
A Document
is a collection of IndexableField
s. A IndexableField
is a
logical representation of a user's content that needs to be indexed or stored. IndexableField
s have a number of properties that tell Lucene how to
treat the content (like indexed, tokenized, stored, etc.) See the Field
implementation of IndexableField
for specifics on these properties.
Note: it is common to refer to Document
s having Field
s, even though technically they have IndexableField
s.
Working with Documents
First and foremost, a Document
is something created by the
user application. It is your job to create Documents based on the content of the files you are
working with in your application (Word, txt, PDF, Excel or any other format.) How this is done is
completely up to you. That being said, there are many tools available in other projects that can
make the process of taking a file and converting it into a Lucene Document
.
How to index ...
Strings
TextField
allows indexing tokens from a String so that one
can perform full-text search on it. The way that the input is tokenized depends on the Analyzer
that is configured on the IndexWriterConfig
. TextField can also be optionally stored.
KeywordField
indexes whole values as a single term so that
one can perform exact search on it. It also records doc values to enable sorting or faceting on
this field. Finally, it also supports optionally storing the value.
If faceting or sorting are not required, StringField
is a
variant of KeywordField
that does not index doc values.
Numbers
If a numeric field represents an identifier rather than a quantity and is more commonly
searched on single values than on ranges of values, it is generally recommended to index its
string representation via KeywordField
(or StringField
if doc values are not necessary).
LongField
, IntField
,
DoubleField
and FloatField
index values in a points index for efficient range queries, and also create doc-values for these
fields for efficient sorting and faceting.
If the field is aimed at being used to tune the score, FeatureField
helps internally store numeric data as term frequencies
in a way that makes it efficient to influence scoring at search time.
Other types of structured data
It is recommended to index dates as a LongField
that stores
the number of milliseconds since Epoch.
IP fields can be indexed via InetAddressPoint
in addition
to a SortedDocValuesField
(if the field is single-valued) or
SortedSetDocValuesField
that stores the result of InetAddressPoint.encode(java.net.InetAddress)
.
Dense numeric vectors
Dense numeric vectors can be indexed with KnnFloatVectorField
if its dimensions are floating-point numbers or
KnnByteVectorField
if its dimensions are bytes. This allows
searching for nearest neighbors at search time.
Sparse numeric vectors
To perform nearest-neighbor search on sparse vectors rather than dense vectors, each dimension
of the sparse vector should be indexed as a FeatureField
.
Queries can then be constructed as a BooleanQuery
with linear queries
as
BooleanClause.Occur.SHOULD
clauses.
-
ClassDescriptionField that stores a per-document
BytesRef
value.An indexed binary field for fast range filters.A binary representation of a range that wraps a BinaryDocValues fieldProvides support for converting dates to strings and vice-versa.Specifies the time granularity.Documents are the unit of indexing and search.AStoredFieldVisitor
that creates aDocument
from stored fields.Syntactic sugar for encoding doubles as NumericDocValues viaDouble.doubleToRawLongBits(double)
.Field that stores a per-documentdouble
value for scoring, sorting or value retrieval and index the field for fast range filters.An indexeddouble
field for fast range filters.An indexed Double Range field.DocValues field for DoubleRange.Field
that can be used to store static scoring factors into documents.Expert: directly create a field for a document.Specifies whether and how a field should be stored.Describes the properties of a field.Syntactic sugar for encoding floats as NumericDocValues viaFloat.floatToRawIntBits(float)
.Field that stores a per-documentfloat
value for scoring, sorting or value retrieval and index the field for fast range filters.An indexedfloat
field for fast range filters.An indexed Float Range field.DocValues field for FloatRange.An indexed 128-bitInetAddress
field.An indexed InetAddress Range FieldField that stores a per-documentint
value for scoring, sorting or value retrieval and index the field for fast range filters.An indexedint
field for fast range filters.An indexed Integer Range field.DocValues field for IntRange.Describes how anIndexableField
should be inverted for indexing terms and postings.Field that indexes a per-document String orBytesRef
into an inverted index for fast filtering, stores values in a columnar fashion usingDocValuesType.SORTED_SET
doc values for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval.A field that contains a single byte numeric vector (or none) for each document.A field that contains a single floating-point numeric vector (or none) for each document.An per-document location field.An indexed location field.An geo shape utility class for indexing and searching gis geometries whose vertices are latitude, longitude values (in decimal degrees).A concrete implementation ofShapeDocValues
for storing binary doc value representation ofLatLonShape
geometries in aLatLonShapeDocValuesField
Concrete implementation of aShapeDocValuesField
for geographic geometries.Field that stores a per-documentlong
value for scoring, sorting or value retrieval and index the field for fast range filters.An indexedlong
field for fast range filters.An indexed Long Range field.DocValues field for LongRange.Field that stores a per-documentlong
value for scoring, sorting or value retrieval.Query class for searchingRangeField
types by a definedPointValues.Relation
.Used byRangeFieldQuery
to check how each internal or leaf node relates to the query.A doc values field forLatLonShape
andXYShape
that usesShapeDocValues
as the underlying binary doc value format.A base shape utility class used for both LatLon (spherical) and XY (cartesian) shape fields.Represents a encoded triangle usingShapeField.decodeTriangle(byte[], DecodedTriangle)
.type of triangleQuery Relation Types *polygons are decomposed into tessellated triangles usingTessellator
these triangles are encoded and inserted as separate indexed POINT fieldsField that stores a per-documentBytesRef
value, indexed for sorting.Field that stores a per-documentlong
values for scoring, sorting or value retrieval.Field that stores a set of per-documentBytesRef
values, indexed for faceting,grouping,joining.A field whose value is stored so thatIndexSearcher.storedFields()
andIndexReader.storedFields()
will return the field and its value.Abstraction around a stored value.Type of aStoredValue
.A field that is indexed but not tokenized: the entire String value is indexed as a single token.A field that is indexed and tokenized, without term vectors.An per-document location field.XYGeometry query forXYDocValuesField
.An indexed XY position field.A cartesian shape utility class for indexing and searching geometries whose vertices are unitless x, y values.A concrete implementation ofShapeDocValues
for storing binary doc value representation ofXYShape
geometries in aXYShapeDocValuesField
Concrete implementation of aShapeDocValuesField
for cartesian geometries.