public class UnifiedHighlighter extends Object
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
),
term vectors (FieldType.setStoreTermVectorOffsets(boolean)
),
or via re-analyzing text.
This highlighter treats the single original document as the whole corpus, and then scores individual
passages as if they were documents in this corpus. It uses a BreakIterator
to find
passages in the text; by default it breaks using getSentenceInstance(Locale.ROOT)
. It then iterates in parallel (merge sorting by offset) through
the positions of all terms from the query, coalescing those hits that occur in a single passage
into a Passage
, and then scores each Passage using a separate PassageScorer
.
Passages are finally formatted into highlighted snippets with a PassageFormatter
.
You can customize the behavior by calling some of the setters, or by subclassing and overriding some methods. Some important hooks:
getBreakIterator(String)
: Customize how the text is divided into passages.
getScorer(String)
: Customize how passages are ranked.
getFormatter(String)
: Customize how snippets are formatted.
This is thread-safe.
Modifier and Type | Class and Description |
---|---|
static class |
UnifiedHighlighter.HighlightFlag
Flags for controlling highlighting behavior.
|
protected static class |
UnifiedHighlighter.LimitedStoredFieldVisitor
Fetches stored fields for highlighting.
|
static class |
UnifiedHighlighter.OffsetSource
Source of term offsets; essential for highlighting.
|
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_CACHE_CHARS_THRESHOLD |
static int |
DEFAULT_MAX_LENGTH |
protected FieldInfos |
fieldInfos |
protected Analyzer |
indexAnalyzer |
protected static char |
MULTIVAL_SEP_CHAR |
protected IndexSearcher |
searcher |
protected static CharacterRunAutomaton[] |
ZERO_LEN_AUTOMATA_ARRAY |
Constructor and Description |
---|
UnifiedHighlighter(IndexSearcher indexSearcher,
Analyzer indexAnalyzer)
Constructs the highlighter with the given index searcher and analyzer.
|
Modifier and Type | Method and Description |
---|---|
protected static Set<Term> |
extractTerms(Query query)
Extracts matching terms after rewriting against an empty index
|
protected static BytesRef[] |
filterExtractedTerms(Predicate<String> fieldMatcher,
Set<Term> queryTerms) |
protected CharacterRunAutomaton[] |
getAutomata(String field,
Query query,
Set<UnifiedHighlighter.HighlightFlag> highlightFlags) |
protected BreakIterator |
getBreakIterator(String field)
Returns the
BreakIterator to use for
dividing text into passages. |
int |
getCacheFieldValCharsThreshold()
Limits the amount of field value pre-fetching until this threshold is passed.
|
protected FieldHighlighter |
getFieldHighlighter(String field,
Query query,
Set<Term> allTerms,
int maxPassages) |
protected FieldInfo |
getFieldInfo(String field)
Called by the default implementation of
getOffsetSource(String) . |
protected Predicate<String> |
getFieldMatcher(String field)
Returns the predicate to use for extracting the query part that must be highlighted.
|
protected Set<UnifiedHighlighter.HighlightFlag> |
getFlags(String field) |
protected PassageFormatter |
getFormatter(String field)
Returns the
PassageFormatter to use for
formatting passages into highlighted snippets. |
Analyzer |
getIndexAnalyzer()
...
|
IndexSearcher |
getIndexSearcher()
...
|
int |
getMaxLength()
The maximum content size to process.
|
protected int |
getMaxNoHighlightPassages(String field)
Returns the number of leading passages (as delineated by the
BreakIterator ) when no
highlights could be found. |
protected UnifiedHighlighter.OffsetSource |
getOffsetSource(String field)
Determine the offset source for the specified field.
|
protected FieldOffsetStrategy |
getOffsetStrategy(UnifiedHighlighter.OffsetSource offsetSource,
UHComponents components) |
protected UnifiedHighlighter.OffsetSource |
getOptimizedOffsetSource(String field,
BytesRef[] terms,
PhraseHelper phraseHelper,
CharacterRunAutomaton[] automata) |
protected PhraseHelper |
getPhraseHelper(String field,
Query query,
Set<UnifiedHighlighter.HighlightFlag> highlightFlags) |
protected PassageScorer |
getScorer(String field)
Returns the
PassageScorer to use for
ranking passages. |
String[] |
highlight(String field,
Query query,
TopDocs topDocs)
Highlights the top passages from a single field.
|
String[] |
highlight(String field,
Query query,
TopDocs topDocs,
int maxPassages)
Highlights the top-N passages from a single field.
|
Map<String,String[]> |
highlightFields(String[] fieldsIn,
Query query,
int[] docidsIn,
int[] maxPassagesIn)
Highlights the top-N passages from multiple fields,
for the provided int[] docids.
|
Map<String,String[]> |
highlightFields(String[] fields,
Query query,
TopDocs topDocs)
Highlights the top passages from multiple fields.
|
Map<String,String[]> |
highlightFields(String[] fields,
Query query,
TopDocs topDocs,
int[] maxPassages)
Highlights the top-N passages from multiple fields.
|
protected Map<String,Object[]> |
highlightFieldsAsObjects(String[] fieldsIn,
Query query,
int[] docIdsIn,
int[] maxPassagesIn)
Expert: highlights the top-N passages from multiple fields,
for the provided int[] docids, to custom Object as
returned by the
PassageFormatter . |
Object |
highlightWithoutSearcher(String field,
Query query,
String content,
int maxPassages)
Highlights text passed as a parameter.
|
protected List<CharSequence[]> |
loadFieldValues(String[] fields,
DocIdSetIterator docIter,
int cacheCharsThreshold)
Loads the String values for each docId by field to be highlighted.
|
protected UnifiedHighlighter.LimitedStoredFieldVisitor |
newLimitedStoredFieldsVisitor(String[] fields) |
protected Collection<Query> |
preSpanQueryRewrite(Query query)
When highlighting phrases accurately, we may need to handle custom queries that aren't supported in the
WeightedSpanTermExtractor as called by the PhraseHelper . |
protected Boolean |
requiresRewrite(SpanQuery spanQuery)
When highlighting phrases accurately, we need to know which
SpanQuery 's need to have
Query.rewrite(IndexReader) called on them. |
void |
setBreakIterator(Supplier<BreakIterator> breakIterator) |
void |
setCacheFieldValCharsThreshold(int cacheFieldValCharsThreshold) |
void |
setFieldMatcher(Predicate<String> predicate) |
void |
setFormatter(PassageFormatter formatter) |
void |
setHandleMultiTermQuery(boolean handleMtq) |
void |
setHighlightPhrasesStrictly(boolean highlightPhrasesStrictly) |
void |
setMaxLength(int maxLength) |
void |
setMaxNoHighlightPassages(int defaultMaxNoHighlightPassages) |
void |
setScorer(PassageScorer scorer) |
protected boolean |
shouldHandleMultiTermQuery(String field)
Returns whether
MultiTermQuery derivatives will be highlighted. |
protected boolean |
shouldHighlightPhrasesStrictly(String field)
Returns whether position sensitive queries (e.g.
|
protected boolean |
shouldPreferPassageRelevancyOverSpeed(String field) |
protected static final char MULTIVAL_SEP_CHAR
public static final int DEFAULT_MAX_LENGTH
public static final int DEFAULT_CACHE_CHARS_THRESHOLD
protected static final CharacterRunAutomaton[] ZERO_LEN_AUTOMATA_ARRAY
protected final IndexSearcher searcher
protected final Analyzer indexAnalyzer
protected volatile FieldInfos fieldInfos
public UnifiedHighlighter(IndexSearcher indexSearcher, Analyzer indexAnalyzer)
indexSearcher
- Usually required, unless highlightWithoutSearcher(String, Query, String, int)
is
used, in which case this needs to be null.indexAnalyzer
- Required, even if in some circumstances it isn't used.protected static Set<Term> extractTerms(Query query) throws IOException
IOException
public void setHandleMultiTermQuery(boolean handleMtq)
public void setHighlightPhrasesStrictly(boolean highlightPhrasesStrictly)
public void setMaxLength(int maxLength)
public void setBreakIterator(Supplier<BreakIterator> breakIterator)
public void setScorer(PassageScorer scorer)
public void setFormatter(PassageFormatter formatter)
public void setMaxNoHighlightPassages(int defaultMaxNoHighlightPassages)
public void setCacheFieldValCharsThreshold(int cacheFieldValCharsThreshold)
protected boolean shouldHandleMultiTermQuery(String field)
MultiTermQuery
derivatives will be highlighted. By default it's enabled. MTQ
highlighting can be expensive, particularly when using offsets in postings.protected boolean shouldHighlightPhrasesStrictly(String field)
SpanQuery
ies)
should be highlighted strictly based on query matches (slower)
versus any/all occurrences of the underlying terms. By default it's enabled, but there's no overhead if such
queries aren't used.protected boolean shouldPreferPassageRelevancyOverSpeed(String field)
protected Predicate<String> getFieldMatcher(String field)
public int getMaxLength()
protected BreakIterator getBreakIterator(String field)
BreakIterator
to use for
dividing text into passages. This returns
BreakIterator.getSentenceInstance(Locale)
by default;
subclasses can override to customize.
Note: this highlighter will call
BreakIterator.preceding(int)
and BreakIterator.next()
many times on it.
The default generic JDK implementation of preceding
performs poorly.
protected PassageScorer getScorer(String field)
PassageScorer
to use for
ranking passages. This
returns a new PassageScorer
by default;
subclasses can override to customize.protected PassageFormatter getFormatter(String field)
PassageFormatter
to use for
formatting passages into highlighted snippets. This
returns a new PassageFormatter
by default;
subclasses can override to customize.protected int getMaxNoHighlightPassages(String field)
BreakIterator
) when no
highlights could be found. If it's less than 0 (the default) then this defaults to the maxPassages
parameter given for each request. If this is 0 then the resulting highlight is null (not formatted).public int getCacheFieldValCharsThreshold()
getMaxLength()
for each field). By setting this to 0, you can force
documents to be fetched and highlighted one at a time, which you usually shouldn't do.
The default is 524288 chars which translates to about a megabyte. However, note
that the highlighter sometimes ignores this and highlights one document at a time (without caching a
bunch of documents in advance) when it can detect there's no point in it -- such as when all fields will be
highlighted via re-analysis as one example.public IndexSearcher getIndexSearcher()
public Analyzer getIndexAnalyzer()
protected UnifiedHighlighter.OffsetSource getOffsetSource(String field)
getFieldInfo(String)
. Note this returns null if there is no searcher or if the
field isn't found there.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
then UnifiedHighlighter.OffsetSource.POSTINGS
is
returned.FieldInfo.hasVectors()
then UnifiedHighlighter.OffsetSource.TERM_VECTORS
is
returned (note we can't check here if the TV has offsets; if there isn't then an exception will get thrown
down the line).UnifiedHighlighter.OffsetSource.ANALYSIS
is returned.
Note that the highlighter sometimes switches to something else based on the query, such as if you have
UnifiedHighlighter.OffsetSource.POSTINGS_WITH_TERM_VECTORS
but in fact don't need term vectors.
protected FieldInfo getFieldInfo(String field)
getOffsetSource(String)
.
If there is no searcher then we simply always return null.public String[] highlight(String field, Query query, TopDocs topDocs) throws IOException
field
- field name to highlight.
Must have a stored string value and also be indexed with offsets.query
- query to highlight.topDocs
- TopDocs containing the summary result documents to highlight.topDocs
.
If no highlights were found for a document, the
first sentence for the field will be returned.IOException
- if an I/O error occurred during processingIllegalArgumentException
- if field
was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
public String[] highlight(String field, Query query, TopDocs topDocs, int maxPassages) throws IOException
field
- field name to highlight. Must have a stored string value.query
- query to highlight.topDocs
- TopDocs containing the summary result documents to highlight.maxPassages
- The maximum number of top-N ranked passages used to
form the highlighted snippets.topDocs
.
If no highlights were found for a document, the
first maxPassages
sentences from the
field will be returned.IOException
- if an I/O error occurred during processingIllegalArgumentException
- if field
was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
public Map<String,String[]> highlightFields(String[] fields, Query query, TopDocs topDocs) throws IOException
Conceptually, this behaves as a more efficient form of:
Map m = new HashMap(); for (String field : fields) { m.put(field, highlight(field, query, topDocs)); } return m;
fields
- field names to highlight. Must have a stored string value.query
- query to highlight.topDocs
- TopDocs containing the summary result documents to highlight.topDocs
.
If no highlights were found for a document, the
first sentence from the field will be returned.IOException
- if an I/O error occurred during processingIllegalArgumentException
- if field
was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
public Map<String,String[]> highlightFields(String[] fields, Query query, TopDocs topDocs, int[] maxPassages) throws IOException
Conceptually, this behaves as a more efficient form of:
Map m = new HashMap(); for (String field : fields) { m.put(field, highlight(field, query, topDocs, maxPassages)); } return m;
fields
- field names to highlight. Must have a stored string value.query
- query to highlight.topDocs
- TopDocs containing the summary result documents to highlight.maxPassages
- The maximum number of top-N ranked passages per-field used to
form the highlighted snippets.topDocs
.
If no highlights were found for a document, the
first maxPassages
sentences from the
field will be returned.IOException
- if an I/O error occurred during processingIllegalArgumentException
- if field
was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
public Map<String,String[]> highlightFields(String[] fieldsIn, Query query, int[] docidsIn, int[] maxPassagesIn) throws IOException
fieldsIn
- field names to highlight. Must have a stored string value.query
- query to highlight.docidsIn
- containing the document IDs to highlight.maxPassagesIn
- The maximum number of top-N ranked passages per-field used to
form the highlighted snippets.docidsIn
.
If no highlights were found for a document, the
first maxPassages
from the field will
be returned.IOException
- if an I/O error occurred during processingIllegalArgumentException
- if field
was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
protected Map<String,Object[]> highlightFieldsAsObjects(String[] fieldsIn, Query query, int[] docIdsIn, int[] maxPassagesIn) throws IOException
PassageFormatter
. Use
this API to render to something other than String.fieldsIn
- field names to highlight. Must have a stored string value.query
- query to highlight.docIdsIn
- containing the document IDs to highlight.maxPassagesIn
- The maximum number of top-N ranked passages per-field used to
form the highlighted snippets.docIdsIn
.
If no highlights were found for a document, the
first maxPassages
from the field will
be returned.IOException
- if an I/O error occurred during processingIllegalArgumentException
- if field
was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
public Object highlightWithoutSearcher(String field, Query query, String content, int maxPassages) throws IOException
IndexSearcher
provided to this highlighter is
null. This use-case is more rare. Naturally, the mode of operation will be UnifiedHighlighter.OffsetSource.ANALYSIS
.
The result of this method is whatever the PassageFormatter
returns. For the DefaultPassageFormatter
and assuming content
has non-zero length, the result will be a non-null
string -- so it's safe to call Object.toString()
on it in that case.field
- field name to highlight (as found in the query).query
- query to highlight.content
- text to highlight.maxPassages
- The maximum number of top-N ranked passages used to
form the highlighted snippets.PassageFormatter
-- probably a String. Might be null.IOException
- if an I/O error occurred during processingprotected FieldHighlighter getFieldHighlighter(String field, Query query, Set<Term> allTerms, int maxPassages)
protected static BytesRef[] filterExtractedTerms(Predicate<String> fieldMatcher, Set<Term> queryTerms)
protected Set<UnifiedHighlighter.HighlightFlag> getFlags(String field)
protected PhraseHelper getPhraseHelper(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags)
protected CharacterRunAutomaton[] getAutomata(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags)
protected UnifiedHighlighter.OffsetSource getOptimizedOffsetSource(String field, BytesRef[] terms, PhraseHelper phraseHelper, CharacterRunAutomaton[] automata)
protected FieldOffsetStrategy getOffsetStrategy(UnifiedHighlighter.OffsetSource offsetSource, UHComponents components)
protected Boolean requiresRewrite(SpanQuery spanQuery)
SpanQuery
's need to have
Query.rewrite(IndexReader)
called on them. It helps performance to avoid it if it's not needed.
This method will be invoked on all SpanQuery instances recursively. If you have custom SpanQuery queries then
override this to check instanceof and provide a definitive answer. If the query isn't your custom one, simply
return null to have the default rules apply, which govern the ones included in Lucene.protected Collection<Query> preSpanQueryRewrite(Query query)
WeightedSpanTermExtractor
as called by the PhraseHelper
.
Should custom query types be needed, this method should be overriden to return a collection of queries if appropriate,
or null if nothing to do. If the query is not custom, simply returning null will allow the default rules to apply.query
- Query to be highlightedprotected List<CharSequence[]> loadFieldValues(String[] fields, DocIdSetIterator docIter, int cacheCharsThreshold) throws IOException
DocIdSetIterator
but need not return all of them; by default the character lengths are summed and this method will return early
when cacheCharsThreshold
is exceeded. Specifically if that number is 0, then only one document is
fetched no matter what. Values in the array of CharSequence
will be null if no value was found.IOException
protected UnifiedHighlighter.LimitedStoredFieldVisitor newLimitedStoredFieldsVisitor(String[] fields)
Copyright © 2000-2019 Apache Software Foundation. All Rights Reserved.