public class UnifiedHighlighter extends Object
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS),
term vectors (FieldType.setStoreTermVectorOffsets(boolean)),
or via re-analyzing text.
This highlighter treats the single original document as the whole corpus, and then scores individual
passages as if they were documents in this corpus. It uses a BreakIterator to find
passages in the text; by default it breaks using getSentenceInstance(Locale.ROOT). It then iterates in parallel (merge sorting by offset) through
the positions of all terms from the query, coalescing those hits that occur in a single passage
into a Passage, and then scores each Passage using a separate PassageScorer.
Passages are finally formatted into highlighted snippets with a PassageFormatter.
You can customize the behavior by calling some of the setters, or by subclassing and overriding some methods. Some important hooks:
getBreakIterator(String): Customize how the text is divided into passages.
getScorer(String): Customize how passages are ranked.
getFormatter(String): Customize how snippets are formatted.
This is thread-safe.
| Modifier and Type | Class and Description |
|---|---|
static class |
UnifiedHighlighter.HighlightFlag
Flags for controlling highlighting behavior.
|
protected static class |
UnifiedHighlighter.LimitedStoredFieldVisitor
Fetches stored fields for highlighting.
|
static class |
UnifiedHighlighter.OffsetSource
Source of term offsets; essential for highlighting.
|
| Modifier and Type | Field and Description |
|---|---|
static int |
DEFAULT_CACHE_CHARS_THRESHOLD |
static int |
DEFAULT_MAX_LENGTH |
protected FieldInfos |
fieldInfos |
protected Analyzer |
indexAnalyzer |
protected static char |
MULTIVAL_SEP_CHAR |
protected IndexSearcher |
searcher |
protected static CharacterRunAutomaton[] |
ZERO_LEN_AUTOMATA_ARRAY |
| Constructor and Description |
|---|
UnifiedHighlighter(IndexSearcher indexSearcher,
Analyzer indexAnalyzer)
Constructs the highlighter with the given index searcher and analyzer.
|
| Modifier and Type | Method and Description |
|---|---|
protected static Set<Term> |
extractTerms(Query query)
Calls
Weight.extractTerms(Set) on an empty index for the query. |
protected static BytesRef[] |
filterExtractedTerms(Predicate<String> fieldMatcher,
Set<Term> queryTerms) |
protected CharacterRunAutomaton[] |
getAutomata(String field,
Query query,
Set<UnifiedHighlighter.HighlightFlag> highlightFlags) |
protected BreakIterator |
getBreakIterator(String field)
Returns the
BreakIterator to use for
dividing text into passages. |
int |
getCacheFieldValCharsThreshold()
Limits the amount of field value pre-fetching until this threshold is passed.
|
protected FieldHighlighter |
getFieldHighlighter(String field,
Query query,
Set<Term> allTerms,
int maxPassages) |
protected FieldInfo |
getFieldInfo(String field)
Called by the default implementation of
getOffsetSource(String). |
protected Predicate<String> |
getFieldMatcher(String field)
Returns the predicate to use for extracting the query part that must be highlighted.
|
protected Set<UnifiedHighlighter.HighlightFlag> |
getFlags(String field) |
protected PassageFormatter |
getFormatter(String field)
Returns the
PassageFormatter to use for
formatting passages into highlighted snippets. |
Analyzer |
getIndexAnalyzer()
...
|
IndexSearcher |
getIndexSearcher()
...
|
int |
getMaxLength()
The maximum content size to process.
|
protected int |
getMaxNoHighlightPassages(String field)
Returns the number of leading passages (as delineated by the
BreakIterator) when no
highlights could be found. |
protected UnifiedHighlighter.OffsetSource |
getOffsetSource(String field)
Determine the offset source for the specified field.
|
protected FieldOffsetStrategy |
getOffsetStrategy(UnifiedHighlighter.OffsetSource offsetSource,
String field,
BytesRef[] terms,
PhraseHelper phraseHelper,
CharacterRunAutomaton[] automata,
Set<UnifiedHighlighter.HighlightFlag> highlightFlags) |
protected UnifiedHighlighter.OffsetSource |
getOptimizedOffsetSource(String field,
BytesRef[] terms,
PhraseHelper phraseHelper,
CharacterRunAutomaton[] automata) |
protected PhraseHelper |
getPhraseHelper(String field,
Query query,
Set<UnifiedHighlighter.HighlightFlag> highlightFlags) |
protected PassageScorer |
getScorer(String field)
Returns the
PassageScorer to use for
ranking passages. |
String[] |
highlight(String field,
Query query,
TopDocs topDocs)
Highlights the top passages from a single field.
|
String[] |
highlight(String field,
Query query,
TopDocs topDocs,
int maxPassages)
Highlights the top-N passages from a single field.
|
Map<String,String[]> |
highlightFields(String[] fieldsIn,
Query query,
int[] docidsIn,
int[] maxPassagesIn)
Highlights the top-N passages from multiple fields,
for the provided int[] docids.
|
Map<String,String[]> |
highlightFields(String[] fields,
Query query,
TopDocs topDocs)
Highlights the top passages from multiple fields.
|
Map<String,String[]> |
highlightFields(String[] fields,
Query query,
TopDocs topDocs,
int[] maxPassages)
Highlights the top-N passages from multiple fields.
|
protected Map<String,Object[]> |
highlightFieldsAsObjects(String[] fieldsIn,
Query query,
int[] docIdsIn,
int[] maxPassagesIn)
Expert: highlights the top-N passages from multiple fields,
for the provided int[] docids, to custom Object as
returned by the
PassageFormatter. |
Object |
highlightWithoutSearcher(String field,
Query query,
String content,
int maxPassages)
Highlights text passed as a parameter.
|
protected List<CharSequence[]> |
loadFieldValues(String[] fields,
DocIdSetIterator docIter,
int cacheCharsThreshold)
Loads the String values for each docId by field to be highlighted.
|
protected UnifiedHighlighter.LimitedStoredFieldVisitor |
newLimitedStoredFieldsVisitor(String[] fields) |
protected Collection<Query> |
preMultiTermQueryRewrite(Query query)
When dealing with multi term queries / span queries, we may need to handle custom queries that aren't supported
by the default automata extraction in
MultiTermHighlighting. |
protected Collection<Query> |
preSpanQueryRewrite(Query query)
When highlighting phrases accurately, we may need to handle custom queries that aren't supported in the
WeightedSpanTermExtractor as called by the PhraseHelper. |
protected Boolean |
requiresRewrite(SpanQuery spanQuery)
When highlighting phrases accurately, we need to know which
SpanQuery's need to have
Query.rewrite(IndexReader) called on them. |
void |
setBreakIterator(Supplier<BreakIterator> breakIterator) |
void |
setCacheFieldValCharsThreshold(int cacheFieldValCharsThreshold) |
void |
setFieldMatcher(Predicate<String> predicate) |
void |
setFormatter(PassageFormatter formatter) |
void |
setHandleMultiTermQuery(boolean handleMtq) |
void |
setHighlightPhrasesStrictly(boolean highlightPhrasesStrictly) |
void |
setMaxLength(int maxLength) |
void |
setMaxNoHighlightPassages(int defaultMaxNoHighlightPassages) |
void |
setScorer(PassageScorer scorer) |
protected boolean |
shouldHandleMultiTermQuery(String field)
Returns whether
MultiTermQuery derivatives will be highlighted. |
protected boolean |
shouldHighlightPhrasesStrictly(String field)
Returns whether position sensitive queries (e.g.
|
protected boolean |
shouldPreferPassageRelevancyOverSpeed(String field) |
protected static final char MULTIVAL_SEP_CHAR
public static final int DEFAULT_MAX_LENGTH
public static final int DEFAULT_CACHE_CHARS_THRESHOLD
protected static final CharacterRunAutomaton[] ZERO_LEN_AUTOMATA_ARRAY
protected final IndexSearcher searcher
protected final Analyzer indexAnalyzer
protected volatile FieldInfos fieldInfos
public UnifiedHighlighter(IndexSearcher indexSearcher, Analyzer indexAnalyzer)
indexSearcher - Usually required, unless highlightWithoutSearcher(String, Query, String, int) is
used, in which case this needs to be null.indexAnalyzer - Required, even if in some circumstances it isn't used.protected static Set<Term> extractTerms(Query query) throws IOException
Weight.extractTerms(Set) on an empty index for the query.IOExceptionpublic void setHandleMultiTermQuery(boolean handleMtq)
public void setHighlightPhrasesStrictly(boolean highlightPhrasesStrictly)
public void setMaxLength(int maxLength)
public void setBreakIterator(Supplier<BreakIterator> breakIterator)
public void setScorer(PassageScorer scorer)
public void setFormatter(PassageFormatter formatter)
public void setMaxNoHighlightPassages(int defaultMaxNoHighlightPassages)
public void setCacheFieldValCharsThreshold(int cacheFieldValCharsThreshold)
protected boolean shouldHandleMultiTermQuery(String field)
MultiTermQuery derivatives will be highlighted. By default it's enabled. MTQ
highlighting can be expensive, particularly when using offsets in postings.protected boolean shouldHighlightPhrasesStrictly(String field)
SpanQueryies)
should be highlighted strictly based on query matches (slower)
versus any/all occurrences of the underlying terms. By default it's enabled, but there's no overhead if such
queries aren't used.protected boolean shouldPreferPassageRelevancyOverSpeed(String field)
protected Predicate<String> getFieldMatcher(String field)
public int getMaxLength()
protected BreakIterator getBreakIterator(String field)
BreakIterator to use for
dividing text into passages. This returns
BreakIterator.getSentenceInstance(Locale) by default;
subclasses can override to customize.
Note: this highlighter will call
BreakIterator.preceding(int) and BreakIterator.next() many times on it.
The default generic JDK implementation of preceding performs poorly.
protected PassageScorer getScorer(String field)
PassageScorer to use for
ranking passages. This
returns a new PassageScorer by default;
subclasses can override to customize.protected PassageFormatter getFormatter(String field)
PassageFormatter to use for
formatting passages into highlighted snippets. This
returns a new PassageFormatter by default;
subclasses can override to customize.protected int getMaxNoHighlightPassages(String field)
BreakIterator) when no
highlights could be found. If it's less than 0 (the default) then this defaults to the maxPassages
parameter given for each request. If this is 0 then the resulting highlight is null (not formatted).public int getCacheFieldValCharsThreshold()
getMaxLength() for each field). By setting this to 0, you can force
documents to be fetched and highlighted one at a time, which you usually shouldn't do.
The default is 524288 chars which translates to about a megabyte. However, note
that the highlighter sometimes ignores this and highlights one document at a time (without caching a
bunch of documents in advance) when it can detect there's no point in it -- such as when all fields will be
highlighted via re-analysis as one example.public IndexSearcher getIndexSearcher()
public Analyzer getIndexAnalyzer()
protected UnifiedHighlighter.OffsetSource getOffsetSource(String field)
getFieldInfo(String). Note this returns null if there is no searcher or if the
field isn't found there.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS then UnifiedHighlighter.OffsetSource.POSTINGS is
returned.FieldInfo.hasVectors() then UnifiedHighlighter.OffsetSource.TERM_VECTORS is
returned (note we can't check here if the TV has offsets; if there isn't then an exception will get thrown
down the line).UnifiedHighlighter.OffsetSource.ANALYSIS is returned.
Note that the highlighter sometimes switches to something else based on the query, such as if you have
UnifiedHighlighter.OffsetSource.POSTINGS_WITH_TERM_VECTORS but in fact don't need term vectors.
protected FieldInfo getFieldInfo(String field)
getOffsetSource(String).
If there is no searcher then we simply always return null.public String[] highlight(String field, Query query, TopDocs topDocs) throws IOException
field - field name to highlight.
Must have a stored string value and also be indexed with offsets.query - query to highlight.topDocs - TopDocs containing the summary result documents to highlight.topDocs.
If no highlights were found for a document, the
first sentence for the field will be returned.IOException - if an I/O error occurred during processingIllegalArgumentException - if field was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSpublic String[] highlight(String field, Query query, TopDocs topDocs, int maxPassages) throws IOException
field - field name to highlight. Must have a stored string value.query - query to highlight.topDocs - TopDocs containing the summary result documents to highlight.maxPassages - The maximum number of top-N ranked passages used to
form the highlighted snippets.topDocs.
If no highlights were found for a document, the
first maxPassages sentences from the
field will be returned.IOException - if an I/O error occurred during processingIllegalArgumentException - if field was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSpublic Map<String,String[]> highlightFields(String[] fields, Query query, TopDocs topDocs) throws IOException
Conceptually, this behaves as a more efficient form of:
Map m = new HashMap();
for (String field : fields) {
m.put(field, highlight(field, query, topDocs));
}
return m;
fields - field names to highlight. Must have a stored string value.query - query to highlight.topDocs - TopDocs containing the summary result documents to highlight.topDocs.
If no highlights were found for a document, the
first sentence from the field will be returned.IOException - if an I/O error occurred during processingIllegalArgumentException - if field was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSpublic Map<String,String[]> highlightFields(String[] fields, Query query, TopDocs topDocs, int[] maxPassages) throws IOException
Conceptually, this behaves as a more efficient form of:
Map m = new HashMap();
for (String field : fields) {
m.put(field, highlight(field, query, topDocs, maxPassages));
}
return m;
fields - field names to highlight. Must have a stored string value.query - query to highlight.topDocs - TopDocs containing the summary result documents to highlight.maxPassages - The maximum number of top-N ranked passages per-field used to
form the highlighted snippets.topDocs.
If no highlights were found for a document, the
first maxPassages sentences from the
field will be returned.IOException - if an I/O error occurred during processingIllegalArgumentException - if field was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSpublic Map<String,String[]> highlightFields(String[] fieldsIn, Query query, int[] docidsIn, int[] maxPassagesIn) throws IOException
fieldsIn - field names to highlight. Must have a stored string value.query - query to highlight.docidsIn - containing the document IDs to highlight.maxPassagesIn - The maximum number of top-N ranked passages per-field used to
form the highlighted snippets.docidsIn.
If no highlights were found for a document, the
first maxPassages from the field will
be returned.IOException - if an I/O error occurred during processingIllegalArgumentException - if field was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSprotected Map<String,Object[]> highlightFieldsAsObjects(String[] fieldsIn, Query query, int[] docIdsIn, int[] maxPassagesIn) throws IOException
PassageFormatter. Use
this API to render to something other than String.fieldsIn - field names to highlight. Must have a stored string value.query - query to highlight.docIdsIn - containing the document IDs to highlight.maxPassagesIn - The maximum number of top-N ranked passages per-field used to
form the highlighted snippets.docIdsIn.
If no highlights were found for a document, the
first maxPassages from the field will
be returned.IOException - if an I/O error occurred during processingIllegalArgumentException - if field was indexed without
IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETSpublic Object highlightWithoutSearcher(String field, Query query, String content, int maxPassages) throws IOException
IndexSearcher provided to this highlighter is
null. This use-case is more rare. Naturally, the mode of operation will be UnifiedHighlighter.OffsetSource.ANALYSIS.
The result of this method is whatever the PassageFormatter returns. For the DefaultPassageFormatter and assuming content has non-zero length, the result will be a non-null
string -- so it's safe to call Object.toString() on it in that case.field - field name to highlight (as found in the query).query - query to highlight.content - text to highlight.maxPassages - The maximum number of top-N ranked passages used to
form the highlighted snippets.PassageFormatter -- probably a String. Might be null.IOException - if an I/O error occurred during processingprotected FieldHighlighter getFieldHighlighter(String field, Query query, Set<Term> allTerms, int maxPassages)
protected static BytesRef[] filterExtractedTerms(Predicate<String> fieldMatcher, Set<Term> queryTerms)
protected Set<UnifiedHighlighter.HighlightFlag> getFlags(String field)
protected PhraseHelper getPhraseHelper(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags)
protected CharacterRunAutomaton[] getAutomata(String field, Query query, Set<UnifiedHighlighter.HighlightFlag> highlightFlags)
protected UnifiedHighlighter.OffsetSource getOptimizedOffsetSource(String field, BytesRef[] terms, PhraseHelper phraseHelper, CharacterRunAutomaton[] automata)
protected FieldOffsetStrategy getOffsetStrategy(UnifiedHighlighter.OffsetSource offsetSource, String field, BytesRef[] terms, PhraseHelper phraseHelper, CharacterRunAutomaton[] automata, Set<UnifiedHighlighter.HighlightFlag> highlightFlags)
protected Boolean requiresRewrite(SpanQuery spanQuery)
SpanQuery's need to have
Query.rewrite(IndexReader) called on them. It helps performance to avoid it if it's not needed.
This method will be invoked on all SpanQuery instances recursively. If you have custom SpanQuery queries then
override this to check instanceof and provide a definitive answer. If the query isn't your custom one, simply
return null to have the default rules apply, which govern the ones included in Lucene.protected Collection<Query> preSpanQueryRewrite(Query query)
WeightedSpanTermExtractor as called by the PhraseHelper.
Should custom query types be needed, this method should be overriden to return a collection of queries if appropriate,
or null if nothing to do. If the query is not custom, simply returning null will allow the default rules to apply.query - Query to be highlightedprotected Collection<Query> preMultiTermQueryRewrite(Query query)
MultiTermHighlighting. This can be overridden to return a collection
of queries if appropriate, or null if nothing to do. If query is not custom, simply returning null will allow the
default rules to apply.query - Query to be highlightedprotected List<CharSequence[]> loadFieldValues(String[] fields, DocIdSetIterator docIter, int cacheCharsThreshold) throws IOException
DocIdSetIterator
but need not return all of them; by default the character lengths are summed and this method will return early
when cacheCharsThreshold is exceeded. Specifically if that number is 0, then only one document is
fetched no matter what. Values in the array of CharSequence will be null if no value was found.IOExceptionprotected UnifiedHighlighter.LimitedStoredFieldVisitor newLimitedStoredFieldsVisitor(String[] fields)
Copyright © 2000-2017 Apache Software Foundation. All Rights Reserved.