public final class SimilarityQueries extends Object
MoreLikeThis
public static Query formSimilarQuery(String body, Analyzer a, String field, Set<?> stop) throws IOException
IndexSearcher
for similar docs.
The only caveat is the first hit returned should be your source document - you'll
need to then ignore that.
So, if you have a code fragment like this:
Query q = formSimilaryQuery( "I use Lucene to search fast. Fast searchers are good", new StandardAnalyzer(), "contents", null);
The query returned, in string form, will be '(i use lucene to search fast searchers are good')
.
The philosophy behind this method is "two documents are similar if they share lots of words". Note that behind the scenes, Lucene's scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.
This method is fail-safe in that if a long 'body' is passed in and
BooleanQuery.add()
(used internally)
throws
BooleanQuery.TooManyClauses
, the
query as it is will be returned.
body
- the body of the document you want to find similar documents toa
- the analyzer to use to parse the bodyfield
- the field you want to search on, probably something like "contents" or "body"stop
- optional set of stop words to ignoreIOException
- this can't happen...