Lucene排序 Payload的应用
有关Lucene的Payload的相关内容,可以参考如下链接,介绍的非常详细,值得参考:
http://www.ibm.com/developerworks/cn/opensource/os-cn-lucene-pl/
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/
例如,有这样的一个需求:
现在有两篇文档内容非常相似,如下所示:
- 文档1:egg?tomato?potato?bread?? 文档2:egg?book?potato?bread??
现在我想要查询食物(foods),而且是查询关键词是egg,如何能够区别出上面两个文档哪一个更是我想要的?
可以看到上面两篇文档,文档1中描述的各项都是食物,而文档2中的book不是食物,基于上述需求,应该是文档1比文档2更相关,在查询结果中,文档1排名应该更靠前。通过上面
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/中给出的方法,可以在文档中,对给定词出现在文档的出现的权重信息(egg在文档1与文档中,以foods来衡量,文档1更相关),可以在索引之前处理一下,为egg增加payload信息,例如:
- 文档1:egg|0.984?tomato?potato?bread?? 文档2:egg|0.356?book?potato?bread??
然后再进行索引,通过Lucene提供的PayloadTermQuery就能够分辨出上述egg这个Term的不同。在Lucene中,实际上是将我们存储的Payload数据,如上述"|"分隔后面的数字,乘到了tf上,然后在进行权重的计算。
下面,我们再看一下,增加一个Field来存储Payload数据,而源文档不需要进行修改,或者,我们可以在索引之前对文档进行一个处理,例如分类,通过分类可以给不同的文档所属类别的不同程度,计算一个Payload数值。
为了能够使用存储的Payload数据信息,结合上面提出的实例,我们需要按照如下步骤去做:
第一,待索引数据处理
例如,增加category这个Field存储类别信息,content这个Field存储上面的内容:
- 文档1:?? new?Field("category",?"foods|0.984?shopping|0.503",?Field.Store.YES,?Field.Index.ANALYZED)?? new?Field("content",?"egg?tomato?potato?bread",?Field.Store.YES,?Field.Index.ANALYZED)?? 文档2:?? new?Field("category",?"foods|0.356?shopping|0.791",?Field.Store.YES,?Field.Index.ANALYZED)?? new?Field("content",?"egg?book?potato?bread",?Field.Store.YES,?Field.Index.ANALYZED)??
?
第二,实现解析Payload数据的Analyzer
由于Payload信息存储在category这个Field中,多个类别之间使用空格分隔,每个类别内容是以"|"分隔的,所以我们的Analyzer就要能够解析它。Lucene提供了DelimitedPayloadTokenFilter,能够处理具有分隔符的情况。我们的实现如下所示:
- package?org.shirdrn.lucene.query.payloadquery;?? ?? ?import?java.io.Reader;?? ?? ?import?org.apache.lucene.analysis.Analyzer;?? import?org.apache.lucene.analysis.TokenStream;?? import?org.apache.lucene.analysis.WhitespaceTokenizer;?? import?org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;?? import?org.apache.lucene.analysis.payloads.PayloadEncoder;?? ?? ?public?class?PayloadAnalyzer?extends?Analyzer?{?? ????private?PayloadEncoder?encoder;?? ?? ?????PayloadAnalyzer(PayloadEncoder?encoder)?{?? ????????this.encoder?=?encoder;?? ????}?? ?? ?????@SuppressWarnings("deprecation")?? ????public?TokenStream?tokenStream(String?fieldName,?Reader?reader)?{?? ????????TokenStream?result?=?new?WhitespaceTokenizer(reader);?//?用来解析空格分隔的各个类别?? ????????result?=?new?DelimitedPayloadTokenFilter(result,?'|',?encoder);?//?在上面分词的基础上,在进行Payload数据解析?? ????????return?result;?? ????}?? }??
第三,?实现Similarity计算得分
Lucene中Similarity类中提供了scorePayload方法,用于计算Payload值来对文档贡献得分,我们重写了该方法,实现如下所示:
- package?org.shirdrn.lucene.query.payloadquery;?? ?? ?import?org.apache.lucene.analysis.payloads.PayloadHelper;?? import?org.apache.lucene.search.DefaultSimilarity;?? ?? ??? ?public?class?PayloadSimilarity?extends?DefaultSimilarity?{?? ?? ?????private?static?final?long?serialVersionUID?=?1L;?? ?? ?????@Override?? ????public?float?scorePayload(int?docId,?String?fieldName,?int?start,?int?end,?? ????????????byte[]?payload,?int?offset,?int?length)?{?? ????????return?PayloadHelper.decodeFloat(payload,?offset);?? ????}?? ?? ?}??
通过使用PayloadHelper这个工具类可以获取到Payload值,然后在计算文档得分的时候起到作用。
第四,创建索引
在创建索引的时候,需要使用到我们上面实现的Analyzer和Similarity,代码如下所示:
- package?org.shirdrn.lucene.query.payloadquery;?? ?? ?import?java.io.File;?? import?java.io.IOException;?? ?? ?import?org.apache.lucene.analysis.Analyzer;?? import?org.apache.lucene.analysis.payloads.FloatEncoder;?? import?org.apache.lucene.document.Document;?? import?org.apache.lucene.document.Field;?? import?org.apache.lucene.index.CorruptIndexException;?? import?org.apache.lucene.index.IndexWriter;?? import?org.apache.lucene.index.IndexWriterConfig;?? import?org.apache.lucene.index.IndexWriterConfig.OpenMode;?? import?org.apache.lucene.search.Similarity;?? import?org.apache.lucene.store.FSDirectory;?? import?org.apache.lucene.store.LockObtainFailedException;?? import?org.apache.lucene.util.Version;?? ?? ?public?class?PayloadIndexing?{?? ?? ?????private?IndexWriter?indexWriter?=?null;?? ????private?final?Analyzer?analyzer?=?new?PayloadAnalyzer(new?FloatEncoder());?//?使用PayloadAnalyzer,并指定Encoder?? ????private?final?Similarity?similarity?=?new?PayloadSimilarity();?//?实例化一个PayloadSimilarity?? ????private?IndexWriterConfig?config?=?null;?? ?????? ?????public?PayloadIndexing(String?indexPath)?throws?CorruptIndexException,?LockObtainFailedException,?IOException?{?? ????????File?indexFile?=?new?File(indexPath);?? ????????config?=?new?IndexWriterConfig(Version.LUCENE_31,?analyzer);?? ????????config.setOpenMode(OpenMode.CREATE).setSimilarity(similarity);?//?设置计算得分的Similarity?? ????????indexWriter?=?new?IndexWriter(FSDirectory.open(indexFile),?config);?? ????}?? ?? ?????public?void?index()?throws?CorruptIndexException,?IOException?{??????? ????????Document?doc1?=?new?Document();?? ????????doc1.add(new?Field("category",?"foods|0.984?shopping|0.503",?Field.Store.YES,?Field.Index.ANALYZED));?? ????????doc1.add(new?Field("content",?"egg?tomato?potato?bread",?Field.Store.YES,?Field.Index.ANALYZED));?? ????????indexWriter.addDocument(doc1);?? ?????????? ?????????Document?doc2?=?new?Document();?? ????????doc2.add(new?Field("category",?"foods|0.356?shopping|0.791",?Field.Store.YES,?Field.Index.ANALYZED));?? ????????doc2.add(new?Field("content",?"egg?book?potato?bread",?Field.Store.YES,?Field.Index.ANALYZED));?? ????????indexWriter.addDocument(doc2);?? ?????????? ?????????indexWriter.close();?? ????}?? ?????? ?????public?static?void?main(String[]?args)?throws?CorruptIndexException,?IOException?{?? ????????new?PayloadIndexing("E:\\index").index();?? ????}?? }??
第五,查询
查询的时候,我们可以构造PayloadTermQuery来进行查询。代码如下所示:
- package?org.shirdrn.lucene.query.payloadquery;?? ?? ?import?java.io.File;?? import?java.io.IOException;?? ?? ?import?org.apache.lucene.document.Document;?? import?org.apache.lucene.index.CorruptIndexException;?? import?org.apache.lucene.index.IndexReader;?? import?org.apache.lucene.index.Term;?? import?org.apache.lucene.queryParser.ParseException;?? import?org.apache.lucene.search.BooleanQuery;?? import?org.apache.lucene.search.Explanation;?? import?org.apache.lucene.search.IndexSearcher;?? import?org.apache.lucene.search.ScoreDoc;?? import?org.apache.lucene.search.TopScoreDocCollector;?? import?org.apache.lucene.search.BooleanClause.Occur;?? import?org.apache.lucene.search.payloads.AveragePayloadFunction;?? import?org.apache.lucene.search.payloads.PayloadTermQuery;?? import?org.apache.lucene.store.NIOFSDirectory;?? ?? ?public?class?PayloadSearching?{?? ?????? ?????private?IndexReader?indexReader;?? ????private?IndexSearcher?searcher;?? ?????? ?????public?PayloadSearching(String?indexPath)?throws?CorruptIndexException,?IOException?{?? ????????indexReader?=?IndexReader.open(NIOFSDirectory.open(new?File(indexPath)),?true);?? ????????searcher?=?new?IndexSearcher(indexReader);?? ????????searcher.setSimilarity(new?PayloadSimilarity());?//?设置自定义的PayloadSimilarity?? ????}?? ?????? ?????public?ScoreDoc[]?search(String?qsr)?throws?ParseException,?IOException?{?? ????????int?hitsPerPage?=?10;?? ????????BooleanQuery?bq?=?new?BooleanQuery();?? ????????for(String?q?:?qsr.split("?"))?{?? ????????????bq.add(createPayloadTermQuery(q),?Occur.MUST);?? ????????}?? ????????TopScoreDocCollector?collector?=?TopScoreDocCollector.create(5?*?hitsPerPage,?true);?? ????????searcher.search(bq,?collector);?? ????????ScoreDoc[]?hits?=?collector.topDocs().scoreDocs;?? ????????for?(int?i?=?0;?i?<?hits.length;?i++)?{?? ????????????int?docId?=?hits[i].doc;?//?文档编号?? ????????????Explanation??explanation??=?searcher.explain(bq,?docId);?? ????????????System.out.println(explanation.toString());?? ????????}?? ????????return?hits;?? ????}?? ?????? ?????public?void?display(ScoreDoc[]?hits,?int?start,?int?end)?throws?CorruptIndexException,?IOException?{?? ????????end?=?Math.min(hits.length,?end);?? ????????for?(int?i?=?start;?i?<?end;?i++)?{?? ????????????Document?doc?=?searcher.doc(hits[i].doc);?? ????????????int?docId?=?hits[i].doc;?//?文档编号?? ????????????float?score?=?hits[i].score;?//?文档得分?? ????????????System.out.println(docId?+?"\t"?+?score?+?"\t"?+?doc?+?"\t");?? ????????}?? ????}?? ?????? ?????public?void?close()?throws?IOException?{?? ????????searcher.close();?? ????????indexReader.close();?? ????}?? ?????? ?????private?PayloadTermQuery?createPayloadTermQuery(String?item)?{?? ????????PayloadTermQuery?ptq?=?null;?? ????????if(item.indexOf("^")!=-1)?{?? ????????????String[]?a?=?item.split("\\^");?? ????????????String?field?=?a[0].split(":")[0];?? ????????????String?token?=?a[0].split(":")[1];?? ????????????ptq?=?new?PayloadTermQuery(new?Term(field,?token),?new?AveragePayloadFunction());?? ????????????ptq.setBoost(Float.parseFloat(a[1].trim()));?? ????????}?else?{?? ????????????String?field?=?item.split(":")[0];?? ????????????String?token?=?item.split(":")[1];?? ????????????ptq?=?new?PayloadTermQuery(new?Term(field,?token),?new?AveragePayloadFunction());?? ????????}?? ????????return?ptq;?? ????}?? ?????? ?????public?static?void?main(String[]?args)?throws?ParseException,?IOException?{?? ????????int?start?=?0,?end?=?10;?????? //??????String?queries?=?"category:foods^123.0?content:bread^987.0";?? ????????String?queries?=?"category:foods?content:egg";?? ????????PayloadSearching?payloadSearcher?=?new?PayloadSearching("E:\\index");?? ????????payloadSearcher.display(payloadSearcher.search(queries),?start,?end);?? ????????payloadSearcher.close();?? ????}?? ?? ?}??
我们可以看到查询结果,两个文档的相关度排序:
- 0???0.3314532???Document<stored,indexed,tokenized<category:foods|0.984?shopping|0.503>?stored,indexed,tokenized<content:egg?tomato?potato?bread>>??? 1???0.21477573??Document<stored,indexed,tokenized<category:foods|0.356?shopping|0.791>?stored,indexed,tokenized<content:egg?book?potato?bread>>??
通过输出计算得分的解释信息,如下所示:
- 0.3314532?=?(MATCH)?sum?of:?? ??0.18281947?=?(MATCH)?weight(category:foods?in?0),?product?of:?? ????0.70710677?=?queryWeight(category:foods),?product?of:?? ??????0.5945349?=?idf(category:??foods=2)?? ??????1.1893445?=?queryNorm?? ????0.2585458?=?(MATCH)?fieldWeight(category:foods?in?0),?product?of:?? ??????0.6957931?=?(MATCH)?btq,?product?of:?? ????????0.70710677?=?tf(phraseFreq=0.5)?? ????????0.984?=?scorePayload(...)?? ??????0.5945349?=?idf(category:??foods=2)?? ??????0.625?=?fieldNorm(field=category,?doc=0)?? ??0.14863372?=?(MATCH)?weight(content:egg?in?0),?product?of:?? ????0.70710677?=?queryWeight(content:egg),?product?of:?? ??????0.5945349?=?idf(content:??egg=2)?? ??????1.1893445?=?queryNorm?? ????0.21019982?=?(MATCH)?fieldWeight(content:egg?in?0),?product?of:?? ??????0.70710677?=?(MATCH)?btq,?product?of:?? ????????0.70710677?=?tf(phraseFreq=0.5)?? ????????1.0?=?scorePayload(...)?? ??????0.5945349?=?idf(content:??egg=2)?? ??????0.5?=?fieldNorm(field=content,?doc=0)?? ?? ?0.21477571?=?(MATCH)?sum?of:?? ??0.066142?=?(MATCH)?weight(category:foods?in?1),?product?of:?? ????0.70710677?=?queryWeight(category:foods),?product?of:?? ??????0.5945349?=?idf(category:??foods=2)?? ??????1.1893445?=?queryNorm?? ????0.09353892?=?(MATCH)?fieldWeight(category:foods?in?1),?product?of:?? ??????0.25173002?=?(MATCH)?btq,?product?of:?? ????????0.70710677?=?tf(phraseFreq=0.5)?? ????????0.356?=?scorePayload(...)?? ??????0.5945349?=?idf(category:??foods=2)?? ??????0.625?=?fieldNorm(field=category,?doc=1)?? ??0.14863372?=?(MATCH)?weight(content:egg?in?1),?product?of:?? ????0.70710677?=?queryWeight(content:egg),?product?of:?? ??????0.5945349?=?idf(content:??egg=2)?? ??????1.1893445?=?queryNorm?? ????0.21019982?=?(MATCH)?fieldWeight(content:egg?in?1),?product?of:?? ??????0.70710677?=?(MATCH)?btq,?product?of:?? ????????0.70710677?=?tf(phraseFreq=0.5)?? ????????1.0?=?scorePayload(...)?? ??????0.5945349?=?idf(content:??egg=2)?? ??????0.5?=?fieldNorm(field=content,?doc=1)??
我们可以看到,除了在tf上乘了一个Payload值以外,其他的都相同,也就是说,我们预期使用的Payload为文档(ID=0)贡献了得分,排名靠前了。否则,如果不使用Payload的话,查询结果中两个文档的得分是相同的(可以模拟设置他们的Payload值相同,测试一下看看)