lucene4.0各内置分析器的使用及比较
最近看lucene,觉得书上的例子很好懂,代码也容易理解,可是从官网上下的lucene4.0,跟以前的版本还是有些出入。
package org.apache.lucene.demo;import java.io.IOException;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.core.KeywordAnalyzer;import org.apache.lucene.analysis.core.SimpleAnalyzer;import org.apache.lucene.analysis.core.StopAnalyzer;import org.apache.lucene.analysis.core.WhitespaceAnalyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.util.Version;public class AnalyzerDemo {private static final String[] examples = {"The quick brown fox jumped over the lazy dog","xyz&corporation - xyz@example.com"};private static final Analyzer[] analyzers = {new WhitespaceAnalyzer(Version.LUCENE_40),new SimpleAnalyzer(Version.LUCENE_40),new KeywordAnalyzer(),new StopAnalyzer(Version.LUCENE_40),new StandardAnalyzer(Version.LUCENE_40)};/** * @param args * @throws IOException */public static void main(String[] args) throws IOException {// TODO Auto-generated method stub String[] strings = examples; if(args.length > 0){ strings = args; } for(String text : strings){ analyze(text); }}public static void analyze(String text) throws IOException{for(Analyzer analyzer : analyzers){System.out.println(" "+analyzer.getClass().getSimpleName() + " ");AnalyzerUtils.displayTokes(analyzer, text);System.out.println();}}}AnalyzerUtils类很简单,只是显示分析器分析结果的一些属性。
package org.apache.lucene.demo;import java.io.IOException;import java.io.StringReader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;public class AnalyzerUtils { public static void displayTokes(Analyzer analyzer,String text) throws IOException{ displayTokes(analyzer.tokenStream("contents", new StringReader(text))); } public static void displayTokes(TokenStream tokenStream) throws IOException{ CharTermAttribute termAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); //此行,不能少,不然会报 java.lang.ArrayIndexOutOfBoundsException while(tokenStream.incrementToken()){ System.out.print("["+termAttribute.toString()+"]"); } }}tokenStream.reset(); 这句不能少,网上关于lucene4.0的使用及介绍比较少,由于刚接触,对其源码还没有阅读,不知道其中缘由,是自己从官网api中看到试了一下,成功运行。
运行结果也与书上描述不一致,代码运行结果为
WhitespaceAnalyzer
[The][quick][brown][fox][jumped][over][the][lazy][dog]
SimpleAnalyzer
[the][quick][brown][fox][jumped][over][the][lazy][dog]
KeywordAnalyzer
[The quick brown fox jumped over the lazy dog]
StopAnalyzer
[quick][brown][fox][jumped][over][lazy][dog]
StandardAnalyzer
[quick][brown][fox][jumped][over][lazy][dog]
WhitespaceAnalyzer
[xyz&corporation][-][xyz@example.com]
SimpleAnalyzer
[xyz][corporation][xyz][example][com]
KeywordAnalyzer
[xyz&corporation - xyz@example.com]
StopAnalyzer
[xyz][corporation][xyz][example][com]
StandardAnalyzer
[xyz][corporation][xyz][example.com]
书上介绍为:

经观察,主要是StandarAnalyzer和书上介绍的不一致,并没有识别出邮件地址。
ps: 刚接触,希望跟大家多交流沟通,彼此学习。