读书人

Lucene04-分词器

发布时间: 2012-07-01 13:15:00 作者: rapoo

Lucene04---分词器

?

package com.iflytek.lucene;import java.io.StringReader;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.cjk.CJKAnalyzer;import org.apache.lucene.analysis.standard.StandardAnalyzer;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import org.apache.lucene.util.Version;/** * @author xudongwang 2012-2-4 * * Email:xdwangiflytek@gmail.com */public class AnalyzerDemo {public void analyze(Analyzer analyzer, String text) throws Exception {System.out.println("----------------------->分词器:" + analyzer.getClass());TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));CharTermAttribute termAtt = (CharTermAttribute) tokenStream.getAttribute(CharTermAttribute.class);// TypeAttribute typeAtt = (TypeAttribute) tokenStream// .getAttribute(TypeAttribute.class);while (tokenStream.incrementToken()) {System.out.println(termAtt.toString());// System.out.println(typeAtt.type());}}public static void main(String[] args) throws Exception {AnalyzerDemo demo = new AnalyzerDemo();System.out.println("---------------->测试英文");String enText = "Hello, my name is wang xudong, I in iteye blog address is xdwangiflytek.iteye.com";System.out.println(enText);System.out.println("By StandardAnalyzer 方式分词:");Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);demo.analyze(analyzer, enText);System.out.println("By SimpleAnalyzer 方式分词:");Analyzer analyzer2 = new SimpleAnalyzer(Version.LUCENE_35);demo.analyze(analyzer2, enText);System.out.println("通过上面的结果发现StandardAnalyzer分词器不会按.来区分的,而SimpleAnalyzer是按.来区分的");System.out.println();System.out.println("---------------->测试中文");String znText = "我叫王旭东";System.out.println(znText);System.out.println("By StandardAnalyzer 方式分词:");// 通过结果发现它是将每个字都作为一个关键字,这样的话效率肯定很低咯demo.analyze(analyzer, znText);System.out.println("By CJKAnalyzer 方式(二分法分词)分词:");Analyzer analyzer3 = new CJKAnalyzer(Version.LUCENE_35);demo.analyze(analyzer3, znText);}}

?

?

?运行结果:

?

---------------->测试英文

Hello, my name is wang xudong, I in iteye blog address is xdwangiflytek.iteye.com

By StandardAnalyzer 方式分词:

----------------------->分词器:class org.apache.lucene.analysis.standard.StandardAnalyzer

hello

my

name

wang

xudong

i

iteye

blog

address

xdwangiflytek.iteye.com

By SimpleAnalyzer 方式分词:

----------------------->分词器:class org.apache.lucene.analysis.SimpleAnalyzer

hello

my

name

is

wang

xudong

i

in

iteye

blog

address

is

xdwangiflytek

iteye

com

通过上面的结果发现StandardAnalyzer分词器不会按.来区分的,而SimpleAnalyzer是按.来区分的

?

---------------->测试中文

我叫王旭东

By StandardAnalyzer 方式分词:

----------------------->分词器:class org.apache.lucene.analysis.standard.StandardAnalyzer

By CJKAnalyzer 方式(二分法分词)分词:

----------------------->分词器:class org.apache.lucene.analysis.cjk.CJKAnalyzer

我叫

叫王

王旭

旭东

?

上面讲的分词方式中对于中文来说最好的还是语义分词,就是中科院的那个。

?

后面对于具体中文分词器,我会在专门的专题中去说的,这里先简单了解一下.

?

读书人网 >软件架构设计

热点推荐