读书人

lucene根本

发布时间: 2012-12-18 12:43:41 作者: rapoo

lucene基本

Lucene索引中有几个最基础的概念,索引(index),文档(document),域(field),和项(或者译为语词term)?字串4?

其中Index为Document的序列?

字串6


?????Document为Field的序列?字串8?

?????Field为Term的序列?

字串7

?????Term就是一个子串.?字串7?

存在于不同的Field中的同一个子串被认为是不同的Term.因此Term实际上是用一对子串表示的,第一个子串为Field的name,第二个为Field中的子串.既然Term这么重要,我们先来认识一下Term.?

字串5

认识Term?字??

最好的方法就是看其源码表示.?

字串7


public?final?class?Term?implements?Comparable,?java.io.Serializable?{?字串8?

?String?field;?字串8?

?String?text;?字串1?

?public?Term(String?fld,?String?txt)?{this(fld,?txt,?true);}?

字串6


?public?final?String?field()?{?return?field;?}?

字串8


?public?final?String?text()?{?return?text;?}?

字串4


//overwrite?equals()?

字串9


?public?final?boolean?equals(Object?o)?{?}?

字串3

//overwrite?hashCode()?

字串5


?public?final?int?hashCode()?{return?field.hashCode()?+?text.hashCode();?字串5?

?}?字串3?


?
?
?
?public?int?compareTo(Object?other)?{return?compareTo((Term)other);}?字串5?

?public?final?int?compareTo(Term?other)?字串8?

?final?void?set(String?fld,?String?txt)?public?final?String?toString()?{?return?field?+?":"?+?text;?}?

字串6


?private?void?readObject(java.io.ObjectInputStream?in)?

字串7


?}?字串4?

从代码中我们可以大体看出Tern其实是一个二元组<FieldName,text>?字串8?

倒排索引
为了使得基于项的搜索更有效率,索引中项是静态存储的。Lucene的索引属于索引方式中的倒排索引,因为对于一个项这种索引可以列出包含它的文档。这刚好是文档与项自然联系的倒置。?字串5?

Field的类型?
Lucene中,Field的文本可能以逐字的非倒排的方式存储在索引中。而倒排过的Field称为被索引过了。Field也可能同时被存储和被索引。Field的文本可能被分解许多Term而被索引,或者就被用作一个Term而被索引。大多数的Field是被分解过的,但是有些时候某些标识符域被当做一个Term索引是很有用的。?字串2?

Index包中的每个类解析?字串3?

CompoundFileReader?

字串4

???????提供读取.cfs文件的方法.?字串9?

CompoundFileWriter?

字串3

???????用来构建.cfs文件,从Lucene1.4开始,会将下面要提到的各类文件,譬如.tii,.tis等合并成一个.cfs文件!?

字串4


???????其结构如下?

字串1


Compound?(.cfs)?-->?FileCount,?<DataOffset,?FileName>FileCount,?FileDataFileCount?

字串1


FileCount?-->?VInt?

字串2


DataOffset?-->?Long?字串6?

FileName?-->?String?

字串2


FileData?-->?raw?file?data?字串9?

DocumentWriter?

字串4


?????构建.frq,.prx,.f文件????

字串7


1.FreqFile?(.frq)?-->?<TermFreqs,?SkipData>TermCount?

字串9


TermFreqs?-->?<TermFreq>DocFreq?

字串4


TermFreq?-->?DocDelta,?Freq??字串2?

SkipData?-->?<SkipDatum>DocFreq/SkipInterval?字串9?

SkipDatum?-->?DocSkip,FreqSkip,ProxSkip?字串4?

DocDelta,Freq,DocSkip,FreqSkip,ProxSkip?-->?VInt?

字串5

?

?
?
?
字串5

??


字串9


??

字串6


2.The?.prx?file?contains?the?lists?of?positions?that?each?term?occurs?at?within?documents.?字串3?

ProxFile?(.prx)?-->?<TermPositions>TermCount?

字串3

TermPositions?-->?<Positions>DocFreq?字串8?

Positions?-->?<PositionDelta>Freq?字串5?

PositionDelta?-->?VInt??

字串7

?
字串7?


?
?
?
3.There′s?a?norm?file?for?each?indexed?field?with?a?byte?for?each?document.?The?.f[0-9]*?file?contains,?for?each?document,?a?byte?that?encodes?a?value?that?is?multiplied?into?the?score?for?hits?on?that?field:?字串4?

Norms?(.f[0-9]*)?-->?<Byte>SegSize?

字串3


Each?byte?encodes?a?floating?point?value.?Bits?0-2?contain?the?3-bit?mantissa,?and?bits?3-8?contain?the?5-bit?exponent.?字串3?

These?are?converted?to?an?IEEE?single?float?value?as?follows:?

字串3


1.????If?the?byte?is?zero,?use?a?zero?float.?

字串1

2.????Otherwise,?set?the?sign?bit?of?the?float?to?zero;?字串7?

3.????add?48?to?the?exponent?and?use?this?as?the?float′s?exponent;?字串4?

4.????map?the?mantissa?to?the?high-order?3?bits?of?the?float′s?mantissa;?and?字串7?

5.????set?the?low-order?21?bits?of?the?float′s?mantissa?to?zero.?字串6?


?
?
?
FieldInfo?

字串3


??????里边有Field的部分信息,是一个四元组<name,isIndexed,num,?storeTermVector>?字串1?

FieldInfos?字串5?

?????此类用来描述Document的fields是否被索引.每个Segment有一个单独的FieldInfo?文件.对于多线程,此类的对象为线程安全的.但是某一时刻,只允许一个线程添加document.别的reader和writer不允许进入.此类维护两个容器ArrayList和HashMap,这两个容器都不是synchronized,何言线程安全,不解???字串4?

观察write函数可知?.fnm文件的构成为?字串1?

?????FieldInfos?(.fnm)?-->?FieldsCount,?<FieldName,?FieldBits>FieldsCount?
????????????????????????FieldsCount?-->?VInt?

字串8


????????????????????????FieldName?-->?String?

字串7


????????????????????????FieldBits?-->?Byte?

字串3


FieldReader?

字串6


????用来读取.fdx文件和.fdt文件?

字串4

FieldWriter?字串3?

?????此类创建两个文件.fdx和.fdt文件?

字串1


?????FieldIndex(.fdx)对于每一个Document,里面都含有一个指向Field的指针(其实是整数)?字串1?

<FieldValuesPosition>SegSize?字串3?

FieldValuesPosition?-->?Uint64?字串4?

?????????????则第n个document的Field?pointer为n*8?字串9?

????FieldData(.fdt)里面包含了每一个文档包含的存储的field信息.内容如下:?字串9?

<DocFieldData>SegSize?字串1?

DocFieldData?-->?FieldCount,?<FieldNum,?Bits,?Value>FieldCount?

字串9

FieldCount?-->?VInt?字串6?

FieldNum?-->?VInt?字串2?

Lucene?<=?1.4:?

字串9


Bits?-->?Byte?字串7?

Value?-->?String?字串1?

Only?the?low-order?bit?of?Bits?is?used.?It?is?one?for?tokenized?fields,?and?zero?for?non-tokenized?fields.?字串8?

FilterIndexReader?

字串2

?????扩展自IndexReader,提供了具体的方法.?字串9?

IndexReader?字串7?

?????为abstract?class!用来读取建完索引的Directory,并可以返回各种信息,譬如Term,TermPosition等等.?字串7?

IndexWriter?字串4?

????IndexWriter用来创建和维护索引。?字串4?

???IndexWriter构造函数中的第三个参数决定一个新的索引是否被创建,或者一个存在的索引是否开放给欲新加入的新的document?字串3?

???通过addDocument()0函数加入新的documents,当添加完document之后,调用close()函数?

字串1

???如果一个Index没有document需要加入并且需要优化查询性能。则在索引close()之前,调用optimize()函数进行优化。?

字串5

????Deleteable文件结构:?

字串8

????A?file?named?"deletable"?contains?the?names?of?files?that?are?no?longer?used?by?the?index,?but?which?could?not?be?deleted.?This?is?only?used?on?Win32,?where?a?file?may?not?be?deleted?while?it?is?still?open.?On?other?platforms?the?file?contains?only?null?bytes.?字串4?

Deletable?-->?DeletableCount,?<DelableName>DeletableCount?

字串7

DeletableCount?-->?UInt32?字串6?

DeletableName?-->?String?字串1?

MultipleTermPositions?

字串2

专门用于search包中的PhrasePrefixQuery?字串2?

MultiReader?字串8?

扩展自IndexReader,用来读取多个索引!添加他们的内容?

字串4


SegmentInfo?

字串1


?????一些关于Segment的信息,是一个三元组<segmentname,docCount,dir>?

字串8

SegmentInfos?字串6?

?????扩展自Vector,就是一个向量组,其中任意成员为SegmentInfo!用来构建segments文件,每个Index有且只有一个这样的文件,此类提供了read和write的方法.?

字串2


?????其内容如下:?

字串5


?????Segments?-->?Format,?Version,?NameCounter,?SegCount,?<SegName,?SegSize>SegCount?

字串2


Format,?NameCounter,?SegCount,?SegSize?-->?UInt32?字串1?

Version?-->?UInt64?字串1?

SegName?-->?String?字串1?

Format?is?-1?in?Lucene?1.4.?字串2?

Version?counts?how?often?the?index?has?been?changed?by?adding?or?deleting?documents.?

字串5


NameCounter?is?used?to?generate?names?for?new?segment?files.?字串4?

SegName?is?the?name?of?the?segment,?and?is?used?as?the?file?name?prefix?for?all?of?the?files?that?compose?the?segment′s?index.?

字串8

SegSize?is?the?number?of?documents?contained?in?the?segment?index.?

字串1



?
?
?
SegmentMergeInfo?

字串9

????用来记录segment合并信息.?字串5?

SegmentMergeQueue?

字串5


????扩展自PriorityQueue(按升序排列)?

字串7


SegmentMerger?字串2?

此类合并多个Segment为一个Segment,被IndexWriter.addIndexes()创建此类对象?字串8?

如果compoundFile为True即可以合并了,创建.cfs文件,并且把其余的几乎所有文件全部合并到.cfs文件中!?

字串2

SegmentReader?字串3?

扩展自IndexReader,提供了很多读取Index的方法?

字串1


SegmentTermDocs?

字串7


扩展自TermDocs?

字串8

SegmentTermEnum?

字串8

??扩展自TermEnum?

字串3


SegmentTermPositions?

字串9

???扩展自TermPositions?

字串8

SegmentTermVector?字串7?

?扩展自TermFreqVector?

字串3

Term?

字串9

?????Term是一个<fieldName,text>对.而Field由于分多种,但是至少都含有<fieldName,fieldValue>这样二者就可以建立关联了.Term是一个搜索单元.Term的text都是诸如dates,email?address,urls等等.?字串7?

TermDocs?字串1?

?????TermDocs是一个Interface.?TermDocs提供一个接口,用来列举<document,frequency>,以共Term使用?字串4?

?????在<document,frequency>对中,document部分给每一个含有term的document命名.document根据其document?number进行标引.frequency部分列举在每一个document中term的数量.<document,frequency>对根据document?number排序.?

字串2


TermEnum?字串3?

?????此类为抽象类,用来enumerate?term.Term?enumerations?由Term.compareTo()进行排序此enumeration中的每一个term都要大于所有在此enumeration之前的term.?字串6?

TermFreqVector?字串8?

?????此Interface用来访问一个document的Field的Term?Vector?

字串6

TermInfo?字串7?

?????此类主要用来存储Term信息.其可以说为一个五元组<Term,docFreq,freqPointer,proxPointer,skipOffset>?字串4?

TermInfoReader?字串9?

?????未细读,待读完SegmentTermEnum?字串7?

TermInfoWriter?

字串9


?????此类用来构建(.tis)和(.tii)文件.这些构成了term?dictionary?

字串9

1.?????The?term?infos,?or?tis?file.?

字串6


TermInfoFile?(.tis)-->?TIVersion,?TermCount,?IndexInterval,?SkipInterval,?TermInfos?字串4?

TIVersion?-->?UInt32?

字串5

TermCount?-->?UInt64?字串1?

IndexInterval?-->?UInt32?

字串9


SkipInterval?-->?UInt32?字串4?

TermInfos?-->?<TermInfo>TermCount?

字串2


TermInfo?-->?<Term,?DocFreq,?FreqDelta,?ProxDelta,?SkipDelta>?字串3?

Term?-->?<PrefixLength,?Suffix,?FieldNum>?字串5?

Suffix?-->?String?字串2?

PrefixLength,?DocFreq,FreqDelta,?ProxDelta,?SkipDelta
-->?VInt?

字串2


This?file?is?sorted?by?Term.?Terms?are?ordered?first?lexicographically?by?the?term′s?field?name,?and?within?that?lexicographically?by?the?term′s?text.?

字串3

TIVersion?names?the?version?of?the?format?of?this?file?and?is?-2?in?Lucene?1.4.?

字串8


Term?text?prefixes?are?shared.?The?PrefixLength?is?the?number?of?initial?characters?from?the?previous?term?which?must?be?pre-pended?to?a?term′s?suffix?in?order?to?form?the?term′s?text.?Thus,?if?the?previous?term′s?text?was?"bone"?and?the?term?is?"boy",?the?PrefixLength?is?two?and?the?suffix?is?"y".?

字串8

FieldNumber?determines?the?term′s?field,?whose?name?is?stored?in?the?.fdt?file.?

字串6

DocFreq?is?the?count?of?documents?which?contain?the?term.?

字串4


FreqDelta?determines?the?position?of?this?term′s?TermFreqs?within?the?.frq?file.?In?particular,?it?is?the?difference?between?the?position?of?this?term′s?data?in?that?file?and?the?position?of?the?previous?term′s?data?(or?zero,?for?the?first?term?in?the?file).?字串5?

ProxDelta?determines?the?position?of?this?term′s?TermPositions?within?the?.prx?file.?In?particular,?it?is?the?difference?between?the?position?of?this?term′s?data?in?that?file?and?the?position?of?the?previous?term′s?data?(or?zero,?for?the?first?term?in?the?file.?字串6?

SkipDelta?determines?the?position?of?this?term′s?SkipData?within?the?.frq?file.?In?particular,?it?is?the?number?of?bytes?after?TermFreqs?that?the?SkipData?starts.?In?other?words,?it?is?the?length?of?the?TermFreq?data.?字串1?

2.?????The?term?info?index,?or?.tii?file.?

字串8


This?contains?every?IndexIntervalth?entry?from?the?.tis?file,?along?with?its?location?in?the?"tis"?file.?This?is?designed?to?be?read?entirely?into?memory?and?used?to?provide?random?access?to?the?"tis"?file.?

字串8


The?structure?of?this?file?is?very?similar?to?the?.tis?file,?with?the?addition?of?one?item?per?record,?the?IndexDelta.?字串3?

TermInfoIndex?(.tii)-->?TIVersion,?IndexTermCount,?IndexInterval,?SkipInterval,?TermIndices?

字串3

TIVersion?-->?UInt32?

字串4


IndexTermCount?-->?UInt64?字串8?

IndexInterval?-->?UInt32?字串9?

SkipInterval?-->?UInt32?

字串6


TermIndices?-->?<TermInfo,?IndexDelta>IndexTermCount?

字串9


IndexDelta?-->?VLong?

字串7

IndexDelta?determines?the?position?of?this?term′s?TermInfo?within?the?.tis?file.?In?particular,?it?is?the?difference?between?the?position?of?this?term′s?entry?in?that?file?and?the?position?of?the?previous?term′s?entry.?字串4?

TODO:?document?skipInterval?information?

字串7

?????????????其中IndexDelta是.tii文件,比之.tis文件多的东西.?

字串4


TermPosition?

字串3


??????此类扩展自TermDocs,是一个Interface,用来enumerate<document,frequency,<position>*>三元组,?

字串3

以供term使用.在此三元组中document和frequency于TernDocs中的相同.postions部分列出了在一个document中,一个term每次出现的顺序位置此三元组为倒排文档的事件表表示.?字串1?

TermPositionVector?字串4?

??????扩展自TermFreqVector.比之TermFreqVector扩展了功能,可以提供term所在的位置?

字串1

TermVectorReader?字串9?

??????用来读取.tvd,.tvf.tvx三个文件.?字串9?

TermVectorWriter?字串7?

??????用于构建.tvd,?.tvf,.tvx文件,这三个文件构成TermVector?字串4?

1.????The?Document?Index?or?.tvx?file.?字串6?

This?contains,?for?each?document,?a?pointer?to?the?document?data?in?the?Document?(.tvd)?file.?字串7?

DocumentIndex?(.tvx)?-->?TVXVersion<DocumentPosition>NumDocs?字串8?

TVXVersion?-->?Int?字串8?

DocumentPosition?-->?UInt64?

字串2


This?is?used?to?find?the?position?of?the?Document?in?the?.tvd?file.?

字串3

2.????The?Document?or?.tvd?file.?字串3?

This?contains,?for?each?document,?the?number?of?fields,?a?list?of?the?fields?with?term?vector?info?and?finally?a?list?of?pointers?to?the?field?information?in?the?.tvf?(Term?Vector?Fields)?file.?字串9?

Document?(.tvd)?-->?TVDVersion<NumFields,?FieldNums,?FieldPositions,>NumDocs?字串2?

TVDVersion?-->?Int?字串8?

NumFields?-->?VInt?

字串7


FieldNums?-->?<FieldNumDelta>NumFields?

字串5


FieldNumDelta?-->?VInt?字串8?

FieldPositions?-->?<FieldPosition>NumFields?字串2?

FieldPosition?-->?VLong?

字串1

The?.tvd?file?is?used?to?map?out?the?fields?that?have?term?vectors?stored?and?where?the?field?information?is?in?the?.tvf?file.?字串8?

3.????The?Field?or?.tvf?file.?字串5?

This?file?contains,?for?each?field?that?has?a?term?vector?stored,?a?list?of?the?terms?and?their?frequencies.?字串6?

Field?(.tvf)?-->?TVFVersion<NumTerms,?NumDistinct,?TermFreqs>NumFields?字串3?

TVFVersion?-->?Int?字串7?

NumTerms?-->?VInt?字串9?

NumDistinct?-->?VInt?--?Future?Use?

字串1


TermFreqs?-->?<TermText,?TermFreq>NumTerms?

字串7


TermText?-->?<PrefixLength,?Suffix>?字串4?

PrefixLength?-->?VInt?

字串6


Suffix?-->?String?字串5?

TermFreq?-->?VInt?

字串2


Term?text?prefixes?are?shared.?The?PrefixLength?is?the?number?of?initial?characters?from?the?previous?term?which?must?be?pre-pended?to?a?term′s?suffix?in?order?to?form?the?term′s?text.?Thus,?if?the?previous?term′s?text?was?"bone"?and?the?term?is?"boy",?the?PrefixLength?is?two?and?the?suffix?is?"y".?字串4?
?


字串8

好的,整个Index包所有类都讲解了,下边咱们开始来编码重新审视一下!?字串5?

下边来编制一个程序来结束本章的讨论。?

字串7


package?org.apache.lucene.index;?字串7?

import?org.apache.lucene.analysis.*;?字串1?

import?org.apache.lucene.analysis.standard.*;?

字串3

import?org.apache.lucene.store.*;?

字串6

import?org.apache.lucene.document.*;?字串8?

import?org.apache.lucene.demo.*;?字串9?

import?org.apache.lucene.search.*;?字串8?

import?java.io.*;?

字串2


/**在使用此程序时,会尽量用到Lucene?Index中的每一个类,尽量将其展示个大家?

字串4

?*使用的Index包中类有?字串9?

?*DocumentWriter(提供给用用户使用的为IndexWriter)?字串7?

?*FieldInfo(和FieldInfos)?字串9?

?*?SegmentDocs(扩展自TermDocs)?

字串1

?*SegmentReader(扩展自IndexReader,提供给用户使用的是IndexReader)?

字串4


?*SegmentMerger?

字串2


?*segmentTermEnum(扩展自TermEnum)?字串2?

?*segmentTermPositions(扩展自TermPositions)?

字串2

?*segmentTermVector(扩展自TermFreqVector)?字串5?

*/?字串8?


?
?
?
public?class?TestIndexPackage?

字串9

{?字串1?

?//用于将Document加入索引?字串8?

?public?static?void?indexDocument(String?segment,String?fileName)?throws?Exception?

字串8


?{?字串8?

????//第二个参数用来控制,如果获得不了目录是否创建?字串5?

????Directory?directory?=?FSDirectory.getDirectory("testIndexPackage",false);?字串3?

????Analyzer?analyzer?=?new?SimpleAnalyzer();?字串8?

????//第三个参数为每一个Field最多拥有的Token个数?

字串5


???DocumentWriter?writer?=?new?DocumentWriter(directory,analyzer,Similarity.getDefault(),1000);?字串4?

????File?file?=?new?File(fileName);?

字串7


????//由于使用FileDocument将file包装成了Docuement,会在document中创建三个field(path,modified,contents)?字串5?

????Document?doc?=?FileDocument.Document(file);?字串7?

????writer.addDocument(segment,doc);?字串9?

????directory.close();?

字串8

?}?

字串8

?//将多个segment进行合并?字串4?

?public?static?void?merge(String?segment1,String?segment2,String?segmentMerged)throws?Exception?字串8?

?{?

字串2

????Directory?directory?=?FSDirectory.getDirectory("testIndexPackage",false);?字串5?

????SegmentReader?segmentReader1=new?SegmentReader(new?SegmentInfo(segment1,1,directory));?字串6?

????SegmentReader?segmentReader2=new?SegmentReader(new?SegmentInfo(segment2,1,directory));?字串8?

????//第三个参数为是否创建.cfs文件?

字串2


????SegmentMerger?segmentMerger?=new?SegmentMerger(directory,segmentMerged,false);?

字串7


????segmentMerger.add(segmentReader1);?字串7?

????segmentMerger.add(segmentReader2);?

字串8

????segmentMerger.merge();?

字串3


????segmentMerger.closeReaders();?

字串4

????directory.close();?字串7?

?}?字串7?

?//将segment即Index的子索引的所有内容展示给你看。?字串4?

?public?static?void?printSegment(String?segment)?throws?Exception?字串8?

?{?字串2?

????Directory?directory?=FSDirectory.getDirectory("testIndexPackage",false);?

字串4

????SegmentReader?segmentReader?=?new?SegmentReader(new?SegmentInfo(segment,1,directory));?字串2?

????//display?documents?

字串8

???for(int?i=0;i<segmentReader.numDocs();i++)?

字串2


??????System.out.println(segmentReader.document(i));?字串3?

????TermEnum?termEnum?=?segmentReader.terms();//此处实际为SegmentTermEnum?

字串6

????//display?term?and?term?positions,termDocs?字串5?

????while(termEnum.next())?

字串6

????{?字串1?

??????System.out.print(termEnum.term().toString2());?字串2?

??????System.out.println("?DocumentFrequency="?+?termEnum.docFreq());?字串2?

??????TermPositions?termPositions=?segmentReader.termPositions(termEnum.term());?字串3?

??????int?i=0;?字串4?

??????while(termPositions.next())?字串5?

??????{?

字串1?

读书人网 >编程

热点推荐