读书人

倒排索引的简略实现

发布时间: 2012-07-08 17:43:43 作者: rapoo

倒排索引的简单实现
首先看一个例子:



假设有3篇文章,file1, file2, file3,文件内容如下:



file1 (单词1,单词2,单词3,单词4....)

file2 (单词a,单词b,单词c,单词d....)

file3 (单词1,单词a,单词3,单词d....)


那么建立的倒排索引就是这个样子:



单词1 (file1,file3)

单词2 (file1)

单词3 (file1,file3)

单词a (file2, file3)

....


倒排索引的概念很简单:就是将文件中的单词作为关键字,然后建立单词与文件的映射关系。当然,你还可以添加文件中单词出现的频数等信息。倒排索引是搜索引擎中一个很基本的概念,几乎所有的搜索引擎都会使用到倒排索引。



下面是我对于倒排索引的一个简单的实现。该程序对于输入的一段文字,查找出该词所出现的行号以及出现的次数。



import java.io.*;import java.util.HashMap;import java.util.Map;public class InvertedIndex {private Map<String, Map<Integer, Integer>> index;private Map<Integer, Integer> subIndex;public void createIndex(String filePath) {index = new HashMap<String, Map<Integer, Integer>>();try {File file = new File(filePath);InputStream is = new FileInputStream(file);BufferedReader read = new BufferedReader(new InputStreamReader(is));String temp = null;int line = 1;while ((temp = read.readLine()) != null) {String[] words = temp.split(" ");for (String word : words) {if (!index.containsKey(word)) {subIndex = new HashMap<Integer, Integer>();subIndex.put(line, 1);index.put(word, subIndex);} else {subIndex = index.get(word);if (subIndex.containsKey(line)) {int count = subIndex.get(line);subIndex.put(line, count+1);} else {subIndex.put(line, 1);}}}line++;}read.close();is.close();} catch (IOException e) {System.out.println("error in read file");}}public void find(String str) {String[] words = str.split(" ");for (String word : words) {StringBuilder sb = new StringBuilder();if (index.containsKey(word)) {sb.append("word: " + word + " in ");Map<Integer, Integer> temp = index.get(word);for (Map.Entry<Integer, Integer> e : temp.entrySet()) {sb.append("line " + e.getKey() + " [" + e.getValue() + "] , ");}} else {sb.append("word: " + word + " not found");}System.out.println(sb);}}public static void main(String[] args) {InvertedIndex index = new InvertedIndex();index.createIndex("news.txt");index.find("I love Shanghai today");}} 


其中,输入文件news.txt内容为:



I am eriolI live in Shanghai and I love ShanghaiI also love travellinglife in Shanghaiis beautiful



输出结果为:



word: I in line 1 [1] , line 2 [2] , line 3 [1] ,
word: love in line 2 [1] , line 3 [1] ,
word: Shanghai in line 2 [2] , line 4 [1] ,
word: today not found

读书人网 >行业软件

热点推荐