hadoop学习札记-6-倒排索引InverseIndex

hadoop学习笔记-6-倒排索引InverseIndex
介绍

倒排索引是将文章中的单词挑出来，排序，便于检索。利用map-reduce思想来实现，如下：

原始文本及内容：

doc1.txt：MapReduce is simple

doc2.txt：MapReduce ispowerful is simple

doc3.txt：Hello MapReduce byeMapReduce

那么输出结果应该是这样子的：

MapReduce：doc1.txt:1;doc2.txt:1;doc3.txt:2;

is：doc1.txt:1;doc2.txt:2;

simple：doc1.txt:1;doc2.txt:1;

Hello：doc3.txt:1;

MapReduce：doc3.txt:1;

其中冒号之前表示文档，之后表示在这个文档中出现的次数，分号分隔各个文档。例如：MapReduce：doc1.txt:1;doc2.txt:1;doc3.txt:2; 表示MapReduce在doc1.txt中出现一次，在doc2.txt中出现一次，在doc3.txt中出现两次。

明白了原理之后，看如何用MapReduce来实现。

原始文件作为输入，经过Map之后变成以下格式：

<MapReduce:doc1.txt, 1>

<is:doc1.txt, 1>

<simple:doc1.txt, 1>

<MapReduce:doc2.txt, 1>

<is:doc2.txt, 1>

<powerful:doc2.txt, 1>

<is:doc2.txt, 1>

<simple:doc2.txt, 1>

<Hello:doc3.txt, 1>

<MapReduce:doc3.txt, 1>

<bye:doc3.txt, 1>

<MapReduce:doc3.txt, 1>

经过Combiner之后变成以下格式：

<MapReduce:doc1.txt, 1>

<is:doc1.txt, 1>

<simple:doc1.txt, 1>

<MapReduce:doc2.txt, 1>

<is:doc2.txt, 2>

<powerful:doc2.txt, 1>

<simple:doc2.txt, 1>

<Hello:doc3.txt, 1>

<MapReduce:doc3.txt, 2>

<bye:doc3.txt, 1>

经过reduce之后变成以下内容：

<MapReduce, doc1.txt:1;doc2.txt:1;doc3.txt:2;>

<is, doc1.txt:1;doc2.txt:2;>

<simple, doc1.txt:1;doc2.txt:1;>

<Hello, doc3.txt:1;>

<MapReduce, doc3.txt:1;>

可以考虑考虑为什么这么做。

源代码

Mapper类：

hadoop学习札记-6-倒排索引InverseIndex

参考：《实战Hadoop--开启通向云计算的捷径》P74-P83

hadoop学习札记-6-倒排索引InverseInd

热点推荐