读书人

Nutch相干框架视频教程18

发布时间: 2013-05-02 09:39:29 作者: rapoo

Nutch相关框架视频教程18

第十八讲

优酷在线视频地址?(57分钟)
压缩超清下载地址?

1、准备压缩数据

dmoz下载url

wget ?http://rdf.dmoz.org/rdf/content.rdf.u8.gz

gunzip ?content.rdf.u8.gz

准备nutch1.6

svn ?co ?https://svn.apache.org/repos/asf/nutch/tags/release-1.6/

cprelease-1.6/conf/nutch-site.xml.template release-1.6/conf/nutch-site.xml

vi release-1.6/conf/nutch-site.xml

增加:

<property>

? <name>http.agent.name</name>

? <value>nutch</value>

</property>

cdrelease-1.6

ant

cd ..

使用DmozParserdmozURL库解析为文本

release-1.6/runtime/local/bin/nutchorg.apache.nutch.tools.DmozParser ?content.rdf.u8 ?> ?urls &

url文本内容放到HDFS上面

hadoop ?fs ?-put?urls ?urls

2、以不同压缩方法注入URL

进入nutch主目录

cd? release-1.6

以未压缩的方式注入URL

runtime/deploy/bin/nutch? inject ?data_no_compress/crawldb ?urls

?

以默认压缩的方式注入URL

viconf/nutch-site.xml

?<property>

??? <name>mapred.output.compression.type</name>

??? <value>BLOCK</value>

?</property>

?<property>

???<name>mapred.output.compress</name>

??? <value>true</value>

?</property>

??? <property>

???<name>mapred.compress.map.output</name>

??? <value>true</value>

?</property>

?<property>

??? <name>mapred.map.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.DefaultCodec</value>

?</property>

?<property>

???<name>mapred.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.DefaultCodec</value>

????? </property>

ant

runtime/deploy/bin/nutch? inject ?data_default_compress/crawldb ?urls

?

Gzip压缩的方式注入URL

viconf/nutch-site.xml

?<property>

???<name>mapred.output.compression.type</name>

??? <value>BLOCK</value>

?</property>

?<property>

??? <name>mapred.output.compress</name>

??? <value>true</value>

?</property>

??? <property>

???<name>mapred.compress.map.output</name>

??? <value>true</value>

?</property>

?<property>

???<name>mapred.map.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.GzipCodec</value>

?</property>

?<property>

???<name>mapred.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.GzipCodec</value>

????? </property>

ant

runtime/deploy/bin/nutch? inject ?data_gzip_compress/crawldb ?urls

?

BZip2的压缩方式注入URL

viconf/nutch-site.xml

?<property>

???<name>mapred.output.compression.type</name>

??? <value>BLOCK</value>

?</property>

?<property>

???<name>mapred.output.compress</name>

??? <value>true</value>

?</property>

??? <property>

??? <name>mapred.compress.map.output</name>

??? <value>true</value>

?</property>

?<property>

???<name>mapred.map.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.BZip2Codec</value>

?</property>

?<property>

??? <name>mapred.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.BZip2Codec</value>

????? </property>

ant

runtime/deploy/bin/nutch? inject ?data_bzip2_compress/crawldb ?urls

?

Snappy的方式注入URL

viconf/nutch-site.xml

?<property>

??? <name>mapred.output.compression.type</name>

??? <value>BLOCK</value>

?</property>

?<property>

???<name>mapred.output.compress</name>

??? <value>true</value>

?</property>

??? <property>

???<name>mapred.compress.map.output</name>

??? <value>true</value>

?</property>

?<property>

???<name>mapred.map.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.SnappyCodec</value>

?</property>

?<property>

???<name>mapred.output.compression.codec</name>

??? <value>org.apache.hadoop.io.compress.SnappyCodec</value>

????? </property>

ant

runtime/deploy/bin/nutch? inject ?data_snappy_compress/crawldb ?urls

压缩类型的影响

块大小的影响

3、Hadoop配置Snappy压缩

下载解压:

wget? https://snappy.googlecode.com/files/snappy-1.1.0.tar.gz

tar ?-xzvf ?snappy-1.1.0.tar.gz

cdsnappy-1.0.5

编译:

./configure

make

make ?install

复制库文件:

scp ?/usr/local/lib/libsnappy* ?host2:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/

scp ?/usr/local/lib/libsnappy* ?host6:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/

scp ?/usr/local/lib/libsnappy* ?host8:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/

在每一台集群机器上面修改环境变量:

vi? /home/hadoop/.bashrc

追加:

export ?LD_LIBRARY_PATH=/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64

读书人网 >互联网

热点推荐