Nutch相关框架视频教程18
第十八讲
优酷在线视频地址?(57分钟)
压缩超清下载地址?
1、准备压缩数据
从dmoz下载url库
wget ?http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip ?content.rdf.u8.gz
准备nutch1.6
svn ?co ?https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
cprelease-1.6/conf/nutch-site.xml.template release-1.6/conf/nutch-site.xml
vi release-1.6/conf/nutch-site.xml
增加:
<property>
? <name>http.agent.name</name>
? <value>nutch</value>
</property>
cdrelease-1.6
ant
cd ..
使用DmozParser把dmoz的URL库解析为文本
release-1.6/runtime/local/bin/nutchorg.apache.nutch.tools.DmozParser ?content.rdf.u8 ?> ?urls &
将url文本内容放到HDFS上面
hadoop ?fs ?-put?urls ?urls
2、以不同压缩方法注入URL
进入nutch主目录
cd? release-1.6
以未压缩的方式注入URL
runtime/deploy/bin/nutch? inject ?data_no_compress/crawldb ?urls
?
以默认压缩的方式注入URL
viconf/nutch-site.xml
?<property>
??? <name>mapred.output.compression.type</name>
??? <value>BLOCK</value>
?</property>
?<property>
???<name>mapred.output.compress</name>
??? <value>true</value>
?</property>
??? <property>
???<name>mapred.compress.map.output</name>
??? <value>true</value>
?</property>
?<property>
??? <name>mapred.map.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.DefaultCodec</value>
?</property>
?<property>
???<name>mapred.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.DefaultCodec</value>
????? </property>
ant
runtime/deploy/bin/nutch? inject ?data_default_compress/crawldb ?urls
?
以Gzip压缩的方式注入URL
viconf/nutch-site.xml
?<property>
???<name>mapred.output.compression.type</name>
??? <value>BLOCK</value>
?</property>
?<property>
??? <name>mapred.output.compress</name>
??? <value>true</value>
?</property>
??? <property>
???<name>mapred.compress.map.output</name>
??? <value>true</value>
?</property>
?<property>
???<name>mapred.map.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.GzipCodec</value>
?</property>
?<property>
???<name>mapred.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.GzipCodec</value>
????? </property>
ant
runtime/deploy/bin/nutch? inject ?data_gzip_compress/crawldb ?urls
?
以BZip2的压缩方式注入URL
viconf/nutch-site.xml
?<property>
???<name>mapred.output.compression.type</name>
??? <value>BLOCK</value>
?</property>
?<property>
???<name>mapred.output.compress</name>
??? <value>true</value>
?</property>
??? <property>
??? <name>mapred.compress.map.output</name>
??? <value>true</value>
?</property>
?<property>
???<name>mapred.map.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.BZip2Codec</value>
?</property>
?<property>
??? <name>mapred.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.BZip2Codec</value>
????? </property>
ant
runtime/deploy/bin/nutch? inject ?data_bzip2_compress/crawldb ?urls
?
以Snappy的方式注入URL
viconf/nutch-site.xml
?<property>
??? <name>mapred.output.compression.type</name>
??? <value>BLOCK</value>
?</property>
?<property>
???<name>mapred.output.compress</name>
??? <value>true</value>
?</property>
??? <property>
???<name>mapred.compress.map.output</name>
??? <value>true</value>
?</property>
?<property>
???<name>mapred.map.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.SnappyCodec</value>
?</property>
?<property>
???<name>mapred.output.compression.codec</name>
??? <value>org.apache.hadoop.io.compress.SnappyCodec</value>
????? </property>
ant
runtime/deploy/bin/nutch? inject ?data_snappy_compress/crawldb ?urls
压缩类型的影响
块大小的影响
3、Hadoop配置Snappy压缩
下载解压:
wget? https://snappy.googlecode.com/files/snappy-1.1.0.tar.gz
tar ?-xzvf ?snappy-1.1.0.tar.gz
cdsnappy-1.0.5
编译:
./configure
make
make ?install
复制库文件:
scp ?/usr/local/lib/libsnappy* ?host2:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/
scp ?/usr/local/lib/libsnappy* ?host6:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/
scp ?/usr/local/lib/libsnappy* ?host8:/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64/
在每一台集群机器上面修改环境变量:
vi? /home/hadoop/.bashrc
追加:
export ?LD_LIBRARY_PATH=/home/hadoop/hadoop-1.1.2/lib/native/Linux-amd64-64