Nutch相关框架视频教程3
第三讲
土豆在线视频地址(53分钟)
超清原版下载地址
压缩高清下载地址
1、????nutch的存储文件夹data下面各个文件夹和文件里面的内容究竟是什么?
2、????命令:
crawldb
bin/nutch | grep read
bin/nutch ?readdb ?data/crawldb ?-stats
bin/nutch ?readdb ?data/crawldb ?-dump ?data/crawldb/crawldb_dump
bin/nutch ?readdb ?data/crawldb ?-url ?http://4008209999.tianyaclub.com/
bin/nutch ?readdb ?data/crawldb ?-topN ?10?data/crawldb/crawldb_topN
bin/nutch readdb data/crawldb ?-topN 10 ?data/crawldb/crawldb_topN_m 1
segments
crawl_generate:
bin/nutch readseg -dump data/segments/20130325042858data/segments/20130325042858_dump -nocontent -nofetch -noparse -noparsedata ?noparsetext
crawl_fetch:
bin/nutch readseg -dump data/segments/20130325042858 data/segments/20130325042858_dump-nocontent -nogenerate -noparse -noparsedata ?noparsetext
content:
bin/nutch readseg -dump data/segments/20130325042858data/segments/20130325042858_dump -nofetch?-nogenerate -noparse -noparsedata ?noparsetext
crawl_parse:
bin/nutch readseg -dump data/segments/20130325042858data/segments/20130325042858_dump -nofetch?-nogenerate -nocontent noparsedata ?noparsetext
parse_data:
bin/nutch readseg -dump data/segments/20130325042858data/segments/20130325042858_dump -nofetch?-nogenerate -nocontent -noparse ?noparsetext
parse_text:
bin/nutch readseg -dump data/segments/20130325042858data/segments/20130325042858_dump -nofetch?-nogenerate -nocontent -noparse ?-noparsedata
全部:
bin/nutch readseg -dump data/segments/20130325042858data/segments/20130325042858_dump
segments
bin/nutch readseg -list -dir data/segments
bin/nutch readseg -list data/segments/20130325043023
bin/nutch readseg -get?data/segments/20130325042858 http://blog.tianya.cn/
linkdb
bin/nutch readlinkdb data/linkdb -url http://4008209999.tianyaclub.com/
bin/nutch readlinkdb data/linkdb -dump data/linkdb_dump