抓取流程-小结
从之前 的抓取果来分析各阶段的情况。其中蓝色表示未修改但要注意的,红色表示前后已经修改的。
?
?
injector:只有二个seed urls( 这里没有列出csdn数据)
http://www.163.com/??? Version: 7??? ??? ??? ??? #7为当前nutch的修改版本
Status: 1 (db_unfetched)??? ??? ??? ??? ??? #see CrawlDatum.STATUS_DB_UNFETCHED
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0??? #seed url为1.0
Signature: null??? ??? #page md5摘要,未抓取,所以为空
Metadata:
?
generator:同样只有二个urls
http://www.163.com/??? Version: 7
Status: 1 (db_unfetched)
Fetch time: Mon Jul 04 14:57:19 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1309887693964
?
?
fetcher:
-------
crawl_fetch:
http://www.163.com/??? Version: 7
Status: 33 (fetch_success)
Fetch time: Sat Jul 09 15:14:02 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1309933252318_pst_: success(1), lastModified=0
?
crawl_parse:
http://www.163.com/??? Version: 7
Status: 65 (signature)
Fetch time: Sat Jul 09 15:14:08 CST 2011
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 0 seconds (0 days)
Score: 1.0
Signature: 989844cdb45e225db2b2731315cb5342
Metadata:?
//其它情况
http://www.163.com/rss/??? Version: 7
Status: 67 (linked)
Fetch time: Sat Jul 09 15:14:08 CST 2011??? //未fetched的以parsed的时间记录
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata:
-------
?
updatedb(crawldb,可以看出,这个文件存放的是所有历史urls,即global link map):
http://www.163.com/??? Version: 7
Status: 2 (db_fetched)??????
Fetch time: Mon Aug 08 15:14:02 CST 2011??? //已经更新为1个月后的fetch time,表明下次就不要再fetch了
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: 989844cdb45e225db2b2731315cb5342?? //与crawl_parse一样,即没有修改,即整个html的md5值
Metadata: _pst_: success(1), lastModified=0
//其它情况如同在injector阶段一样,以为generator准备
http://www.163.com/rss/??? Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jul 12 23:49:27 CST 2011???? //未fetched的更新为update时的时间
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.01
Signature: null
Metadata:
?
** 关于如何保证fetched过的urlds不再fetch,参阅updatedb
**修改crawldb/current下数据的只有:
* injector
* generator中generate.update.crawldb参数为true时进行
* updatedb
?
?