为Heritrix定制自己的QueueAssignmentPolicy
??????????????? HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());
将我们自己写的ELFHashQueueAssignmentPolicy类添加进去,即变成:
String queueStr = System.getProperty(AbstractFrontier.class.getName() +
??????????????? "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
??????????????? ELFHashQueueAssignmentPolicy.class.getName() + " " +
??????????????? //HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());
?
第二步:到org.archive.crawler.frontier.AdaptiveRevisitFrontier下找到
protected final static String DEFAULT_QUEUE_ASSIGNMENT_POLICY = HostnameQueueAssignmentPolicy.class.getName();
将其改为:
protected final static String DEFAULT_QUEUE_ASSIGNMENT_POLICY = ELFHashQueueAssignmentPolicy.class.getName();
然后继续往后找到public AdaptiveRevisitFrontier(String name, String description) 方法,对其中的一段:
String queueStr = System.getProperty(AbstractFrontier.class.getName() +
??????????????????? "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
??????????????????? HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());
修改为:
String queueStr = System.getProperty(AbstractFrontier.class.getName() +
??????????????????? "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
??????????????????? ELFHashQueueAssignmentPolicy.class.getName() + " " +
??????????????????? //HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());
?
第三步:
到heritrix.properties文件下找到
org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
??? org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
??? org.archive.crawler.frontier.IPQueueAssignmentPolicy \
??? org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
??? org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
??? org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy
将我们的ELFHashQueueAssignmentPolicy类添加进去,即变成这样:
org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
?? ?my.ELFHashQueueAssignmentPolicy \
??? org.archive.crawler.frontier.IPQueueAssignmentPolicy \
??? org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
??? org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
??? org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy
?
?
这样,当我们使用Heritrix抓取网页的时候,Heritrix就变成默认使用ELFHashQueueAssignmentPolicy来分配连接队列了。经过验证,爬取的效率的确能得到了很大的提高。
?
最后,不得不提的是,通过以上修改,有时还是会出问题,表现在是整个Job已经完成了,但只爬到几KB的样子,而mirror目录根本没有生成出来。上网搜了一下,有网游这样建议:
--------------------引用部分-------------------------
(1) 配置下在Setting里的frontier项中的max retries,改成100(有可能是入口过少)
(2) 将url地址改成ip地址(看过log,有时候会有很多404error,那我直接换成ip地址试下,果然好使,哈哈)
--------------------引用部分------------------------
?
希望哪位大虾知道原因能告诉我一声,呵呵。。。
?
1 楼 lookqlp 2012-07-21 heritrix神马版本?1.X的 ? 2 楼 lookqlp 2012-07-21 3.1版本应该只需要改个配置文件了,回头我试试