读书人

替Heritrix定制自己的QueueAssignment

发布时间: 2012-12-28 10:29:05 作者: rapoo

为Heritrix定制自己的QueueAssignmentPolicy
??????????????? HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

将我们自己写的ELFHashQueueAssignmentPolicy类添加进去,即变成:

String queueStr = System.getProperty(AbstractFrontier.class.getName() +
??????????????? "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
??????????????? ELFHashQueueAssignmentPolicy.class.getName() + " " +
??????????????? //HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

?

第二步:org.archive.crawler.frontier.AdaptiveRevisitFrontier下找到

protected final static String DEFAULT_QUEUE_ASSIGNMENT_POLICY = HostnameQueueAssignmentPolicy.class.getName();

将其改为:

protected final static String DEFAULT_QUEUE_ASSIGNMENT_POLICY = ELFHashQueueAssignmentPolicy.class.getName();

然后继续往后找到public AdaptiveRevisitFrontier(String name, String description) 方法,对其中的一段:

String queueStr = System.getProperty(AbstractFrontier.class.getName() +
??????????????????? "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
??????????????????? HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

修改为:

String queueStr = System.getProperty(AbstractFrontier.class.getName() +
??????????????????? "." + ATTR_QUEUE_ASSIGNMENT_POLICY,
??????????????????? ELFHashQueueAssignmentPolicy.class.getName() + " " +
??????????????????? //HostnameQueueAssignmentPolicy.class.getName() + " " +
??????????????????? IPQueueAssignmentPolicy.class.getName() + " " +
??????????????????? BucketQueueAssignmentPolicy.class.getName() + " " +
??????????????????? SurtAuthorityQueueAssignmentPolicy.class.getName() + " " +
??????????????????? TopmostAssignedSurtQueueAssignmentPolicy.class.getName());

?

第三步:

heritrix.properties文件下找到

org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
??? org.archive.crawler.frontier.HostnameQueueAssignmentPolicy \
??? org.archive.crawler.frontier.IPQueueAssignmentPolicy \
??? org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
??? org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
??? org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy

将我们的ELFHashQueueAssignmentPolicy类添加进去,即变成这样:

org.archive.crawler.frontier.AbstractFrontier.queue-assignment-policy = \
?? ?my.ELFHashQueueAssignmentPolicy \
??? org.archive.crawler.frontier.IPQueueAssignmentPolicy \
??? org.archive.crawler.frontier.BucketQueueAssignmentPolicy \
??? org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy \
??? org.archive.crawler.frontier.TopmostAssignedSurtQueueAssignmentPolicy

?

?

这样,当我们使用Heritrix抓取网页的时候,Heritrix就变成默认使用ELFHashQueueAssignmentPolicy来分配连接队列了。经过验证,爬取的效率的确能得到了很大的提高。

?

最后,不得不提的是,通过以上修改,有时还是会出问题,表现在是整个Job已经完成了,但只爬到几KB的样子,而mirror目录根本没有生成出来。上网搜了一下,有网游这样建议:

--------------------引用部分-------------------------
(1) 配置下在Setting里的frontier项中的max retries,改成100(有可能是入口过少)
(2) 将url地址改成ip地址(看过log,有时候会有很多404error,那我直接换成ip地址试下,果然好使,哈哈)

--------------------引用部分------------------------

?

希望哪位大虾知道原因能告诉我一声,呵呵。。。

?

1 楼 lookqlp 2012-07-21 heritrix神马版本?1.X的 ? 2 楼 lookqlp 2012-07-21 3.1版本应该只需要改个配置文件了,回头我试试

读书人网 >编程

热点推荐