用scrapy开展网页抓取

用scrapy进行网页抓取

最近用scrapy来进行网页抓取,对于pythoner来说它用起来非常方便,详细文档在这里:http://doc.scrapy.org/en/0.14/index.html

要想利用scrapy来抓取网页信息,需要先新建一个工程,scrapy startproject myproject

工程建立好后,会有一个myproject/myproject的子目录,里面有item.py(由于你要抓取的东西的定义),pipeline.py(用于处理抓取后的数据,可以保存数据库,或是其他),然后是spiders文件夹,可以在里面编写爬虫的脚本.

这里以爬取某网站的书籍信息为例:

item.py如下:

from scrapy.item import Item, Fieldclass BookItem(Item):    # define the fields for your item here like:    name = Field()    publisher = Field()    publish_date = Field()    price = Field()

我们要抓取的东西都在上面定义好了,分别是名字,出版商,出版日期,价格,

下面就要写爬虫去网战抓取信息了,

spiders/book.py如下:

from urlparse import urljoinimport simplejsonfrom scrapy.http import Requestfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.selector import HtmlXPathSelectorfrom myproject.items import BookItemclass BookSpider(CrawlSpider):    name = 'bookspider'    allowed_domains = ['test.com']    start_urls = [        "http://test_url.com",   #这里写开始抓取的页面地址(这里网址是虚构的,实际使用时请替换)    ]    rules = (        #下面是符合规则的网址,但是不抓取内容,只是提取该页的链接(这里网址是虚构的,实际使用时请替换)        Rule(SgmlLinkExtractor(allow=(r'http://test_url/test?page_index=\d+'))),        #下面是符合规则的网址,提取内容,(这里网址是虚构的,实际使用时请替换)        Rule(SgmlLinkExtractor(allow=(r'http://test_rul/test?product_id=\d+')), callback="parse_item"),    )            def parse_item(self, response):        hxs = HtmlXPathSelector(response)        item = BookItem()        item['name'] = hxs.select('//div[@name="code">from scrapy import log#from scrapy.core.exceptions import DropItemfrom twisted.enterprise import adbapifrom scrapy.http import Requestfrom scrapy.exceptions import DropItemfrom scrapy.contrib.pipeline.images import ImagesPipelineimport timeimport MySQLdbimport MySQLdb.cursorsclass MySQLStorePipeline(object):    def __init__(self):        self.dbpool = adbapi.ConnectionPool('MySQLdb',                db = 'test',                user = 'user',                passwd = '******',                cursorclass = MySQLdb.cursors.DictCursor,                charset = 'utf8',                use_unicode = False        )    def process_item(self, item, spider):                query = self.dbpool.runInteraction(self._conditional_insert, item)                query.addErrback(self.handle_error)        return item      def _conditional_insert(self, tx, item):        if item.get('name'):            tx.execute(\                "insert into book (name, publisher, publish_date, price ) \                 values (%s, %s, %s, %s)",                (item['name'],  item['publisher'], item['publish_date'],                 item['price'])            )

完成之后在setting.py中添加该pipeline:

ITEM_PIPELINES = ['myproject.pipelines.MySQLStorePipeline']

?最后运行scrapy crawl bookspider就开始抓取了

用scrapy开展网页抓取

热点推荐