【Python】妹子图图片全站抓取

来源:互联网 发布:cmd telnet 端口号 编辑:程序博客网 时间:2024/06/10 21:12

环境

  • python3.5
  • scrapy1.4.0

代码

  • items.py
# -*- coding: utf-8 -*-import scrapyclass MeiziSpiderItem(scrapy.Item):    image_url = scrapy.Field()   # 存放图片真实的URL    refer_url = scrapy.Field()   # 存放图片下载时对应的请求Refer
  • spiders/meizi.py
# -*- coding: utf-8 -*-from scrapy.spider import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorfrom MeiziSpider.items import MeiziSpiderItemclass MeiziSpider(CrawlSpider):    name = 'meizi'    allowed_domains = ['www.mzitu.com']    start_urls = ['http://www.mzitu.com/']    rules = (        Rule(LinkExtractor(allow=('http://www.mzitu.com/\d{1,6}',)), callback='parse_item', follow=True),        Rule(LinkExtractor(allow=('http://www.mzitu.com/\d{1,6}/\d{1,3}',)), callback='parse_item', follow=True),    )    def parse_item(self, response):        img_item = MeiziSpiderItem()        img_item['image_url'] = response.css(".main-image p a img::attr(src)").extract()        img_item['refer_url'] = response.url        yield img_item
  • pipelines.py
# -*- coding: utf-8 -*-from scrapy.contrib.pipeline.images import ImagesPipelinefrom scrapy.http import Requestclass MyImagesPipeline(ImagesPipeline):    def get_media_requests(self, item, info):        for image_url in item['image_url']:            default_headers = {                'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',                'referer': '{}'.format(item['refer_url'])            }            yield Request(image_url, headers=default_headers)    def item_completed(self, results, item, info):        return item
  • settings.py
...ROBOTSTXT_OBEY = FalseITEM_PIPELINES = {   'MeiziSpider.pipelines.MyImagesPipeline': 5,}IMAGES_URLS_FIELD ="image_url"  #image_url是在items.py中配置的网络爬取得图片地址#配置保存本地的地址project_dir = os.path.abspath(os.path.dirname(__file__))  #获取当前爬虫项目的绝对路径IMAGES_STORE = os.path.join(project_dir, 'images')  #组装新的图片路径...

效果

  • 共抓取到3G多的图片,花费1.5小时