scrapy 单脚本执行爬去jandan无聊图的gif文件

阅读：评论：0

直接创建工程麻烦，直接用一个脚本文件，python spider.py那样更简单。

其实很简单，spider类派生还是跟生成的一样，只需要添加awler import CrawlerProcess

最后生成process对象，执行start即可

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
awler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class gifitem(scrapy.Item):
file_urls = scrapy.Field()

class Meizitu(scrapy.Spider):
name = "pic"
allowed_domains = ["jandan"]
start_urls = (
'/',
)

def is_gif_url(self, url):
ext = url[url.rfind("."):]
if "gif" in ext:
return True
else:
return False

def parse(self, response):
print "visit url---------->",response.url
a = response.selector.xpath('//div[@class="cp-pagenavi"]')

if len(a) > 0 and 0:#debug
b = a[0]
urls = b.xpath('.//a/@href').extract()
for url in urls:
yield scrapy.Request(url, self.parse)

gifs = response.selector.xpath('//a/@href').extract()
urls = []
for gif in gifs:
if 'gif' in gif:
if(self.is_gif_url(gif)):
urls.append(gif)
if len(gif) > 0:
gif_item = gifitem()
gif_item['file_urls'] = urls
yield gif_item
pass

setting = get_project_settings()
setting.set("ITEM_PIPELINES", {
'scrapy.pipelines.files.FilesPipeline':1
})
setting.set("FILES_STORE", "J:\pic_download")
process = CrawlerProcess(setting)
awl(Meizitu)
process.start()
这里需要注意的是，setting对象，可以完成一些setting的设置。比如设置Pipeline，一些属性等等，具体参考scrapy的文档。jandan有简单的放爬功能，改写user_agent功能即可，但是如果想爬去所有的gif，估计需要中间件的方式，继续学习。