scrapy模拟登录新浪微博

来源:互联网 发布:网络设计师卡通图片 编辑:程序博客网 时间:2024/06/11 20:58

hi:

  all, scrapy搞模拟登录真的很简单哦,以下均是在你安装scrapy成功的前提下哦.

  首先,分析新浪微薄的登录流程,使用抓包工具得到下面的图片:

一般来说,登录主要就是对服务器进行post数据过去,如果对方有验证码,需要验证码识别之类的东西,那是计算机图形学干的事,scrapy干不了,而新浪微博比较特别,首先大家应该清楚,新浪是个大公司,不会那么简单直接让你post数据的,所以在post请求前有一个get请求,去获取服务器的一些参数,那么,我们做的第一个事情是写一个get请求:

第一步,使用scrapy的shell命令创建一些模板

E:\workspace\TribuneSpider\src>scrapy genspider -t crawl weibo weibo.comCreated spider 'weibo' using template 'crawl' in module:  src.spiders.weibo

 形如以上代码,后2行是scrapy给你的提示。生成以下模块化代码

复制代码
#! -*- encoding:utf-8 -*-import refrom scrapy.selector import HtmlXPathSelectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom src.items import SrcItemclass WeiboSpider(CrawlSpider):    '''            这是一个使用scrapy模拟登录新浪微博的例子,            希望能对广大的同学有点帮助     '''        name = 'weibo'    allowed_domains = ['weibo.com']    start_urls = ['http://www.weibo.com/']    rules = (        Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),    )    def parse_item(self, response):        hxs = HtmlXPathSelector(response)        i = SrcItem()        #i['domain_id'] = hxs.select('//input[@id="sid"]/@value').extract()        #i['name'] = hxs.select('//div[@id="name"]').extract()        #i['description'] = hxs.select('//div[@id="description"]').extract()        return i
复制代码

第二步,修改设置,继承自CrawlSpider类的WeiboSpider的name,allowed_domains属性是必须的,但是start_urls属性是可以不要的,改由其他方法生成,接下来修改代码,获取新浪的第一的链接返回的值,提前我们先要知道一般这个链接返回什么格式:

sinaSSOController.preloginCallBack({"retcode":0,"servertime":1314019864,"nonce":"J1F9XN"})

  返回的是形如以上代码段的格式内容,而以上内容在post登录时是需要这3个属性的。修改后代码为:

复制代码
#! -*- encoding:utf-8 -*-import reimport timefrom scrapy.selector import HtmlXPathSelectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.http import Requestfrom src.items import SrcItemclass WeiboSpider(CrawlSpider):    '''            这是一个使用scrapy模拟登录新浪微博的例子,            希望能对广大的同学有点帮助     '''        name = 'weibo'    allowed_domains = ['weibo.com', 'sina.com.cn']            def start_requests(self):        username = '********'        url = 'http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=%s&client=ssologin.js(v1.3.14)&_=%s'  % \            (username, str(time.time()).replace('.', ''))        print url        return [Request(url=url, method='get', callback=self.parse_item)]        def parse_item(self, response):        print response.body        hxs = HtmlXPathSelector(response)        i = SrcItem()        #i['domain_id'] = hxs.select('//input[@id="sid"]/@value').extract()        #i['name'] = hxs.select('//div[@id="name"]').extract()        #i['description'] = hxs.select('//div[@id="description"]').extract()        return i
复制代码

  代码说明:

1    1. 首先scrapy有一个offsite机制,即是否抓其他域名,需要将allowed_domains属性添加上sina.com.cn,

      2. 分析get的链接的特点,得出以下结论:get方式需要你传递你的用户名和你当前发送请求的时间撮,一般使用time.time()是13位的时间撮,只不过是有个顿号而已。去掉之。

      3. 使用callback来指定回滚函数,这里涉及到scrapy的内部执行机制,还挺深,不延展了。然后在parse_item中打印response.body,即返回的这个response的主题内容,response对象有很多属性,比如response.request, response.url,response.headers等等。这个你可以自己用dir(response)查看的。

接下来执行当前这个爬虫,看是否能正常工作,执行结果:

复制代码
E:\workspace\TribuneSpider\src>scrapy crawl weibo2011-08-22 22:02:47+0800 [scrapy] INFO: Scrapy 0.12.0.2539 started (bot: src)2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, CloseSpider2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, Releware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleepthMiddleware2011-08-22 22:02:47+0800 [scrapy] DEBUG: Enabled item pipelines: SrcPipelinehttp://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=*******&client=ssologin.js(v&_=1314021767672011-08-22 22:02:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:60232011-08-22 22:02:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:60802011-08-22 22:02:47+0800 [weibo] INFO: Spider opened2011-08-22 22:02:47+0800 [weibo] DEBUG: Crawled (200) <GET http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.nCallBack&user=******&client=ssologin.js(v1.3.14)&_=131402176767> (referer: None)sinaSSOController.preloginCallBack({"retcode":0,"servertime":1314021767,"nonce":"0G2Q3S"})2011-08-22 22:02:47+0800 [weibo] DEBUG: Scraped SrcItem() in <http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOControlloginCallBack&user=********&client=ssologin.js(v1.3.14)&_=131402176767>2011-08-22 22:02:47+0800 [weibo] INFO: Passed SrcItem()2011-08-22 22:02:47+0800 [weibo] INFO: Closing spider (finished)2011-08-22 22:02:47+0800 [weibo] INFO: Spider closed (finished)
复制代码

你看到这里有sinaSSOControlloginCallBack&user=********&client=ssologin.js(v1.3.14)&_=131402176767>这个了吗?bingo


接下来加东西,加post请求啊.

1. 分析post请求的数据:

  case:博客园,我的心你伤不起啊,也太慢了吧,上传个图片我的妈呀,真是个拖拉机

其实也就很简单了,直接把这一堆值给post的地址url 全部post过去,这还不简单?!!!

现在添加实现代码如下:

复制代码
#! -*- encoding:utf-8 -*-import reimport osimport timefrom scrapy.selector import HtmlXPathSelectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.http import Requestfrom scrapy.http import FormRequestfrom src.items import SrcItemclass WeiboSpider(CrawlSpider):    '''            这是一个使用scrapy模拟登录新浪微博的例子,            希望能对广大的同学有点帮助     '''        name = 'weibo'    allowed_domains = ['weibo.com', 'sina.com.cn']            def start_requests(self):        username = '*******'        url = 'http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=%s&client=ssologin.js(v1.3.14)&_=%s'  % \            (username, str(time.time()).replace('.', ''))        print url        return [Request(url=url, method='get', callback=self.post_message)]        def post_message(self, response):        serverdata = re.findall('{"retcode":0,"servertime":(.*?),"nonce":"(.*?)"}', response.body, re.I)[0]        print serverdata        servertime = serverdata[0]        print servertime        nonce = serverdata[1]        print nonce        formdata = {"entry" : 'miniblog',                    "gateway" : '1',                    "from" : "",                    "savestate" : '7',                    "useticket" : '1',                    "ssosimplelogin" : '1',                    "username" : '**********',                    "service" : 'miniblog',                    "servertime" : servertime,                    "nonce" : nonce,                    "pwencode" : 'wsse',                    "password" : '*********',                    "encoding" : 'utf-8',                    "url" : 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack',                    "returntype" : 'META'}                return [FormRequest(url = 'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.3.14)',                                formdata = formdata,callback=self.parse_item) ]        def parse_item(self, response):        with open('%s%s%s' % (os.getcwd(), os.sep, 'logged.html'), 'wb') as f:            f.write(response.body)
复制代码

  接下来验证,查看文件logged.html内容:

<html><head><script language='javascript'>parent.sinaSSOController.feedBackUrlCallBack({"result":true,"userinfo":{"uniqueid":"1700208252","userid":"×××××××××","displayname":"×××××××××","userdomain":"×××××××××"}});</script></head><body></body></html>null

  有一个result=true,是否成功了呢?别急,我们在加段代码(实际上,上面的内容是抓包的第三个链接返回的内容):

复制代码
#! -*- encoding:utf-8 -*-import reimport osimport timefrom scrapy.selector import HtmlXPathSelectorfrom scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.contrib.spiders import CrawlSpider, Rulefrom scrapy.http import Requestfrom scrapy.http import FormRequestfrom src.items import SrcItemclass WeiboSpider(CrawlSpider):    '''            这是一个使用scrapy模拟登录新浪微博的例子,            希望能对广大的同学有点帮助 ,这是所有代码    '''        name = 'weibo'    allowed_domains = ['weibo.com', 'sina.com.cn']            def start_requests(self):        username = '********'        url = 'http://login.sina.com.cn/sso/prelogin.php?entry=miniblog&callback=sinaSSOController.preloginCallBack&user=%s&client=ssologin.js(v1.3.14)&_=%s'  % \            (username, str(time.time()).replace('.', ''))        print url        return [Request(url=url, method='get', callback=self.post_message)]        def post_message(self, response):        serverdata = re.findall('{"retcode":0,"servertime":(.*?),"nonce":"(.*?)"}', response.body, re.I)[0]        print serverdata        servertime = serverdata[0]        print servertime        nonce = serverdata[1]        print nonce        formdata = {"entry" : 'miniblog',                    "gateway" : '1',                    "from" : "",                    "savestate" : '7',                    "useticket" : '1',                    "ssosimplelogin" : '1',                    "username" : '*******',                    "service" : 'miniblog',                    "servertime" : servertime,                    "nonce" : nonce,                    "pwencode" : 'wsse',                    "password" : '*******',                    "encoding" : 'utf-8',                    "url" : 'http://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack',                    "returntype" : 'META'}                return [FormRequest(url = 'http://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.3.14)',                                formdata = formdata,callback=self.check_page) ]            def check_page(self, response):        url = 'http://weibo.com/'        request = response.request.replace(url=url, method='get', callback=self.parse_item)        return request        def parse_item(self, response):        with open('%s%s%s' % (os.getcwd(), os.sep, 'logged.html'), 'wb') as f:            f.write(response.body)
复制代码

  查看输出文件,用浏览器打开:

成功了吧,朋友们。真正有多少代码量呢?哦,忘记说了一句,scrapy的FormRequest本身是有一个最简单的查找提交表单的方法的,即使用FormRequest.from_response()方法,但是,新浪的登录比较特别,无法查找到表单,所以就采用以上方法。