Python 爬虫笔记(由站内到站外爬虫)
来源:互联网 发布:霍尼韦尔2316简易编程 编辑:程序博客网 时间:2024/06/09 23:01
#! /usr/bin/env python#coding=utf-8import urllib2from bs4 import BeautifulSoupimport reimport datetimeimport randompages=set()random.seed(datetime.datetime.now())#Retrieves a list of all Internal links found on a pagedef getInternalLinks(bsObj, includeUrl): internalLinks = [] #Finds all links that begin with a "/" for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): if link.attrs['href'] is not None: if link.attrs['href'] not in internalLinks: internalLinks.append(link.attrs['href']) return internalLinks#Retrieves a list of all external links found on a pagedef getExternalLinks(bsObj, excludeUrl): externalLinks = [] #Finds all links that start with "http" or "www" that do #not contain the current URL for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")): if link.attrs['href'] is not None: if link.attrs['href'] not in externalLinks: externalLinks.append(link.attrs['href']) return externalLinksdef splitAddress(address): addressParts = address.replace("http://", "").split("/") return addressPartsdef getRandomExternalLink(startingPage): html= urllib2.urlopen(startingPage) bsObj = BeautifulSoup(html) externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0]) if len(externalLinks) == 0: internalLinks = getInternalLinks(startingPage) return internalLinks[random.randint(0, len(internalLinks)-1)] else: return externalLinks[random.randint(0, len(externalLinks)-1)]def followExternalOnly(startingSite): externalLink=getRandomExternalLink("http://oreilly.com") print("Random external link is: "+externalLink) followExternalOnly(externalLink)followExternalOnly("http://oreilly.com")
0 0
- Python 爬虫笔记(由站内到站外爬虫)
- 问答系统--站内爬虫
- python 爬虫笔记(二)
- Python 爬虫笔记(三)
- Python 爬虫笔记(CrawlingwithScrapy)
- python爬虫笔记(三)
- Python爬虫笔记----爬虫技术入门(1)
- WSWP(用 python写爬虫) 笔记五:并发爬虫
- Python爬虫笔记一
- Python爬虫笔记
- python 爬虫笔记
- python爬虫项目笔记
- python爬虫笔记
- Python爬虫学习笔记
- python爬虫入门笔记
- python爬虫学习笔记
- python爬虫学习笔记
- python爬虫学习笔记
- EAS webservice串用户、串数据中心问题
- apache httpd/apache2配置文件小结
- [Leetcode] Palindrome Partitioning II
- 微信开发消息自动回复和自定义菜单简述。
- WEB APP、HYBRID APP与NATIVE APP 差异分析
- Python 爬虫笔记(由站内到站外爬虫)
- 在linux 桌面环境中使用windows 应用
- Web前端面试指导(十八):用纯CSS创建一个三角形的原理是什么?
- 使用kubeadm安装kubernetes
- 一叶飘舟
- Android Glide加载gif播放次数及监听
- 译文 ▏大数据如何影响商业决策
- [数据结构]第三章-栈
- linux下 ftp服务器如何设置上传文件的权限