BeautifulSoup解析文档只有部分内容

来源：互联网发布：gre模考软件编辑：程序博客网时间：2024/06/10 01:30

首先建议参考：
http://m.blog.csdn.net/blog/muzixiaozi/39960219
我的错误跟他十分相似，同样python2.7的环境，调用BeautifulSoup解析网页源代码，发现解析后的结果丢掉后了小半部分，只保留前半部分的内容。
BeautifulSoup4.4的官方文档，给出了一个代码诊断的功能：
from bs4.diagnose import diagnose data = open("bad.html").read() diagnose(data) \#**注释部分是你的python环境检测结果。** \# Diagnostic running on Beautiful Soup 4.2.0 \# Python version 2.7.3 (default, Aug 1 2012, 05:16:07) \# I noticed that html5lib is not installed. Installing it may help. \# Found lxml version 2.3.2.0 # \# Trying to parse your data with html.parser \# Here's what html.parser did with the document: \# ...
\#以下是我的部分代码，已检测完毕，我首先使用的是lxml，毕竟它的速度最快，还有较强的容错能力。但是很遗憾，还是不能解析我的文件。 from bs4.diagnose import diagnose filePath="D:\\test1.html" \#html_soup = BeautifulSoup(open(filePath),'lxml') html_soup = BeautifulSoup(open(filePath),'html5lib') data = open(filePath).read() \#diagnose(data) print html_soup.get_text()
检测结果建议我安装html5lib，因为python库自带”html parser”。html5lib容错能力最强，对于不规则的html文档都能很好的解析出来，但是速度是很慢的，中间要转为html5。直接下载html5lib安装包解压后cd到解压文件下执行Python setup.py install，然后进入Python环境import html5lib验证是否安装成功，果然提示错误，一看是six插件未安装，果断执行easy_install six 命令安装完成后，没再提示错误。至此，困扰几天的问题得以解决，mark一下。
BeautifulSoup4.4跟3.0版本差别很大，它提供三种解析器，具体参见上面给出的博客地址和官方文档。

0 0