其他测试框架 [已解决] 根据 raymond 大神的 python 爬虫写的作业，有两个问题请教！

shela2009 · 2017年08月02日 · 最后由 shela2009 回复于 2017年08月03日 · 3992 次阅读

@raymond 大神的文章https://testerhome.com/topics/4637
原文是 2016 年时的，现在按照文章中的代码就找不到同一个标签了，于是又写了一遍。
写完后有两个问题，怕时间久远大神没有关注原贴，重新发帖，如果有明白的童鞋也希望不吝赐教！

# -*- coding: utf8 -*-
import requests
from bs4 import BeautifulSoup as bs

request_url = 'https://testerhome.com/topics/4621'
#response = requests.get(request_url).text
response = bs(requests.get(request_url).text, 'lxml') #返回一个bs对象，格式为lxml
print("这篇文章的标题是：",response.h1.text)#文章标题
#print("这篇文章属于话题：",response.find('a',{"class":"node"}).text)#文章分类
print("这篇文章属于话题：",response.h1.span.text)#文章分类，两种方法都可以找到分类
print("这篇文章的作者是：",response.find('a',{"class":"user-name"}).text)#作者名字
print("----------这是获取所有的文章评论----------")
for i in response.find_all('div',{"class":"reply"}):
    print(i.p.text)
print("----------获取所有的文章评论完毕----------")

问题 1：获取文章标题时，通过 text 拿出来的结果是其他测试框架 Python 网络爬虫 (一)，希望只取到Python 网络爬虫 (一)，不知道怎么办到。截取源码：

<div class="navbar-topic-title">
  <a href="#" class="topic-title pull-left" title="Python 网络爬虫  (一) " data-type="top">
    <h1><span class="node">其他测试框架</span> Python 网络爬虫  (一) </h1>
  </a>
</div>

问题 2：最后取到的评论到 39 楼就停止了，截取部分评论结果和报错：

#36楼 @fing520 urllib有好有坏吧， 好的地方是对爬虫及http原理解释的足够详细， 不好的地方是太唠叨了。。。
下一篇什么时候出?坐等
最后输出，报这样的错：File "", line 1, in <mo
UnicodeEncodeError: 'gbk' codec
5: illegal multibyte sequence
Traceback (most recent call last):
  File "C:/pyspace/pachong/test_pachong.py", line 14, in <module>
    print(i.p.text)
AttributeError: 'NoneType' object has no attribute 'text'

我看 39 楼（5: illegal multibyte sequence）后是一个 class:"reply reply-system"的标签 “raymond 在 [该话题已被删除] 中提及了此贴”，可是 class 不一样，应该可以跳过的啊？为什么会报错呢？
为了好观察，把源码整理成以下：

----------------------------------------------解决手动分隔线-----------------------------------------------------

问题 1：
文章名字 name=response.find('a',{"class":"topic-title pull-left"})["title"]
问题 2：
由于删除状态那句话的class="reply reply-system"，也就是说 reply 也是匹配的，这样的 p 标签是没有的，也就是 i.p 是可能为空的，不为空的才是我们想要的。所以加一个 if 判空语句。
最终修改为以下代码：

# -*- coding: utf8 -*-
import requests
from bs4 import BeautifulSoup as bs

request_url = 'https://testerhome.com/topics/4621'
response = bs(requests.get(request_url).text, 'lxml') #返回一个bs对象，格式为lxml
name=response.find('a',{"class":"topic-title pull-left"})["title"]
print("这篇文章的名字是：",name)
print("这篇文章属于话题：",response.h1.span.text)#文章分类
print("这篇文章的作者是：",response.find('a',{"class":"user-name"}).text)#作者名字
print("----------这是获取所有的文章评论----------")
for i in response.find_all('div',{"class":"reply"}) :
   if i.p != None:
        print(i.p.text)
print("----------获取所有的文章评论完毕----------")

3 个赞

共收到 9 条回复时间点赞

onemorecd #5 · 2017年08月03日

问题 1, 可以从标签中的 title attribute 取你想要的,如果执意要从 h1 里找,感觉就只能全找出来再正则删掉的部分？

<a href="#" class="topic-title pull-left" title="Python 网络爬虫  (一) " data-type="top">

问题 2, 你通过 response.find_all('div',{"class":"reply"}) 这个规则来找,就不太对吧.可以再看看格式和规律

shela2009 #2 · 2017年08月03日 Author

对

onemorecd 回复

问题 1：从 title 里没有找到完全满意的，最后用的print("这篇文章的名字是：",response.head.title.text)
问题 2：这个规则哪里有问题？按照结构是这样的啊？而且前 39 楼的评论都取出来了，就是被中间的这个状态挡住了。可我觉得这个状态并没有匹配，应该是跳过去的，结果报错了

for i in response.find_all('div',{"class":"reply"}):
    print(i.p.text)

riklu #7 · 2017年08月03日

问题 1，可以使用下面的方式可以获取，

shela2009 #8 · 2017年08月03日 Author

对

riklu 回复

啊……这样当然肯定能得到啦，只是这样的话就只能在这个分类或者 6 个字的分类下的文章了，不能扩展到所有的文章。

riklu #5 · 2017年08月03日

问题 2，刚才试了一下，应该到此评论 i.p 为空了吧，增加判断 if i.p != None: 然后打印 i.p.text，

shela2009 #6 · 2017年08月03日 Author

对

riklu 回复

非常感谢！这样果然得出结果了！
我开始以为 i.p 不会为空，后来发现删除那条状态的class="reply reply-system"，所以是匹配了 reply 的，然后没有找到 p，所以报错了。

shela2009 #3 · 2017年08月03日 Author

对

onemorecd 回复

问题 2 已经解决了！还是谢谢

YueChen #9 · 2017年08月03日

问题 1：

print("这篇文章的标题是：",response.find('a',{"class":"topic-title pull-left"}).get('title'))#文章标题

shela2009 #1 · 2017年08月03日 Author

对

YueChen 回复

非常感谢！！！！！！！！

需要登录后方可回复, 如果你还没有账号请点击这里注册。