python 爬虫的两种方法

爬虫步骤（爬取小说为例）

第一步：获取小说网页信息 ；
第二步：获取小说的章节标题和可访问的url；
第三步：获取各章节的小说内容；
第四步：把章节标题和内容写入文件

准备工作

语言： python 3.x
第三方库：requests和re
浏览器：Chrome
IDE：Pycharm或VScode

下面介绍 python 爬虫的其中两种方法

第一种方法：面向过程

import requests 
import re 

#取得所有小说章节的urls
url = 'http://www.biqukan.com/1_1094/'
req = requests.get(url).text   
webs  = re.findall(r'<dd><a href ="(.*?)">(.*?)</a></dd>',req)[12:]  #去掉前11个更新的章节url

# 打开或新建文件
f = open('一念永恒1.txt','w', encoding='utf-8')

# 获取所有章节的可访问url和title
for web in webs:  
    novel_title = web[1]
    novel_urls = web[0]
    if 'http' not in web:          
        novel_urls = 'http://www.biqukan.com%s' % web[0]

    #获取所有章节的内容
    html = requests.get(novel_urls).text
    novel_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',html)[0]
    novel_content = novel_content.replace('&nbsp;','')
    novel_content = novel_content.replace('<br />','')
    novel_content = novel_content.replace('【感谢大家一直以来的支持，这次起-点515粉丝节的作家荣耀堂和作品总选举，希望都能支持一把。另外粉丝节还有些红包礼包的，领一领，把订阅继续下去！','')
    novel_content = novel_content.replace('请记住本书首发域名：www.biqukan.com。笔趣阁手机版阅读网址：m.biqukan.com','')
    novel_content = novel_content.replace(novel_urls,'')

    # 把内容写入文件
    f.write(novel_title + '\n')
    f.write(novel_content + '\n\n')

f.close()

第二种方法：面向对象

import requests
import re

# 定义下载小说的类
class download_novel(object):

    def __init__(self, url):
        self.url = url

    #1 获取所有章节的url
    def get_urls(self):
        req = requests.get(self.url).text
        novel_urls = re.findall(r'<dd><a href ="(.*?)">(.*?)</a></dd>',req)[12:]  #去掉前11个更新的章节url
        return novel_urls

    #2 获取每个章节的内容     
    def get_content(self, novel_urls):
        html = requests.get(novel_urls).text
        novel_content = re.findall(r'<div id="content" class="showtxt">(.*?)</div>',html)[0]
        novel_content = novel_content.replace('&nbsp;','')
        novel_content = novel_content.replace('<br />','')
        novel_content = novel_content.replace('【感谢大家一直以来的支持，这次起-点515粉丝节的作家荣耀堂和作品总选举，希望都能支持一把。另外粉丝节还有些红包礼包的，领一领，把订阅继续下去！','')
        novel_content = novel_content.replace('请记住本书首发域名：www.biqukan.com。笔趣阁手机版阅读网址：m.biqukan.com','')
        novel_content = novel_content.replace(novel_urls,'')
        return novel_content

    #3 写入文件
    def write_novel(self):
        novel_urls = self.get_urls()
        with open('一念永恒.txt','w', encoding='utf-8') as f:
            for web in novel_urls:
                novel_title = web[1]
                novel_urls = web[0]
                if 'http' not in web:          
                    novel_urls = 'http://www.biqukan.com%s' % web[0]
                novel_content = self.get_content(novel_urls)
                print(novel_urls)
                f.write(novel_title + '\n')
                f.write(novel_content + '\n\n')

# 实例
if __name__ == '__main__':
    url = 'http://www.biqukan.com/1_1094/'
    a = download_novel(url)
    a.write_novel()

还有面向函数的方法，这里就不列举了。面向对象的方法看起来比较复杂，但是它在以后的维护和更新效率上有很大的帮助，慢慢掌握它才是上道。

↙↙↙阅读原文可查看相关链接，并与作者交流