通用技术 BeautifulSoup 简介

李宏伟 · 2017年03月31日 · 最后由 李宏伟 回复于 2017年04月01日 · 2081 次阅读

内容基本上是对中文官方文档的精简整理版,梳理一下条理更清楚一点。

安装

安装 BeautifulSoup

pip install beautifulsoup

安装解析器

默认解析器为 html.parser, python 自带无需安装,其他解析器如下:

解析器 用法 安装
lxml HTML BeautifulSoup(markup, 'lxml') pip install lxml
lxml XML BeautifulSoup(markup, 'xml') pip install lxml
html5lib BeautifulSoup(markup, 'html5lib') pip install html5lib

对象

BeautifulSoup

BeautifulSoup 会将 HTML 文档解析成一个文档树,BeautifulSoup 对象相当于 root 节点

from bs4 import BeautifulSoup

markup = """
<html>
<head>siters</head>
<body>
<div class=sister>Elise</div>
</body>
</html>
"""

soup = BeautifulSoup(markup, 'html5lib') # 解析html, 获取soup对象
type(soup)
# <class 'bs4.BeautifulSoup'>
print soup.prettify() # 按照标准格式缩进,输出文档字符串

Tag

Tag 对象与 HTML 或 XML 原生文档中的 tag 相同,可以通过 name 获得 tag 的名字,通过 attrs 获得 tag 的属性字典

from bs4 import BeautifulSoup

markup = '<p class="section" id="test">test content</p>'
soup = BeautifulSoup(markup, 'html5lib')
tag = soup.p
type(tag)
# <class 'bs4.element.Tag'>
print tag.name
# u'p'
print tag.attrs
# {u'class': u'section', u'id': u'test'}

字符串通常包含在 Tag 中,BeautifulSoup 中使用 NavigableString 类来包装字符串

print tag.string
# u'test content'
type(tag.string)
# <class 'bs4.element.NavigableString'>

Comment

comment 对象是一个特殊的 Navigable 对象

from bs4 import BeautifulSoup

markup = '<b><!--comment string--></b>'
soup = BeautifulSoup(markup, 'html5lib')
commemt = soup.b.string
print comment
# comment string
type(comment)
# <class 'bs4.element.Comment'>

遍历

通过 name 获取 tag

如果 tag 的名字在文档中唯一,则通过 name 可直接获取;如果文档内有多个同名的 tag,则返回顺序查找到的第一个 tag;若 tag 下还有子节点,可多次调用 name 访问;

from bs4 import BeautifulSoup

markup = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(markup, 'html5lib')
print soup.head
# <head><title>The Dormouse's story</title></head>
print soup.a
# <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
print soup.body.b
# <b>The Dormouse's story</b>

子节点

tag.contents 以列表形式返回 tag 的直接子节点,字符串没有 contents 属性
tag.children 以生成器的形式返回 tag 的直接子节点
tag.descendants 以生成器的形式返回 tag 的所有子节点 (深度优先的遍历策略)
tag.string 返回 tag 下的字符串 (tag 下只有一个字符串节点时返回该字符串,若有多个字符串返回 None)
tag.strings 以生成器的形式返回 tag 下所有的字符串
tag.stripped_strings 以生成器的形式返回 tag 下所有的字符串 (去除空白内容)

head = soup.head
print head.contents
# [<title>The Dormouse's story</title>]
title = soup.title
print title.contents
# [u'The Dormouse's story']

for child in head.children:
    print child
# <title>The Dormouse's story</title>

for descendant head.descendants:
    print descendant
# <title>The Dormouse's story</title>
# The Dormouse's story

print title.string
# u'The Dormouse's story'

print soup.body.string
# None

for string in soup.body.strings:
    repr(string)
# u'The Dormouse's story'
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'

for string in soup.body.stripped_strings:
    print string
# u'The Dormouse's story'
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'

父节点

tag.parent 获取某个 tag 的父节点
tag.parents 可递归获得 tag 的所有父节点

title = soup.title
print title.parent
# <head><title>The Dormouse's story</title></head>
print soup.parent
# None

title_string = title.string
for parent in title_string.parents:
    print parent
# <title>The Dormouse's story</title>
# <head><title>The Dormouse's story</title></head>
# soup

兄弟节点

tag.next_sibling 获取下一个兄弟节点
tag.previous_sibling 获取前一个兄弟节点
tag.next_siblings 获取下面所有的兄弟节点
tag.previous_siblings 获取前面所有的兄弟节点

from bs4 import BeautifulSoup

markup = """<a><b>text1</b><c>text2</c><d>text3</d></a>"""

soup = BeautifulSoup(markup, 'html5lib')
b = soup.b
print b.next_sibling
# <c>text2</c>
for sibling in b.next_siblings:
    print sibling
# <c>text2</c>
# <d>text3</d>

d = soup.d
print d.previous_sibling
# <c>text2</c>
for sibling in d.previous_siblings:
    print sibling
# <c>text2</c>
# <b>text1</b>

回退和前进

tag.next_element 获取下一个节点 (深度优先遍历树中的下个节点)
tag.previous_element 获取前一个节点
tag.next_elements 获取下面所有的节点
tag.previous_elements 获取前面所有的节点

title = soup.title
print title.next_element
# u'The Dormouse's story'

link = soup.find('a', id=link3)
print link
# <a href="http://example.com/tillie" class="sister" id="link3">
print link.previous_element
# 'and'

搜索

过滤器

过滤器贯穿整个搜索的 API,可用在 name 参数中,tag 的属性中,字符串中以及混合使用,过滤器包含以下几个类型:

  • 字符串(顺序查找 name 与字符串完全匹配的 tag,返回第一次查询到的结果)
from bs4 import BeautifulSoup

markup = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
</html>
"""

soup = BeautifulSoup(markup, 'html5lib')
print soup.find('b')
# <b>The Dormouse's story</b>
  • 正则表达式(通过正则表达式匹配 name)
import re

pattern = re.compile(r'b')
for tag in soup.find_all(pattern): 
    print tag.name
# body
# b
  • 列表
print soup.find_all(['a', 'b']) 
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  • True(查找文档中所有的 tag,但不返回字符串节点)
for tag in soup.find_all(True): 
    print tag.name
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p
  • 方法(使用 bool 函数作过滤器,方法只接受一个元素参数,返回 Ture 表示匹配)
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

print soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

find_all()

find_all(name, attrs, recursive, text, **kwargs)

  • name(通过 name 查找 tag,参数值参见过滤器)
print soup.find_all('title') 
# [<title>The Dormouse's story</title>]
  • recursive(不递归,只搜索 tag 的直接子节点)
print soup.html.find_all('title')
# [<title>The Dormouse's story</title>]
print soup.html.find_all('title', recursive=False) 
# []
  • text(搜索文档中的字符串内容,与 name 的可选参数值一样)
print soup.find_all(text='Elsie')
# [u'Elsie']
print soup.find_all(text=['Tillie', 'Elsie', 'Lacie'])
# [u'Elsie', u'Lacie', u'Tillie']
print soup.find_all(text=re.compile('Dormouse'))
# [u"The Dormouse's story", u"The Dormouse's story"]
  • limit(限制返回结果数)
print soup.find_all('a', limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
  • **kwargs
print soup.find_all(id='link2') # 搜索每个tag的id属性,匹配对应的
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

print soup.find_all(href=re.compile('elise')) # 搜索每个tag的href属性,匹配对应的
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print soup.find_all('a', id='link1', href=re.compile('elsie')) # 可以使用多个关键字参数搜索
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print soup.find_all('a', class_='sister', id='link2') # class是python保留字,搜索class属性时key可以写成class_
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

soup = BeautifulSoup('<a data-href="http://example.com/elsie">Elise</a>')
soup.find_all(data-href=re.compile('elise')) # 有些tag的属性在搜索中不能使用,这时可以用attrs参数定义一个字典参数来搜索包含特殊属性的tag
# SyntaxError: keyword can't be an expression
soup.find_all(attrs={'data-href': re.compile('elise')})
# <a data-href="http://example.com/elsie">Elise</a>

find

与 find_all 几乎等价,唯一区别是 find_all 返回所有符合条件的 tag,而 find 只返回查找到的第一个结果;find 和 find_all 都是搜索当前节点的所有子孙节点,实际上是对.descendants 属性的迭代搜索

print soup.find_all('title', limit=1) # find()和find_all(limit=1)两者是等价的
# [<title>The Dormouse's story</title>]

print soup.find('title')
# <title>The Dormouse's story</title>

find_parents() && find_parent()

find_parents 和 find_parent 是搜索当前节点的父节点,是对.parents 属性的迭代搜索

a_string = soup.find(text='Lacie')
print a_string
# u'Lacie'

print a_string.find_parents('a')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

print a_string.find_parent('p')
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

print a_string.find_parents('p', class='title')
# []

find_next_sibings() && find_next_sibling()

find_next_siblings 和 find_next_sibling 是搜索当前节点之后的兄弟节点,是对.next_sibings 属性的迭代搜索

first_link = soup.a
print first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print first_link.find_next_siblings('a')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

first_story_paragraph = soup.find('p', class_='story')
print first_story_paragraph.find_next_sibling('p')
# <p class="story">...</p>

find_previous_siblings && find_previous_sibling()

find_previous_siblings 和 find_previous_sibling 是搜索当前节点之前的兄弟节点,是对.previous_siblings 属性的迭代搜索

last_link = soup.find('a', id='link3')
print last_link
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

print last_link.find_previous_siblings('a')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

first_story_paragraph = soup.find('p', class_='story')
print first_story_paragraph.find_previous_sibling('p')
# <p class="title"><b>The Dormouse's story</b></p>

find_all_next() && find_next()

find_all_next 和 find_next 是搜索当前节点之后的节点,是对.next_elements 属性的迭代搜索

first_link = soup.a
print first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print first_link.find_all_next(text=True)
# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
#  u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']

print first_link.find_next('p')
# <p class="story">...</p>

find_all_previous() && find_previous()

find_all_previous 和 find_previous 是搜索当前节点之前的节点,是对.previous_element 属性的迭代搜索

first_link = soup.a
print first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print first_link.find_all_previous('p')
# [<p class="story">Once upon a time there were three little sisters; ...</p>,
#  <p class="title"><b>The Dormouse's story</b></p>]

print first_link.find_previous('title')
# <title>The Dormouse's story</title>
共收到 2 条回复 时间 点赞

👍 BeautifulSoup 确实好用

剪烛 回复

🤔 爬虫利器

需要 登录 后方可回复, 如果你还没有账号请点击这里 注册