网络爬虫笔记【8】 应用 BeautifulSoup 库解析网页内容

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库,它能够通过你喜欢的 parser 实现文档导航、查找、修改文档的 parser tree。Beautiful Soup 会帮你节省数小时甚至数天的工作时间.

使用 lxml parser 解析 HTML 并提取内容

首先看一个简单应用 BeautifulSoup 解析网页的例子。例子中,BeautifulSoup 会根据 HTML 文档建立对象。

from bs4 import BeautifulSoup

htmlDoc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
bs = BeautifulSoup(htmlDoc, 'html.parser')
print(type(bs))
print(bs)
运行结果:

<class 'bs4.BeautifulSoup'>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

上面的例子中,建立了一个 BeautifulSoup 对象 html,并指定了文档解析器 "html.parser"。

BeautifulSoup 中常见对象

使用 BeautifulSoup解析文档,首先需要建立 BeautifulSoup 对象,这是一个复杂的树形形象,它有大量用于查找和修改文档的方法。常用的对象有:

  • BeautifulSoup 对象:表示的是一个文档的全部内容,它与 Tag 对象很类似;
  • Tag 标签对象:对象与 XML 或 HTML 原生文档中的 tag 相同。Tag 有很多方法和属性,最重要的属性是:
    • name:Tag 名,例如 body、a
    • attributes:Tag 的属性
    • 事实上,对于 xml 或 html 文档对象,也是 Tag 对象的特殊类型
  • NavigableString 可遍历的字符串对象:这类字符串通常被包围在一些标签中,通过标签对象.string 属性访问:
    • 使用标签对象的 string 属性获得
    • 可以跨越多个标签层次
  • Comment 注释对象:一个特殊类型的 NavigableString 对象
# BeautifulSoup 示例

from bs4 import BeautifulSoup

htmlDoc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>
"""

bsoup = BeautifulSoup(htmlDoc, 'html.parser')
print('---'*20)

# 利用 Tag 对象获取标签及内部信息
print(bsoup.head)
print(bsoup.title)
print(bsoup.body)
print('---'*20)

# 获取第一个 a 标签
tagA = bsoup.a
print(type(tagA))
print(tagA)
print('---'*20)

# 获取所有的 a 标签
a = bsoup.findAll(name=bsoup.a.name)
print(a)
print('---'*20)

# 可操作字符串 NavigableString 对象
ns = bsoup.p.string
print(ns)
print(type(ns))
print('---'*20)

# Comments 对象
cmt = bsoup.div.string
print(cmt)
print(type(cmt))
运行结果:

------------------------------------------------------------
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
------------------------------------------------------------
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
------------------------------------------------------------
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
------------------------------------------------------------
The Dormouse's story
<class 'bs4.element.NavigableString'>
------------------------------------------------------------
 This is a comment 
<class 'bs4.element.Comment'>
------------------------------------------------------------

标签属性的获取

Tag 对象的 name,attrs 等属性能够使我们十分边界的获取相关信息。如果存在多个 name 属性一样的标签,Tag 对象会返回从上到下的第一个同名标签。

from bs4 import BeautifulSoup

htmlDoc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>
"""
bsoup = BeautifulSoup(htmlDoc, 'html.parser')
tagA = bsoup.a
print(tagA.name)
print(tagA.attrs)
print(tagA.text)
print('---'*20)
print(tagA.parent)
运行结果:

a
{'id': 'link1', 'href': 'http://example.com/elsie', 'class': ['sister']}
Elsie
------------------------------------------------------------
<p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>

文档标签的遍历

在 BeautifulSoup 中遍历各个 Tag 对象(标签)时有三种方式,相对应的遍历方法有:

  • 下行遍历(从根到叶)
    • contents:子节点的列表,将tag所有儿子结点存入列表;
    • children:子节点的迭代类型,与.contents类似,用于循环遍历子节点;
    • deseendants:子孙结点的迭代类型,包含所有子孙节点,用于循环遍历。
  • 上行遍历(从叶到根)
    • parent:节点的父亲标签;
    • parents:节点先辈标签的迭代类型,用于循环遍历先辈节点。
  • 平行遍历(同一父亲节点下的兄弟之间遍历)
    • next_sibling:返回按照HTML文本顺序的下一个平行节点标签;
    • previous_sibling:返回按照HTML文本顺序的上一个平行节点标签;
    • next_sibling:迭代类型,返回按照HTML文本顺序的后续所有平行标签;
    • previous_siblings:迭代类型,返回按照HTML文本顺序的前续所有平行节点的标签。
"""BeautifulSoup文档标签遍历示例"""
from bs4 import BeautifulSoup

htmlDoc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>
"""
bsoup = BeautifulSoup(htmlDoc,'html.parser')
print(bsoup.head)

# 用于下行遍历的几个属性
print(bsoup.head.contents)
print('---'*20)
print(bsoup.body.contents)
print('---'*20)
print(bsoup.body.children)
print([_ for _ in bsoup.body.children])
print('---'*20)
print(bsoup.body.descendants)
print([_.name for _ in bsoup.body.descendants])

# 用于上行遍历的属性
print('---'*20)
print(bsoup.b)
print(bsoup.b.parent)
print('---'*20)
print([_ for _ in bsoup.b.parents])

# 用于平行遍历的属性
print('---'*20)
print(bsoup.a)
print(bsoup.a.next_sibling)
print(bsoup.a.previous_sibling)
print('---'*20)
print([_.name for _ in bsoup.a.next_siblings])
print([_.name for _ in bsoup.a.previous_siblings])
运行结果:

<head><title>The Dormouse's story</title></head>
[<title>The Dormouse's story</title>]
------------------------------------------------------------
['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n', <div><!-- This is a comment --></div>, '\n']
------------------------------------------------------------
<list_iterator object at 0x000001EFE086DDD8>
['\n', <p class="title"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n', <div><!-- This is a comment --></div>, '\n']
------------------------------------------------------------
<generator object descendants at 0x000001EFDF5E1C50>
[None, 'p', 'b', None, None, 'p', None, 'a', None, None, 'a', None, None, 'a', None, None, None, 'p', None, None, 'div', None, None]
------------------------------------------------------------
<b>The Dormouse's story</b>
<p class="title"><b>The Dormouse's story</b></p>
------------------------------------------------------------
[<p class="title"><b>The Dormouse's story</b></p>, <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
<div><!-- This is a comment --></div>
</body>, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>, 
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
<p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>
]
------------------------------------------------------------
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
,
        
Once upon a time there were three little sisters; and their names were
        
------------------------------------------------------------
[None, 'a', None, 'a', None]
[None]

BeautifulSoup 辅助优化文档展现结构

利用 prettify() 函数可以更好的展现 HTML 或 XML 结构,有利于人们浏览文档。

"""BeautifulSoup文档prettify示例"""
from bs4 import BeautifulSoup

htmlDoc = """
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a  ref="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p><p class="story">...</p><div><!-- This is a comment --></div></body></html>
"""

bsoup = BeautifulSoup(htmlDoc,'html.parser')
print(bsoup.prettify())
运行结果:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
  <div>
   <!-- This is a comment -->
  </div>
 </body>
</html>

BeautifulSoup 文本内容查找方法

  • find()
  • find_all(name, attrs, recursive, string, **kwargs)
    • 功能:查找所有符合条件的标签,返回一个列类型
    • name:对标签名称的检索字符串
# BeautifulSoup 内容查找示例
from bs4 import BeautifulSoup
import re

htmlDoc = """
<html><head><title>The Dormouse's story</title></head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
<div><!-- This is a comment --></div>
</body>
</html>
"""
bsoup = BeautifulSoup(htmlDoc,'html.parser')

# 使用 find_all 函数,以 name 值为条件查找标签的方法
print([(_.name, _.string) for _ in bsoup.find_all('a')])
print('---'*20)

print([(_.name, _.string) for _ in bsoup.find_all('a','b')])
print('---'*20)

print([(_.name, _.string) for _ in bsoup.find_all(True)])
print('---'*20)

print([(_.name, _.string) for _ in bsoup.find_all(re.compile('b',re.I))])
print('---'*20)

# 使用 find_all 函数,以 name 值、sttrs 值为条件查找标签的方法
print([(_.name, _.string) for _ in bsoup.find_all('a','sister')])
print('---'*20)

#使用 find_all 函数,以 id 值为条件查找标签的方法
print([(_.name, _.string) for _ in bsoup.find_all(id = re.compile('link',re.I))])
print('---'*20)


运行结果:

[('a', 'Elsie'), ('a', 'Lacie'), ('a', 'Tillie')]
------------------------------------------------------------
[]
------------------------------------------------------------
[('html', None), ('head', "The Dormouse's story"), ('title', "The Dormouse's story"), ('body', None), ('p', "The Dormouse's story"), ('b', "The Dormouse's story"), ('p', None), ('a', 'Elsie'), ('a', 'Lacie'), ('a', 'Tillie'), ('p', '...'), ('div', ' This is a comment ')]
------------------------------------------------------------
[('body', None), ('b', "The Dormouse's story")]
------------------------------------------------------------
[('a', 'Elsie'), ('a', 'Lacie'), ('a', 'Tillie')]
------------------------------------------------------------
[('a', 'Elsie'), ('a', 'Lacie'), ('a', 'Tillie')]
------------------------------------------------------------
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 撸撸猫 设计师:设计师小姐姐 返回首页