Beautiful Soup介绍

阅读：评论：0

Beautiful Soup介绍

前言

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库，简单来说，它能将HTML的标签文件解析成树形结构，然后方便地获取到指定标签的对应属性。通过Beautiful Soup库，我们可以将指定的class或id值作为参数，来直接获取到对应标签的相关数据。简单来说，就是把html或者xml源代码进行了格式化，方便我们对其中的节点、标签、属性等进行进一步的操作。

一、Beautiful Soup的安装以及导入

# 安装
pip install beautifulsoup4
# 看一下是否安装成功
pip list
# 导入beautifulsoup4模块，简称bs4
from bs4 import BeautifulSoup
# bs4库可以简单理解为是一个解析、遍历、维护、‘标签树’的功能库

二、Beautiful Soup的使用

#导入bs4模块
from bs4 import BeautifulSouphtml = """<html><head><title>The Dormouse's story</title></head>
<body> <p class='title'><b>The Dormouse's story</b></p> 
<p class='story'>Once upon a time there were three little sisters; 
and their names were ' class='sister' 
id='link1'>Elsie, ' class='sister' 
id='link2'>Lacie and ' class='sister' id='link3'>Tillie; 
and they lived at the bottom of a well.</p> <p class='story'>...</p> </html>
"""
# html：需要解析的内容，即网页源代码所构成的字符串
# html解析器：html.parser
soup = BeautifulSoup(html,'html.parser')
# 输出结果,即把html输出，排过版的，结构清晰。
print(soup.prettify())# 输出title
soup.title
# 输出title的名字
soup.title.name
# 输出p标签
soup.p
# 找到所有的p标签
soup.find_all('p')
# 输出p标签的class属性
soup.p['class']

备注：更多关于Beautiful Soup介绍请看参考文章。

参考文章：
Beautiful Soup教程.
Beautiful Soup库（bs4）入门.
python bs4 库简介.