beautifulsoup模块学习

阅读：评论：0

beautifulsoup模块学习

模块安装：pip3 install beautifulsoup4

from bs4 import BeautifulSouphtml_doc = """<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><title>Title</title>
</head>
<body>
<a href="dfghrt">Chat on the internet</a>
<p>I have received a e-mail from an old friend yesterday. she asked how my summer holiday is.had I ever gone tosomewhere for pleasure ! and also ,told me that she wanted to go around for fun ,but ,unfortunately,there is notime! <a href="dref">what a pity !</a>so ,if I would get to work, how will my life be? what type of jobs shouldI choose? uh, maybe I think a lot! how time flies! I have reached school for almost half a month.came to readarticles day by day.</p>
<hr>
<div class="story"><p id="we11">Homely &comely appearance! want to make a beating plan! buy some herbs, face-mask and so on . surfing theinternet ,I chance meet a stranger ,he has a beatiful net-mane,it give me a good sence,so I made him into myfriends-list,and then i found ,that he has doctor degree. and he is of great knowlege! uh ,and a confident man! butunluckly ,he divorced .maybe in our country ,divorcement is normal .but I feel unsafety. I must get to listen "4+1"oral english now ,continue it later!</p></div>
</body>
</html>
"""soup = BeautifulSoup(html_doc,features = 'html.parser')

1、name，标签名称

tag = soup.find(name='a')
print(tag,tag.name)

输出：<a href="dfghrt">Chat on the internet</a> a

2、attr，标签属性

print(tag.attrs)
tag.attrs = {'k1':'123'}   #所有属性重设
tag.attrs['k2'] = '456'     #新增属性
print(soup)

部分输出：

<body>
<a k1="123" k2="456">Chat on the internet</a>

3，children，所有子标签

body = soup.find('body')
l1 = body.children            #迭代器
print(list(l1))

子类中包含换行

4，descendants，所有子孙标签

l2 = body.descendants

5、clear,清空标签内容并保留标签名

tag = soup.find('body')
tag.clear()
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>
<body></body>
</html>

6、decompose，递归的删除所有的标签，将选中标签一并删除

body = soup.find('body')
body.decompose()
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>

</html>

7、extract，递归删除所有标签，并获取删除标签（删除效果同上，功能相当于剪切）

body = soup.find('body')
v= act()
print(v)

8、decode，转换为字符串（含当前标签）；decode_contents(不含当前标签)

body = soup.find('body')
val = body.decode()
print(type(body),type(val))

9、encode，转换为字节（含标签），encode_contents(不含标签)

body = soup.find('body')
val = de()
print(type(body), type(val))

10、find，获取匹配的第一个标签

tag = soup.find(name='p', attrs={'id': 'we11'}, recursive=True)

get获取标签属性

tag = soup.find('a')
val = ('href')

11、find_all，获取所有匹配标签并返回列表

tags=soup.find_all(name='a',limit=1)               限制范围

tags = soup.find_all(name=['a','div'])             查找列表内所有标签类型

import re
rep = repile('^H')
tags = soup.find_all(text=rep,limit=1)       #节点文本

12、has_attr，检查标签是否具有该属性

tag = soup.find('a')
val = tag.has_attr('href')

13、get_text、获取标签内部文本内容

tag = soup.find('a')
val = _text()

14、index，查询标签索引位置

tag = soup.find('body')
val = tag.index(tag.find('p'))

15、is_empty_element，是否是空标签或是自闭合标签(hr br、input、img、meta、link 、frame等)

tag = soup.find('hr')
val = tag.is_empty_element

16、关联标签

# 
# _element
# _elements
# _sibling
# _siblings# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings# tag.parent
# tag.parents

17、查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)# tag.find_parent(...)
# tag.find_parents(...)

18、select,select_one、css选择器

tag =soup.select("div p")

select按照css选择器方式进行匹配，返回列表；select_one只匹配一个对象

19、wrap，用指定标签把当前标签包裹起来

from bs4.element import Tag
tag1 = Tag(name='div',attrs={'color':'red'})
tag1.string = 'newtag'
tag2 = soup.find('a')
val = tag2.wrap(tag1)
print(soup)

20、unwrap，去掉当前标签保留其包裹的标签

tag = soup.find('div')
val = tag.unwrap()
print(soup)

val是所去掉的标签

21、标签内容

tag = soup.find('p')
print(tag.string)
tag.string='newcontents'
print(soup)# tag = soup.find('body')
# val = tag.stripped_strings
# print(next(val))

22、append--当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before在当前标签前后插入标签

replace_with替换指定标签

23、CSS选择器

soup.select('ui li')

本文发布于:2024-02-03 02:16:17，感谢您对本站的认可！

本文链接：https://www.4u4v.net/it/170689777547975.html

上一篇：期末复习（DAY6）

下一篇：数据结构和算法布隆过滤器示例

标签：模块 beautifulsoup

留言与评论（共有 0 条评论）