beautifulsoup模块学习

阅读: 评论:0

beautifulsoup模块学习

beautifulsoup模块学习

模块安装:pip3 install beautifulsoup4

from bs4 import BeautifulSouphtml_doc = """<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><title>Title</title>
</head>
<body>
<a href="dfghrt">Chat on the internet</a>
<p>I have received a e-mail from an old friend yesterday. she asked how my summer holiday is.had I ever gone tosomewhere for pleasure ! and also ,told me that she wanted to go around for fun ,but ,unfortunately,there is notime! <a href="dref">what a pity !</a>so ,if I would get to work, how will my life be? what type of jobs shouldI choose? uh, maybe I think a lot! how time flies! I have reached school for almost half a month.came to readarticles day by day.</p>
<hr>
<div class="story"><p id="we11">Homely &comely appearance! want to make a beating plan! buy some herbs, face-mask and so on . surfing theinternet ,I chance meet a stranger ,he has a beatiful net-mane,it give me a good sence,so I made him into myfriends-list,and then i found ,that he has doctor degree. and he is of great knowlege! uh ,and a confident man! butunluckly ,he divorced .maybe in our country ,divorcement is normal .but I feel unsafety. I must get to listen "4+1"oral english now ,continue it later!</p></div>
</body>
</html>
"""soup = BeautifulSoup(html_doc,features = 'html.parser')

1、name,标签名称

tag = soup.find(name='a')
print(tag,tag.name)

输出:<a href="dfghrt">Chat on the internet</a> a 

2、attr,标签属性

print(tag.attrs)
tag.attrs = {'k1':'123'}   #所有属性重设
tag.attrs['k2'] = '456'     #新增属性
print(soup)

部分输出:

<body>
<a k1="123" k2="456">Chat on the internet</a>

3,children,所有子标签

body = soup.find('body')
l1 = body.children            #迭代器
print(list(l1))

子类中包含换行

4,descendants,所有子孙标签

l2 = body.descendants

5、clear,清空标签内容并保留标签名

tag = soup.find('body')
tag.clear()
print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>
<body></body>
</html> 

 6、decompose,递归的删除所有的标签,将选中标签一并删除

body = soup.find('body')
body.decompose()
print(soup)

 

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Title</title>
</head>

</html>

 7、extract,递归删除所有标签,并获取删除标签(删除效果同上,功能相当于剪切)

body = soup.find('body')
v= act()
print(v)

8、decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

body = soup.find('body')
val = body.decode()
print(type(body),type(val))

<class 'bs4.element.Tag'> <class 'str'>

9、encode,转换为字节(含标签),encode_contents(不含标签)

body = soup.find('body')
val = de()
print(type(body), type(val))

<class 'bs4.element.Tag'> <class 'bytes'>

10、find,获取匹配的第一个标签

tag = soup.find(name='p', attrs={'id': 'we11'}, recursive=True)

get获取标签属性 

tag = soup.find('a')
val = ('href')

11、find_all,获取所有匹配标签并返回列表

tags=soup.find_all(name='a',limit=1)               限制范围
tags = soup.find_all(name=['a','div'])             查找列表内所有标签类型
import re
rep = repile('^H')
tags = soup.find_all(text=rep,limit=1)       #节点文本

12、has_attr,检查标签是否具有该属性

tag = soup.find('a')
val = tag.has_attr('href')

13、get_text、获取标签内部文本内容

tag = soup.find('a')
val = _text()

14、index,查询标签索引位置

tag = soup.find('body')
val = tag.index(tag.find('p'))

15、is_empty_element,是否是空标签或是自闭合标签(hr br、input、img、meta、link 、frame等)

tag = soup.find('hr')
val = tag.is_empty_element

16、关联标签

# 
# _element
# _elements
# _sibling
# _siblings# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings# tag.parent
# tag.parents

17、查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...)# tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...)# tag.find_parent(...)
# tag.find_parents(...)

18、select,select_one、css选择器

tag =soup.select("div p")

select按照css选择器方式进行匹配,返回列表;select_one只匹配一个对象

19、wrap,用指定标签把当前标签包裹起来

from bs4.element import Tag
tag1 = Tag(name='div',attrs={'color':'red'})
tag1.string = 'newtag'
tag2 = soup.find('a')
val = tag2.wrap(tag1)
print(soup)

 20、unwrap,去掉当前标签保留其包裹的标签

tag = soup.find('div')
val = tag.unwrap()
print(soup)

val是所去掉的标签

21、标签内容

tag = soup.find('p')
print(tag.string)
tag.string='newcontents'
print(soup)# tag = soup.find('body')
# val = tag.stripped_strings
# print(next(val))

22、append--当前标签内部追加一个标签

insert在当前标签内部指定位置插入一个标签

insert_after,insert_before在当前标签前后插入标签

replace_with替换指定标签

23、CSS选择器

soup.select('ui li')

本文发布于:2024-02-03 02:16:17,感谢您对本站的认可!

本文链接:https://www.4u4v.net/it/170689777547975.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:模块   beautifulsoup
留言与评论(共有 0 条评论)
   
验证码:

Copyright ©2019-2022 Comsenz Inc.Powered by ©

网站地图1 网站地图2 网站地图3 网站地图4 网站地图5 网站地图6 网站地图7 网站地图8 网站地图9 网站地图10 网站地图11 网站地图12 网站地图13 网站地图14 网站地图15 网站地图16 网站地图17 网站地图18 网站地图19 网站地图20 网站地图21 网站地图22/a> 网站地图23