首页手记 soup findall class

soup findall class

标签：

杂七杂八

利用Soup库：快速提取HTML文档中的独立段落

Soup是一个Python库，用于处理HTML和XML文档。在Soup中，findall方法是用于查找所有匹配指定模式的标签。class_参数用于过滤结果，只返回具有指定类名的标签。

1. Soup的基本使用方法

首先，需要导入bs4库中的BeautifulSoup模块。然后，使用BeautifulSoup()函数，将HTML文档作为输入参数，并指定解析器类型，通常使用'html.parser'。接下来，就可以使用Soup提供的各种方法对HTML文档进行操作了。

2. 使用Soup的findall方法

findall方法用于查找所有匹配指定模式的标签。它的语法如下：

soup.findall(tag, attrs=None, classes=None, filters=None)

参数说明：

tag：要查找的标签名称。
attrs：可选的属性参数，用于筛选具有特定属性值的标签。
classes：可选的类名参数，用于筛选具有特定类名的标签。
filters：可选的筛选条件参数，用于筛选满足特定条件的标签。

以一个简单的HTML文档为例：

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document Title</title>
</head>
<body>
    <h1>Heading 1</h1>
    <p class="text1">This is a paragraph with class text1.</p>
    <p class="text2">This is another paragraph with class text2.</p>
    <p class="text3">This is a third paragraph with class text3.</p>
</body>
</html>

我们可以使用findall方法来找到所有具有class属性为text的段落标签：

from bs4 import BeautifulSoup

html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document Title</title>
</head>
<body>
    <h1>Heading 1</h1>
    <p class="text1">This is a paragraph with class text1.</p>
    <p class="text2">This is another paragraph with class text2.</p>
    <p class="text3">This is a third paragraph with class text3.</p>
</body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

paragraphs = soup.findall('p', class_='text')

for p in paragraphs:
    print(p.text)

输出结果：

This is a paragraph with class text1.
This is another paragraph with class text2.
This is a third paragraph with class text3.

3. 使用Soup的findall方法的进阶用法

在上面的例子中，我们使用findall方法找到了所有具有class属性为text的段落标签。但是，还有更多的用法可以探索。

如果我们要查找所有具有class属性值开头的标签，可以使用startswith参数：

paragraphs = soup.findall('p', class_='text', startswith='text')

输出结果：

This is a paragraph with class text1.
This is another paragraph with class text2.

如果我们要查找所有具有任意多个class属性的标签，可以使用any参数：

paragraphs = soup.findall('p', class_='text', any(['text1', 'text2']))

输出结果：

This is a paragraph with class text1.
This is another paragraph with class text2.
This is a third paragraph with class text3.

如果我们要查找所有具有特定类名的标签，但不考虑属性值是否包含空格，可以使用not_in参数：

paragraphs = soup.findall('p', class_='text', not_in=['text1', 'text3'])

输出结果：

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

呼如林

手记
篇

粉丝

102

获赞与收藏

363

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 30583 343

网络编程入门教程

20个小节 12561 237

Pandas 入门教程

25个小节 18499 337

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

soup findall class

阅读免费教程