解析 HTML 以检索术语

我创建了一个爬虫。所以，现在我有一堆被抓取的 URL。我需要使用向量空间或至少是 HTML 中所有术语的列表来创建索引。假设这个随机网页https://www.centralpark.com/things-to-do/central-park-zoo/polar-bears/如何解析该网页中的所有术语？我有点不明白我应该在特定标签之间抓取文本还是其他东西或者我应该使用哪个库？我完全迷失了。这是我需要对 HTML 执行的操作：你可以在线使用 html 解析器，但原则上，你可以使用 html 正文中的文本...或者像这样的 p /p、h2 /h2 这样的标签之间的文本。任何解析上述 HTML 的帮助表示赞赏。编辑：我正在尝试 BeautifulSoup：import bs4from urllib.request import urlopen as uReqfrom bs4 import BeautifulSoup as soup my_url='https://www.centralpark.com/things-to-do/central-park-zoo/polar-bears/' # opening up connection uClient = uReq(my_url) page_html = uClient.read() # close connection uClient.close() page_soup = soup(page_html, features="html.parser") print(page_soup.p)如何将所有文本元素放入列表？前任：This is p<\p>This is another p<\p><h1>This is h1<\h1>maybe some other text tags到List = ['This is p','This is another p','This is h1',...]

查看完整描述

2 回答

www说

TA贡献1775条经验获得超8个赞

很好，你进步了！

我推荐你pip install requests并使用它。您会发现它是一个比 urllib 方便得多的 API。（此外，它只是soup该变量的常用名称。）

如何将所有文本元素放入列表？

就这么简单：

print(list(page_soup.find_all('p')))

这就解释了为什么这么多人非常喜欢 BeautifulSoup。

这将显示页面的摘录：

paragraphs = page_soup.find_all('p')

for p in paragraphs:

print(str(p)[:40])

There are no longer any

Polar Bear (Ursus Ma

Zoo collection includes:</str

Found in the wild: A

See Them at the Central Park

Description: The mal

Zoo Bear Habitat: Th

What do they eat: T

Life span: 25 to 30

Threats: Global warm

Fun Facts: A newborn

这是要注意重要的p是不是一个字符串。它是一个可以搜索的对象，就像它来自的汤一样。您可能想在其中找到跨度。

反对回复 2021-09-11

热搜

最近搜索清空

解析 HTML 以检索术语

解析 HTML 以检索术语

2 回答

添加回答