使用 BeautifulSoup 从 <a href 标签中提取特定页面链接

我正在使用 BeautifulSoup 从此页面中提取所有链接：http : //kern.humdrum.org/search?s=t&keyword=Haydn我通过这种方式获得所有这些链接：# -*- coding: utf-8 -*-from urllib.request import urlopen as uReqfrom bs4 import BeautifulSoup as soupmy_url = 'http://kern.humdrum.org/search?s=t&keyword=Haydn'#opening up connecting, grabbing the pageuClient = uReq(my_url)# put all the content in a variablepage_html = uClient.read()#close the internet connectionuClient.close()#It does my HTML parserpage_soup = soup(page_html, "html.parser")# Grab all of the linkscontainers = page_soup.findAll('a', href=True)#print(type(containers))for container in containers: link = container #start_index = link.index('href="') print(link) print("---") #print(start_index)我的部分输出是：请注意，它返回了几个链接，但我真的想要所有带有 >Someting 的链接。（例如，“> Allegro”和“Allegro vivace”等等）。我很难获得以下类型的输出（图像示例）：“快板 - http://kern.ccarh.org/cgi-bin/ksdata?location=users/craig/classical/beethoven/piano/奏鸣曲&文件=奏鸣曲01-1.krn&格式=信息“换句话说，在这一点上，我有一堆锚标签（+- 1000）。从所有这些标签中，有一堆只是“垃圾”和 +- 350 个我想提取的标签。所有这些标签看起来几乎一样，但唯一的区别是我需要的标签末尾有一个“>某人的名字<\a>”。我只想提取具有此特征的所有锚标记的链接。

查看完整描述

3 回答

守着星空守着你

TA贡献1799条经验获得超8个赞

最好和最简单的方法是在打印链接时使用文本属性。像这样： print link.text

反对回复 2021-08-14

热搜

最近搜索清空

使用 BeautifulSoup 从 <a href 标签中提取特定页面链接

使用 BeautifulSoup 从 <a href 标签中提取特定页面链接

3 回答

添加回答