3 回答
TA贡献1872条经验 获得超3个赞
我猜你想要这样:
from bs4 import BeautifulSoup
html = '''<a href="/title/tt0110912/" title="Quentin Tarantino">
Pulp Fiction
</a>
<a href="/title/tt0137523/" title="David Fincher">
Fight Club
</a>
<a href="blablabla" title="Yet to Release">
Yet to Release
</a>
<a href="something" title="Movies">
Coming soon
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
titles = []
for a in soup.select('a[href*="/title/"]',href=True):
if a.text:
titles.append(a.text.replace('\n'," "))
print(titles)
输出:
[' Pulp Fiction ', ' Fight Club ']
TA贡献1877条经验 获得超6个赞
1.) 要获取所有以 开头的<a>标签,您可以使用 CSS 选择器。href="/title/"a[href^="/title/"]
2.) 要去除标签内的所有文本,您可以使用.get_text()with 参数strip=True
soup = BeautifulSoup(html_text, 'html.parser')
out = [a.get_text(strip=True) for a in soup.select('a[href^="/title/"]')]
print(out)
印刷:
['Pulp Fiction', 'Fight Club']
添加回答
举报