我想从一个站点(https://www.vanglaini.org/)收集链接:/hmarchhak/102217 并将其打印为https://www.vanglaini.org/hmarchhak/102217。请帮忙 看图import requestsimport pandas as pdfrom bs4 import BeautifulSoupsource = requests.get('https://www.vanglaini.org/').textsoup = BeautifulSoup(source, 'lxml')for article in soup.find_all('article'): headline = article.a.text summary=article.p.text link = article.a.href print(headline) print(summary) print(link)print()这是我的代码。
1 回答
慕无忌1623718
TA贡献1744条经验 获得超4个赞
除非我遗漏了一些标题和摘要似乎是相同的文本。您可以使用:hasbs4 4.7.1+ 来确保您article有一个孩子href;这似乎去掉了article不属于主体的标签元素,我怀疑这实际上是你的目标
from bs4 import BeautifulSoup as bs
import requests
base = 'https://www.vanglaini.org'
r = requests.get(base)
soup = bs(r.content, 'lxml')
for article in soup.select('article:has([href])'):
headline = article.h5.text.strip()
summary = re.sub(r'\n+|\r+',' ',article.p.text.strip())
link = f"{base}{article.a['href']})"
print(headline)
print(summary)
print(link)
添加回答
举报
0/150
提交
取消