2 回答
TA贡献1829条经验 获得超7个赞
用于beautifulsoup抓取所有新闻内容以获取图像:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
soup = BeautifulSoup(session.get("https://phys.org/earth-news/").text, "lxml")
news_list = [news_div.get("href") for news_div in soup.select('.news-link')]
for url in news_list:
soup = BeautifulSoup(session.get(url).text, "lxml")
img = soup.select_one(".article-img")
if img:
print(url, img.select_one('img').get("src"))
else:
print(url, "This news doesn't contain image")
TA贡献1780条经验 获得超1个赞
用于BeautifulSoup提取图像链接:
import requests
from bs4 import BeautifulSoup
url = 'https://phys.org/earth-news/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for img in soup.select('.sorted-article img[data-src]'):
print( img['data-src'].replace('/175u/', '/800/') )
印刷:
https://scx1.b-cdn.net/csz/news/800/2020/biofuels.jpg
https://scx1.b-cdn.net/csz/news/800/2020/waterscarcity.jpg
https://scx1.b-cdn.net/csz/news/800/2020/soilerosion.jpg
https://scx1.b-cdn.net/csz/news/800/2020/hydropowerdam.jpg
https://scx1.b-cdn.net/csz/news/800/2019/flood.jpg
https://scx1.b-cdn.net/csz/news/800/2018/1-emissions.jpg
https://scx1.b-cdn.net/csz/news/800/2020/globalforest.jpg
https://scx1.b-cdn.net/csz/news/800/2020/fleeingthecl.jpg
https://scx1.b-cdn.net/csz/news/800/2020/watersecurity.jpg
https://scx1.b-cdn.net/csz/news/800/2019/2-water.jpg
https://scx1.b-cdn.net/csz/news/800/2020/japaneseexpe.jpg
https://scx1.b-cdn.net/csz/news/800/2020/6-scientistsco.jpg
https://scx1.b-cdn.net/csz/news/800/2020/housescollap.jpg
https://scx1.b-cdn.net/csz/news/800/2020/soil.jpg
https://scx1.b-cdn.net/csz/news/800/2020/32-researcherst.jpg
https://scx1.b-cdn.net/csz/news/800/2020/2-nasatracking.jpg
https://scx1.b-cdn.net/csz/news/800/2020/thelargersec.jpg
https://scx1.b-cdn.net/csz/news/800/2020/4-nasasterrasa.jpg
https://scx1.b-cdn.net/csz/news/800/2020/howtorecycle.jpg
https://scx1.b-cdn.net/csz/news/800/2020/newtoolstrac.jpg
添加回答
举报