3 回答

TA贡献1848条经验 获得超10个赞
以下
import requests
from bs4 import BeautifulSoup
response = source = requests.get('https://occ.ca/our-publications/', headers={'User-Agent': 'Mozilla'})
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html')
pdfs = soup.findAll('div', {"class": "publicationoverlay"})
links = [pdf.find('a').attrs['href'] for pdf in pdfs]
print(links)
输出
['https://occ.ca/wp-content/uploads/The-Great-Mosaic-Reviving-Ontarios-Regional-Economies.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-in-support-of-the-OPG-Pickering-Nuclear-Nomination.pdf', 'https://occ.ca/wp-content/uploads/OCC-Beverage-Alcohol-Report.pdf', 'https://occ.ca/wp-content/uploads/Industrial-Electricity-Rates.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter_Strategic-Approach-to-Alcohol-Sales.pdf', 'https://occ.ca/wp-content/uploads/OCC-Submission-Modernizing-Ontarios-Environmental-Assessment-Program.pdf', 'https://occ.ca/wp-content/uploads/OCC-Letter-on-Ticket-Sales-Act.pdf', 'https://occ.ca/wp-content/uploads/2018-2019-Policy-Report-Card.pdf', 'https://occ.ca/wp-content/uploads/Letter-on-Right-to-Repair-May-1.pdf', 'https://occ.ca/wp-content/uploads/Federal-Carbon-Tax-Transparency-Act-2019-OCC.pdf', 'https://occ.ca/wp-content/uploads/Waste-and-Litter-Submission-_-Final.pdf', 'https://occ.ca/wp-content/uploads/Supporting-Ontarios-Budding-Cannabis-Industry.pdf']

TA贡献1827条经验 获得超9个赞
该页面返回 403(禁止请求)和一些错误页面。如果您添加用户代理标头,它会返回 200(OK)以及您需要的页面:
requests.get(url, headers={'User-Agent': 'Mozilla'})

TA贡献1859条经验 获得超6个赞
那是因为在您的原始请求中,您收到了 403 禁止请求。默认情况下,Python 请求会添加如下标头:
{
'User-Agent': 'python-requests/2.21.0',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive',
'Content-Length': '40',
'Content-Type': 'application/json'
}
某些网站会阻止此类标头。所以你得到一个 403 HTTP 错误。
source=requests.get(url, headers={'User-Agent': 'Mozilla'})
添加这将解决该问题,您将获得所需的内容。
添加回答
举报