为了账号安全,请及时绑定邮箱和手机立即绑定

从 urlReq(url) 中删除 'urllib.error.HTTPError:

从 urlReq(url) 中删除 'urllib.error.HTTPError:

aluckdog 2021-12-21 16:48:06
嘿伙计们怎么了?:)我正在尝试使用一些 url 参数来抓取网站。如果我使用为url1,url2 URL3它WORKS得当,它打印我的常规输出我想要(HTML) - >import bs4from urllib.request import urlopen as urlReqfrom bs4 import BeautifulSoup as soup# create urlsurl1 = 'https://en.titolo.ch/sale'url2 = 'https://en.titolo.ch/sale?limit=108'url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'# opening up connection on each url, grabbing the pageuClient = urlReq(url4)page_html = uClient.read()uClient.close()# parsing the downloaded htmlpage_soup = soup(page_html, "html.parser")# print the htmlprint(page_soup.body.prettify())-> 但是当我尝试“url4”时, url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'它给了我下面的错误。我究竟做错了什么?- 也许它与饼干有关?-> 但是为什么它对其他 url 有效...- 也许他们只是阻止了抓取尝试?- 如何在 URL 中使用多个参数来避免此错误?urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.The last 30x error message was:Moved Temporarily我在这里先向您的帮助表示感谢!干杯艾伦我已经尝试过的:我尝试了请求库import requestsurl = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'r = requests.get(url)html = r.textprint(html)<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>403 Forbidden</title></head><body><h1>Forbidden</h1><p>You don't have permission to access /saleon this server.</p></body></html>[Finished in 0.375s]
查看完整描述

1 回答

?
繁花不似锦

TA贡献1851条经验 获得超4个赞

如果使用requestspackage 并在标头中添加用户代理,则看起来它会收到200所有 4 个链接的响应。所以尝试添加用户代理标头:


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}


import requests

from bs4 import BeautifulSoup as soup


# create urls

url1 = 'https://en.titolo.ch/sale'

url2 = 'https://en.titolo.ch/sale?limit=108'

url3 = 'https://en.titolo.ch/sale?category_styles=29838_21212'

url4 = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}


url_list = [url1, url2, url3, url4]


for url in url_list:

# opening up connection on each url, grabbing the page

    response = requests.get(url, headers=headers)

    print (response.status_code)

输出:


200

200

200

200

所以:


import requests


headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}


url = 'https://en.titolo.ch/sale?category_styles=31066&limit=108'


r = requests.get(url, headers=headers)

html = r.text

print(html)


查看完整回答
反对 回复 2021-12-21
  • 1 回答
  • 0 关注
  • 208 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信