首页猿问无法消除...

无法消除 process_exception 引发的一些错误

Python

慕莱坞森 2023-08-22 14:45:28

我试图不显示/获取 scrapy 中抛出的一些process_response错误RetryMiddleware。超过最大重试限制时脚本遇到的错误。我在中间件中使用了代理。奇怪的是脚本抛出的异常已经在列表中EXCEPTIONS_TO_RETRY。脚本有时可能会超过最大重试次数而没有成功，这是完全可以的。但是，我只是不希望看到该错误，即使它存在，这意味着抑制或绕过它。错误是这样的：Traceback (most recent call last): File "middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider)))twisted.internet.error.TCPTimedOutError: TCP connection timed out: 10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond..这是process_response里面的RetryMiddleware样子：class RetryMiddleware(object): cus_retry = 3 EXCEPTIONS_TO_RETRY = (defer.TimeoutError, TimeoutError, DNSLookupError, \ ConnectionRefusedError, ConnectionDone, ConnectError, \ ConnectionLost, TCPTimedOutError, TunnelError, ResponseFailed) def process_exception(self, request, exception, spider): if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \ and not request.meta.get('dont_retry', False): return self._retry(request, exception, spider) def _retry(self, request, reason, spider): retries = request.meta.get('cus_retry',0) + 1 if retries<=self.cus_retry: r = request.copy() r.meta['cus_retry'] = retries r.meta['proxy'] = f'https://{ip:port}' r.dont_filter = True return r else: print("done retrying")我怎样才能消除中的错误EXCEPTIONS_TO_RETRY？PS：无论我选择哪个站点，当达到最大重试限制时脚本都会遇到错误。

查看完整描述

3 回答

缥缈止盈

TA贡献2041条经验获得超4个赞

尝试修复刮刀本身的代码。有时，错误的解析函数可能会导致您所描述的那种错误。一旦我修复了代码，它就消失了。

反对回复 2023-08-22

白衣染霜花

TA贡献1796条经验获得超10个赞

当达到最大重试次数时，类似的方法parse_error()应该处理蜘蛛中存在的任何错误：

def start_requests(self):

for start_url in self.start_urls:

yield scrapy.Request(start_url,errback=self.parse_error,callback=self.parse,dont_filter=True)

def parse_error(self, failure):

# print(repr(failure))

pass

然而，我想在这里提出一种完全不同的方法。如果您采用以下路线，则根本不需要任何自定义中间件。包括重试逻辑在内的所有内容都已经存在于蜘蛛中。

class mySpider(scrapy.Spider):

name = "myspider"

start_urls = [

"some url",

]

proxies = [] #list of proxies here

max_retries = 5

retry_urls = {}

def parse_error(self, failure):

proxy = f'https://{ip:port}'

retry_url = failure.request.url

if retry_url not in self.retry_urls:

self.retry_urls[retry_url] = 1

else:

self.retry_urls[retry_url] += 1

if self.retry_urls[retry_url] <= self.max_retries:

yield scrapy.Request(retry_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10}, errback=self.parse_error,dont_filter=True)

else:

print("gave up retrying")

def start_requests(self):

for start_url in self.start_urls:

proxy = f'https://{ip:port}'

yield scrapy.Request(start_url,callback=self.parse,meta={"proxy":proxy,"download_timeout":10},errback=self.parse_error,dont_filter=True)

def parse(self,response):

for item in response.css().getall():

print(item)

不要忘记添加以下行以从上述建议中获得上述结果：

custom_settings = {

'DOWNLOADER_MIDDLEWARES': {

'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,

}

顺便说一句，我正在使用 scrapy 2.3.0。

反对回复 2023-08-22

慕运维8079593

TA贡献1876条经验获得超5个赞

也许问题不在您这边，但第三方网站可能有问题。也许他们的服务器上存在连接错误，或者可能是安全的，所以没有人可以访问它。

因为该错误甚至表明该错误与一方有关，该错误已关闭或无法正常工作，可能首先检查第三方站点是否在请求时正常工作。如果可以的话尝试联系他们。

因为错误不是在你这边，而是在党那边，正如错误所说。

反对回复 2023-08-22

3 回答
0 关注
1616 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

无法消除 process_exception 引发的一些错误

无法消除 process_exception 引发的一些错误

3 回答

添加回答