我spider_idle设置了一个信号来向蜘蛛提供另一批网址。但是,这在开始时似乎工作正常,但随后Crawled (200)...消息出现的次数越来越少,最终停止出现。我有 115 个测试 URL 可以分发,正如 Scrapy 所说的Crawled 38 pages...那样。下面是蜘蛛和scrapy日志的代码。一般来说,我正在实现两阶段爬行,第一遍仅将 url 下载到urls.jl文件,第二遍是对这些 URL 执行抓取。我现在正在接近第二个蜘蛛的编码。import jsonimport scrapyimport loggingfrom scrapy import signalsfrom scrapy.http.request import Requestfrom scrapy.exceptions import DontCloseSpiderclass A2ndexample_comSpider(scrapy.Spider): name = '2nd_example_com' allowed_domains = ['www.example.com'] def parse(self, response): pass @classmethod def from_crawler(cls, crawler, *args, **kwargs): spider = cls(crawler, *args, **kwargs) crawler.signals.connect(spider.idle_consume, signals.spider_idle) return spider def __init__(self, crawler): self.crawler = crawler # read from file self.urls = [] with open('urls.jl', 'r') as f: for line in f: self.urls.append(json.loads(line)) # How many urls to return from start_requests() self.batch_size = 5 def start_requests(self): for i in range(self.batch_size): if 0 == len(self.urls): return url = self.urls.pop(0) yield Request(url["URL"]) def idle_consume(self): # Everytime spider is about to close check our urls # buffer if we have something left to crawl reqs = self.start_requests() if not reqs: return logging.info('Consuming batch... [left: %d])' % len(self.urls)) for req in reqs: self.crawler.engine.schedule(req, self) raise DontCloseSpider我预计蜘蛛会抓取所有 115 个 URL,而不仅仅是 38 个。此外,如果它不想再抓取,并且 singal-handler 函数没有引发DontCloseSpider,那么它至少不应该关闭然后?
添加回答
举报
0/150
提交
取消