首页猿问如何在 Scrapy...

如何在 Scrapy 中按所需顺序或同步爬行？

Python

MYYA 2022-05-19 14:32:17

问题我正在尝试创建一个蜘蛛，它从商店抓取和抓取每个产品并将结果输出到 JSON 文件，其中包括进入主页中的每个类别并抓取每个产品（只是名称和价格）、每个产品类别页面包括无限滚动。我的问题是，每次我在抓取一类项目的第一页后提出请求，而不是从同一类型中获取下一批项目，我从下一个类别中获取项目，并且输出最终变得一团糟.我已经尝试过的我已经尝试过弄乱设置并强制并发请求为一个并为每个请求设置不同的优先级。我发现了异步抓取，但我不知道如何按顺序创建请求。代码import scrapyfrom scrapper_pccom.items import ScrapperPccomItemclass PccomSpider(scrapy.Spider): name = 'pccom' allowed_domains = ['pccomponentes.com'] start_urls = ['https://www.pccomponentes.com/componentes'] #Scrapes links for every category from main page def parse(self, response): categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href') prio = 20 for category in categories: url = response.urljoin(category.extract()) yield scrapy.Request(url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio}) prio = prio - 1 #Scrapes products from every page of each category def parse_item_list(self, response, prio): products = response.xpath('//article[contains(@class,"tarjeta-articulo")]') for product in products: item = ScrapperPccomItem() item['name'] = product.xpath('@data-name').extract() item['price'] = product.xpath('@data-price').extract() yield item #URL of the next page next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first() if next_page: next_url = response.urljoin(next_page) yield scrapy.Request(next_url, self.parse_item_list, priority=prio, cb_kwargs={'prio': prio})输出与预期它的作用：第 1 类第 1 页 > 第 2 类第 1 页 > 第 3 类第 1 页 > ...我想要它做什么：Cat 1 page 1 > Cat 1 page 2 > Cat 1 page 3 > ... > Cat 2 page 1

查看完整描述

1 回答

千巷猫影

TA贡献1829条经验获得超7个赞

这很简单，

获取中所有类别的列表all_categories，现在不要抓取所有链接，只抓取第一个类别链接，一旦该类别的所有页面都被抓取，然后将请求发送到另一个类别链接。

这是代码，我没有运行代码，所以可能存在一些语法错误，但你需要的是逻辑

class PccomSpider(scrapy.Spider):

name = 'pccom'

allowed_domains = ['pccomponentes.com']

start_urls = ['https://www.pccomponentes.com/componentes']

all_categories = []

def yield_category(self):

if self.all_categories:

url = self.all_categories.pop()

print("Scraping category %s " % (url))

return scrapy.Request(url, self.parse_item_list)

else:

print("all done")

#Scrapes links for every category from main page

def parse(self, response):

categories = response.xpath('//a[contains(@class,"enlace-secundario")]/@href')

self.all_categories = list(response.urljoin(category.extract()) for category in categories)

yield self.yield_category()

#Scrapes products from every page of each category

def parse_item_list(self, response, prio):

products = response.xpath('//article[contains(@class,"tarjeta-articulo")]')

for product in products:

item = ScrapperPccomItem()

item['name'] = product.xpath('@data-name').extract()

item['price'] = product.xpath('@data-price').extract()

yield item

#URL of the next page

next_page = response.xpath('//div[@id="pager"]//li[contains(@class,"c-paginator__next")]//a/@href').extract_first()

if next_page:

next_url = response.urljoin(next_page)

yield scrapy.Request(next_url, self.parse_item_list)

else:

print("All pages of this category scraped, now scraping next category")

yield self.yield_category()

反对回复 2022-05-19

1 回答
0 关注
618 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何在 Scrapy 中按所需顺序或同步爬行？

如何在 Scrapy 中按所需顺序或同步爬行？

1 回答

添加回答