我试图根据其类别刮擦黄页。因此,我从文本文件加载类别并将其提供给start_urls。我在这里面临的问题是为每个类别单独保存输出。以下是我试图实现的代码:CATEGORIES = []with open('Catergories.txt', 'r') as f: data = f.readlines() for category in data: CATEGORIES.append(category.strip())在 settings.py 中打开文件,并在蜘蛛中列出要访问的列表。蜘蛛:# -*- coding: utf-8 -*-from scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom ..items import YellowItemfrom scrapy.utils.project import get_project_settingssettings = get_project_settings()class YpSpider(CrawlSpider): categories = settings.get('CATEGORIES') name = 'yp' allowed_domains = ['yellowpages.com'] start_urls = ['https://www.yellowpages.com/search?search_terms={0}&geo_location_terms=New%20York' '%2C ' '%20NY'.format(*categories)] rules = ( Rule(LinkExtractor(restrict_xpaths='//a[@class="business-name"]', allow=''), callback='parse_item', follow=True), Rule(LinkExtractor(restrict_xpaths='//a[@class="next ajax-page"]', allow=''), follow=True), ) def parse_item(self, response): categories = settings.get('CATEGORIES') print(categories) item = YellowItem() # for data in response.xpath('//section[@class="info"]'): item['title'] = response.xpath('//h1/text()').extract_first() item['phone'] = response.xpath('//p[@class="phone"]/text()').extract_first() item['street_address'] = response.xpath('//h2[@class="address"]/text()').extract_first() email = response.xpath('//a[@class="email-business"]/@href').extract_first() try: item['email'] = email.replace("mailto:", '') except AttributeError: pass
2 回答
繁华开满天机
TA贡献1816条经验 获得超4个赞
我会在后期处理中这样做。将所有项目导出到一个带有类别字段.csv文件。我认为你没有以正确的方式思考这个问题,并使其过于复杂。不确定这是否有效,但它值得一试:)
with open('parent.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
with open('{}.csv'.format(row[category]), 'a') as f:
writer = csv.writer(f)
writer.writerow(row)
您也可以使用蜘蛛闭合信号应用此代码。
https://docs.scrapy.org/en/latest/topics/signals.html#scrapy.signals.spider_closed
青春有我
TA贡献1784条经验 获得超8个赞
dict.items()
返回可迭代,其中每个项目看起来像 要摆脱此错误,您需要删除并解压缩该项目,例如tuple (key, value)
iter
for category, exporter in self.exporter.items():
添加回答
举报
0/150
提交
取消