首页猿问单个 Scrapy 项目与多个项目

单个 Scrapy 项目与多个项目

Python

慕仙森 2022-05-24 17:21:27

我对如何存储我所有的蜘蛛有这个困境。这些蜘蛛将通过命令行调用和从stdin. 我还计划让这些蜘蛛的一个子集在单独的 Web 服务器上使用 scrapyrt 返回单个项目的结果。我将需要在具有不同项目模型的许多不同项目中创建蜘蛛。它们都将具有相似的设置（例如使用相同的代理）。我的问题是构建我的scrapy项目的最佳方法是什么？将所有蜘蛛放在同一个存储库中。提供一种为 Item 加载器和 Item 管道创建基类的简单方法。将我正在处理的每个项目的蜘蛛分组到单独的存储库中。这样做的好处是允许项目成为每个项目的焦点，并且不会变得太大。无法共享通用代码、设置、蜘蛛监视器 (spidermon) 和基类。尽管有一些重复，但这感觉最干净。只打包我计划在 NiFi 存储库中使用非实时的蜘蛛和另一个存储库中的实时蜘蛛。有一个优势，我将蜘蛛保留在实际使用它们的项目中，但仍然集中/卷积哪些蜘蛛与哪些项目一起使用。感觉正确的答案是＃2。与特定程序相关的蜘蛛应该在自己的scrapy项目中，就像为项目A创建Web服务一样，你不会说哦，我可以将项目B的所有服务端点都扔到同一个服务中，因为那是即使某些设置可能重复，我的所有服务都将驻留在其中。可以说，一些共享代码/类可以通过单独的包共享。你怎么看？你们都如何构建你的scrapy项目以最大限度地提高可重用性？您在哪里划清同一项目与单独项目的界限？它是基于您的项目模型还是数据源？

查看完整描述

2 回答

慕慕森

TA贡献1856条经验获得超17个赞

来自 Google Group 主题的 Jakob 题为“ Single Scrapy Project vs. Multiple Projects for various Sources ”推荐：

蜘蛛是否应该进入同一个项目主要取决于它们抓取的数据类型，而不是数据的来源。
假设您正在从所有目标站点中抓取用户配置文件，那么您可能有一个项目管道来清理和验证用户头像，并将它们导出到您的“头像”数据库中。将所有蜘蛛放到同一个项目中是有意义的。毕竟，它们都使用相同的管道，因为无论从何处抓取数据，数据始终具有相同的形状。另一方面，如果您从 Stack Overflow、维基百科的用户资料和 Github 中抓取问题，并且您以不同的方式验证/处理/导出所有这些数据类型，则将蜘蛛放入单独的项目中会更有意义.
换句话说，如果您的蜘蛛有共同的依赖关系（例如，它们共享项目定义/管道/中间件），它们可能属于同一个项目；如果他们每个人都有自己特定的依赖关系，他们可能属于不同的项目。

Pablo Hoffman 是 Scrapy 的开发者之一，他在另一个帖子“ Scrapy spider vs project ”中回应道：

...建议将所有蜘蛛程序保持在同一个项目中，以提高代码的可重用性（通用代码、辅助函数等）。
我们有时会在蜘蛛名称上使用前缀，例如 film_spider1、film_spider2、actor_spider1、actor_spider2 等。有时我们还会编写抓取多种项目类型的蜘蛛，因为当抓取的页面有很大重叠时，它更有意义。

反对回复 2022-05-24

胡说叔叔

TA贡献1804条经验获得超8个赞

首先，当我写这样的路径时'/path'，因为我是 Ubuntu 用户。如果您是 Windows 用户，请调整它。那是文件管理系统的问题。

灯光示例

假设您想抓取2 个或更多不同的网站。第一个是泳装零售网站。二是关于天气。您想同时了解这两种情况，因为您想观察泳衣价格和天气之间的联系，以便预测较低的购买价格。

请注意pipelines.py我将使用 mongo 集合，因为这是我使用的，我暂时不需要 SQL。如果您不了解 mongo，请考虑将集合等同于关系数据库中的表。

scrapy 项目可能如下所示：

spiderswebsites.py, 在这里你可以写下你想要的蜘蛛数量。

import scrapy

from ..items.py import SwimItem, WeatherItem

#if sometimes you have trouble to import from parent directory you can do

#import sys

#sys.path.append('/path/parentDirectory')

class SwimSpider(scrapy.Spider):

name = "swimsuit"

start_urls = ['https://www.swimsuit.com']

def parse (self, response):

price = response.xpath('span[@class="price"]/text()').extract()

model = response.xpath('span[@class="model"]/text()').extract()

... # and so on

item = SwimItem() #needs to be called -> ()

item['price'] = price

item['model'] = model

... # and so on

return item

class WeatherSpider(scrapy.Spider):

name = "weather"

start_urls = ['https://www.weather.com']

def parse (self, response):

temperature = response.xpath('span[@class="temp"]/text()').extract()

cloud = response.xpath('span[@class="cloud_perc"]/text()').extract()

... # and so on

item = WeatherItem() #needs to be called -> ()

item['temperature'] = temperature

item['cloud'] = cloud

... # and so on

return item

items.py, 在这里你可以写下你想要的项目模式的数量。

import scrapy

class SwimItem(scrapy.Item):

price = scrapy.Field()

stock = scrapy.Field()

...

model = scrapy.Field()

class WeatherItem(scrapy.Item):

temperature = scrapy.Field()

cloud = scrapy.Field()

...

pressure = scrapy.Field()

pipelines.py，我在哪里使用 Mongo

import pymongo

from .items import SwimItem,WeatherItem

from .spiders.spiderswebsites import SwimSpider , WeatherSpider

class ScrapePipeline(object):

def __init__(self, mongo_uri, mongo_db):

self.mongo_uri = mongo_uri

self.mongo_db = mongo_db

@classmethod #this is a decorator, that's a powerful tool in Python

def from_crawler(cls, crawler):

return cls(

mongo_uri=crawler.settings.get('MONGODB_URL'),

mongo_db=crawler.settings.get('MONGODB_DB', 'defautlt-test')

)

def open_spider(self, spider):

self.client = pymongo.MongoClient(self.mongo_uri)

self.db = self.client[self.mongo_db]

def close_spider(self, spider):

self.client.close()

def process_item(self, item, spider):

if isinstance(spider, SwimItem):

self.collection_name = 'swimwebsite'

elif isinstance(spider, WeatherItem):

self.collection_name = 'weatherwebsite'

self.db[self.collection_name].insert(dict(item))

因此，当您查看我的示例项目时，您会发现该项目根本不依赖于项目模式，因为您可以在同一个项目中使用多种项目。在上面的模式中，优点是您可以根据settings.py需要保留相同的配置。但是不要忘记你可以“自定义”你的蜘蛛的命令。如果您希望您的蜘蛛运行与默认设置稍有不同，您可以设置为scrapy crawl spider -s DOWNLOAD_DELAY=35而不是25您编写的settings.py设置。

函数式编程

而且这里我的例子很轻。实际上，您很少对原始数据感兴趣。你需要很多代表很多线条的治疗方法。为了提高代码的可读性，您可以在模块中创建函数。但要小心意大利面条代码。

functions.py, 定制模块

from re import search

def cloud_temp(response): #for WeatherSpider

"""returns a tuple containing temperature and percentage of clouds"""

temperature = response.xpath('span[@class="temp"]/text()').extract() #returns a str as " 12°C"

cloud = response.xpath('span[@class="cloud_perc"]/text()').extract() #returns a str as "30%"

#treatments, you want to record it as integer

temperature = int(re.search(r'[0-9]+',temperature).group()) #returns int as 12

cloud = int(re.search(r'[0-9]+',cloud).group()) #returns int as 30

return (cloud,temperature)

它屈服了spiders.py

import scrapy

from items.py import SwimItem, WeatherItem

from functions.py import *

...

class WeatherSpider(scrapy.Spider):

name = "weather"

start_urls = ['https://www.weather.com']

def parse (self, response):

cloud , temperature = cloud_temp(response) "this is shorter than the previous one

... # and so on

item = WeatherItem() #needs to be called -> ()

item['temperature'] = temperature

item['cloud'] = cloud

... # and so on

return item

此外，它在调试方面也有相当大的改进。假设我想做一个scrapy shell session。

>>> scrapy shell https://www.weather.com

...

#I check in the sys path if the directory where my `functions.py` module is present.

>>> import sys

>>> sys.path #returns a list of paths

>>> #if the directory is not present

>>> sys.path.insert(0, '/path/directory')

>>> #then I can now import my module in this session, and test in the shell, while I modify in the file functions.py itself

>>> from functions.py import *

>>> cloud_temp(response) #checking if it returns what I want.

这比复制和粘贴一段代码更舒服。而且因为 Python 是一种用于函数式编程的优秀编程语言，所以您应该从中受益。这就是为什么我告诉你“更一般地说，如果你限制行数，提高可读性，限制错误，任何模式都是有效的。” 它的可读性越高，您就越能限制错误。您编写的行数越少（例如避免复制和粘贴对不同变量的相同处理），您限制的错误就越少。因为当你纠正一个函数本身时，你纠正了所有依赖它的东西。

所以现在，如果你对函数式编程不是很熟悉，我可以理解你为不同的项目模式制作了几个项目。您可以利用当前的技能并改进它们，然后随着时间的推移改进您的代码。

反对回复 2022-05-24

2 回答
0 关注
142 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

单个 Scrapy 项目与多个项目

单个 Scrapy 项目与多个项目

2 回答

添加回答