我是Scrapy的新手,我正在寻找一种从Python脚本运行它的方法。我找到2个资料来解释这一点:http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/http://snipplr.com/view/67006/using-scrapy-from-a-script/我不知道应该把我的Spider代码放在哪里以及如何从main函数中调用它。请帮忙。这是示例代码:# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. # # The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.# # [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet. #!/usr/bin/pythonimport osos.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other importsfrom scrapy import log, signals, projectfrom scrapy.xlib.pydispatch import dispatcherfrom scrapy.conf import settingsfrom scrapy.crawler import CrawlerProcessfrom multiprocessing import Process, Queueclass CrawlerScript(): def __init__(self): self.crawler = CrawlerProcess(settings) if not hasattr(project, 'crawler'): self.crawler.install() self.crawler.configure() self.items = [] dispatcher.connect(self._item_passed, signals.item_passed) def _item_passed(self, item): self.items.append(item) def _crawl(self, queue, spider_name): spider = self.crawler.spiders.create(spider_name) if spider: self.crawler.queue.append_spider(spider) self.crawler.start() self.crawler.stop() queue.put(self.items) def crawl(self, spider): queue = Queue() p = Process(target=self._crawl, args=(queue, spider,)) p.start() p.join() return queue.get(True)
3 回答
LEATH
TA贡献1936条经验 获得超6个赞
所有其他答案均参考Scrapyv0.x。根据更新的文档,Scrapy 1.0要求:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
添加回答
举报
0/150
提交
取消