我有以下蜘蛛:class Downloader(scrapy.Spider): name = "sor_spider" download_folder = FOLDER def get_links(self): df = pd.read_excel(LIST) return df["Value"].loc def start_requests(self): urls = self.get_links() for url in urls.iteritems(): index = {"index" : url[0]} yield scrapy.Request(url=url[1], callback=self.download_file, errback=self.errback_httpbin, meta=index, dont_filter=True) def download_file(self, response): url = response.url index = response.meta["index"] content_type = response.headers['Content-Type'] download_path = os.path.join(self.download_folder, r"{}".format(str(index))) with open(download_path, "wb") as f: f.write(response.body) yield LinkCheckerItem(index=response.meta["index"], url=url, code="downloaded") def errback_httpbin(self, failure): yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="error")这应该:阅读具有链接的excel(LIST)转到每个链接并将文件下载到 FOLDER登录结果LinkCheckerItem(我正在将其导出到csv)那通常可以正常工作,但是我的列表包含不同类型的文件-zip,pdf,doc等。这些是我的链接的示例LIST:https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=2c5fb68702294531afd03041e877ca84http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1173293http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1263289https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=eb9f06d2b837401eba9c66c8bf5be813http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=952317http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=1042224https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1160005https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=925955https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166563我希望它以原始扩展名保存文件,无论它是什么...就像我的浏览器在打开警报以保存文件时一样。我试图用来response.headers["Content-type"]找出类型,但在这种情况下,它总是application/octet-stream。我该怎么办?
添加回答
举报
0/150
提交
取消