我正在使用scrapy来获取页面上某些网址中的内容,类似于这里的这个问题: 使用scrapy获取网址列表,然后在这些网址中抓取内容我可以从我的起始网址(第一个 def)中获取子网址,但是,我的第二个 def 似乎没有通过。结果文件是空的。我已经在scrapy shell中测试了函数内部的内容,它正在获取我想要的信息,但在我运行蜘蛛时却没有。import scrapyfrom scrapy.selector import Selector#from scrapy import Spiderfrom WheelsOnlineScrapper.items import Dealerfrom WheelsOnlineScrapper.url_list import urlsimport loggingfrom urlparse import urljoinlogger = logging.getLogger(__name__)class WheelsonlinespiderSpider(scrapy.Spider): logger.info('Spider starting') name = 'wheelsonlinespider' rotate_user_agent = True # lives in middleware.py and settings.py allowed_domains = ["https://wheelsonline.ca"] start_urls = urls # this list is created in url_list.py logger.info('URLs retrieved') def parse(self, response): subURLs = [] partialURLs = response.css('.directory_name::attr(href)').extract() for i in partialURLs: subURLs = urljoin('https://wheelsonline.ca/', i) yield scrapy.Request(subURLs, callback=self.parse_dealers) logger.info('Dealer ' + subURLs + ' fetched') def parse_dealers(self, response): logger.info('Beginning of page') dlr = Dealer() #Extracting the content using css selectors try: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() + ' ' + response.css(".dealer_head_aux_name::text").extract_first() except TypeError: dlr['DealerName'] = response.css(".dealer_head_main_name::text").extract_first() dlr['MailingAddress'] = ','.join(response.css(".dealer_address_right::text").extract()) dlr['PhoneNumber'] = response.css(".dealer_head_phone::text").extract_first() logger.info('Dealer fetched ' + dlr['DealerName']) yield dlr logger.info('End of page')
1 回答
牛魔王的故事
TA贡献1830条经验 获得超3个赞
您的allowed_domains
列表包含协议 ( https
)。根据文档,它应该只有域名:
allowed_domains = ["wheelsonline.ca"]
此外,您应该在日志中收到一条消息:
URLWarning: allowed_domains 只接受域,不接受 URL。忽略allowed_domains 中的URL 条目https://wheelsonline.ca
添加回答
举报
0/150
提交
取消