首页猿问使用 Scrapy 从...

使用 Scrapy 从 Business Insider 抓取股票详细信息

Python

噜噜哒 2022-12-06 15:26:44

我正在尝试从以下站点提取每只股票的“名称”、“最新价格”和“%”字段： https ://markets.businessinsider.com/index/components/s&p_500但是，即使我已经确认我的 XPaths 在 Chrome 控制台中为这些字段工作，我也没有得到任何数据。作为参考，我一直在使用本指南： https ://realpython.com/web-scraping-with-scrapy-and-mongodb/items.pyfrom scrapy.item import Item, Fieldclass InvestmentItem(Item): ticker = Field() name = Field() px = Field() pct = Field()investment_spider.pyfrom scrapy import Spiderfrom scrapy.selector import Selectorfrom investment.items import InvestmentItemclass InvestmentSpider(Spider): name = "investment" allowed_domains = ["markets.businessinsider.com"] start_urls = [ "https://markets.businessinsider.com/index/components/s&p_500", ] def parse(self, response): stocks = Selector(response).xpath('//*[@id="index-list-container"]/div[2]/table/tbody/tr') for stock in stocks: item = InvestmentItem() item['name'] = stock.xpath('td[1]/a/text()').extract()[0] item['px'] = stock.xpath('td[2]/text()[1]').extract()[0] item['pct'] = stock.xpath('td[5]/span[2]').extract()[0] yield item控制台输出：...2020-05-26 00:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/robots.txt> (referer: None)2020-05-26 00:08:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://markets.businessinsider.com/index/components/s&p_500> (referer: None)2020-05-26 00:08:33 [scrapy.core.engine] INFO: Closing spider (finished)2020-05-26 00:08:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:...2020-05-26 00:08:33 [scrapy.core.engine] INFO: Spider closed (finished)

查看完整描述

2 回答

蝴蝶不菲

TA贡献1810条经验获得超4个赞

您在 xpath 表达式的开头缺少“./”。我已经简化了你的 xpaths：

def parse(self, response):

stocks = response.xpath('//table[@class="table table-small"]/tr')

for stock in stocks[1:]:

item = InvestmentItem()

item['name'] = stock.xpath('./td[1]/a/text()').get()

item['px'] = stock.xpath('./td[2]/text()[1]').get().strip()

item['pct'] = stock.xpath('./td[5]/span[2]/text()').get()

yield item

反对回复 2022-12-06

阿波罗的战车

TA贡献1862条经验获得超6个赞

XPATH版本

def parse(self, response):

rows = response.xpath('//*[@id="index-list-container"]/div[2]/table/tr')

for row in rows:

yield{

'name' : row.xpath('td[1]/a/text()').extract(),

'price':row.xpath('td[2]/text()[1]').extract(),

'pct':row.xpath('td[5]/span[2]/text()').extract(),

'datetime':row.xpath('td[7]/span[2]/text()').extract(),

}

CSS版本

def parse(self, response):

table = response.css('div#index-list-container table.table-small')

rows = table.css('tr')

for row in rows:

name = row.css("a::text").get()

high_low = row.css('td:nth-child(2)::text').get()

date_time = row.css('td:nth-child(7) span:nth-child(2) ::text').get()

yield {

'name' : name,

'high_low': high_low,

'date_time' : date_time

}

结果

{"high_low": "\r\n146.44", "name": "3M", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

{"high_low": "\r\n42.22", "name": "AO Smith", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

{"high_low": "\r\n91.47", "name": "Abbott Laboratories", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

{"high_low": "\r\n92.10", "name": "AbbVie", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

{"high_low": "\r\n193.71", "name": "Accenture", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

{"high_low": "\r\n73.08", "name": "Activision Blizzard", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},

{"high_low": "\r\n385.26", "name": "Adobe", "date_time": "05/25/2020 08:00:00 PM UTC-0400"},

{"high_low": "\r\n133.48", "name": "Advance Auto Parts", "date_time": "05/26/2020 04:15:11 PM UTC-0400"},

反对回复 2022-12-06

2 回答
0 关注
97 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

使用 Scrapy 从 Business Insider 抓取股票详细信息

使用 Scrapy 从 Business Insider 抓取股票详细信息

2 回答

添加回答