2 回答
TA贡献1813条经验 获得超2个赞
Scrapy 将是完成此任务的不错选择。这将是一个非常简单的蜘蛛,它将能够收集所需的信息。
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.amazon.com/dp/B07Q6H83VY']
def parse(self, response):
for row in response.css('div.review'):
item = {}
item['author'] = row.css('span.a-profile-name::text').extract_first()
rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]
item['rating'] = int(float(rating.strip().replace(',', '.')))
item['title'] = row.css('span.review-title > span::text').extract_first()
created_date = row.css('span.review-date::text').extract_first().strip()
item['created_date'] = created_date
review_content = row.css('div.reviewText ::text').extract()
review_content = [rc.strip() for rc in review_content if rc.strip()]
item['content'] = ', '.join(review_content)
yield item
输出示例:
{
"author": "Jhona Diaz",
"rating": 4,
"title": "Recomendable solo si eres fan ya que si está algo caro",
"created_date": "Reviewed in Mexico on November 23, 2019",
"content": "Buena calidad y pues muy completo"
},
{
"author": "MANUEL MENDOZA OLVERA",
"rating": 5,
"title": "Perfecto Estado",
"created_date": "Reviewed in Mexico on September 28, 2019",
"content": "excelente, la edición es de caja metálica y llegó intacta"
},
TA贡献1963条经验 获得超6个赞
首先做 pip install selenium
第二个使用 Python 库 dryscrape 来抓取 javascript 驱动的网站。在这个网址https://phantomjs.org/download.html
from selenium import webdriver
#the path below from dryscrape folder from step2
driver = webdriver.PhantomJS(executable_path='C:\\Users\\nayef\\Desktop\\New folder\\phantomjs-2.1.1-windows\\bin\\phantomjs')
driver.get('https://www.amazon.com/dp/B07Q6H83VY')
p_element = driver.find_element_by_id('deliveryMessageMirId')
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
Arrives: Friday, Aug 7 Details
添加回答
举报