为了账号安全,请及时绑定邮箱和手机立即绑定

使用 Beautiful Soup 抓取亚马逊评论

使用 Beautiful Soup 抓取亚马逊评论

人到中年有点甜 2023-04-18 17:18:38
我需要从这个亚马逊页面上抓取一些信息:https://www.amazon.com/dp/B07Q6H83VY/ref=sspa_dk_detail_6?pd_rd_i=B07Q6H83VY&pd_rd_w=n4cqh&pf_rd_p=48d372c1-f7e1-4b8b-9d02-4bd86f5158c5&pd_rd_wg=8d6Pd&pf_rd_r=AES6X22PPPPREK5DD60G&pd_rd_r=2a4ff4e6-f8ce-4d62-8106-cffd53838b9e&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEyTTZUQzQ0Q05TOVZJJmVuY3J5cHRlZElkPUEwMDU2MjE0Q05HOUFSMkdQTkhPJmVuY3J5cHRlZEFkSWQ9QTA4NTIyNzAxOVZYM1dISEVBUk1DJndpZGdldE5hbWU9c3BfZGV0YWlsJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ&th=1具体来说,我会对这些领域感兴趣:Author | Star | Date | Title | Review例如:    Gi1.0 out of 5 stars Unacceptable Launch State for PS4Reviewed in the United States on September 14, 2019Platform: PlayStation 4Edition: Super DeluxeVerified Purchase因为我以前从来没有这样做过,所以我想知道我是否可以用 Scrapy/BeautifulSoup/Selenium 来做这件事,或者我是否需要一个 API,尽管这些信息来自Author under <span class="a-profile-name">Gi</span>Rating <span class="a-icon-alt">1.0 out of 5 stars</span>Review <div data-hook="review-collapsed" aria-expanded="false" class="a-expander-content a-expander-partial-collapse-content" style="padding-bottom: 19px;"> ...TEXT...</div>
查看完整描述

2 回答

?
慕姐8265434

TA贡献1813条经验 获得超2个赞

Scrapy 将是完成此任务的不错选择。这将是一个非常简单的蜘蛛,它将能够收集所需的信息。


import scrapy



class TestSpider(scrapy.Spider):

    name = 'test'

    start_urls = ['https://www.amazon.com/dp/B07Q6H83VY']


    def parse(self, response):

        for row in response.css('div.review'):

            item = {}


            item['author'] = row.css('span.a-profile-name::text').extract_first()


            rating = row.css('i.review-rating > span::text').extract_first().strip().split(' ')[0]

            item['rating'] = int(float(rating.strip().replace(',', '.')))


            item['title'] = row.css('span.review-title > span::text').extract_first()

            created_date = row.css('span.review-date::text').extract_first().strip()

            item['created_date'] = created_date


            review_content = row.css('div.reviewText ::text').extract()

            review_content = [rc.strip() for rc in review_content if rc.strip()]

            item['content'] = ', '.join(review_content)


            yield item

输出示例:


{

        "author": "Jhona Diaz",

        "rating": 4,

        "title": "Recomendable solo si eres fan ya que si está algo caro",

        "created_date": "Reviewed in Mexico on November 23, 2019",

        "content": "Buena calidad y pues muy completo"

    },

    {

        "author": "MANUEL MENDOZA OLVERA",

        "rating": 5,

        "title": "Perfecto Estado",

        "created_date": "Reviewed in Mexico on September 28, 2019",

        "content": "excelente, la edición es de caja  metálica y llegó intacta"

    },


查看完整回答
反对 回复 2023-04-18
?
神不在的星期二

TA贡献1963条经验 获得超6个赞

首先做 pip install selenium

第二个使用 Python 库 dryscrape 来抓取 javascript 驱动的网站。在这个网址https://phantomjs.org/download.html


from selenium import webdriver

#the path below  from dryscrape  folder  from step2 

 driver = webdriver.PhantomJS(executable_path='C:\\Users\\nayef\\Desktop\\New folder\\phantomjs-2.1.1-windows\\bin\\phantomjs')

driver.get('https://www.amazon.com/dp/B07Q6H83VY')

p_element = driver.find_element_by_id('deliveryMessageMirId')


driver.get(my_url)

p_element = driver.find_element_by_id(id_='intro-text')

print(p_element.text)


# result:

Arrives: Friday, Aug 7 Details


查看完整回答
反对 回复 2023-04-18
  • 2 回答
  • 0 关注
  • 150 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信