1 回答
TA贡献1111条经验 获得超0个赞
下面显示了您可以进行的一些更改。我在某些时候改用了 css 选择器。
要循环的主要结果集由 检索soup.select('.s-result-list [data-asin]')。这指定了具有类名的元素,.s-result-list其子元素具有属性data-asin。这与页面上的 60 个(当前)项目匹配。
我将 PRIME 选择交换为使用属性 = 值选择器
头现在h5即header = soup.select_one('h5')。
soup.select_one('[aria-label="Amazon Prime"]
示例代码:
import datetime
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re
keyword = 'blue+skateboard'
driver = webdriver.Chrome()
url = 'https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={}'
driver.get(url.format(keyword))
soup = BeautifulSoup(driver.page_source, 'lxml')
results = soup.select('.s-result-list [data-asin]')
for a, b in enumerate(results):
soup = b
header = soup.select_one('h5')
result = a + 1
title = header.text.strip()
try:
link = soup.select_one('h5 > a')
url = link['href']
url = re.sub(r'/ref=.*', '', str(url))
except:
url = "None"
if url !='/gp/slredirect/picassoRedirect.html':
ASIN = re.sub(r'.*/dp/', '', str(url))
#print(ASIN)
try:
score = soup.select_one('.a-icon-alt')
score = score.text
score = score.strip('\n')
score = re.sub(r' .*', '', str(score))
except:
score = "None"
try:
reviews = soup.select_one("href*='#customerReviews']")
reviews = reviews.text.strip()
except:
reviews = "None"
try:
PRIME = soup.select_one('[aria-label="Amazon Prime"]')
PRIME = PRIME['aria-label']
except:
PRIME = "None"
data = {keyword:[keyword,str(result),title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
print(data)
示例输出:
添加回答
举报