为了账号安全,请及时绑定邮箱和手机立即绑定

为什么我只能抓取 eBay 上前 4 页的结果?

为什么我只能抓取 eBay 上前 4 页的结果?

慕工程0101907 2023-08-22 17:09:22
我有一个简单的脚本来分析 eBay 上的销售数据(棒球交易卡)。前 4 页似乎工作正常,但在第 5 页,它根本不再加载所需的 html 内容,我无法弄清楚为什么会发生这种情况:#Import statementsimport requestsimport timefrom bs4 import BeautifulSoup as soupfrom tqdm import tqdm#FOR DEBUGPage_1="https://www.ebay.com/sch/213/i.html?_from=R40&LH_Sold=1&_sop=16&_pgn=1"#Request URL working examplesource=requests.get(Page_1)time.sleep(5)eBay_full = soup(source.text, "lxml")Complete_container=eBay_full.find("ul",{"class":"b-list__items_nofooter"})Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})items=[]#For all items on page perform desired operationfor i in tqdm(Single_item):    items.append(i.find("a", {"class": "s-item__link"})["href"].split('?')[0].split('/')[-1])    #Works fine for Links_to_check[0] upto Links_to_check[3]但是,当我尝试抓取第五页或更多页面时,会发生以下情况:Page_5="https://www.ebay.com/sch/213/i.html?_from=R40&LH_Sold=1&_sop=16&_pgn=5"source=requests.get(Page_5)time.sleep(5)eBay_full = soup(source.text, "lxml")Complete_container=eBay_full.find("ul",{"class":"b-list__items_nofooter"})Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})items=[]#For all items on page perform desired operationfor i in tqdm(Single_item):    items.append(i.find("a", {"class": "s-item__link"})["href"].split('?')[0].split('/')[-1])----> 5 Single_item=Complete_container.find_all("div",{"class":"s-item__wrapper clearfix"})      6 items=[]      7 #For all items on page perform desired operationAttributeError: 'NoneType' object has no attribute 'find_all'这似乎是后面页面的 eBay_full 汤中缺少 ul 类 b-list__items_nofooter 的逻辑结果。但问题是为什么这些信息丢失了?滚动浏览汤,所有感兴趣的项目似乎都不存在。正如预期的那样,该信息出现在网页本身上。谁能指导我?
查看完整描述

2 回答

?
Helenr

TA贡献1780条经验 获得超4个赞

在 headers 变量中仅放置其中一种浏览器,以及当前的稳定版本号(例如 Chrome/53.0.2785.143)


headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}


source= requests.get(Page_5, headers=headers, timeout=2)


查看完整回答
反对 回复 2023-08-22
?
缥缈止盈

TA贡献2041条经验 获得超4个赞

主要问题在于 eBay 理解机器人/脚本发送请求。

但eBay如何理解它呢?这是因为默认的requests用户代理是python-requestseBay 理解它并且似乎阻止使用此类用户代理发出的请求。

通过添加自定义用户代理,我们可以在某种程度上伪造真实的用户请求。然而,它并不完全可靠,并且标头可能需要旋转或/并与代理一起使用,最好是住宅。

Whatismybrowser 上的用户代理列表。

附带说明一下,您可以使用SelectorGadget Chrome 扩展通过单击浏览器中所需的元素来轻松选择 CSS 选择器,如果页面大量使用 JS,则这并不总是能完美工作(在本例中我们可以)。

下面的示例显示了如何从所有页面中提取列表。

from bs4 import BeautifulSoup

import requests, json, lxml


# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers

headers = {

    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",

    }

    

params = {

    '_nkw': 'baseball trading cards', # search query

    'LH_Sold': '1',                   # shows sold items

    '_pgn': 1                         # page number

    }


data = []


while True:

    page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)

    soup = BeautifulSoup(page.text, 'lxml')

    

    print(f"Extracting page: {params['_pgn']}")


    print("-" * 10)

    

    for products in soup.select(".s-item__info"):

        title = products.select_one(".s-item__title span").text

        price = products.select_one(".s-item__price").text

        link = products.select_one(".s-item__link")["href"]

        

        data.append({

          "title" : title,

          "price" : price,

          "link" : link

        })


    if soup.select_one(".pagination__next"):

        params['_pgn'] += 1

    else:

        break


    print(json.dumps(data, indent=2, ensure_ascii=False))

输出示例


Extracting page: 1

----------

[

  {

    "title": "Shop on eBay",

    "price": "$20.00",

    "link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"

  },

  {

    "title": "Ken Griffey Jr. Seattle Mariners 1989 Topps Traded RC Rookie Card #41T",

    "price": "$7.20",

    "link": "https://www.ebay.com/itm/385118055958?hash=item59aad32e16:g:EwgAAOSwhgljI0Vm&amdata=enc%3AAQAHAAAAoFRRlvb50yb%2FN4cmlg5OtVDKIH0DsaMJBL3Tp67SI1dCSP1WPdZW3f16bTf4HTSUhX0g3OMmZSitEY3F3SVGg0%2FhSBF3ykE9X88Lo2EHuS2b23tA1kGiG92F9xyr73RLorcidserdH8tvUXhxmT4pJDnCfMAdfqtRzSIxcB6h4aDC1J1XvJ5IyRfYtWBGUQ60ykrA7mNlhH53cwZe5MiRSw%3D%7Ctkp%3ABk9SR7rKxt7sYA"

  },

  {

    "title": "Ken Griffey Jr. 1989 Score Traded Rookie Card Gem 10 Auto Beckett 13604418",

    "price": "$349.00",

    "link": "https://www.ebay.com/itm/353982131344?hash=item526afaac90:g:9hQAAOSwvCpiQ5FY&amdata=enc%3AAQAHAAAAoOKm1SWvHtdNVIEqtE4m5%2B453xtvR75ZimUBLL16P0WwfJy%2BJJQ2Phd9crgAacTWlsqp9HB%2Ft0McttOjmCfyL0RDQB%2FYOWQK3hxj%2FoDRmybJRipjqb0JG2%2BCa1RhI04PN3R5wpH9vvYqefwY6JuAsPqDU0SmSk6h1h%2FQr7cfJqOmdCo0cqbwPcJ8OcvAyP07txigrDyO55XqFD7CHcSmUPA%3D%7Ctkp%3ABk9SR7rKxt7sYA"

  },

  {

    "title": "Mike Jorgensen NY Mets MLB OF-1B 1972 Topps Baseball Card #16 Single Original",

    "price": "$1.19",

    "link": "https://www.ebay.com/itm/374255790865?hash=item5723622b11:g:KiwAAOSwz4ljI0G4&amdata=enc%3AAQAHAAAAoPVkKyeDZ7wbRNBwQppCcjVmLlOlY3ylPVwQyG7dfOy1UtPYhK7tRXtvn5v3M5n%2F35MS1LXLvWAioKRrMGPEPCmDoMkhdynuH3csaincrM%2F6JNwwIUFa3F%2FcylfPqnrxjJXF7cZ3ga9aCihTM6sfVJc1kzNkaBw2C2ewMyQ3ARgYpuDcUa6CMo4zBKF%2FGTj5KlZieLYywQm4dnzLCrFbtEM%3D%7Ctkp%3ABk9SR7rKxt7sYA"

  },

  # ...

]


查看完整回答
反对 回复 2023-08-22
  • 2 回答
  • 0 关注
  • 171 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信