首页猿问具有不同页码的网页抓取

具有不同页码的网页抓取

Python

SMILET 2023-12-29 15:26:22

所以我正在尝试从网上抓取一堆个人资料。每个个人资料都有一个视频集合。我正在尝试从网络上抓取每个视频的信息。我遇到的问题是每个配置文件上传不同数量的视频，因此每个配置文件包含视频的页面数量各不相同。例如，一个个人资料有 45 页视频，如下面的 html 所示：<div class="pagination "><ul><li><a class="active" href="">1</a></li><li><a href="#1">2</a></li><li><a href="#2">3</a></li><li><a href="#3">4</a></li><li><a href="#4">5</a></li><li><a href="#5">6</a></li><li><a href="#6">7</a></li><li><a href="#7">8</a></li><li><a href="#8">9</a></li><li><a href="#9">10</a></li><li><a href="#10">11</a></li><li class="no-page"><a href="#" class="ellipsis last-ellipsis">...</a><li><a href="#44" class="last-page">45</a></li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>而另一个个人资料有 2 页<div class="pagination "><ul><li><a class="active" href="">1</a></li><li><a href="#1">2</a></li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>我的问题是，如何解释页面的不同变化？我正在考虑制作一个 for 循环并在末尾添加一个随机数，例如for i in range(0,1000): new_url = 'url' + str(i)我占该页面的位置，但我想知道是否有更有效的方法来执行此操作。

查看完整描述

3 回答

子衿沉夜

TA贡献1828条经验获得超3个赞

获取<li>...</li>其中的元素<div class="pagination "><ul>
按类别排除最后一个<li class="no-page">
解析“href”并构建您的下一个 url 目的地。
抓取每个新的 url 目标。

反对回复 2023-12-29

达令说

TA贡献1821条经验获得超6个赞

我只想感谢所有回答我问题的人花时间回答我的问题。我找到了答案——或者至少是对我有用的答案——并决定分享，以防对其他人有帮助。

url = 'insert url'

re = requests.get(url)

soup = BeautifulSoup(re.content,'html.parser')

#look for pagination class

page = soup.find(class_='pagination')

#create list to include all page numbers

href=[]

#look for all 'li' tags as the users above suggested

links = page.findAll('li')

for link in links:

href += [link.find('a',href=True).text]

'''

href will now include all pages and the word Next.

So for instance it will look something like this:[1,2,3...,44,Next].

I want to get 44, which will be href[-2] and then convert that to an int for

a for loop. In the for loop add + 1 because it will iterate to i-1, instead of

i. For instance, if you're iterating (0,44), the last output of i will be 43,

which is why we +1

'''

for i in range(0, int(href[-2])+1):

new_url = url + str(1)

反对回复 2023-12-29

墨色风雨

TA贡献1853条经验获得超6个赞

循环的“骨架”可以如下所示：

url = 'http://url/?page={page}'

page = 1

while True:

soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')

# ...

# do we have next page?

next_page = soup.select_one('.next-page')

# no, so break from the loop

if not next_page:

break

page += 1

您可以有无限循环，并且仅当没有下一页（如果最后一页上while True:没有任何标签）时才会中断循环。class="next-page"

反对回复 2023-12-29

3 回答
0 关注
139 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

具有不同页码的网页抓取

具有不同页码的网页抓取

3 回答

添加回答