3 回答
TA贡献1828条经验 获得超3个赞
获取
<li>...</li>
其中的元素<div class="pagination "><ul>
按类别排除最后一个
<li class="no-page">
解析“href”并构建您的下一个 url 目的地。
抓取每个新的 url 目标。
TA贡献1821条经验 获得超6个赞
我只想感谢所有回答我问题的人花时间回答我的问题。我找到了答案——或者至少是对我有用的答案——并决定分享,以防对其他人有帮助。
url = 'insert url'
re = requests.get(url)
soup = BeautifulSoup(re.content,'html.parser')
#look for pagination class
page = soup.find(class_='pagination')
#create list to include all page numbers
href=[]
#look for all 'li' tags as the users above suggested
links = page.findAll('li')
for link in links:
href += [link.find('a',href=True).text]
'''
href will now include all pages and the word Next.
So for instance it will look something like this:[1,2,3...,44,Next].
I want to get 44, which will be href[-2] and then convert that to an int for
a for loop. In the for loop add + 1 because it will iterate to i-1, instead of
i. For instance, if you're iterating (0,44), the last output of i will be 43,
which is why we +1
'''
for i in range(0, int(href[-2])+1):
new_url = url + str(1)
TA贡献1853条经验 获得超6个赞
循环的“骨架”可以如下所示:
url = 'http://url/?page={page}'
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
# ...
# do we have next page?
next_page = soup.select_one('.next-page')
# no, so break from the loop
if not next_page:
break
page += 1
您可以有无限循环,并且仅当没有下一页(如果最后一页上while True:没有任何标签)时才会中断循环。class="next-page"
添加回答
举报