2 回答
TA贡献1883条经验 获得超3个赞
该脚本将打印演员的所有角色:
import requests
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/name/nm4043618/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
seen = set()
for row in soup.select('#filmo-head-actor + div .filmo-row > br'):
role = row.find_next(text=True).strip()
if not role in seen:
seen.add(role)
print(role)
印刷:
Peter Parker / Spider-Man
Nathan Drake
Todd Hewitt
Nico Walker
Arvin Russell
Ian Lightfoot (voice)
Jip (voice)
Walter (voice)
Samuel Insull
Brother Diarmuid - The Novice
Jack Fawcett
Bradley Baker
Thomas Nickerson
Tom
Gregory Cromwell
Former Billy (Encore) (uncredited)
Isaac
Eddie (voice)
Boy
Lucas
Shô (UK version, voice)
编辑:要获得 DataFrame 的角色,您可以这样做:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.imdb.com/name/nm4043618/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
seen = set()
all_data = []
for row in soup.select("#filmo-head-actor + div .filmo-row > br"):
role = row.find_next(text=True).strip()
if not role in seen:
seen.add(role)
all_data.append(role)
df = pd.DataFrame(all_data, columns=["Role"])
print(df)
印刷:
Role
0 Peter Parker / Spider-Man
1 Nathan Drake
2 Todd Hewitt
3 Nico Walker
4 Arvin Russell
5 Ian Lightfoot (voice)
6 Jip (voice)
7 Walter (voice)
8 Samuel Insull
9 Brother Diarmuid - The Novice
10 Jack Fawcett
11 Bradley Baker
12 Thomas Nickerson
13 Tom
14 Gregory Cromwell
15 Former Billy (Encore) (uncredited)
16 Isaac
17 Eddie (voice)
18 Boy
19 Lucas
20 Shô (UK version, voice)
TA贡献1876条经验 获得超6个赞
尝试:
from bs4 import BeautifulSoup
html = '''<html>
<div class="filmo-row odd" id="actor-tt10872600">
<span class="year_column">
2021
</span>
<b><a href="/title/tt10872600/">Untitled Spider-Man Sequel</a></b>
(<a class="in_production" href="https://pro.imdb.com/title/tt10872600?rf=cons_nm_filmo">announced</a>)
<br/>
Peter Parker / Spider-Man
</div>, <div class="filmo-row even" id="actor-tt1464335">
<span class="year_column">
2021
</span>
<b><a href="/title/tt1464335/">Uncharted</a></b>
(<a class="in_production" href="https://pro.imdb.com/title/tt1464335?rf=cons_nm_filmo">filming</a>)
<br/>
Nathan Drake
</div>, <div class="filmo-row odd" id="actor-tt2076822">
<span class="year_column">
2021
</span>
<b><a href="/title/tt2076822/">Chaos Walking</a></b>
(<a class="in_production" href="https://pro.imdb.com/title/tt2076822?rf=cons_nm_filmo">post-production</a>)
<br/>
Todd Hewitt
</div>, <div class="filmo-row even" id="actor-tt9130508">
<span class="year_column">
2020/I
</span>
<b><a href="/title/tt9130508/">Cherry</a></b>
(<a class="in_production" href="https://pro.imdb.com/title/tt9130508?rf=cons_nm_filmo">post-production</a>)
<br/>
Nico Walker
</div>, <div class="filmo-row odd" id="actor-tt7395114">
<span class="year_column">
2020
</span>
<b><a href="/title/tt7395114/">The Devil All the Time</a></b>
(<a class="in_production" href="https://pro.imdb.com/title/tt7395114?rf=cons_nm_filmo">completed</a>)
<br/>
Arvin Russell
</div>, <div class="filmo-row even" id="actor-tt7146812">
<span class="year_column">
2020/I
</span>
<b><a href="/title/tt7146812/">Onward</a></b>
<br/>
Ian Lightfoot (voice)
</div>, <div class="filmo-row odd" id="actor-tt6673612">
<span class="year_column">
2020
</span>
<b><a href="/title/tt6673612/">Dolittle</a></b>
<br/>
Jip (voice)
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
divs = soup.select('div.filmo-row.odd')
for div in divs:
text = div.find_all(text=True, recursive=False)
print(*[t.strip() for t in text if len(t) > 3])
印刷:
Peter Parker / Spider-Man
Todd Hewitt
Arvin Russell
Jip (voice)
添加回答
举报