2 回答
TA贡献1828条经验 获得超6个赞
你不需要硒。您可以做的(并且您正确识别了它)是提取注释,然后从其中解析表格。
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/crd/2017_roster.htm'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
tables = []
for each in comments:
if 'table' in each:
try:
tables.append(pd.read_html(each)[0])
except ValueError as e:
print(e)
continue
输出:
print(tables[0].head().to_string())
No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr) Salary
0 54.0 Bryson Albright 23.0 NaN 7 0.0 245.0 6-5 Miami (OH) 3/15/1994 1 0.0 NaN $246,177
1 36.0 Budda Baker*+ 21.0 ss 16 7.0 195.0 5-10 Washington 1/10/1996 Rook 9.0 Arizona Cardinals / 2nd / 36th pick / 2017 $465,000
2 64.0 Khalif Barnes 35.0 NaN 3 0.0 320.0 6-6 Washington 4/21/1982 12 0.0 Jacksonville Jaguars / 2nd / 52nd pick / 2005 $176,471
3 41.0 Antoine Bethea 33.0 db 15 6.0 206.0 5-11 Howard 7/27/1984 11 4.0 Indianapolis Colts / 6th / 207th pick / 2006 $2,000,000
4 28.0 Justin Bethel 27.0 rcb 16 6.0 200.0 6-0 Presbyterian 6/17/1990 5 3.0 Arizona Cardinals / 6th / 177th pick / 2012 $2,000,000
....
TA贡献1744条经验 获得超4个赞
您尝试抓取的标签是由 JavaScript 动态生成的。您很可能使用请求来抓取 HTML。不幸的是 requests 不会运行 JavaScript,因为它将所有 HTML 作为原始文本提取。 BeautifulSoup 找不到该标签,因为它从未在您的抓取程序中生成。
- 2 回答
- 0 关注
- 105 浏览
添加回答
举报