为了账号安全,请及时绑定邮箱和手机立即绑定

网络抓取清理 CSV 表格时出现问题

网络抓取清理 CSV 表格时出现问题

肥皂起泡泡 2023-04-11 15:43:56
我正在尝试从表中抓取一些数据。我得到了我期望的结果,但我找不到将它们保存在干净的 CSV 表中的方法。这是代码,在结果和我想要的下面。有什么建议吗?from bs4 import BeautifulSoupimport urllib.request # web accessimport csvimport reurl = "https://wsc.nmbe.ch/family/87/Senoculidae"page = urllib.request.urlopen(url) # conntect to websitetry:    page = urllib.request.urlopen(url)except:    print("Ups!")soup = BeautifulSoup(page, 'html.parser')regex = re.compile('^speciesTitle')content_lis = soup.find_all('div', attrs={'class': regex})for li in content_lis:    con = li.get_text("#",strip=True).split("\n")[0]    print(con)我得到了这些不错的输出:Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| BrazilSenoculus barroanus#Chickering, 1941#|#| PanamaSenoculus bucolicus#Chickering, 1941#|#| Panama但我需要这样的东西(CSV 用分号或制表符分隔):Senoculus albidus;(F. O. Pickard-Cambridge, 1897);BrazilSenoculus barroanus;Chickering1941;PanamaSenoculus bucolicus;Chickering, 1941;Panama如何删除字符“|” 和一些空间?有什么建议吗?
查看完整描述

2 回答

?
幕布斯6054654

TA贡献1876条经验 获得超7个赞

尝试这个:


from bs4 import BeautifulSoup

import urllib.request # web access

import re


url = "https://wsc.nmbe.ch/family/87/Senoculidae"

page = urllib.request.urlopen(url) # conntect to website

try:

    page = urllib.request.urlopen(url)

except:

    print("Ups!")

soup = BeautifulSoup(page, 'html.parser')

#div = soup.find(text=True, recursive=)

regex = re.compile('^speciesTitle')

content_lis = soup.find_all('div', attrs={'class': regex})

file = ''

for cl in content_lis:

    a = cl.select_one('div a strong i')

    b = cl.find(text=True, recursive=False)

    c = cl.select_one('span')

    cc = re.findall("[\w]+", c.text)[0]

    file += f'{a.get_text(strip=True)};{b.strip()};{cc}\n'

with open('file.csv', 'w') as f:

   f.write(file)

保存一个文件:


Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil

Senoculus barroanus;Chickering, 1941;Panama

Senoculus bucolicus;Chickering, 1941;Panama

Senoculus cambridgei;Mello-Leitão, 1927;Brazil

Senoculus canaliculatus;F. O. Pickard-Cambridge, 1902;Mexico

Senoculus carminatus;Mello-Leitão, 1927;Brazil

Senoculus darwini;(Holmberg, 1883);Argentina

Senoculus fimbriatus;Mello-Leitão, 1927;Brazil

Senoculus gracilis;(Keyserling, 1879);Guyana

Senoculus guianensis;Caporiacco, 1947;j

Senoculus iricolor;(Simon, 1880);Brazil

Senoculus maronicus;Taczanowski, 1872;French

等等...


查看完整回答
反对 回复 2023-04-11
?
慕哥6287543

TA贡献1831条经验 获得超10个赞

此代码基于您的示例数据集:


lst=[

'Senoculus albidus#(F. O. Pickard-Cambridge, 1897)#|#| Brazil',

'Senoculus barroanus#Chickering, 1941#|#| Panama',

'Senoculus bucolicus#Chickering, 1941#|#| Panama'

]


lst2 = [s.replace('|',"").split('#') for s in lst]


lst3=[]


for s in lst2:

   lst3.append(';'.join([sx.strip() for sx in s]).replace(';;',';'))


for s in lst3:

   print(s)

输出


Senoculus albidus;(F. O. Pickard-Cambridge, 1897);Brazil 

Senoculus barroanus;Chickering, 1941;Panama 

Senoculus bucolicus;Chickering, 1941;Panama

--- 根据请求者评论更新 ---


在最后一个循环中添加一行:


for li in content_lis:

    con = li.get_text("#",strip=True).split("\n")[0]

    con = ';'.join(sx.strip() for sx in con.replace('|',"").split('#')).replace(';;',';') # add this line

    print(con)


查看完整回答
反对 回复 2023-04-11
  • 2 回答
  • 0 关注
  • 114 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信