为了账号安全,请及时绑定邮箱和手机立即绑定

我正在尝试使用 Python 从 30 个相似链接中抓取多个表

我正在尝试使用 Python 从 30 个相似链接中抓取多个表

DIEA 2023-06-20 10:16:43
我有 10 个公司链接。https://www.zaubacorp.com/company/ASHRAFI-MEDIA-NETWORK-PRIVATE-LIMITED/U22120GJ2019PTC111757,https://www.zaubacorp.com/company/METTLE-PUBLICATIONS-PRIVATE-LIMITED/U22120MH2019PTC329729,https://www.zaubacorp.com/company/PRINTSCAPE-INDIA-PRIVATE-LIMITED/U22120MH2020PTC335354,https://www.zaubacorp.com/company/CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665,https://www.zaubacorp.com/company/BHOOKA-NANGA-FILMS-PRIVATE-LIMITED/U22130DL2019PTC353194,https://www.zaubacorp.com/company/WHITE-CAMERA-SCHOOL-OF-PHOTOGRAPHY-PRIVATE-LIMITED/U22130JH2019PTC013311,https://www.zaubacorp.com/company/RLE-PRODUCTIONS-PRIVATE-LIMITED/U22130KL2019PTC059208,https://www.zaubacorp.com/company/CATALIZADOR-MEDIA-PRIVATE-LIMITED/U22130KL2019PTC059793,https://www.zaubacorp.com/company/TRIPPLED-MEDIAWORKS-OPC-PRIVATE-LIMITED/U22130MH2019OPC333171,https://www.zaubacorp.com/company/KRYSTAL-CINEMAZ-PRIVATE-LIMITED/U22130MH2019PTC330391现在我正在尝试从这些链接中抓取表格并将数据以良好的格式保存在 csv 列中。我想抓取“公司详细信息”、“股本和员工人数”、“上市和年度合规详细信息”、“联系方式”、“董事详细信息”的表格。如果任何表没有数据或缺少任何列,我希望输出 csv 文件中的该列为空白。我写了一段代码,但无法得到输出。我在这里做错了什么。请帮忙import pandas as pdfrom bs4 import BeautifulSoupfrom urllib.request import urlopenimport requestsimport csvimport lxmlurl_file = "Zaubalinks.txt"with open(url_file, "r") as url:    url_pages = url.read()# we need to split each urls into lists to make it iterablepages = url_pages.split("\n") # Split by lines using \n# now we run a for loop to visit the urls one by onedata = []for single_page in pages:    r = requests.get(single_page)    soup = BeautifulSoup(r.content, 'html5lib')    table = soup.find_all('table')  # finds all tables    table_top = pd.read_html(str(table))[0]  # the top table
查看完整描述

2 回答

?
UYOU

TA贡献1878条经验 获得超4个赞

import requests

from bs4 import BeautifulSoup

import pandas as pd


companies = {

    'ASHRAFI-MEDIA-NETWORK-PRIVATE-LIMITED/U22120GJ2019PTC111757',

    'METTLE-PUBLICATIONS-PRIVATE-LIMITED/U22120MH2019PTC329729',

    'PRINTSCAPE-INDIA-PRIVATE-LIMITED/U22120MH2020PTC335354',

    'CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665',

    'BHOOKA-NANGA-FILMS-PRIVATE-LIMITED/U22130DL2019PTC353194',

    'WHITE-CAMERA-SCHOOL-OF-PHOTOGRAPHY-PRIVATE-LIMITED/U22130JH2019PTC013311',

    'RLE-PRODUCTIONS-PRIVATE-LIMITED/U22130KL2019PTC059208',

    'CATALIZADOR-MEDIA-PRIVATE-LIMITED/U22130KL2019PTC059793',

    'TRIPPLED-MEDIAWORKS-OPC-PRIVATE-LIMITED/U22130MH2019OPC333171',

    'KRYSTAL-CINEMAZ-PRIVATE-LIMITED/U22130MH2019PTC330391'

}



def main(url):

    with requests.Session() as req:

        goal = []

        for company in companies:

            r = req.get(url.format(company))

            df = pd.read_html(r.content)

            target = pd.concat([df[x].T for x in [0, 3, 4]], axis=1)

            goal.append(target)

        new = pd.concat(goal)

        new.to_csv("data.csv")



main("https://www.zaubacorp.com/company/{}")


查看完整回答
反对 回复 2023-06-20
?
largeQ

TA贡献2039条经验 获得超7个赞

Fortunatley,看来您可以使用更简单的方法到达那里。以一个随机链接为例,它应该是这样的:


url = 'https://www.zaubacorp.com/company/CHARVAKA-TELEVISION-NETWORK-PRIVATE-LIMITED/U22121KA2019PTC126665'


import pandas as pd


tables = pd.read_html(url)

从这里开始,您的表格位于tables[0]、tables[3]、tables[4]、tables[15]等中。只需使用一个for循环来轮换所有 url。


查看完整回答
反对 回复 2023-06-20
  • 2 回答
  • 0 关注
  • 140 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信