首页猿问如何将包含许多注释行的数据文本文件...

如何将包含许多注释行的数据文本文件加载到 pandas 中？

Python

皈依舞 2023-09-26 15:09:56

我正在尝试将分隔文本文件读入 python 中的数据帧中。当我使用时，分隔符未被识别pd.read_table。如果我明确设置sep = ' '，则会收到错误：Error tokenizing data. C error。值得注意的是，当我使用np.loadtxt().例子：pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comment = '%', header = None) 00 1850 1 -0.777 0.412 NaN NaN...1 1850 2 -0.239 0.458 NaN NaN...2 1850 3 -0.426 0.447 NaN NaN...3 1850 4 -0.680 0.367 NaN NaN...4 1850 5 -0.687 0.298 NaN NaN...如果我设置 sep = ' '，则会收到另一个错误：pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comment = '%', header = None, sep = ' ')ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58查找此错误，人们建议使用header = None（已经完成）并sep = 显式设置，但这导致了问题：Python Pandas Error tokenizing data。我查看了第 78 行，没有发现任何问题。如果我设置，error_bad_lines=False我会得到一个空的 df，表明每个条目都有问题。值得注意的是，当我使用以下命令时，这会起作用np.loadtxt()：pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', comments = '%')) 0 1 2 3 4 5 6 7 8 9 10 110 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN这对我来说表明文件没有问题，而是我调用的方式有问题pd.read_table()。我查看了文档，np.loadtxt()希望将 sep 设置为相同的值，但这只是显示：（delimiter=Nonehttps://numpy.org/doc/stable/reference/ generated /numpy.loadtxt.html ）。我希望能够将其导入为 apd.DataFrame并设置名称，而不是必须导入为 amatrix然后转换为pd.DataFrame.我错了什么？

查看完整描述

2 回答

慕娘9325324

TA贡献1783条经验获得超5个赞

这个是相当棘手的。请尝试下面的代码片段：

import pandas as pd

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

df = pd.read_csv(url,

sep='\s+',

comment='%',

usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),

names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',

'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',

'20y.Anomaly', '20y.Unc.'))

反对回复 2023-09-26

料青山看我应如是

TA贡献1772条经验获得超8个赞

问题是该文件有 77 行注释文本，例如'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'

其中两行是标题

有一堆数据，然后还有两个标头，以及一组新数据'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
该解决方案将文件中的两个表分成单独的数据帧。
这不像其他答案那么好，但数据被正确地分成不同的数据帧。
标题很痛苦，手动创建自定义标题并跳过将标题与文本分开的代码行可能会更容易。
重要的一点是air与ice数据分离。

import requests

import pandas as pd

import math

# read the file with requests

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

response = requests.get(url)

data = response.text

# convert data into a list

data = [d.strip().replace('% ', '') for d in data.split('\n')]

# specify the data from the ranges in the file

air_header1 = data[74].split() # not used

air_header2 = [v.strip() for v in data[75].split(',')]

# combine the 2 parts of the header into a single header

air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]

air_data = [v.split() for v in data[77:2125]]

h2o_header1 = data[2129].split() # not used

h2o_header2 = [v.strip() for v in data[2130].split(',')]

# combine the 2 parts of the header into a single header

h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]

h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes

air = pd.DataFrame(air_data, columns=air_header)

h2o = pd.DataFrame(h2o_data, columns=h2o_header)

没有标题代码

通过使用手动标头列表来简化代码。

import pandas as pd

import requests

# read the file with requests

url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'

response = requests.get(url)

data = response.text

# convert data into a list

data = [d.strip().replace('% ', '') for d in data.split('\n')]

# manually created header

headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',

'Annual_Anomaly', 'Annual_Unc.',

'Five-year_Anomaly', 'Five-year_Unc.',

'Ten-year_Anomaly', 'Ten-year_Unc.',

'Twenty-year_Anomaly', 'Twenty-year_Unc.']

# separate the air and h2o data

air_data = [v.split() for v in data[77:2125]]

h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes

air = pd.DataFrame(air_data, columns=headers)

h2o = pd.DataFrame(h2o_data, columns=headers)

air

Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.

0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN

1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN

2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN

h2o

Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.

0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN

1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN

2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN

反对回复 2023-09-26

2 回答
0 关注
235 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

如何将包含许多注释行的数据文本文件加载到 pandas 中？

如何将包含许多注释行的数据文本文件加载到 pandas 中？

2 回答

添加回答