3 回答
TA贡献1872条经验 获得超3个赞
尝试这个。更新为包含自动计算行数的逻辑。基本上我提取原始数据帧索引(行号)的最大值,它在大字符串内。
如果我们从使用您提供的示例转换为字符串的数据帧开始:
df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6,
data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
string = str(df)
首先,让我们提取列名:
import re
import numpy as np
lst = re.split('\n', string)
num_rows = int(lst[lst.index('') -1][0]) + 1
col_names = []
lst = [i for i in lst if i != '']
for i in range(0,len(lst), num_rows + 1):
col_names.append(lst[i])
new_col_names = []
for i in col_names:
new_col_names.append(re.split(' ', i))
final_col_names = []
for i in new_col_names:
final_col_names += i
final_col_names = [i for i in final_col_names if i != '']
final_col_names = [i for i in final_col_names if i != '\\']
然后,让我们获取数据:
for i in col_names:
lst.remove(i)
new_lst = [re.split(r'\s{2,}', i) for i in lst]
new_lst = [i[1:-1] for i in new_lst]
newer_lst = []
for i in range(num_rows):
sub_lst = []
for j in range(i,len(final_col_names), num_rows):
sub_lst += new_lst[j]
newer_lst.append(sub_lst)
reshaped = np.reshape(newer_lst, (num_rows,len(final_col_names)))
最后,我们可以使用数据和列名创建重建的数据框:
fixed_df = pd.DataFrame(data=reshaped, columns = final_col_names)
我的代码执行了一些循环,因此如果您的原始数据帧有数十万行,这种方法可能需要一段时间。
添加回答
举报