读取换行的分隔文件

如果对此已经有了明显的答案，我深表歉意。我有一个非常大的文件，对解析提出了一些挑战。我从我的组织外部收到这些文件，因此我无法更改它们的格式。首先，文件以空格分隔，但表示数据“列”的字段可以跨越多行。例如，如果您有一行应该是 25 列数据，它可能会在文件中写为：1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 251 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25如您所见，我不能依赖每组数据都在同一行上，但我可以依赖每组数据的列数相同。更糟糕的是，该文件遵循一个定义：数据类型格式，其中前 3 行左右将描述数据（包括一个告诉我有多少行的字段），接下来的 N 行是数据。然后它会再次回到 3 行格式来描述下一组数据。这意味着我不能只为 N 列格式设置一个阅读器并让它运行到 EOF。我担心内置的 python 文件读取功能会变得非常难看，但我在 csv 或 numpy 中找不到任何有效的东西。有什么建议么？编辑：就像不同解决方案的一个例子：我们在 MATLAB 中有一个旧工具，它在打开的文件句柄上使用 textscan 解析这个文件。我们知道列数，因此我们执行以下操作：data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);这将读取数据，无论它如何包装，同时保持文件句柄打开以稍后处理下一部分。这样做是因为文件太大，可能会导致 RAM 使用量过多。

查看完整描述

1 回答

慕尼黑5688855

TA贡献1848条经验获得超2个赞

这是您如何进行的草图：（编辑：有一些修改）

file = open("testfile.txt", "r")

# store data for the different sections here

datasections = list()

while True:

current_row = []

# read three lines

l1 = file.readline()

if line == '': # or other end condition

break

l2 = file.readline()

l3 = file.readline()

# extract the following information from l1, l2, l3

nrows = # extract the number rows in the next section

ncols = # extract the number of columns in the next section

# loop while len(current_row) < nrows * ncols:

# read next line, isolate the items using str.split()

# append items to current_row

# break current_row into the lines after each ncols-th item

# store data in datasections in a new array

反对回复 2021-06-15

热搜

最近搜索清空

读取换行的分隔文件

读取换行的分隔文件

1 回答

添加回答