通过没有足够的内存来防止 python3 进程被“杀死”

我正在尝试连接两个大型数字矩阵，第一个：features是np.arrayshape 1238,72，另一个是从.json文件中加载的，如下面的第二行所示，它是 shape 1238, 768。我需要加载、连接、重新索引、拆分为折叠并将每个折叠保存在自己的文件夹中。问题是我Killed迈出了第一步（将.json内容读入bert）with open(bert_dir+"/output4layers.json", "r+") as f: bert = [json.loads(l)['features'][0]['layers'][0]['values'] for l in f.readlines()] bert_post_data = np.concatenate((features,bert), axis=1) del bert bert_post_data = [bert_post_data[i] for i in index_shuf] bert_folds = np.array_split(bert_post_data, num_folds) for i in range(num_folds): print("saving bert fold ",str(i), bert_folds[i].shape) fold_dir = data_dir+"/folds/"+str(i) save_p(fold_dir+"/bert", bert_folds[i])有没有办法可以有效地做到这一点？我的意思是，必须有更好的方法......熊猫，json lib？感谢您的时间和关注

查看完整描述

2 回答

幕布斯7119047

TA贡献1794条经验获得超8个赞

尝试：

bert = [json.loads(line)['features'][0]['layers'][0]['values'] for line in f]

这样，您至少不会一次读取内存中的整个文件-也就是说，如果文件很大，您必须进一步处理存储的内容bert

反对回复 2022-10-25

尚方宝剑之说

TA贡献1788条经验获得超4个赞

我在搜索类似问题时找到了这个解决方案。它不是在特定问题中投票最多的，但在我看来，它比任何东西都好。

这个想法很简单：不是保存一个字符串列表（文档中的每一行一个），而是保存一个引用每一行的文件索引位置列表，然后当你想访问它的内容时，你只需要seek到这个记忆位置。为此，一个类LineSeekableFile就派上用场了。

唯一的问题是您需要在整个过程中保持文件对象（而不是整个文件！）打开。

class LineSeekableFile:

def __init__(self, seekable):

self.fin = seekable

self.line_map = list() # Map from line index -> file position.

self.line_map.append(0)

while seekable.readline():

self.line_map.append(seekable.tell())

def __getitem__(self, index):

# NOTE: This assumes that you're not reading the file sequentially.

# For that, just use 'for line in file'.

self.fin.seek(self.line_map[index])

return self.fin.readline()

然后访问它：

b_file = bert_dir+"/output4layers.json"

fin = open(b_file, "rt")

BertSeekFile = LineSeekableFile(fin)

b_line = BertSeekFile[idx] #uses the __getitem__ method

fin.close()

反对回复 2022-10-25

热搜

最近搜索清空

通过没有足够的内存来防止 python3 进程被“杀死”

通过没有足够的内存来防止 python3 进程被“杀死”

2 回答

添加回答