如何在python中过滤大文件中的重叠行

我正在尝试在python中过滤大文件中的重叠行。重叠度设置为25％。换句话说，任何两行之间的交集元素的数量少于它们的并集的0.25倍。如果超过0.25，则删除一行。因此，如果我有一个总文件数为1000 000的大文件，则第一个5行如下：c6 c24 c32 c54 c67c6 c24 c32 c51 c68 c78c6 c32 c54 c67c6 c32 c55 c63 c85 c94 c75c6 c32 c53 c67由于第一行和第二行之间的交集元素数为3（例如c6，c24，c32），因此它们之间的并集数为8（例如c6，c24，c32，c54，c67，c51 ，c68，c78）。重叠度为3/8 = 0.375> 0.25，第二行被删除，第三行和第五行也是如此。最后的答案是第一行和第四行。c6 c24 c32 c54 c67c6 c32 c55 c63 c85 c94 c75伪代码如下： csv_file = default_storage.open(self.filepath, 'r') new_object = CSVImport(csvfile=csv_file.read(), model=Contact, modelspy=".", mappings="1=first_name,2=mobile,3=email,4=last_name,5=43,6=16") new_object.run()如何在python中解决这个问题？谢谢！

查看完整描述

1 回答

BIG阳

TA贡献1859条经验获得超6个赞

棘手的部分是，您必须修改要遍历的列表，并且仍然要跟踪两个索引。一种方法是向后移动，因为删除索引等于或大于您跟踪的索引的项目不会影响它们。

这段代码未经测试，但是您可以理解：

with open("file.txt") as fileobj:

sets = [set(line.split()) for line in fileobj]

for first_index in range(len(sets) - 2, -1, -1):

for second_index in range(len(sets) - 1, first_index, -1):

union = sets[first_index] | sets[second_index]

intersection = sets[first_index] & sets[second_index]

if len(intersection) / float(len(union)) > 0.25:

del sets[second_index]

with open("output.txt", "w") as fileobj:

for set_ in sets:

# order of the set is undefined, so we need to sort each set

output = " ".join(sorted(set_, key=lambda x: int(x[1:])))

fileobj.write("{0}\n".format(output))

既然很明显如何对每一行的元素进行排序，我们可以这样做。如果顺序以某种方式自定义，则必须将读取行与每个set元素耦合在一起，以便我们可以准确地写回最后读取的行，而不是重新生成它。

反对回复 2021-04-06

热搜

最近搜索清空

如何在python中过滤大文件中的重叠行

如何在python中过滤大文件中的重叠行

1 回答

添加回答