如何在python中过滤大文件中两行的重叠

我正在尝试在python中过滤大文件中的重叠行。重叠度设置为两行和其他两行的25％。换言之，重叠度是a*b/(c+d-a*b)>0.25，a是多少交叉点的第一行和第三行之间，b是多少交叉点的第二行和第四行之间，c是乘以元素数第1行的元素的数量第二行d的元素数乘以第四行的元素数。如果重叠度大于0.25，则删除第3行和第4行。因此，如果我有一个大文件，总共有1000 000行，那么前6行如下：c6 c24 c32 c54 c67k6 k12 k33 k63 k62c6 c24 c32 c51 c68 c78k6 k12 k24 k63c6 c32 c24 c63 c67 c67 c75 c75k6 k12 k33 k63因为重叠度第一两行和第二行的是a=3，（例如c6,c24,c32）， b=3（如k6,k12,k63），，c=25,d=24，a*b/(c+d-a*b)=9/40<0.25的第三和第四行没有被删除。接下来，第一两行和第三两行的重叠度为5*4/(25+28-5*4)=0.61>0.25，则删除第三两行。最终答案是第一和第二两行。c6 c24 c32 c54 c67k6 k12 k33 k63 k62c6 c24 c32 c51 c68 c78k6 k12 k24 k63伪代码如下：for i=1:(n-1) # n is a half of the number of rows of the big file for j=(i+1):n if overlap degrees of the ith two rows and jth two rows is more than 0.25 delete the jth two rows from the big file end endendpython代码如下，但这是错误的。如何解决？with open("iuputfile.txt") as fileobj: sets = [set(line.split()) for line in fileobj] for first_index in range(len(sets) - 4, -2, -2): c=len(sets[first_index])*len(sets[first_index+1]) for second_index in range(len(sets)-2 , first_index, -2): d=len(sets[second_index])*len(sets[second_index+1]) ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1]) if (ab/(c+d-ab))>0.25: del sets[second_index] del sets[second_index+1]with open("outputfile.txt", "w") as fileobj: for set_ in sets: # order of the set is undefined, so we need to sort each set output = " ".join(set_) fileobj.write("{0}\n".format(output))可以在https://stackoverflow.com/questions/17321275/中找到类似的问题如何修改该代码以解决Python中的此问题？谢谢！

查看完整描述

如何在python中过滤大文件中两行的重叠

如何在python中过滤大文件中两行的重叠

2 回答

添加回答

热搜

最近搜索清空

如何在python中过滤大文件中两行的重叠

如何在python中过滤大文件中两行的重叠

2 回答

添加回答