2 回答
TA贡献1802条经验 获得超10个赞
如果我明白你在做什么;在每一列中找到最常见的字符(?),你可以这样做:
def most_common(col, exclude_char='N'):
col = list(filter((exclude_char).__ne__, col))
return max(set(col), key=col.count)
sequences = []
with open('DNAinput.txt', 'r') as file:
for line in file:
if line[0] == '>':
continue
else:
sequences.append(line.strip())
m = max([len(v) for v in sequences])
matrix = [list(v) for v in sequences]
for seq in matrix:
seq.extend(list('N' * (m - len(seq))))
transposed_matrix = [[matrix[j][i] for j in range(len(matrix))] for i in range(m)]
for column in transposed_matrix:
print(most_common(column))
这通过以下方式工作:
打开您的文件并将其读入list如下所示:
# This is the `sequences` list
['GATCA', 'AATC', 'AATA', 'ACTA']
获取最长DNA序列的长度:
# m = max([len(v) for v in sequences])
5
从这些序列创建一个矩阵(列表列表):
# matrix = [list(v) for v in sequences]
[['G', 'A', 'T', 'C', 'A'],
['A', 'A', 'T', 'C'],
['A', 'A', 'T', 'A'],
['A', 'C', 'T', 'A']]
填充矩阵,使所有序列的长度相同:
# for seq in matrix:
# seq.extend(list('N' * (m - len(seq))))
[['G', 'A', 'T', 'C', 'A'],
['A', 'A', 'T', 'C', 'N'],
['A', 'A', 'T', 'A', 'N'],
['A', 'C', 'T', 'A', 'N']]
转置矩阵,使列移动top -> bottom(不是left -> right)。这会将来自相同位置的所有字符一起放入一个列表中。
# [[matrix[j][i] for j in range(len(matrix))] for i in range(m)]
[['G', 'A', 'A', 'A'],
['A', 'A', 'A', 'C'],
['T', 'T', 'T', 'T'],
['C', 'C', 'A', 'A'],
['A', 'N', 'N', 'N']]
最后,迭代转置矩阵中的每个列表,并most_common以子列表作为输入调用:
# for column in transposed_matrix:
# print(most_common(column))
A
A
T
C
A
这种方法有一些注意事项;首先,如果most_common单个位置中有相同数量的核苷酸,我所包含的函数将返回第一个值(参见位置四,这可能是A或C)。此外,该most_common功能可能需要成倍的时间要比使用Counter从集合。
由于这些原因,我强烈建议使用以下脚本代替collections安装时包含在 python 中的脚本。
from collections import Counter
sequences = []
with open('DNAinput.txt', 'r') as file:
for line in file:
if line[0] == '>':
continue
else:
sequences.append(line.strip())
m = max([len(v) for v in sequences])
matrix = [list(v) for v in sequences]
for seq in matrix:
seq.extend(list('N' * (m - len(seq))))
transposed_matrix = [[matrix[j][i] for j in range(len(matrix))] for i in range(m)]
for column in transposed_matrix:
print(Counter(column).most_common(5))
添加回答
举报