2 回答

TA贡献1851条经验 获得超4个赞
你写了
if (parent_go_id in parent_list):
go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)
正确的是
if (parent_go_id in parent_list):
go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)[go_id]
在此更改后,它会产生:
{
'GO:0003674': {
'GO:0003824': {},
'GO:0005198': {},
'GO:0005488': {
'GO:0005515': {},
'GO:0005549': {
'GO:0005550': {}
}
}
}
}
但我会建议完全不同的方法。创建一个类来解析术语并构建依赖树,因为它这样做。
为方便起见,我将它派生自dict,因此您可以编写term.id而不是term['id']:
class Term(dict):
__getattr__ = dict.__getitem__
__setattr__ = dict.__setitem__
__delattr__ = dict.__delitem__
registry = {}
single_valued = 'id name namespace alt_id def comment synonym is_a'.split()
multi_valued = 'subset xref'.split()
def __init__(self, text):
self.children = []
self.parent = None
for line in text.splitlines():
if not ': ' in line:
continue
key, val = line.split(': ', 1)
if key in Term.single_valued:
self[key] = val
elif key in Term.multi_valued:
if not key in self:
self[key] = [val]
else:
self[key].append(val)
else:
print('unclear property: %s' % line)
if 'id' in self:
Term.registry[self.id] = self
if 'alt_id' in self:
Term.registry[self.alt_id] = self
if 'is_a' in self:
key = self.is_a.split(' ! ', 1)[0]
if key in Term.registry:
Term.registry[key].children.append(self)
self.parent = Term.registry[key]
def is_top(self):
return self.parent == None
def is_valid(self):
return self.get('is_obsolete') != 'true' and self.id != None
现在,您可以一口气读取文件:
with open('tiny_go.obo', 'rt') as f:
contents = f.read()
terms = [Term(text) for text in contents.split('\n\n')]
并且递归树变得容易。例如,一个仅输出非过时节点的简单“打印”函数:
def print_tree(terms, indent=''):
valid_terms = [term for term in terms if term.is_valid()]
for term in valid_terms:
print(indent + 'Term %s - %s' % (term.id, term.name))
print_tree(term.children, indent + ' ')
top_terms = [term for term in terms if term.is_top()]
print_tree(top_terms)
这打印:
术语 GO:0003674-molecular_function
术语 GO:0003824 - 催化活性
术语 GO:0005198 - 结构分子活性
术语 GO:0005488 - 绑定
术语 GO:0005515 - 蛋白质结合
术语 GO:0005549 - 气味绑定
术语 GO:0005550 - 信息素结合
你也可以做类似的事情Term.registry['GO:0005549'].parent.name,这会得到"binding".
我将生成嵌套dicts的 GO-ID(例如在您自己的示例中)作为练习,但您甚至可能不需要它,因为Term.registry已经与此非常相似。

TA贡献2051条经验 获得超10个赞
您可以将递归用于更短的解决方案:
import itertools, re, json
content = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]
terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]
terms = sorted(terms, key=lambda x:'is_a' in x)
def tree(d, _start):
t = [i for i in d if i.get('is_a') == _start]
return {} if not t else {i['id']:tree(d, i['id']) for i in t}
print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))
输出:
{
"GO:0003674": {
"GO:0003824": {},
"GO:0005198": {},
"GO:0005488": {
"GO:0005515": {},
"GO:0005549": {
"GO:0005550": {}
}
}
}
}
如果父数据集未在其子数据集之前定义,这也将起作用。例如,当父级位于其原始位置以下三个位置时,仍会生成相同的结果(请参阅文件):
print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))
输出:
{
"GO:0003674": {
"GO:0003824": {},
"GO:0005198": {},
"GO:0005488": {
"GO:0005515": {},
"GO:0005549": {
"GO:0005550": {}
}
}
}
}
添加回答
举报