为了账号安全,请及时绑定邮箱和手机立即绑定

如何从平面文件(Gene Ontology OBO 文件)生成递归树状字典?

如何从平面文件(Gene Ontology OBO 文件)生成递归树状字典?

慕无忌1623718 2021-12-08 10:32:33
我正在尝试编写代码来解析 Gene Ontology (GO) OBO 文件并将 go 术语 ID(例如 GO:0003824)推送到树状嵌套字典中。OBO 文件中的层次结构用“is_a”标识符表示,用于标记每个 GO 术语的每个父级。一个 GO 术语可能有多个父级,而层次结构中最高的 Go 术语没有父级。GO OBO 文件的一个小例子如下所示:[Term]id: GO:0003674name: molecular_functionnamespace: molecular_functionalt_id: GO:0005554def: "A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process." [GOC:pdt]comment: Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. When this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code "no data" (ND), is used to indicate this. Despite its name, this is not a type of 'function' in the sense typically defined by upper ontologies such as Basic Formal Ontology (BFO). It is instead a BFO:process carried out by a single gene product or complex.subset: goslim_aspergillussubset: goslim_candidasubset: goslim_chemblsubset: goslim_genericsubset: goslim_metagenomicssubset: goslim_pirsubset: goslim_plantsubset: goslim_yeastsynonym: "molecular function" EXACT []
查看完整描述

2 回答

?
繁花不似锦

TA贡献1851条经验 获得超4个赞

你写了


if (parent_go_id in parent_list):

    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)

正确的是


if (parent_go_id in parent_list):

    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)[go_id]

在此更改后,它会产生:


{

    'GO:0003674': {

        'GO:0003824': {}, 

        'GO:0005198': {}, 

        'GO:0005488': {

            'GO:0005515': {},

            'GO:0005549': {

                'GO:0005550': {}

            }

        }

    }

}

但我会建议完全不同的方法。创建一个类来解析术语并构建依赖树,因为它这样做。


为方便起见,我将它派生自dict,因此您可以编写term.id而不是term['id']:


class Term(dict):

    __getattr__ = dict.__getitem__

    __setattr__ = dict.__setitem__

    __delattr__ = dict.__delitem__


    registry = {}

    single_valued = 'id name namespace alt_id def comment synonym is_a'.split()

    multi_valued = 'subset xref'.split()


    def __init__(self, text):

        self.children = []

        self.parent = None


        for line in text.splitlines():

            if not ': ' in line:

                continue

            key, val = line.split(': ', 1)

            if key in Term.single_valued:

                self[key] = val

            elif key in Term.multi_valued:

                if not key in self:

                    self[key] = [val]

                else:

                    self[key].append(val)

            else:

                print('unclear property: %s' % line)


        if 'id' in self:

            Term.registry[self.id] = self


        if 'alt_id' in self:

            Term.registry[self.alt_id] = self


        if 'is_a' in self:

            key = self.is_a.split(' ! ', 1)[0]

            if key in Term.registry:

                Term.registry[key].children.append(self)

                self.parent = Term.registry[key]


    def is_top(self):

        return self.parent == None


    def is_valid(self):

        return self.get('is_obsolete') != 'true' and self.id != None

现在,您可以一口气读取文件:


with open('tiny_go.obo', 'rt') as f:

    contents = f.read()


terms = [Term(text) for text in contents.split('\n\n')]

并且递归树变得容易。例如,一个仅输出非过时节点的简单“打印”函数:


def print_tree(terms, indent=''):

    valid_terms = [term for term in terms if term.is_valid()]

    for term in valid_terms:

        print(indent + 'Term %s - %s' % (term.id, term.name))

        print_tree(term.children, indent + '  ')


top_terms = [term for term in terms if term.is_top()]


print_tree(top_terms)

这打印:


术语 GO:0003674-molecular_function

  术语 GO:0003824 - 催化活性

  术语 GO:0005198 - 结构分子活性

  术语 GO:0005488 - 绑定

    术语 GO:0005515 - 蛋白质结合

    术语 GO:0005549 - 气味绑定

      术语 GO:0005550 - 信息素结合

你也可以做类似的事情Term.registry['GO:0005549'].parent.name,这会得到"binding".


我将生成嵌套dicts的 GO-ID(例如在您自己的示例中)作为练习,但您甚至可能不需要它,因为Term.registry已经与此非常相似。


查看完整回答
反对 回复 2021-12-08
?
侃侃无极

TA贡献2051条经验 获得超10个赞

您可以将递归用于更短的解决方案:


import itertools, re, json

content = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))

entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]

terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]

terms = sorted(terms, key=lambda x:'is_a' in x)

def tree(d, _start):

  t = [i for i in d if i.get('is_a') == _start]

  return {} if not t else {i['id']:tree(d, i['id']) for i in t}


print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

输出:


{

  "GO:0003674": {

    "GO:0003824": {},

    "GO:0005198": {},

    "GO:0005488": {

        "GO:0005515": {},

        "GO:0005549": {

            "GO:0005550": {}

        }

      }

   }

}

如果父数据集未在其子数据集之前定义,这也将起作用。例如,当父级位于其原始位置以下三个位置时,仍会生成相同的结果(请参阅文件):


print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

输出:


{

"GO:0003674": {

    "GO:0003824": {},

    "GO:0005198": {},

    "GO:0005488": {

        "GO:0005515": {},

        "GO:0005549": {

            "GO:0005550": {}

        }

      }

   }

}


查看完整回答
反对 回复 2021-12-08
  • 2 回答
  • 0 关注
  • 378 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信