首页猿问在 Python...

在 Python 中解析嵌套且复杂的 XML

Python

尚方宝剑之说 2023-10-26 10:34:06

我正在尝试解析相当复杂的 xml 文件并将其内容存储在数据框中。我尝试了 xml.etree.ElementTree 并且设法检索了一些元素，但我以某种方式多次检索了它，就好像有更多对象一样。我正在尝试提取以下内容：category, created, last_updated, accession type, name type identifier, name type synonym as a list<cellosaurus><cell-line category="Hybridoma" created="2012-06-06" last_updated="2020-03-12" entry_version="6"> <accession-list> <accession type="primary">CVCL_B375</accession> </accession-list> <name-list> <name type="identifier">#490</name> <name type="synonym">490</name> <name type="synonym">Mab 7</name> <name type="synonym">Mab7</name> </name-list> <comment-list> <comment category="Monoclonal antibody target"> Cronartium ribicola antigens </comment> <comment category="Monoclonal antibody isotype"> IgM, kappa </comment> </comment-list> <species-list> <cv-term terminology="NCBI-Taxonomy" accession="10090">Mus musculus</cv-term> </species-list> <derived-from> <cv-term terminology="Cellosaurus" accession="CVCL_4032">P3X63Ag8.653</cv-term> </derived-from> <reference-list> <reference resource-internal-ref="Patent=US5616470"/> </reference-list> <xref-list> <xref database="CLO" category="Ontologies" accession="CLO_0001018"> <url><![CDATA[https://www.ebi.ac.uk/ols/ontologies/clo/terms?iri=http://purl.obolibrary.org/obo/CLO_0001018]]></url> </xref> <xref database="ATCC" category="Cell line collections" accession="HB-12029"> <url><![CDATA[https://www.atcc.org/Products/All/HB-12029.aspx]]></url> </xref> <xref database="Wikidata" category="Other" accession="Q54422073"> <url><![CDATA[https://www.wikidata.org/wiki/Q54422073]]></url> </xref> </xref-list></cell-line></cellosaurus>

查看完整描述

3 回答

蓝山帝景

TA贡献1843条经验获得超7个赞

鉴于在某些情况下您希望解析标签属性，而在其他情况下您希望解析 tag_values，您的问题有点不清楚。

我的理解如下。您需要以下值：

标签cell-line的属性类别的值。
标签cell-line创建的属性值。
标签cell-line的属性last_updated的值。
标签加入的属性类型的值。
与具有属性标识符的标签名称相对应的文本。
与带有属性synonym 的标签名称相对应的文本。

这些值可以使用模块 xml.etree.Etree 从 xml 文件中提取。特别是，请注意使用Element 类的findall和iter方法。

假设 xml 位于名为input.xml的文件中，则以下代码片段应该可以解决问题。

import xml.etree.ElementTree as et

def main():

tree = et.parse('cellosaurus.xml')

root = tree.getroot()

results = []

for element in root.findall('.//cell-line'):

key_values = {}

for key in ['category', 'created', 'last_updated']:

key_values[key] = element.attrib[key]

for child in element.iter():

if child.tag == 'accession':

key_values['accession type'] = child.attrib['type']

elif child.tag == 'name' and child.attrib['type'] == 'identifier':

key_values['name type identifier'] = child.text

elif child.tag == 'name' and child.attrib['type'] == 'synonym':

key_values['name type synonym'] = child.text

results.append([

# Using the get method of the dict object in case any particular

# entry does not have all the required attributes.

key_values.get('category' , None)

,key_values.get('created' , None)

,key_values.get('last_updated' , None)

,key_values.get('accession type' , None)

,key_values.get('name type identifier', None)

,key_values.get('name type synonym' , None)

])

print(results)

if __name__ == '__main__':

main()

反对回复 2023-10-26

狐的传说

TA贡献1804条经验获得超3个赞

恕我直言，解析 xml 的最简单方法是使用 lxml。

from lxml import etree

data = """[your xml above]"""

doc = etree.XML(data)

for att in doc.xpath('//cell-line'):

print(att.attrib['category'])

print(att.attrib['last_updated'])

print(att.xpath('.//accession/@type')[0])

print(att.xpath('.//name[@type="identifier"]/text()')[0])

print(att.xpath('.//name[@type="synonym"]/text()'))

输出：

Hybridoma

2020-03-12

primary

#490

['490', 'Mab 7', 'Mab7']

然后，您可以将输出分配给变量、附加到列表等。

反对回复 2023-10-26

呼唤远方

TA贡献1856条经验获得超11个赞

另一种方法。最近比较了几个XML解析库，发现这个很好用。我推荐它。

from simplified_scrapy import SimplifiedDoc, utils

xml = '''your xml above'''

# xml = utils.getFileContent('your file name.xml')

results = []

doc = SimplifiedDoc(xml)

for ele in doc.selects('cell-line'):

key_values = {}

for k in ele:

if k not in ['tag','html']:

key_values[k]=ele[k]

key_values['name type identifier'] = ele.select('name@type="identifier">text()')

key_values['name type synonym'] = ele.selects('name@type="synonym">text()')

results.append(key_values)

print (results)

结果：

[{'category': 'Hybridoma', 'created': '2012-06-06', 'last_updated': '2020-03-12', 'entry_version': '6', 'name type identifier': '#490', 'name type synonym': ['490', 'Mab 7', 'Mab7']}]

反对回复 2023-10-26

3 回答
0 关注
184 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

在 Python 中解析嵌套且复杂的 XML

在 Python 中解析嵌套且复杂的 XML

3 回答

添加回答