如何从包含“<?>”的标签中解析文本

我的目标是获取文本： 27. The method according to claim 23 wherein...How do I go about retrieving the text inside a tag that contains <?. 我相信他们被谷歌搜索称为 php 短标签。我正在使用 lxml、xpaths，他们似乎只是没有将其注册为标签或节点。我试过 itertext() 但效果不佳。 <claim id="CLM-00027" num="00027"> <claim-text> <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys. <?insert-end id="REI-00005" ?></claim-text> </claim>

查看完整描述

2 回答

UYOU

TA贡献1878条经验获得超4个赞

下面是一段代码，它使用 XPath 到达最深的“有效”标签，然后从那里getchildren一直tail深入到实际文本。

import lxml

xml=""" <claim id="CLM-00027" num="00027">

<claim-text> <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys. <?insert-end id="REI-00005" ?></claim-text>

</claim>"""

root = lxml.etree.fromstring(xml)

e = root.xpath("/claim/claim-text")

res = e[0].getchildren()[0].tail

print(res)

输出：

'27。24.根据权利要求23所述的方法，其中所述非晶态金属选自Zr基合金、Ti基合金、Al基合金、Fe基合金、La基合金、Cu基合金、Mg基合金、Pt基合金，和Pd基合金。

反对回复 2023-02-12

守着一只汪

TA贡献1872条经验获得超3个赞

通过索引访问特定的子节点。

from xml.etree import ElementTree as ET

tree = ET.parse('path_to_your.xml')

root = tree.getroot()

print(root[0].text)

输出：

27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.

反对回复 2023-02-12

热搜

最近搜索清空

如何从包含“<?>”的标签中解析文本

如何从包含“<?>”的标签中解析文本

2 回答

添加回答