首页猿问剥离一些标签并重命名它们

剥离一些标签并重命名它们

Python

慕森卡 2022-06-02 15:32:20

使用 lxml 库，拥有这个 doc xml 文件，我想剥离一些标签并重命名它们：doc.xml<html> <body> <h5>Fruits</h5> <div>This is some <span attr="foo">Text</span>.</div> <div>Some <span>more</span> text.</div> <h5>Vegetables</h5> <div>Yet another line <span attr="bar">of</span> text.</div> <div>This span will get <span attr="foo">removed</span> as well.</div> <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div> <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div> </body></html>而不是 html,body 将所有内容包装在“p tag”中，而不是让 h5 和每个 div 使用 lxml 将所有内容作为示例包装如下：我的问题是如何从一种格式以下面的格式包装所有内容？<p><h5 title='Fruits'> <div>This is some <span attr='foo'>Test</span>.</div><div>Some<span>more</span>text.</div></h5><h5 title='Vegetables'><div>Yet another line <span attr='bar'>of</span>text.</div>....</h5></p>使用 lxml，剥离标签：tree = etree.tostring(doc.xml)tree1 = lxml.html.fromstring(tree)etree.strip_tags(tree1, 'body')有人对此有任何想法吗？

查看完整描述

2 回答

皈依舞

TA贡献1851条经验获得超3个赞

创建一个只有标签的新文档。<p>
迭代<body>原始文档中标记的后代。

如果遇到<h5>标签；将<h5>标签添加到<p>标签
并将后续标签作为后代添加到它（<h5>）
将标签从原始文档添加到新文档 - 作为其<p>标签的后代

反对回复 2022-06-02

至尊宝的传说

TA贡献1789条经验获得超10个赞

这是使用 lxml 的 xslt 解决方案。它将处理卸载到 libxml。我在转换样式表中添加了注释：

from lxml import etree

xsl = etree.XML('''

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes" />

<xsl:strip-space elements="*"/>

<xsl:template match="/">

<p>

<xsl:apply-templates select="html/body"/>

</p>

</xsl:template>

<xsl:template match="body">

<xsl:apply-templates />

</xsl:template>

<xsl:template match="h5">

<xsl:variable name="title" select="."/>

<h5>

<xsl:attribute name="title">

<xsl:value-of select="$title" />

</xsl:attribute>

<xsl:for-each select="following-sibling::div[preceding-sibling::h5[1] = $title]">

<xsl:copy-of select="." />

</xsl:for-each>

</h5>

</xsl:template>

<xsl:template match="div" />

</xsl:stylesheet>

''')

transform = etree.XSLT(xsl)

with open("doc.xml") as f:

print(transform(etree.parse(f)), end='')

如果样式表存储在文件名 doc.xsl 中，则可以使用 libxml 实用程序 xsltproc 获得相同的结果：

xsltproc doc.xsl doc.xml

结果：

<?xml version="1.0"?>

<p>

</h5>

<div>Yet another line <span attr="bar">of</span> text.</div>

<div>This span will get <span attr="foo">removed</span> as well.</div>

<div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>

<div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>

</h5>

</p>

反对回复 2022-06-02

2 回答
0 关注
159 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

剥离一些标签并重命名它们

剥离一些标签并重命名它们

2 回答

添加回答