为了账号安全,请及时绑定邮箱和手机立即绑定

有没有办法创建 XML 元素树?

有没有办法创建 XML 元素树?

慕的地6264312 2024-01-24 16:19:33
我目前正在编写一些 XSD 和 DTD 来验证一些 XML 文件,我正在手工编写它们,因为我在使用 XSD 生成器(例如 Oxygen)时有过非常糟糕的体验。但是,我已经有一个需要执行此操作的示例 XML,并且该 XML 非常巨大,例如,我有一个包含 4312 个子元素的元素。由于我对 XSD 生成器的体验非常糟糕,因此我想创建一种仅包含唯一标签和属性的 XML 树,这样在查看要编写的 XML 时我不必处理重复元素一个XSD。我的意思是,我有这个 XML(由 W3 提供):<?xml version="1.0" encoding="UTF-8"?><breakfast_menu><food some_attribute="1.0">    <name>Belgian Waffles</name>    <price>$5.95</price>    <description>   Two of our famous Belgian Waffles with plenty of real maple syrup   </description>    <calories>650</calories></food><food>    <name>Strawberry Belgian Waffles</name>    <price>$7.95</price>    <description>    Light Belgian waffles covered with strawberries and whipped cream    </description>    <calories>900</calories></food><food>    <name>Berry-Berry Belgian Waffles</name>    <price>$8.95</price>    <description>    Belgian waffles covered with assorted fresh berries and whipped cream    </description>    <calories>900</calories></food><food>    <name>French Toast</name>    <price>$4.50</price>    <description>    Thick slices made from our homemade sourdough bread    </description>    <calories>600</calories>    <some_complex_type_element_1>      <some_simple_type_element_1>Text.</some_simple_type_element_1>    </some_complex_type_element_1></food><food>    <name>Homestyle Breakfast</name>    <price>$6.95</price>    <description>    Two eggs, bacon or sausage, toast, and our ever-popular hash browns    </description>    <calories>950</calories>    <some_simple_type_element_2>Text.</some_simple_type_element_2></food></breakfast_menu>正如您所看到的,根元素下有 4 种类型的独特元素。这些都是:元素 1(有属性),元素 2 和 3,元素 4(有另一个复杂类型元素),元素 5(有另一个 simpleType 元素)。我想要实现的是此 XML 的某种树表示,但仅包含唯一元素且不包含文本。
查看完整描述

1 回答

?
小唯快跑啊

TA贡献1863条经验 获得超2个赞

看看这是否满足您的需求。


from simplified_scrapy import SimplifiedDoc, utils


xml = '''

<?xml version="1.0" encoding="UTF-8"?>

<breakfast_menu>

    <food some_attribute="1.0">

        <name>Belgian Waffles</name>

        <price>$5.95</price>

        <description>

    Two of our famous Belgian Waffles with plenty of real maple syrup

    </description>

        <calories>650</calories>

    </food>

    <food>

        <name>Strawberry Belgian Waffles</name>

        <price>$7.95</price>

        <description>

        Light Belgian waffles covered with strawberries and whipped cream

        </description>

        <calories>900</calories>

    </food>

    <food>

        <name>Berry-Berry Belgian Waffles</name>

        <price>$8.95</price>

        <description>

        Belgian waffles covered with assorted fresh berries and whipped cream

        </description>

        <calories>900</calories>

    </food>

    <food>

        <name>French Toast</name>

        <price>$4.50</price>

        <description>

        Thick slices made from our homemade sourdough bread

        </description>

        <calories>600</calories>

        <some_complex_type_element_1>

        <some_simple_type_element_1>Text.</some_simple_type_element_1>

        </some_complex_type_element_1>

    </food>

    <food>

        <name>Homestyle Breakfast</name>

        <price>$6.95</price>

        <description>

        Two eggs, bacon or sausage, toast, and our ever-popular hash browns

        </description>

        <calories>950</calories>

        <some_simple_type_element_2>Text.</some_simple_type_element_2>

    </food>

</breakfast_menu>

'''


def loop(node):

    para = {}

    for k in node:

        if k=='tag' or k=='html': continue

        para[k] = ''

    if para: node.setAttrs(para) # Remove attributes

    children = node.children

    if children:

        for c in children:

            loop(c)

    else:

        if node.text:

            node.setContent('') # Remove value


doc = SimplifiedDoc(xml)

# Remove values and attributes

loop(doc.breakfast_menu)


dicNode = {}

for node in doc.breakfast_menu.children:

    key = node.outerHtml

    if dicNode.get(key):

        node.remove() # Delete duplicate

    else:

        dicNode[key] = True


print(doc.html)

结果:


<?xml version="1.0" encoding="UTF-8"?>

<breakfast_menu>

    <food some_attribute="">

        <name></name>

        <price></price>

        <description></description>

        <calories></calories>

    </food>

    <food>

        <name></name>

        <price></price>

        <description></description>

        <calories></calories>

    </food>

    <food>

        <name></name>

        <price></price>

        <description></description>

        <calories></calories>

        <some_complex_type_element_1>

        <some_simple_type_element_1></some_simple_type_element_1>

        </some_complex_type_element_1>

    </food>

    <food>

        <name></name>

        <price></price>

        <description></description>

        <calories></calories>

        <some_simple_type_element_2></some_simple_type_element_2>

    </food>

</breakfast_menu>

对于大文件,请尝试以下方法。


from simplified_scrapy import SimplifiedDoc, utils

from simplified_scrapy.core.regex_helper import replaceReg


filePath = 'test.xml'

doc = SimplifiedDoc()

doc.loadFile(filePath, lineByline=True)


utils.appendFile('dest.xml','<?xml version="1.0" encoding="UTF-8"?><breakfast_menu>')

dicNode = {}

for node in doc.getIterable('food'):

    key = node.outerHtml

    key = replaceReg(key, '>[^>]*?<', '><')

    key = replaceReg(key, '"[^"]*?"', '""')


    if not dicNode.get(key):

        dicNode[key] = True

        utils.appendFile('dest.xml', key)



utils.appendFile('dest.xml', '</breakfast_menu>')


查看完整回答
反对 回复 2024-01-24
  • 1 回答
  • 0 关注
  • 107 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信