python 3.x 中的站点地图 xml 解析

我的 xml 结构如下<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"> <url> <loc>hello world 1</loc> <image:image> <image:loc>this is image loc 1</image:loc> <image:title>this is image title 1</image:title> </image:image> <lastmod>2019-06-19</lastmod> <changefreq>daily</changefreq> <priority>0.25</priority> </url> <url> <loc>hello world 2</loc> <image:image> <image:loc>this is image loc 2</image:loc> <image:title>this is image title 2</image:title> </image:image> <lastmod>2020-03-19</lastmod> <changefreq>daily</changefreq> <priority>0.25</priority> </url></urlset>我只想得到hello world 1hello world 2我的 python 代码如下：import xml.etree.ElementTree as ETtree = ET.parse('test.xml')root = tree.getroot()for url in root.findall('url'): loc = url.find('loc').text print(loc)不幸的是，它什么也没给我。但是当我将 xml 更改为<urlset> <url> <loc>hello world 1</loc> <lastmod>2019-06-19</lastmod> <changefreq>daily</changefreq> <priority>0.25</priority> </url> <url> <loc>hello world 2</loc> <lastmod>2020-03-19</lastmod> <changefreq>daily</changefreq> <priority>0.25</priority> </url></urlset>它给了我正确的结果。hello world 1hello world 2我该怎么做才能在不更改 xml 的情况下获得正确的结果？因为修改 10000 多行文件没有任何意义。爱

查看完整描述

2 回答

牛魔王的故事

TA贡献1830条经验获得超3个赞

对您的代码的（不雅）修复是：

import xml.etree.ElementTree as ET

tree = ET.parse('test.xml')

root = tree.getroot()

# In find/findall, prefix namespaced tags with the full namespace in braces

for url in root.findall('{http://www.sitemaps.org/schemas/sitemap/0.9}url'):

loc = url.find('{http://www.sitemaps.org/schemas/sitemap/0.9}loc').text

print(loc)

这是因为您必须使用定义 XML 的命名空间来限定标记名称。有关如何使用名称空间find和findall方法的详细信息来自Parse XML namespace with Element Tree findall

反对回复 2022-12-06

繁星点点滴滴

TA贡献1803条经验获得超3个赞

如果你不想弄乱命名空间，这是比公认的答案更简单的解决方案，而且更优雅，使用通用的 xpath 查询：

import lxml.etree

tree = lxml.etree.parse('test.xml')

for url in tree.xpath("//*[local-name()='loc']/text()"):

print(url)

如果你更喜欢使用 xml 命名空间，你应该这样做：

import lxml.etree

tree = lxml.etree.parse('test.xml')

namespaces = {

'sitemapindex': 'http://www.sitemaps.org/schemas/sitemap/0.9',

}

for url in tree.xpath("//sitemapindex:loc/text()", namespaces=namespaces):

print(url)

如果你更喜欢直接从内存而不是文件加载 xml 数据，你可以使用 lxml.etree.fromstring 而不是 lxml.etree.parse。

反对回复 2022-12-06

热搜

最近搜索清空

python 3.x 中的站点地图 xml 解析

python 3.x 中的站点地图 xml 解析

2 回答

添加回答