2 回答
TA贡献2021条经验 获得超8个赞
这是可行的——不是很优雅,但是有效。我扩展了您的示例 html,引入了一些更多有问题的节点:
test = """
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table>
<tr>
<th>test</th>
<td>abc</td>
</tr>
<tr>
<th>test1</th>
<td>abc</td>
<td>abc</td>
</tr>
<tr>
<th>test2</th>
<td>abc</td>
</tr>
<tr>
<a>test3</a>
<td>abcd</td>
</tr>
<tr>
<td>test4</td>
<td>abcd</td>
</tr>
</table>
</body> """
import lxml.html
doc = lxml.html.fromstring(test)
good_tags = ['th','td']
targs = doc.xpath('//tr')
for targ in targs:
tr = targ.xpath('.//*')
if len(tr)==2 and (tr[0].tag != tr[1].tag) and tr[0].tag in good_tags and tr[1].tag in good_tags:
print(lxml.html.tostring(targ).decode())
输出:
<tr>
<th>test</th>
<td>abc</td>
</tr>
<tr>
<th>test2</th>
<td>abc</td>
</tr>
- 2 回答
- 0 关注
- 104 浏览
添加回答
举报