1 回答
TA贡献1883条经验 获得超3个赞
您可以使用它beautifulsoup来清理所有标签中的字符串。例如:
from bs4 import BeautifulSoup
lst = [{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':1},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':2},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':3},
{'field1': [],
'field2': {'field2_1': 'string1',
'field2_2': 'string2',
'field2_3': 'string3'},
'field3': '<html> strings4 <html>',
'id':4}]
def flatten(d):
if isinstance(d, dict):
for v in d.values():
yield from flatten(v)
elif isinstance(d, list):
for v in d:
yield from flatten(v)
elif isinstance(d, str):
yield d
out = {}
for d in lst:
out[d['id']] = ' '.join(map(str.strip, BeautifulSoup(' '.join(flatten(d)), 'html.parser').find_all(text=True)))
print(out)
印刷:
{1: 'string1 string2 string3 strings4', 2: 'string1 string2 string3 strings4', 3: 'string1 string2 string3 strings4', 4: 'string1 string2 string3 strings4'}
添加回答
举报