2 回答
TA贡献2003条经验 获得超2个赞
如果您的所有<a>标签都相同,则可以使用:
from bs4 import BeautifulSoup
import pandas as pd
page = '''<blockquote>
<a name="title"><p><B>Title</b> <table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue"><tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font></td></tr></table> Body Text.</blockquote>
'''
soup = BeautifulSoup(page, "html.parser")
text = []
for texts in soup.find_all('a'):
paragraph = texts.find('p')
title = texts.find('b').text
subtitle = texts.find_all('b')[1].text
other = ''.join(paragraph.find_all(text=True, recursive=False))
d = {'col1': [title], 'col2': [subtitle],'col3' : [other]}
df = pd.DataFrame(data=d)
print(df)
输出 :
col1 col2 col3
0 Title Subtitle: Top Text. Body Text.
TA贡献1817条经验 获得超6个赞
仅使用您共享的 HTML 片段:
from bs4 import BeautifulSoup
content = '<a name="title"><p><B>Title</b> ' \
'<table frame="hsides" border="1" cellspacing="0" cellpadding="2" bordercolor="darkblue">' \
'<tr><td><font face="arial" size="2" color="#0000CC"><b><I>Subtitle</I>: Top Text.</b></font>' \
'</td></tr></table> Body Text.'
soup = BeautifulSoup(content, 'html.parser')
articles = soup.find_all('a')
for article in articles:
paragraph = article.find('p')
print({
'title': article.find('b').text,
'subtitle': article.select('table i')[0].text,
'body': ''.join(paragraph.find_all(text=True, recursive=False))
})
由于问题主要是关于 BeautifulSoup,而不是关于 Pandas,我认为字典就足够了,你可以自己将它放入数据框或其他数据结构中吗?
结果:
{'title': 'Title', 'subtitle': 'Subtitle', 'body': ' Body Text.'}
添加回答
举报