3 回答

TA贡献1816条经验 获得超6个赞
在这个例子中,我们可以使用 CSS 选择器。假设你使用的是 BeautifulSoup 4.7+,CSS 选择器支持是由Soupsieve库提供的。我们将首先使用:has()CSS 级别 4 选择器来查找<p>具有直接子<b>标签的标签,然后使用汤筛的非标准:contains选择器来确保<b>标签包含Description:. 然后我们简单地打印所有符合此条件的元素的内容,去除前导和尾随空格并去除Description:. 请记住,有多种方法可以做到这一点,这就是我选择来说明方法:
import bs4
markup = """
<div class="col-sm-6">
<b>Book Title:</b>
<A HREF="book_detail.cfm?ID=2449">The Holy Bible containing the Old and New Testaments, according to the authorised version. With illustrations by Gustave Doré</a>
<b>Author:</b> Doré, Gustave, 1832-1883
<b>Image Title:</b> Baptism of Jesus
<b>Scripture Reference:</b><ul><li>John 1 (<a href='search.cfm?biblicalbook=John&biblicalbookchapter=1'>further images</a> / <a rel='shadowbox;height=500;width=600' href='http://www.commonenglishbible.com/explore/passage-lookup/?query=John+1'>scripture text</a>)</li></ul>
<b>Description:</b> John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.
<A HREF="book_list.cfm?ID=2449">Click here
</a> for additional images available from this book.
<p>For information on licensing this image, please send an email, including a link to the image, to
<a href="mailto:dia@emory.edu?subject=Licensing%20Image%20From%20DIA - 17250">dia@emory.edu</a>
soup = bs4.BeautifulSoup(markup, "html.parser")
for el in soup.select('p:has(> b:contains("Description:"))'):
print(el.get_text().strip('').replace('Description: ', ''))
John the Baptist baptizes Jesus in the Jordan River; the Holy Spirit appears overhead in the form of a dove. The artist, Gustave Doré (1832-1883), has placed his signature at the lower left of the woodcut, and the engraver’s signature, A. Ligny, is located at the lower right.

TA贡献1876条经验 获得超7个赞
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
url = "http://pitts.emory.edu/dia/image_details.cfm?ID=17250"
f = urllib.request.urlopen(url)
soup = BeautifulSoup(f, 'html.parser')
parent = soup.find("b", text="Description:").parent
parent.find("b", text="Description:").decompose()
我添加了 BeautifulSoup 并删除了描述。

TA贡献1865条经验 获得超7个赞
我使用 < p > 标签作为索引,然后选择了 [4] 索引。我只是一个新手,但它奏效了。
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://pitts.emory.edu/dia/image_details.cfm?ID=17250")
soup = BeautifulSoup(html, 'html.parser')
page = soup.find_all('p')[4].getText()