4 回答
TA贡献1744条经验 获得超4个赞
带有scrapy的CSS选择器选项:
address = response.css("span.address-line1::text, span.address-line2::text, span[itemprop=addressLocality]::text, span[itemprop=addressRegion]::text, span[itemprop=postalCode]::text").extract() # should return list
if address:
address = ", ".
TA贡献1909条经验 获得超7个赞
使用单行 XPath 的肮脏解决方案:
concat(//span[@class='address-line1']/text(),' ',//span[@class='address-line2']/text(),' ',//span[@itemprop='addressLocality']/text(),', ',//span[@itemprop='addressRegion']/text(),//span[@itemprop='postalCode']/text())
输出 :
"5835 Post Rd. Suite 217 East Greenwich, RI02818"
TA贡献1921条经验 获得超9个赞
这是面向未来的想法,因为 ids/classes 可以在此期间发生变化:
from re import sub
from bs4 import BeautifulSoup as bs
teststr = """<span class="office-address" itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">
<span class="address-line1">5835 Post Rd.</span>
<span class="address-line2">Suite 217</span>
</span>
<span class="city-state-zip">
<span itemprop="addressLocality">East Greenwich</span>, <span itemprop="addressRegion">RI</span> <span itemprop="postalCode">02818</span>
</span>
</span>"""
r = bs(teststr,"lxml").getText().strip()
r = sub( r"\n", ", ", r)
r = sub( r"[, ]{2,}", ", ", r)
print ( r )
结果:
5835 Post Rd., Suite 217, East Greenwich, RI 02818
添加回答
举报