首页猿问无法从网页解析某些名称及其相关网址

无法从网页解析某些名称及其相关网址

Python

交互式爱情 2021-12-21 16:17:50

我已经使用请求和 BeautifulSoup 创建了一个 python 脚本来解析配置文件名称以及从网页到其配置文件名称的链接。内容似乎是动态生成的，但它们存在于页面源中。所以，我尝试了以下方法，但不幸的是我什么也没得到。网站链接到目前为止我的尝试：import requestsfrom bs4 import BeautifulSoupURL = 'https://www.century21.com/real-estate-agents/Dallas,TX'headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9,bn;q=0.8', 'cache-control': 'max-age=0', 'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;', 'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}def get_info(link): res = requests.get(link,headers=headers) soup = BeautifulSoup(res.text,"lxml") for item in soup.select(".media__content"): profileUrl = item.get("href") profileName = item.select_one("[itemprop='name']").get_text() print(profileUrl,profileName)if __name__ == '__main__': get_info(URL)如何从该页面获取内容？

查看完整描述

3 回答

largeQ

TA贡献2039条经验获得超7个赞

所需内容在页面源中可用。当使用相同的user-agent.所以，我曾经fake_useragent随机提供相同的请求。如果您不经常使用它，它会起作用。

工作解决方案：

import requests

from bs4 import BeautifulSoup

from urllib.parse import urljoin

from fake_useragent import UserAgent

URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'

def get_info(s,link):

s.headers["User-Agent"] = ua.random

res = s.get(link)

soup = BeautifulSoup(res.text,"lxml")

for item in soup.select(".media__content a[itemprop='url']"):

profileUrl = urljoin(link,item.get("href"))

profileName = item.select_one("span[itemprop='name']").get_text()

print(profileUrl,profileName)

if __name__ == '__main__':

ua = UserAgent()

with requests.Session() as s:

get_info(s,URL)

部分输出：

https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a Stewart Kipness

https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Andrea-Anglin-Bulin-2631495a Andrea Anglin Bulin

https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Betty-DeVinney-2631507a Betty DeVinney

https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Sabra-Waldman-2657945a Sabra Waldman

https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/Russell-Berry-2631447a Russell Berry

反对回复 2021-12-21

心有法竹

TA贡献1866条经验获得超5个赞

看起来你也可以构建 url（虽然看起来更容易抓住它）

import requests

from bs4 import BeautifulSoup as bs

URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'

headers = {

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',

'accept-encoding': 'gzip, deflate, br',

'accept-language': 'en-US,en;q=0.9,bn;q=0.8',

'cache-control': 'max-age=0',

'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',

'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

}

r = requests.get(URL, headers = headers)

soup = bs(r.content, 'lxml')

items = soup.select('.media')

ids = []

names = []

urls = []

for item in items:

if item.select_one('[data-agent-id]') is not None:

anId = item.select_one('[data-agent-id]')['data-agent-id']

ids.append(anId)

name = item.select_one('[itemprop=name]').text.replace(' ','-')

names.append(name)

url = 'https://www.century21.com/CENTURY-21-Judge-Fite-Company-14501c/' + name + '-' + anId + 'a'

urls.append(url)

results = list(zip(names, urls))

print(results)

反对回复 2021-12-21

慕哥6287543

TA贡献1831条经验获得超10个赞

页面内容不是通过 javascript 呈现的。你的代码在我的情况下很好。您在查找 profileUrl 和处理nonetype异常方面遇到了一些问题。您必须专注于a标签才能获取数据

你应该试试这个：

import requests

from bs4 import BeautifulSoup

URL = 'https://www.century21.com/real-estate-agents/Dallas,TX'

headers = {

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',

'accept-encoding': 'gzip, deflate, br',

'accept-language': 'en-US,en;q=0.9,bn;q=0.8',

'cache-control': 'max-age=0',

'cookie': 'JSESSIONID=8BF2F6FB5603A416DCFBAB8A3BB5A79E.app09-c21-id8; website_user_id=1255553501;',

'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

}

def get_info(link):

res = requests.get(link,headers=headers)

soup = BeautifulSoup(res.text,"lxml")

results = []

for item in soup.select(".media__content"):

a_link = item.find('a')

if a_link:

result = {

'profileUrl': a_link.get('href'),

'profileName' : a_link.get_text()

}

results.append(result)

return results

if __name__ == '__main__':

info = get_info(URL)

print(info)

print(len(info))

输出：

[{'profileName': 'Stewart Kipness',

'profileUrl': '/CENTURY-21-Judge-Fite-Company-14501c/Stewart-Kipness-2657107a'},

....,

{'profileName': 'Courtney Melkus',

'profileUrl': '/CENTURY-21-Realty-Advisors-47551c/Courtney-Melkus-7389925a'}]

941

反对回复 2021-12-21

3 回答
0 关注
206 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

无法从网页解析某些名称及其相关网址

无法从网页解析某些名称及其相关网址

3 回答

添加回答