2 回答
TA贡献1815条经验 获得超10个赞
作业的完整描述以 JavaScript 变量的形式存储在页面内。您可以使用selenium提取它或re模块:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://djinni.co/jobs2/144172-data-scientist'
html_data = requests.get(url).text
full_desc = re.search(r'fullDescription:"(.*?)",', html_data).group(1).replace(r'\r\n', '\n')
short_desc = BeautifulSoup(html_data, 'html.parser').select_one('.job-teaser').get_text()
print(short_desc)
print('-' * 80)
print(full_desc)
印刷:
Together Networks is looking for an experienced Data Scientist to join our Agile team. Together Networks is a worldwide leader in the online dating niche with millions of users across more than 45 countries. Our brands are BeNaughty, CheekyLovers, Flirt, Click&Flirt, Flirt Spielchen.
--------------------------------------------------------------------------------
What you get to deal with:
- Active collaboration with stakeholders throughout the organization;
- User experience modelling;
- Advanced segmentation;
- User behavior analytics;
- Anomaly detection, fraud detection;
- Looking for bottlenecks;
- Churn prediction.
You need to have (required):
- Masterâs or PHD in Statistics, Mathematics, Computer Science or another quantitative field;
- 2+ years of experience manipulating data sets and building statistical models;
- Strong knowledge in a wide range of machine learning methods and algorithms for classification, regression, clustering, and others;
- Knowledge and experience in statistical and data mining techniques;
- Experience using statistical computer languages (Python, SLQ, etc.) to manipulate data and draw insights from large data sets.
- Knowledge of a variety of machine learning techniques and their real-world advantages\u002Fdrawbacks;
- Experience visualizing\u002Fpresenting insights for stakeholders;
- Independent, creative thinking, and ability to learn fast.
Would be a great plus:
- Experience dealing with end to end machine learning projects: data exploration, feature engineering\u002Fdefinition, model building, production, maintenance;
- Experience in data visualization with Tableau;
- Experience in dating, game dev, social projects.
TA贡献1790条经验 获得超9个赞
这是网页抓取时的一个典型错误。
您可能查看了浏览器中呈现的 HTML 的源代码,并尝试p
获取job-description-wrapper
div
.
但是,如果您只是加载页面本身(浏览器处理的第一个请求)并查看其内容,您会发现该段落最初并未加载。有些脚本会稍后加载它的内容 - 但这种情况发生得如此之快,您作为用户几乎不会注意到它。
检查此输出:
print(requests.get(url='https://djinni.co/jobs2/144172-data-scientist').text)
这就是造成问题的原因。如何解决又是另外一回事了。一种方法是在 Python 中使用无头浏览器,该浏览器在加载页面后运行脚本,并且仅当页面完成加载所有内容时,才能获取您需要的内容。您可以查看类似的工具selenium
。
添加回答
举报