1 回答
TA贡献1808条经验 获得超4个赞
通过网络我看到链接总是比描述高两个级别。然后你冷使用find_parent()函数来获取a找到的作业的标签。
你的代码中有:
jobs = soup.find_all('p',text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
然后在那之后添加:
for i in jobs:
print(i.find_parent('a').get('href'))
这将打印链接。请注意,这些链接是相对链接而不是绝对链接。您应该添加根以查找特定页面。例如,如果您发现一个链接是:ETender.aspx?id=ed60009c-8d64-4759-a722-872e21cf9ea7&action=show。您必须在开头添加:https://www.auftrag.at/。作为最后一个链接:https : //www.auftrag.at/ETender.aspx?id=ed60009c-8d64-4759-a722-872e21cf9ea7&action=show
如果需要,您可以像处理职位描述一样将它们添加到列表中。完整代码(不保存在 csv 中)将是:
import re, requests, time, os, csv, subprocess
from bs4 import BeautifulSoup
def get_jobs(url):
keywords = ["KI", "AI", "Big Data", "Data", "data", "big data", "Analytics", "analytics", "digitalisierung", "ML",
"Machine Learning", "Daten", "Datenexperte", "Datensicherheitsexperte"]
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'}
html = requests.get(url, headers=headers, timeout=5)
time.sleep(2)
soup = BeautifulSoup(html.text, 'html.parser')
jobs = soup.find_all('p',text=re.compile(r'\b(?:%s)\b' % '|'.join(keywords)))
# links = jobs.find_all('a')
jobs_found = []
links = []
for word in jobs:
jobs_found.append(word)
links.append(word.find_parent('a').get('href'))
with open("jobs.csv", 'a', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(jobs_found)
# subprocess.call('./Autopilot3.py')
print("Matched Jobs have been collected.")
return soup, jobs
soup, jobs = get_jobs('https://www.auftrag.at//tenders.aspx')
如果要添加完整的 url,只需更改行:
links.append(word.find_parent('a').get('href'))
到:
links.append("//".join(["//".join(url.split("//")[:2]),word.find_parent('a').get('href')]))
添加回答
举报