为了账号安全,请及时绑定邮箱和手机立即绑定

如何从使用 php 和 javascript 的网页中使用 python 解析信息

如何从使用 php 和 javascript 的网页中使用 python 解析信息

吃鸡游戏 2021-11-30 18:33:07
我正在尝试从此网页获取所有事件和这些事件的其他元数据:https : //alando-palais.de/events我的问题是,结果(html)不包含我正在寻找的信息。我想,它们“隐藏”在一些 php 脚本后面。这个网址:' https://alando-palais.de/wp/wp-admin/admin-ajax.php '任何想法,如何等待页面完全加载,或者我必须使用什么样的方法来获取事件信息?这是我现在的脚本:-):from bs4 import BeautifulSoupfrom urllib.request import urlopen, urljoinfrom urllib.parse import urlparseimport reimport requestsif __name__ == '__main__':    target_url = 'https://alando-palais.de/events'    #target_url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'    soup = BeautifulSoup(requests.get(target_url).text, 'html.parser')    print(soup)    links = soup.find_all('a', href=True)    for x,link in enumerate(links):        print(x, link['href'])#    for image in images:#        print(urljoin(target_url, image))预期输出将类似于:日期:08.03.2019标题:阁楼俱乐部特别节目:麦外和朋友们img: https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg "这是这个结果的一些东西:<div class="vc_gitem-zone vc_gitem-zone-b vc_custom_1547045488900 originalbild vc-gitem-zone-height-mode-auto vc_gitem-is-link" style="background-image: url(https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg) !important;">    <a href="https://alando-palais.de/event/penthouse-club-special-maiwai-friends" title="Penthouse Club Special: Maiwai &#038; Friends" class="vc_gitem-link vc-zone-link"></a>    <img src="https://alando-palais.de/wp/wp-content/uploads/2019/02/0803_MaiwaiFriends-500x281.jpg" class="vc_gitem-zone-img" alt="">  <div class="vc_gitem-zone-mini">        <div class="vc_gitem_row vc_row vc_gitem-row-position-top"><div class="vc_col-sm-6 vc_gitem-col vc_gitem-col-align-left">   <div class="vc_gitem-post-meta-field-Datum eventdatum vc_gitem-align-left"> 08.03.2019    </div>
查看完整描述

2 回答

?
慕容3067478

TA贡献1773条经验 获得超3个赞

您可以模仿该页面发布的 xhr 帖子


from bs4 import BeautifulSoup

import requests

import pandas as pd


url = 'https://alando-palais.de/wp/wp-admin/admin-ajax.php'


data = {


  'action': 'vc_get_vc_grid_data',

  'vc_action': 'vc_get_vc_grid_data',

  'tag': 'vc_basic_grid',

  'data[visible_pages]' : 5,

  'data[page_id]' : 30,

  'data[style]' : 'all',

  'data[action]' : 'vc_get_vc_grid_data',

  'data[shortcode_id]' : '1551112413477-5fbaaae1-0622-2',

  'data[tag]' : 'vc_basic_grid',

  'vc_post_id' : '30',

  '_vcnonce' : 'cc8cc954a4'  


}


res = requests.post(url, data = data)

soup = BeautifulSoup(res.content, 'lxml')

dates = [item.text.strip() for item in soup.select('.vc_gitem-zone[style*="https://alando-palais.de"]')]

textInfo = [item for item in soup.select('.vc_gitem-link')][::2]

imageLinks = [item['src'].strip() for item in soup.select('img')]

titles = []

links = []

for item in textInfo:

    titles.append(item['title'])

    links.append(item['href'])

results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])

print(results)

或硒:


from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

import pandas as pd


url = 'https://alando-palais.de/events#'

driver = webdriver.Chrome()

driver.get(url)


dates = [item.text.strip() for item in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".vc_gitem-zone[style*='https://alando-palais.de']"))) if len(item.text)]

textInfo = [item for item in driver.find_elements_by_css_selector('.vc_gitem-link')][::2]

textInfo = textInfo[: int(len(textInfo) / 2)]

imageLinks = [item.get_attribute('src').strip() for item in driver.find_elements_by_css_selector('a + img')][::2]

titles = []

links = []


for item in textInfo:

    titles.append(item.get_attribute('title'))

    links.append(item.get_attribute('href'))

results = pd.DataFrame(list(zip(titles, dates, links, imageLinks)),columns = ['title', 'date', 'link', 'imageLink'])


print(results)


driver.quit()


查看完整回答
反对 回复 2021-11-30
?
智慧大石

TA贡献1946条经验 获得超3个赞

我最好建议您使用selenium绕过所有服务器限制。


已编辑


from selenium import webdriver


driver = webdriver.Firefox()

driver.get("https://alando-palais.de/events")

elems = driver.find_elements_by_xpath("//a[@href]")

for elem in elems:

    print elem.get_attribute("href")


查看完整回答
反对 回复 2021-11-30
  • 2 回答
  • 0 关注
  • 155 浏览
慕课专栏
更多

添加回答

举报

0/150
提交
取消
意见反馈 帮助中心 APP下载
官方微信