首页猿问 puppeteer 获取 href...

puppeteer 获取 href 数组，然后遍历每个 href 和该页面上的 href

JavaScript

智慧大石 2022-10-27 14:11:11

我正在尝试通过 node.js 中的 puppeteer 抓取数据目前，我正在寻找编写一个脚本，该脚本会在 well.ca 的某个部分中抓取所有数据现在，这是我试图通过 node.js 实现的方法/逻辑1 - 前往网站的医学健康部分.panel-body-content2 - 使用 dom 选择器通过 dom 选择器获取一组 hrefpanel-body-content a[href] 以抓取子部分3 - 使用 for 循环遍历每个链接（小节）col-lg-5ths col-md-3 col-sm-4 col-xs-6 4 对于每个小节链接，通过获取具有值的每个类的 href，为每个产品获取另一个 href 数组.col-lg-5ths col-md-3 col-sm-4 col-xs-6 a[href]5 - 遍历小节内的每个产品6 - 为每个产品抓取数据目前，我已经编写了上述大部分代码：const puppeteer = require('puppeteer');const chromeOptions = { headless: false, defaultViewport: null,};(async function main() { const browser = await puppeteer.launch(chromeOptions); try { const page = await browser.newPage(); await page.goto("https://well.ca/categories/medicine-health_2.html"); console.log("::::::: OPEN WELL ::::::::::"); // href attribute const hrefs1 = await page.evaluate( () => Array.from( document.querySelectorAll('.panel-body-content a[href]'), a => a.getAttribute('href') ) ); console.log(hrefs1); const urls = hrefs1 for (let i = 0; i < urls.length; i++) { const url = urls[i]; await page.goto(url); } const hrefs2 = await page.evaluate( () => Array.from( document.querySelectorAll('.col-lg-5ths col-md-3 col-sm-4 col-xs-6 a[href]'), a => a.getAttribute('href') ) );当我尝试为每个产品的每个 href 获取一个数组时，我在数组中什么也没有收到。如何添加嵌套的 for 循环，以获取每个小节中每个产品的所有 href 数组，然后访问每个产品链接？.col-lg-5ths col-md-3 col-sm-4 col-xs-6什么是正确的 dom 选择器，用于获取具有 id的类中的所有 href product_grid_link如果我想添加一个后续循环以通过每个小节中产品的 href 从每个产品中获取信息，我该如何将其嵌入到代码中？任何帮助将非常感激

查看完整描述

1 回答

心有法竹

TA贡献1866条经验获得超5个赞

似乎有些链接是重复的，所以最好收集最终页面的所有链接，对链接列表进行重复数据删除，然后刮掉最终页面。（您也可以将最终页面的链接保存在文件中以供以后使用。）该脚本收集了 5395 个链接（已删除）。

'use strict';

const puppeteer = require('puppeteer');

(async function main() {

try {

const browser = await puppeteer.launch({ headless: false, defaultViewport: null });

const [page] = await browser.pages();

await page.goto('https://well.ca/categories/medicine-health_2.html');

const hrefsCategoriesDeduped = new Set(await page.evaluate(

() => Array.from(

document.querySelectorAll('.panel-body-content a[href]'),

a => a.href

)

));

const hrefsPages = [];

for (const url of hrefsCategoriesDeduped) {

await page.goto(url);

hrefsPages.push(...await page.evaluate(

() => Array.from(

document.querySelectorAll('.col-lg-5ths.col-md-3.col-sm-4.col-xs-6 a[href]'),

a => a.href

)

));

}

const hrefsPagesDeduped = new Set(hrefsPages);

// hrefsPagesDeduped can be converted back to an array

// and saved in a JSON file now if needed.

for (const url of hrefsPagesDeduped) {

await page.goto(url);

// Scrape the page.

}

await browser.close();

} catch (err) {

console.error(err);

}

})();

反对回复 2022-10-27

1 回答
0 关注
290 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

puppeteer 获取 href 数组，然后遍历每个 href 和该页面上的 href

puppeteer 获取 href 数组，然后遍历每个 href 和该页面上的 href

1 回答

添加回答