首页猿问网络爬虫在第一页停止

网络爬虫在第一页停止

慕丝7291255 2022-09-12 20:31:04

我正在开发一个网络爬虫，它应该像这样工作：转到某个网站，抓取该网站的所有链接下载所有图像（从起始页开始）如果当前页面上没有留下任何图像，请转到步骤1中找到的下一个链接，然后执行步骤2和3，直到没有链接/图像为止。似乎下面的代码以某种方式工作，就像当我尝试抓取一些网站时，我会得到一些图像下载。（即使我不理解我得到的图像，因为我在网站上找不到它们，似乎爬虫不是从网站的起始页开始的）。经过几张图像（~25-500张），爬行器完成并停止，没有错误，它只是停止。我在多个网站上尝试了这个，在一些图像之后，它只是停止了。我认为爬虫以某种方式忽略了步骤3。package mainimport ( "fmt" "io" "log" "net/http" "os" "strconv" "strings" "github.com/PuerkitoBio/goquery")var ( currWebsite string = "https://www.youtube.com" imageCount int = 0 crawlWebsite string)func processElement(index int, element *goquery.Selection) { href, exists := element.Attr("href") if exists && strings.HasPrefix(href, "http") { crawlWebsite = href response, err := http.Get(crawlWebsite) if err != nil { log.Fatalf("error on current website") } defer response.Body.Close() document, err := goquery.NewDocumentFromReader(response.Body) if err != nil { log.Fatal("Error loading HTTP response body.", err) } document.Find("img").Each(func(index int, element *goquery.Selection) { imgSrc, exists := element.Attr("src") if strings.HasPrefix(imgSrc, "http") && exists { fileName := fmt.Sprintf("./images/img" + strconv.Itoa(imageCount) + ".jpg") currWebsite := fmt.Sprint(imgSrc) fmt.Println("[+]", currWebsite) DownloadFile(fileName, currWebsite) imageCount++ } }) }}func main() { err := os.MkdirAll("./images/", 0777) if err != nil { log.Fatalln("error on creating directory") } response, err := http.Get(currWebsite) if err != nil { log.Fatalln("error on searching website") } defer response.Body.Close() document, err := goquery.NewDocumentFromReader(response.Body) if err != nil { log.Fatalln("Error loading HTTP response body. ", err) }

查看完整描述

1 回答

函数式编程

TA贡献1807条经验获得超9个赞

（即使我不理解我得到的图像，因为我在网站上找不到它们，似乎爬虫不是从网站的起始页开始的）。

是的，你是对的。您的代码不会从起始页下载图像，因为它从起始页获取的唯一内容是所有锚点标记元素，然后调用在起始页上找到的每个锚点元素 -processElement()

response, err := http.Get(currWebsite)

if err != nil {

log.Fatalln("error on searching website")

}

defer response.Body.Close()

document, err := goquery.NewDocumentFromReader(response.Body)

if err != nil {

log.Fatalln("Error loading HTTP response body. ", err)

}

document.Find("a").Each(processElement) // Here

要从起始页下载所有图像，您应该定义另一个函数来执行获取元素和下载图像的工作，但是在函数中，您只需要获取链接并在该链接上调用 -processUrl()imgprocessElement()hrefprocessUrl()

func processElement(index int, element *goquery.Selection) {

href, exists := element.Attr("href")

if exists && strings.HasPrefix(href, "http") {

crawlWebsite = href

processUrl(crawlWebsite)

}

func processUrl(crawlWebsite string) {

response, err := http.Get(crawlWebsite)

if err != nil {

log.Fatalf("error on current website")

}

defer response.Body.Close()

document, err := goquery.NewDocumentFromReader(response.Body)

if err != nil {

log.Fatal("Error loading HTTP response body.", err)

}

document.Find("img").Each(func(index int, element *goquery.Selection) {

imgSrc, exists := element.Attr("src")

if strings.HasPrefix(imgSrc, "http") && exists {

fileName := fmt.Sprintf("./images/img" + strconv.Itoa(imageCount) + ".jpg")

currWebsite := fmt.Sprint(imgSrc)

fmt.Println("[+]", currWebsite)

DownloadFile(fileName, currWebsite)

imageCount++

}

})

}

现在只需在处理所有链接之前从起始页抓取图像 -

func main() {

...

document, err := goquery.NewDocumentFromReader(response.Body)

if err != nil {

log.Fatalln("Error loading HTTP response body. ", err)

}

// First crawl images from start page url

processUrl(currWebsite)

document.Find("a").Each(processElement)

}

反对回复 2022-09-12

1 回答
0 关注
120 浏览

关注

添加回答

0/150

提交

取消

热搜

最近搜索清空

网络爬虫在第一页停止

网络爬虫在第一页停止

1 回答

添加回答