XPath 不检索某些内容

我是一个新手，试图编写一个爬虫来从论坛中获取一些统计信息。这是我的代码：<?php$ch = curl_init();$timeout = 0; // set to zero for no timeoutcurl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);$file_contents = curl_exec($ch);curl_close($ch);$dom = new DOMDocument;libxml_use_internal_errors(true);$dom->loadHTML($file_contents);$xpath = new DOMXPath($dom);$posts = $xpath->query("//div[@class='who-post']/a");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");$dates = $xpath->query("//div[@class='date-post']");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");$contents = $xpath->query("//div[@class='message text-enrichi-fmobile text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[@id='yourTagIdHere']");$i = 0;foreach ($posts as $post) { $nodes = $post->childNodes; foreach ($nodes as $node) { $value = trim($node->nodeValue); $tab[$i]['author'] = $value; $i++; }}$i = 0;foreach ($dates as $date) { $nodes = $date->childNodes; foreach ($nodes as $node) { $value = trim($node->nodeValue); $tab[$i]['date'] = $value; $i++; }}$i = 0;foreach ($contents as $content) { $nodes = $content->childNodes; foreach ($nodes as $node) { $value = $node->nodeValue; echo $value; $tab[$i]['content'] = trim($value); $i++; }}?><h1>Participants</h2><pre><?php print_r($tab);?></pre>如您所见，代码不会检索某些内容。例如，我试图从以下位置检索此内容：http: //m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom -je-vous-le-montre-en-action.htm第二个帖子是一张图片，我的代码不起作用。另一方面，我想我犯了一些错误，我发现我的代码很难看。你能帮我吗？

查看完整描述

1 回答

收到一只叮咚

TA贡献1821条经验获得超4个赞

您可以简单地先选择帖子，然后使用以下方法分别获取每个子数据：

DOMXPath::evaluate结合normalize-space检索纯文本，
DOMXPath::query结合DOMDocument::save检索消息段落。

代码：

$xpath = new DOMXPath($dom);

$postsElements = $xpath->query('//*[@class="post"]');

$posts = [];

foreach ($postsElements as $postElement) {

$author = $xpath->evaluate('normalize-space(.//*[@class="who-post"])', $postElement);

$date = $xpath->evaluate('normalize-space(.//*[@class="date-post"])', $postElement);

$message = '';

foreach ($xpath->query('.//*[contains(@class, "message")]/p', $postElement) as $messageParagraphElement) {

$message .= $dom->saveHTML($messageParagraphElement);

}

$posts[] = (object)compact('author', 'date', 'message');

}

print_r($posts);

无关说明：抓取网站的 HTML 本身并不违法，但您应避免在未经他们同意的情况下在您自己的应用程序/网站上显示他们的数据。此外，如果他们决定更改其 HTML 结构/CSS 类名，这可能会在任何时候中断。

反对回复 2022-07-29

热搜

最近搜索清空

XPath 不检索某些内容

XPath 不检索某些内容

1 回答

添加回答