[404判断]CURL处理返回page not found 404页面的问题及判断。

Thu, 02 Apr 2009 10:26:00 +0000

背景：外包写图片抓取时出现404内容，但就把nginx里的输出给保存在了jpg里，再读取时导致出现：图片没法显示，里面内容是404。

用curl抓取页面时，一般根据curl_exec的返回内容判断是否抓取成功了。但我发现，访问有些站点本来是返回404错误，但页面有内容时，curl把page not found的内容也抓回来了。如果以curl_exec的结果判断是否正确抓取就被误导了。如下面的代码：

$url = 'http://www.cq.xinhuanet.com/house/2008-11/24/content_14996426.htm-';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_ENCODING, "gzip, deflate");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CIBA; InfoPath.1; .NET CLR 2.0.50727)");
curl_setopt($ch, CURLOPT_MAXREDIRS, 5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //自动跟踪location
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //Timeout
curl_setopt($ch, CURLOPT_HEADER, 1);
//curl_setopt($ch, CURLOPT_NOBODY, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

$contents = curl_exec($ch);
curl_close($ch);

if (false == $contents || empty($contents)) {
echo $contents;
} else {
echo “抓取页面失败！”;
}

查了下手册，发现curl里还有个curl_getinfo函数。应该判断http状态：

$contents = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ($http_code >= 400) { //400 - 600都是服务器错误
echo "访问失败！";
exit;
} else {
echo $contents;
}
curl_close($ch);

新加网上找了一个：

Add Time：2014-01-15

[评论] [404判断]CURL处理返回page not found 404页面的问题及判断。

Thu, 09 Dec 2021 22:16:33 +0000

fuck you

向东博客 专注WEB应用 构架之美 --- 构架之美，在于尽态极妍 | 应用之美，在于药到病除

[404判断]CURL处理返回page not found 404页面的问题及判断。

[评论] [404判断]CURL处理返回page not found 404页面的问题及判断。

向东博客专注WEB应用构架之美 --- 构架之美，在于尽态极妍 | 应用之美，在于药到病除