PHP扩展curl和正则表达式轻松采集文章

2月 20

›PHP相关技术

分享到：微信新浪微博 QQ好友 QQ空间腾讯微博复制网址

　　采集已经不是什么新名词了，很多站长为了省事，也局限于人力的缺乏，使用程序来给自己的网站添砖加瓦，比如本人的个人网站www.xxx.com也采集了大量的新闻，那么如果实现呢?今天我们运用PHP来实现这个功能。

　　谈到采集，我们不得不说两个东西，第一个是如何获取远程网站的源代码，这个可以通过php的一个扩展curl来获取，另一个是如果去匹配你需要的信息，这个的解决办法是正则表达式。

　　Windows下开启curl的方法如下：

　　1、拷贝PHP目录中的libeay32.dll， ssleay32.dll， php5ts.dll， php_curl.dll文件到 system32 目录。

　　2、修改php.ini：配置好 extension_dir ，去掉 extension = php_curl.dll 前面的分号。

　　3、重起apache。

　　Linux下开启curl的方法如下：

　　进入安装原php 的源码目录，

　cd ext
　　cd curl
　　phpize
　　./configure –with-curl =DIR
　　make

　　就会在PHPDIR/ext/curl /moudles/下生成curl .so的文件。

　　复制curl .so文件到extensions的配置目录，修改php .ini就好了。

　　然后你就可以利用curl来获取到指定url的网页源码了，这里给大家一个封装好的函数：

　　以下为引用的内容：

　function getWebcontent($url){
　　$ch = curl_init();
　　$timeout = 10;
　　curl_setopt($ch, CURLOPT_URL, $url);
　　curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
　　curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
　　curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
　　$contents = trim(curl_exec($ch));
　　curl_close($ch);
　　return $contents;
　　}

　　接下来就应该说到php中的正则表达式了：

　　1.中括号

　　[0-9]匹配0-9
　　[a-z]匹配a-z小写字母
　　[A-Z]匹配A-Z大写字母
　　[a-zA-Z]匹配所有大小写字母

　　可以使用ascii来制定更多

　　2.量词

　　以下为引用的内容：

　 p+匹配至少一个含p的字符串
　　p*陪陪任何包含0个或多个p的字符串
　　p?匹配任何包含0个或一个p的字符串
　　p{2}匹配包含2个p的序列的字符串
　　p{2,3}匹配任何包含2个或3个的字符串
　　p$匹配任何以p结尾的字符串
　　^p匹配任何以p开头的字符串
　　[^a-zA-Z]匹配任何不包含a-zA-Z的字符串
　　p.p匹配任何包含p、接下来是任何字符、再接下来有又是p的字符串
　　^.{2}$匹配任何值包含2个字符的字符串
　　(.*)b>匹配任何被>包围的字符串
　　p(hp)*匹配任何一个包含p,后面是多个或0个hp的字符串

　　3.预定义字符范围

　　以下为引用的内容：

　 [:alpha:]同[a-zA-Z]
　　[:alnum:]同[a-zA-Z0-9]
　　[:cntrl:]匹配控制字符，比如制表符，反斜杠，退格符
　　[:digit:]同[0-9]
　　[:graph:]所有ASCII33~166范围内可以打印的字符
　　[:lower:]同[a-z]
　　[:punct:]标点符号
　　[:upper:]同[A-Z]
　　[:space:]空白字符，可以是空格、水平制表符、换行、换页、回车
　　[:xdigit:]十六进制符同[a-fA-F0-9]

　　废话不多说，直接上我的源码吧，有什么不懂的可以上百度查查。

　　以下为引用的内容：

      <?php
　   header(“Content-type: text/html; charset=utf-8”);
　　getinfo(“http://rss.sina.com.cn/rollnews/news/gn_total.js”,1);
　　getinfo(“http://rss.sina.com.cn/rollnews/news/gj_total.js”,2);
　　getinfo(“http://rss.sina.com.cn/rollnews/news/sh_total.js”,3);
　　getinfo(“http://rss.sina.com.cn/rollnews/sports/sports_total.js”,4);
　　getinfo(“http://rss.sina.com.cn/rollnews/tech/tech1_total.js”,5);
　　getinfo(“http://rss.sina.com.cn/rollnews/finance/finance1_news_total.js”,6);
　　getinfo(“http://rss.sina.com.cn/rollnews/ent/ent_total.js”,7);
　　getinfo(“http://rss.sina.com.cn/rollnews/jczs/jczs_total.js”,8);
　　function getinfo($infourl,$catid)
　　{
　　$pagecontent=getwebcontent($infourl);
　　preg_match_all(“/title:”(.*?)”/”, $pagecontent, $match);
　　$titlearr=$match[1];
　　preg_match_all(“/link:”(.*?)”/”, $pagecontent, $match);
　　$urlarr=$match[1];
　　for ($i=1;$i
　　echo “go {$titlearr[$i-1]}n”;
　　$title=iconv(“gbk”,”utf-8″,$titlearr[$i-1]);
　　$content=iconv(“gbk”,”utf-8″,getnewscontent($urlarr[$i]));
　　$content=MySQL_escape_string($content);
　　if(!insertdb($title,$content,$catid)) break;
　　}
　　}
　　function insertdb($title,$content,$catid){
　　将数据写入你的库
　　}
　　function getnewscontent($newsurl){
    $newscontent=getwebcontent($newsurl);
    preg_match_all(“/<div class=”blkContainerSblkCon” id=”artibody”>([sS]*?)<!– publish_helper_end –>/”,$newscontent,$match);
    $content=preg_replace(“/<a.*?</a>/si”,””,$match[1][0]);
    $content=preg_replace(“/<div style=”overflow:hidden;zoom:1;” class=”otherContent_01″>.*?</div>/si”,””,$content);
    $content=preg_replace(“/<div class=”blk-video”>.*?<div class=”clearcl”></div>/si”,””,$content);
    $content=str_replace(“<div style=”clear:both;height:0;visibility:hiddden;overflow:hidden;”></div>”,””,$content);
    return $content;
}
function getwebcontent($url){
    $ch = curl_init();
    $timeout = 10;
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
    $contents = trim(curl_exec($ch));
    curl_close($ch);
    return $contents;
}
?>

然后如何实现比较实时的同步呢，这可以利用windows下的任务计划或linux下的crontab 了，定时（比如十分钟）执行这个程序，这样，你就不再愁网站没有内容了。

curl, PHP, 扩展, 文章, 正则表达式, 采集

一	二	三	四	五	六	日
« 12月
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

博客水木

PHP扩展curl和正则表达式轻松采集文章

评论

回复

腾讯新闻

3D滚动云标签

全球网站点击统计

百度站内搜索

文章分类

友情链接

心路历程

扫描关注我微信