﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-beautykingdom-随笔分类-SearchEngine</title><link>http://www.cppblog.com/beautykingdom/category/13101.html</link><description /><language>zh-cn</language><lastBuildDate>Thu, 18 Feb 2010 13:55:42 GMT</lastBuildDate><pubDate>Thu, 18 Feb 2010 13:55:42 GMT</pubDate><ttl>60</ttl><item><title>如何写一个网络蜘蛛</title><link>http://www.cppblog.com/beautykingdom/archive/2010/02/18/108046.html</link><dc:creator>chatler</dc:creator><author>chatler</author><pubDate>Thu, 18 Feb 2010 13:54:00 GMT</pubDate><guid>http://www.cppblog.com/beautykingdom/archive/2010/02/18/108046.html</guid><wfw:comment>http://www.cppblog.com/beautykingdom/comments/108046.html</wfw:comment><comments>http://www.cppblog.com/beautykingdom/archive/2010/02/18/108046.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/beautykingdom/comments/commentRss/108046.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/beautykingdom/services/trackbacks/108046.html</trackback:ping><description><![CDATA[<p><a onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Web_spider?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Web_spider" target=_blank><u><font color=#0000ff>这里</font></u></a>是维基百科对网络爬虫的词条页面。网络爬虫以叫网络蜘蛛，网络机器人，这是一个程序，其会自动的通过网络抓取互联网上的网页，这种技术一般可能用来检查你的站点上所有的链接是否是都是有效的。当然，更为高级的技术是把网页中的相关数据保存下来，可以成为搜索引擎。</p>
<p>从技相来说，实现抓取网页可能并不是一件很困难的事情，困难的事情是对网页的分析和整理，那是一件需要有轻量智能，需要大量数学计算的程序才能做的事情。下面一个简单的流程：</p>
<p><span id=more-27></span></p>
<p>&nbsp;</p>
<p>在这里，我们只是说一下如何写一个网页抓取程序。</p>
<p>首先我们先看一下，如何使用命令行的方式来找开网页。</p>
<p style="TEXT-ALIGN: left; PADDING-LEFT: 30px">telnet somesite.com 80<br>GET /index.html HTTP/1.0<br>按回车两次</p>
<p style="TEXT-ALIGN: left">使用telnet就是告诉你其实这是一个socket的技术，并且使用HTTP的协议，如 GET方法来获得网页，当然，接下来的事你就需要解析HTML文法，甚至还需要解析Javascript，因为现在的网页使用Ajax的越来越多了，而很多网页内容都是通过Ajax技术加载的，因为，只是简单地解析HTML文件在未来会远远不够。当然，在这里，只是展示一个非常简单的抓取，简单到只能做为一个例子，下面这个示例的伪代码：</p>
<pre>取网页
for each 链接 in 当前网页所有的链接
{
if(如果本链接是我们想要的 || 这个链接从未访问过)
{
处理对本链接
把本链接设置为已访问
}
}</pre>
<pre class=ruby>require &#8220;rubygems&#8221;
require &#8220;mechanize&#8221;
class Crawler &lt; WWW::Mechanize
attr_accessor :callback
INDEX = 0
DOWNLOAD = 1
PASS = 2
def initialize
super
init
@first = true
self.user_agent_alias = &#8220;Windows IE 6&#8243;
end
def init
@visited = []
end
def remember(link)
@visited &lt;&lt; link
end
def perform_index(link)
self.get(link)
if(self.page.class.to_s == &#8220;WWW::Mechanize::Page&#8221;)
links = self.page.links.map {|link| link.href } - @visited
links.each do |alink|
start(alink)
end
end
end
def start(link)
return if link.nil?
if(!@visited.include?(link))
action = @callback.call(link)
if(@first)
@first = false
perform_index(link)
end
case action
when INDEX
perform_index(link)
when DOWNLOAD
self.get(link).save_as(File.basename(link))
when PASS
puts &#8220;passing on #{link}&#8221;
end
end
end
def get(site)
begin
puts &#8220;getting #{site}&#8221;
@visited &lt;&lt; site
super(site)
rescue
puts &#8220;error getting #{site}&#8221;
end
end
end</pre>
<p>上面的代码就不必多说了，大家可以去试试。下面是如何使用上面的代码：</p>
<pre class=ruby>require &#8220;crawler&#8221;
x = Crawler.new
callback = lambda do |link|
if(link =~/\\.(zip|rar|gz|pdf|doc)
x.remember(link)
return Crawler::PASS
elsif(link =~/\\.(jpg|jpeg)/)
return Crawler::DOWNLOAD
end
return Crawler::INDEX;
end
x.callback = callback
x.start(&#8221;http://somesite.com&#8221;)</pre>
<p>下面是一些和网络爬虫相关的开源网络项目</p>
<ul>
    <li><a class="external text" title=http://arachnode.net onclick="pageTracker._trackPageview('/outgoing/arachnode.net/?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://arachnode.net/" rel=nofollow target=_blank><strong><u><font color=#0000ff>arachnode.net</font></u></strong></a> is a .NET crawler written in C# using SQL 2005 and <a title=Lucene onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Lucene?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Lucene" target=_blank><u><font color=#0000ff>Lucene</font></u></a> and is released under the <a title="GNU General Public License" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/GNU_General_Public_License?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/GNU_General_Public_License" target=_blank><u><font color=#0000ff>GNU General Public License</font></u></a>.
    <li><strong><a title=DataparkSearch onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/DataparkSearch?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/DataparkSearch" target=_blank><u><font color=#0000ff>DataparkSearch</font></u></a></strong> is a crawler and search engine released under the <a title="GNU General Public License" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/GNU_General_Public_License?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/GNU_General_Public_License" target=_blank><u><font color=#0000ff>GNU General Public License</font></u></a>.
    <li><strong><a title=Wget onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Wget?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Wget" target=_blank><u><font color=#0000ff>GNU Wget</font></u></a></strong> is a <a class=mw-redirect title="Command line interface" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Command_line_interface?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Command_line_interface" target=_blank><u><font color=#0000ff>command-line</font></u></a>-operated crawler written in <a title="C (programming language)" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/C_28programming_language_29?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/C_%28programming_language%29" target=_blank><u><font color=#0000ff>C</font></u></a> and released under the <a title="GNU General Public License" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/GNU_General_Public_License?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/GNU_General_Public_License" target=_blank><u><font color=#0000ff>GPL</font></u></a>. It is typically used to mirror Web and FTP sites.
    <li><strong><a title="Grub (search engine)" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Grub_28search_engine_29?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Grub_%28search_engine%29" target=_blank><u><font color=#0000ff>GRUB</font></u></a></strong> is an open source distributed search crawler that Wikia Search ( <a class="external free" title=http://wikiasearch.com onclick="pageTracker._trackPageview('/outgoing/wikiasearch.com/?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://wikiasearch.com/" rel=nofollow target=_blank><u><font color=#0000ff>http://wikiasearch.com</font></u></a> ) uses to crawl the web.
    <li><strong><a title=Heritrix onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Heritrix?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Heritrix" target=_blank><u><font color=#0000ff>Heritrix</font></u></a></strong> is the <a title="Internet Archive" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Internet_Archive?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Internet_Archive" target=_blank><u><font color=#0000ff>Internet Archive</font></u></a>&#8217;s archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in <a title="Java (programming language)" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Java_28programming_language_29?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Java_%28programming_language%29" target=_blank><u><font color=#0000ff>Java</font></u></a>.
    <li><strong><a class=mw-redirect title=Ht-//dig onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Ht-//dig?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Ht-//dig" target=_blank><u><font color=#0000ff>ht://Dig</font></u></a></strong> includes a Web crawler in its indexing engine.
    <li><strong><a title=HTTrack onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/HTTrack?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/HTTrack" target=_blank><u><font color=#0000ff>HTTrack</font></u></a></strong> uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in <a title="C (programming language)" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/C_28programming_language_29?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/C_%28programming_language%29" target=_blank><u><font color=#0000ff>C</font></u></a> and released under the <a title="GNU General Public License" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/GNU_General_Public_License?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/GNU_General_Public_License" target=_blank><u><font color=#0000ff>GPL</font></u></a>.
    <li><strong><a title="ICDL crawling" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/ICDL_crawling?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/ICDL_crawling" target=_blank><u><font color=#0000ff>ICDL Crawler</font></u></a></strong> is a <a title=Cross-platform onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Cross-platform?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Cross-platform" target=_blank><u><font color=#0000ff>cross-platform</font></u></a> web crawler written in <a title=C++ onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/C_2B_2B?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/C%2B%2B" target=_blank><u><font color=#0000ff>C++</font></u></a> and intended to crawl Web sites based on <a title="Website Parse Template" onclick="pageTracker._trackPageview('/outgoing/en.wikipedia.org/wiki/Website_Parse_Template?referer=http%3A%2F%2Fcoolshell.cn%2F%3Fp%3D1695');" href="http://en.wikipedia.org/wiki/Website_Parse_Template" target=_blank><br></a></li>
</ul>
<p>from:<br><a href="http://coolshell.cn/?p=27">http://coolshell.cn/?p=27</a></p>
<img src ="http://www.cppblog.com/beautykingdom/aggbug/108046.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/beautykingdom/" target="_blank">chatler</a> 2010-02-18 21:54 <a href="http://www.cppblog.com/beautykingdom/archive/2010/02/18/108046.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>