﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-蓝色理想-随笔分类-搜索引擎</title><link>http://www.cppblog.com/merlinfang/category/6294.html</link><description /><language>zh-cn</language><lastBuildDate>Tue, 19 Aug 2008 06:22:22 GMT</lastBuildDate><pubDate>Tue, 19 Aug 2008 06:22:22 GMT</pubDate><ttl>60</ttl><item><title>url格式规范</title><link>http://www.cppblog.com/merlinfang/archive/2008/08/19/59292.html</link><dc:creator>merlinfang</dc:creator><author>merlinfang</author><pubDate>Mon, 18 Aug 2008 16:13:00 GMT</pubDate><guid>http://www.cppblog.com/merlinfang/archive/2008/08/19/59292.html</guid><wfw:comment>http://www.cppblog.com/merlinfang/comments/59292.html</wfw:comment><comments>http://www.cppblog.com/merlinfang/archive/2008/08/19/59292.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/merlinfang/comments/commentRss/59292.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/merlinfang/services/trackbacks/59292.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: &nbsp;&nbsp;<a href='http://www.cppblog.com/merlinfang/archive/2008/08/19/59292.html'>阅读全文</a><img src ="http://www.cppblog.com/merlinfang/aggbug/59292.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/merlinfang/" target="_blank">merlinfang</a> 2008-08-19 00:13 <a href="http://www.cppblog.com/merlinfang/archive/2008/08/19/59292.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>增量搜集</title><link>http://www.cppblog.com/merlinfang/archive/2008/05/22/50801.html</link><dc:creator>merlinfang</dc:creator><author>merlinfang</author><pubDate>Thu, 22 May 2008 14:23:00 GMT</pubDate><guid>http://www.cppblog.com/merlinfang/archive/2008/05/22/50801.html</guid><wfw:comment>http://www.cppblog.com/merlinfang/comments/50801.html</wfw:comment><comments>http://www.cppblog.com/merlinfang/archive/2008/05/22/50801.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/merlinfang/comments/commentRss/50801.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/merlinfang/services/trackbacks/50801.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: &nbsp;&nbsp;<a href='http://www.cppblog.com/merlinfang/archive/2008/05/22/50801.html'>阅读全文</a><img src ="http://www.cppblog.com/merlinfang/aggbug/50801.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/merlinfang/" target="_blank">merlinfang</a> 2008-05-22 22:23 <a href="http://www.cppblog.com/merlinfang/archive/2008/05/22/50801.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>网页净化</title><link>http://www.cppblog.com/merlinfang/archive/2008/03/09/44045.html</link><dc:creator>merlinfang</dc:creator><author>merlinfang</author><pubDate>Sun, 09 Mar 2008 14:52:00 GMT</pubDate><guid>http://www.cppblog.com/merlinfang/archive/2008/03/09/44045.html</guid><wfw:comment>http://www.cppblog.com/merlinfang/comments/44045.html</wfw:comment><comments>http://www.cppblog.com/merlinfang/archive/2008/03/09/44045.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/merlinfang/comments/commentRss/44045.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/merlinfang/services/trackbacks/44045.html</trackback:ping><description><![CDATA[<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 当我们看网页的时候，常常看见大量和我们所关心内容无关的导航条、广告信息、版权信息以及调查问卷等。有时候，我们可能从中得到一些意外的惊喜；但大多数时候都是非常讨厌。<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 飞行广告可以说是其中的佼佼者,但已经有插件可以让它不显示了。但更多更多的无关广告，特别是那种点进去还中毒的广告，是不是也该开发个插件让它们不要出现在我们面前了。</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 最近看搜索引擎，搜索引擎分析网页的时候也需要这样处理，称之为网页净化。<br><br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 不过不知道加了这种插件之后，那网站还能不能接到广告哈。。。</p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/merlinfang/aggbug/44045.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/merlinfang/" target="_blank">merlinfang</a> 2008-03-09 22:52 <a href="http://www.cppblog.com/merlinfang/archive/2008/03/09/44045.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>搜索引擎-网页预处理(2)</title><link>http://www.cppblog.com/merlinfang/archive/2008/03/05/43777.html</link><dc:creator>merlinfang</dc:creator><author>merlinfang</author><pubDate>Wed, 05 Mar 2008 15:10:00 GMT</pubDate><guid>http://www.cppblog.com/merlinfang/archive/2008/03/05/43777.html</guid><wfw:comment>http://www.cppblog.com/merlinfang/comments/43777.html</wfw:comment><comments>http://www.cppblog.com/merlinfang/archive/2008/03/05/43777.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/merlinfang/comments/commentRss/43777.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/merlinfang/services/trackbacks/43777.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: &nbsp;&nbsp;<a href='http://www.cppblog.com/merlinfang/archive/2008/03/05/43777.html'>阅读全文</a><img src ="http://www.cppblog.com/merlinfang/aggbug/43777.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/merlinfang/" target="_blank">merlinfang</a> 2008-03-05 23:10 <a href="http://www.cppblog.com/merlinfang/archive/2008/03/05/43777.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>搜索引擎-网页搜集(1)</title><link>http://www.cppblog.com/merlinfang/archive/2008/03/04/43705.html</link><dc:creator>merlinfang</dc:creator><author>merlinfang</author><pubDate>Tue, 04 Mar 2008 13:52:00 GMT</pubDate><guid>http://www.cppblog.com/merlinfang/archive/2008/03/04/43705.html</guid><wfw:comment>http://www.cppblog.com/merlinfang/comments/43705.html</wfw:comment><comments>http://www.cppblog.com/merlinfang/archive/2008/03/04/43705.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.cppblog.com/merlinfang/comments/commentRss/43705.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/merlinfang/services/trackbacks/43705.html</trackback:ping><description><![CDATA[<p>最新研究搜索引擎了，做点笔记。</p>
<p>搜索引擎一般分为三个模块: 网页搜集、预处理和查询服务。<br></p>
<p>网页搜集是事先搜集的，在查询的时候再去搜集明显不可能了。而事先搜集又分为定期搜集和增量搜集。定期搜集是个全量的搜集过程，往往更新一次需要很长的时间，基本也不时新了，但是实现无疑要简单点；增量搜集除第一次是全量的外，后续做的就是更新了（包括新增网页，删除过期的，以及更新），实现上要复杂的多。现实上这两种也是相辅相成的，如新闻的搜索要及时更新，但某些学术网站就很少更新了。<br></p>
<p>网页搜集要解决的问题：<br>（1）各种类型的网页（html、asp、javascript），各种语种(ascii, utf-8)</p>
<p>（2）网络资源的多样化（文件，图片，文档，音频，视频 etc.）<br>（3）搜索策略（深度优先，广度优先）<br>（4）并发搜集（避免对同一站点同一时刻的大量访问，不然就变成Dos攻击了）</p>
<p>（5）避免重复搜集<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 记录未访问、已访问URL和网页内容摘要信息</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 域名与IP的对应问题&nbsp; <br>（6）判断网页的重要程度</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1) 网页的入度大，表明被其他网页引用的次数多；<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2) 某网页的父网页入度大；<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3) 网页的镜像度高，说明网页内容比较热门，从而显得重要；<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4) 网页的目录深度小，易于用户浏览到。</p>
<img src ="http://www.cppblog.com/merlinfang/aggbug/43705.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/merlinfang/" target="_blank">merlinfang</a> 2008-03-04 21:52 <a href="http://www.cppblog.com/merlinfang/archive/2008/03/04/43705.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>