﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-Just do it for fun-随笔分类-搜索引擎</title><link>http://www.cppblog.com/zzfmars/category/14823.html</link><description>Open Source Spirit    

</description><language>zh-cn</language><lastBuildDate>Sun, 19 Jun 2011 06:34:54 GMT</lastBuildDate><pubDate>Sun, 19 Jun 2011 06:34:54 GMT</pubDate><ttl>60</ttl><item><title> Lucene入门级笔记五 -- 分词器，使用中文分词器，扩展词库，停用词</title><link>http://www.cppblog.com/zzfmars/archive/2011/04/17/144401.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Sun, 17 Apr 2011 11:25:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2011/04/17/144401.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/144401.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2011/04/17/144401.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/144401.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/144401.html</trackback:ping><description><![CDATA[<div style="BORDER-BOTTOM: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; PADDING-BOTTOM: 4px; BACKGROUND-COLOR: #eeeeee; PADDING-LEFT: 4px; WIDTH: 98%; PADDING-RIGHT: 5px; FONT-SIZE: 13px; WORD-BREAK: break-all; BORDER-TOP: #cccccc 1px solid; BORDER-RIGHT: #cccccc 1px solid; PADDING-TOP: 4px"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><span style="COLOR: #000000">1</span><span style="COLOR: #000000">.&nbsp;常见的中文分词器有：极易分词的(MMAnalyzer)&nbsp;、</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">庖丁分词</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">分词器(PaodingAnalzyer)、IKAnalyzer&nbsp;等等。其中&nbsp;MMAnalyzer&nbsp;和&nbsp;PaodingAnalzyer&nbsp;不支持&nbsp;lucene3.0及以后版本。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;使用方式都类似，在构建分词器时<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Analyzer&nbsp;analyzer&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;[My]Analyzer();&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">.&nbsp;这里只示例&nbsp;IKAnalyzer，目前只有它支持Lucene3.</span><span style="COLOR: #000000">0</span><span style="COLOR: #000000">&nbsp;以后的版本。&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;首先需要导入&nbsp;IKAnalyzer3.</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">.0Stable.jar&nbsp;包<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">.&nbsp;示例代码<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;view&nbsp;plaincopy&nbsp;to&nbsp;clipboardprint</span><span style="COLOR: #000000">?</span><span style="COLOR: #000000"><br><img id=Codehighlighter1_362_1560_Open_Image onclick="this.style.display='none'; Codehighlighter1_362_1560_Open_Text.style.display='none'; Codehighlighter1_362_1560_Closed_Image.style.display='inline'; Codehighlighter1_362_1560_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_362_1560_Closed_Image onclick="this.style.display='none'; Codehighlighter1_362_1560_Closed_Text.style.display='none'; Codehighlighter1_362_1560_Open_Image.style.display='inline'; Codehighlighter1_362_1560_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif"></span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">&nbsp;AnalyzerTest&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_362_1560_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_362_1560_Open_Text><span style="COLOR: #000000">{&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Test&nbsp;&nbsp;<br><img id=Codehighlighter1_425_806_Open_Image onclick="this.style.display='none'; Codehighlighter1_425_806_Open_Text.style.display='none'; Codehighlighter1_425_806_Closed_Image.style.display='inline'; Codehighlighter1_425_806_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_425_806_Closed_Image onclick="this.style.display='none'; Codehighlighter1_425_806_Closed_Text.style.display='none'; Codehighlighter1_425_806_Open_Image.style.display='inline'; Codehighlighter1_425_806_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">void</span><span style="COLOR: #000000">&nbsp;test()&nbsp;</span><span style="COLOR: #0000ff">throws</span><span style="COLOR: #000000">&nbsp;Exception&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_425_806_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_425_806_Open_Text><span style="COLOR: #000000">{&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;text&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">An&nbsp;IndexWriter&nbsp;creates&nbsp;and&nbsp;maintains&nbsp;an&nbsp;index.</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_525_540_Open_Image onclick="this.style.display='none'; Codehighlighter1_525_540_Open_Text.style.display='none'; Codehighlighter1_525_540_Closed_Image.style.display='inline'; Codehighlighter1_525_540_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_525_540_Closed_Image onclick="this.style.display='none'; Codehighlighter1_525_540_Closed_Text.style.display='none'; Codehighlighter1_525_540_Open_Image.style.display='inline'; Codehighlighter1_525_540_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_525_540_Closed_Text>/**/</span><span id=Codehighlighter1_525_540_Open_Text><span style="COLOR: #008000">/*</span><span style="COLOR: #008000">&nbsp;标准分词器：单子分词&nbsp;</span><span style="COLOR: #008000">*/</span></span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Analyzer&nbsp;analyzer&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;StandardAnalyzer(Version.LUCENE_30);&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;testAnalyzer(analyzer,&nbsp;text);&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;text2&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">测试中文环境下的信息检索</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;testAnalyzer(</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;IKAnalyzer(),&nbsp;text2);&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;使用IKAnalyzer，词库分词&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_823_969_Open_Image onclick="this.style.display='none'; Codehighlighter1_823_969_Open_Text.style.display='none'; Codehighlighter1_823_969_Closed_Image.style.display='inline'; Codehighlighter1_823_969_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_823_969_Closed_Image onclick="this.style.display='none'; Codehighlighter1_823_969_Closed_Text.style.display='none'; Codehighlighter1_823_969_Open_Image.style.display='inline'; Codehighlighter1_823_969_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_823_969_Closed_Text>/**&nbsp;*/</span><span id=Codehighlighter1_823_969_Open_Text><span style="COLOR: #008000">/**</span><span style="COLOR: #008000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;使用指定的分词器对指定的文本进行分词，并打印结果&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="COLOR: #808080">@param</span><span style="COLOR: #008000">&nbsp;analyzer&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="COLOR: #808080">@param</span><span style="COLOR: #008000">&nbsp;text&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="COLOR: #808080">@throws</span><span style="COLOR: #008000">&nbsp;Exception&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">*/</span></span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img id=Codehighlighter1_1055_1555_Open_Image onclick="this.style.display='none'; Codehighlighter1_1055_1555_Open_Text.style.display='none'; Codehighlighter1_1055_1555_Closed_Image.style.display='inline'; Codehighlighter1_1055_1555_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_1055_1555_Closed_Image onclick="this.style.display='none'; Codehighlighter1_1055_1555_Closed_Text.style.display='none'; Codehighlighter1_1055_1555_Open_Image.style.display='inline'; Codehighlighter1_1055_1555_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">private</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">void</span><span style="COLOR: #000000">&nbsp;testAnalyzer(Analyzer&nbsp;analyzer,&nbsp;String&nbsp;text)&nbsp;</span><span style="COLOR: #0000ff">throws</span><span style="COLOR: #000000">&nbsp;Exception&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_1055_1555_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_1055_1555_Open_Text><span style="COLOR: #000000">{&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">当前使用的分词器：</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;analyzer.getClass());&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TokenStream&nbsp;tokenStream&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;analyzer.tokenStream(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">content</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">,&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;StringReader(text));&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tokenStream.addAttribute(TermAttribute.</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">);&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_1357_1543_Open_Image onclick="this.style.display='none'; Codehighlighter1_1357_1543_Open_Text.style.display='none'; Codehighlighter1_1357_1543_Closed_Image.style.display='inline'; Codehighlighter1_1357_1543_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_1357_1543_Closed_Image onclick="this.style.display='none'; Codehighlighter1_1357_1543_Closed_Text.style.display='none'; Codehighlighter1_1357_1543_Open_Image.style.display='inline'; Codehighlighter1_1357_1543_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">while</span><span style="COLOR: #000000">&nbsp;(tokenStream.incrementToken())&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_1357_1543_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_1357_1543_Open_Text><span style="COLOR: #000000">{&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TermAttribute&nbsp;termAttribute&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;tokenStream.getAttribute(TermAttribute.</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">);&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(termAttribute.term());&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif">}</span></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_1595_2712_Open_Image onclick="this.style.display='none'; Codehighlighter1_1595_2712_Open_Text.style.display='none'; Codehighlighter1_1595_2712_Closed_Image.style.display='inline'; Codehighlighter1_1595_2712_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_1595_2712_Closed_Image onclick="this.style.display='none'; Codehighlighter1_1595_2712_Closed_Text.style.display='none'; Codehighlighter1_1595_2712_Open_Image.style.display='inline'; Codehighlighter1_1595_2712_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif"></span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">&nbsp;AnalyzerTest&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_1595_2712_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_1595_2712_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;@Test<br><img id=Codehighlighter1_1653_2011_Open_Image onclick="this.style.display='none'; Codehighlighter1_1653_2011_Open_Text.style.display='none'; Codehighlighter1_1653_2011_Closed_Image.style.display='inline'; Codehighlighter1_1653_2011_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_1653_2011_Closed_Image onclick="this.style.display='none'; Codehighlighter1_1653_2011_Closed_Text.style.display='none'; Codehighlighter1_1653_2011_Open_Image.style.display='inline'; Codehighlighter1_1653_2011_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">void</span><span style="COLOR: #000000">&nbsp;test()&nbsp;</span><span style="COLOR: #0000ff">throws</span><span style="COLOR: #000000">&nbsp;Exception&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_1653_2011_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_1653_2011_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;text&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">An&nbsp;IndexWriter&nbsp;creates&nbsp;and&nbsp;maintains&nbsp;an&nbsp;index.</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">;<br><img id=Codehighlighter1_1747_1762_Open_Image onclick="this.style.display='none'; Codehighlighter1_1747_1762_Open_Text.style.display='none'; Codehighlighter1_1747_1762_Closed_Image.style.display='inline'; Codehighlighter1_1747_1762_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_1747_1762_Closed_Image onclick="this.style.display='none'; Codehighlighter1_1747_1762_Closed_Text.style.display='none'; Codehighlighter1_1747_1762_Open_Image.style.display='inline'; Codehighlighter1_1747_1762_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_1747_1762_Closed_Text>/**/</span><span id=Codehighlighter1_1747_1762_Open_Text><span style="COLOR: #008000">/*</span><span style="COLOR: #008000">&nbsp;标准分词器：单子分词&nbsp;</span><span style="COLOR: #008000">*/</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Analyzer&nbsp;analyzer&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;StandardAnalyzer(Version.LUCENE_30);<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;testAnalyzer(analyzer,&nbsp;text);<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;text2&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">测试中文环境下的信息检索</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;testAnalyzer(</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;IKAnalyzer(),&nbsp;text2);&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;使用IKAnalyzer，词库分词</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;<br><img id=Codehighlighter1_2022_2156_Open_Image onclick="this.style.display='none'; Codehighlighter1_2022_2156_Open_Text.style.display='none'; Codehighlighter1_2022_2156_Closed_Image.style.display='inline'; Codehighlighter1_2022_2156_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_2022_2156_Closed_Image onclick="this.style.display='none'; Codehighlighter1_2022_2156_Closed_Text.style.display='none'; Codehighlighter1_2022_2156_Open_Image.style.display='inline'; Codehighlighter1_2022_2156_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_2022_2156_Closed_Text>/**&nbsp;*/</span><span id=Codehighlighter1_2022_2156_Open_Text><span style="COLOR: #008000">/**</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;使用指定的分词器对指定的文本进行分词，并打印结果<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="COLOR: #808080">@param</span><span style="COLOR: #008000">&nbsp;analyzer<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="COLOR: #808080">@param</span><span style="COLOR: #008000">&nbsp;text<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*&nbsp;</span><span style="COLOR: #808080">@throws</span><span style="COLOR: #008000">&nbsp;Exception<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">*/</span></span><span style="COLOR: #000000"><br><img id=Codehighlighter1_2240_2710_Open_Image onclick="this.style.display='none'; Codehighlighter1_2240_2710_Open_Text.style.display='none'; Codehighlighter1_2240_2710_Closed_Image.style.display='inline'; Codehighlighter1_2240_2710_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_2240_2710_Closed_Image onclick="this.style.display='none'; Codehighlighter1_2240_2710_Closed_Text.style.display='none'; Codehighlighter1_2240_2710_Open_Image.style.display='inline'; Codehighlighter1_2240_2710_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">private</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">void</span><span style="COLOR: #000000">&nbsp;testAnalyzer(Analyzer&nbsp;analyzer,&nbsp;String&nbsp;text)&nbsp;</span><span style="COLOR: #0000ff">throws</span><span style="COLOR: #000000">&nbsp;Exception&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_2240_2710_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_2240_2710_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">当前使用的分词器：</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;analyzer.getClass());<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TokenStream&nbsp;tokenStream&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;analyzer.tokenStream(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">content</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">,&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;StringReader(text));<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tokenStream.addAttribute(TermAttribute.</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">);<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;<br><img id=Codehighlighter1_2524_2701_Open_Image onclick="this.style.display='none'; Codehighlighter1_2524_2701_Open_Text.style.display='none'; Codehighlighter1_2524_2701_Closed_Image.style.display='inline'; Codehighlighter1_2524_2701_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_2524_2701_Closed_Image onclick="this.style.display='none'; Codehighlighter1_2524_2701_Closed_Text.style.display='none'; Codehighlighter1_2524_2701_Open_Image.style.display='inline'; Codehighlighter1_2524_2701_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">while</span><span style="COLOR: #000000">&nbsp;(tokenStream.incrementToken())&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_2524_2701_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_2524_2701_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;TermAttribute&nbsp;termAttribute&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;tokenStream.getAttribute(TermAttribute.</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">);<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;System.out.println(termAttribute.term());<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif">}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">.&nbsp;如何扩展词库：很多情况下，我们可能需要定制自己的词库，例如&nbsp;XXX&nbsp;公司，我们希望这能被分词器识别，并拆分成一个词。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;IKAnalyzer&nbsp;可以很方便的实现我们的这种需求。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;新建&nbsp;IKAnalyzer.cfg.xml<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;view&nbsp;plaincopy&nbsp;to&nbsp;clipboardprint</span><span style="COLOR: #000000">?</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;?</span><span style="COLOR: #000000">xml&nbsp;version</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">1.0</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&nbsp;encoding</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">UTF-8</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">?&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;!</span><span style="COLOR: #000000">DOCTYPE&nbsp;properties&nbsp;SYSTEM&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">http://java.sun.com/dtd/properties.dtd</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">properties</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;!--</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">，文件要是&nbsp;UTF</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">8</span><span style="COLOR: #000000">&nbsp;编码。</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">，一行写一个词&nbsp;</span><span style="COLOR: #000000">--&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;!--</span><span style="COLOR: #000000">用户可以在这里配置自己的扩展字典</span><span style="COLOR: #000000">--&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">entry&nbsp;key</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">ext_dict</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;/</span><span style="COLOR: #000000">mydict.dic</span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">entry</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">properties</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;?</span><span style="COLOR: #000000">xml&nbsp;version</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">1.0</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&nbsp;encoding</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">UTF-8</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">?&gt;</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;!</span><span style="COLOR: #000000">DOCTYPE&nbsp;properties&nbsp;SYSTEM&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">http://java.sun.com/dtd/properties.dtd</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">properties</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;!--</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">，文件要是&nbsp;UTF</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">8</span><span style="COLOR: #000000">&nbsp;编码。</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">，一行写一个词&nbsp;</span><span style="COLOR: #000000">--&gt;</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;!--</span><span style="COLOR: #000000">用户可以在这里配置自己的扩展字典</span><span style="COLOR: #000000">--&gt;</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">entry&nbsp;key</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">ext_dict</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;/</span><span style="COLOR: #000000">mydict.dic</span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">entry</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">properties</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;解析：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">entry&nbsp;key</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">ext_dict</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;/</span><span style="COLOR: #000000">mydict.dic</span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">entry</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;扩展了一个自己的词典，名字叫&nbsp;mydict.dic<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;因此我们要建一个文本文件，名为：mydict.dic&nbsp;&nbsp;（此处使用的&nbsp;.dic&nbsp;并非必须）<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在这个文本文件里写入：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;北京XXXX科技有限公司<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;这样就添加了一个词汇。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;如果要添加多个，则新起一行：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;词汇一<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;词汇二<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;词汇三<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src="http://www.cppblog.com/Images/dot.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;需要注意的是，这个文件一定要使用&nbsp;UTF</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">8编码<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">4</span><span style="COLOR: #000000">.&nbsp;停用词：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;有些词在文本中出现的频率非常高，但是对文本所携带的信息基本不产生影响，例如英文的</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">a、an、the、of</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">，或中文的</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">的、了、着</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">，以及各种标点符号等，这样的词称为停用词（stop&nbsp;word）。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;文本经过分词之后，停用词通常被过滤掉，不会被进行索引。在检索的时候，用户的查询中如果含有停用词，检索系统也会将其过滤掉（因为用户输入的查询字符串也要进行分词处理）。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;排除停用词可以加快建立索引的速度，减小索引库文件的大小。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;IKAnalyzer&nbsp;中自定义停用词也非常方便，和配置&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">扩展词库</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&nbsp;操作类型，只需要在&nbsp;IKAnalyzer.cfg.xml&nbsp;加入如下配置：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">entry&nbsp;key</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">ext_stopwords</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;/</span><span style="COLOR: #000000">ext_stopword.dic</span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">entry</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;同样这个配置也指向了一个文本文件&nbsp;</span><span style="COLOR: #000000">/</span><span style="COLOR: #000000">ext_stopword.dic&nbsp;（后缀名任意），格式如下：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;也<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;了<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;仍<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;从<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<img src="http://www.cppblog.com/Images/dot.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">本文来自CSDN博客，转载请标明出处：http:</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">blog.csdn.net/wenlin56/archive/2010/12/13/6074124.aspx</span></div>
<img src ="http://www.cppblog.com/zzfmars/aggbug/144401.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2011-04-17 19:25 <a href="http://www.cppblog.com/zzfmars/archive/2011/04/17/144401.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>网页解析开源项目</title><link>http://www.cppblog.com/zzfmars/archive/2011/04/17/144369.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Sun, 17 Apr 2011 00:36:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2011/04/17/144369.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/144369.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2011/04/17/144369.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/144369.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/144369.html</trackback:ping><description><![CDATA[<a href="http://htmlparser.sourceforge.net/">http://htmlparser.sourceforge.net/</a><br>
<img src ="http://www.cppblog.com/zzfmars/aggbug/144369.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2011-04-17 08:36 <a href="http://www.cppblog.com/zzfmars/archive/2011/04/17/144369.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>一个 Java 搜索引擎的实现，第 2 部分: 网页预处理</title><link>http://www.cppblog.com/zzfmars/archive/2011/04/16/144357.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Sat, 16 Apr 2011 12:36:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2011/04/16/144357.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/144357.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2011/04/16/144357.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/144357.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/144357.html</trackback:ping><description><![CDATA[<div style="BORDER-BOTTOM: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; PADDING-BOTTOM: 4px; BACKGROUND-COLOR: #eeeeee; PADDING-LEFT: 4px; WIDTH: 98%; PADDING-RIGHT: 5px; FONT-SIZE: 13px; WORD-BREAK: break-all; BORDER-TOP: #cccccc 1px solid; BORDER-RIGHT: #cccccc 1px solid; PADDING-TOP: 4px"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><span style="COLOR: #000000">在&nbsp;上一部分&nbsp;中，您了解到如何编写一个&nbsp;spider&nbsp;程序来进行网页的爬取，作为&nbsp;spider&nbsp;的爬取结果，我们获得了一个按照一定格式存储的原始网页库，原始网页库也是我们第二部分网页预处理的数据基础。网页预处理的主要目标是将原始网页通过一步步的数据处理变成可方便搜索的数据形式。下面就让我们逐步介绍网页预处理的设计和实现。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">预处理模块的整体结构<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">预处理模块的整体结构如下：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">图&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">.&nbsp;预处理模块的整体结构<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">通过&nbsp;spider&nbsp;的收集，保存下来的网页信息具有较好的信息存储格式，但是还是有一个缺点，就是不能按照网页&nbsp;URL&nbsp;直接定位到所指向的网页。所以，在第一个流程中，需要先建立网页的索引，如此通过索引，我们可以很方便的从原始网页库中获得某个&nbsp;URL&nbsp;对应的页面信息。之后，我们处理网页数据，对于一个网页，首先需要提取其网页正文信息，其次对正文信息进行分词，之后再根据分词的情况建立索引和倒排索引，这样，网页的预处理也全部完成。可能读者对于其中的某些专业术语会有一些不明白之处，在后续详述各个流程的时候会给出相应的图或者例子来帮助大家理解。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">建立索引网页库<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">原始网页库是按照格式存储的，这对于网页的索引建立提供了方便，下图给出了一条网页信息记录：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">清单&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">.&nbsp;原始网页库中的一条网页记录<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;之前的记录</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;version:</span><span style="COLOR: #000000">1.0</span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;记录头部</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&nbsp;url:http:</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">ast.nlsde.buaa.edu.cn/&nbsp;</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&nbsp;date:Mon&nbsp;Apr&nbsp;</span><span style="COLOR: #000000">05</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">14</span><span style="COLOR: #000000">:</span><span style="COLOR: #000000">22</span><span style="COLOR: #000000">:</span><span style="COLOR: #000000">53</span><span style="COLOR: #000000">&nbsp;CST&nbsp;</span><span style="COLOR: #000000">2010</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;IP:</span><span style="COLOR: #000000">218.241</span><span style="COLOR: #000000">.</span><span style="COLOR: #000000">236.72</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;length:</span><span style="COLOR: #000000">3981</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;</span><span style="COLOR: #000000">&lt;!</span><span style="COLOR: #000000">DOCTYPE&nbsp;&#8230;&#8230;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;记录数据部分</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">html</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;&#8230;&#8230;&nbsp;</span><span style="COLOR: #000000">&lt;/</span><span style="COLOR: #000000">html</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;之后的记录</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">&nbsp;xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">我们采用&#8220;网页库名—偏移&#8221;的信息对来定位库中的某条网页记录。由于数据量比较大，这些索引网页信息需要一种保存的方法，dySE&nbsp;使用数据库来保存这些信息。数据库们采用&nbsp;mysql，配合&nbsp;SQL</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">Front&nbsp;软件可以轻松进行图形界面的操作。我们用一个表来记录这些信息，表的内容如下：url、content、offset、raws。URL&nbsp;是某条记录对应的&nbsp;URL，因为索引数据库建立之后，我们是通过&nbsp;URL&nbsp;来确定需要的网页的；raws&nbsp;和&nbsp;offset&nbsp;分别表示网页库名和偏移值，这两个属性唯一确定了某条记录，content&nbsp;是网页内容的摘要，网页的数据量一般较大，把网页的全部内容放入数据库中显得不是很实际，所以我们将网页内容的&nbsp;MD5&nbsp;摘要放入到&nbsp;content&nbsp;属性中，该属性相当于一个校验码，在实际运用中，当我们根据&nbsp;URL&nbsp;获得某个网页信息是，可以将获得的网页做&nbsp;MD5&nbsp;摘要然后与&nbsp;content&nbsp;中的值做一个匹配，如果一样则网页获取成功，如果不一样，则说明网页获取出现问题。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">这里简单介绍一下&nbsp;mySql&nbsp;的安装以及与&nbsp;Java&nbsp;的连接：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">安装&nbsp;mySql，最好需要三个组件，mySql，mySql</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">front，mysql</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">connector</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">java</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">5.1</span><span style="COLOR: #000000">.</span><span style="COLOR: #000000">7</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">bin.jar，分别可以在网络中下载。注意：安装&nbsp;mySql&nbsp;与&nbsp;mySql</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">front&nbsp;的时候要版本对应，MySql5.</span><span style="COLOR: #000000">0</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;MySql</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">Front3.</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">&nbsp;和&nbsp;MySql5.</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;MySql</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">Front4.</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">，这个组合是不能乱的，可以根据相应的版本号来下载，否则会爆&#8220;&#8216;&nbsp;</span><span style="COLOR: #000000">10.000000</span><span style="COLOR: #000000">&nbsp;&#8217;&nbsp;ist&nbsp;kein&nbsp;gUltiger&nbsp;Integerwert&nbsp;&#8221;的错误。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">导入&nbsp;mysql</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">connector</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">java</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">5.1</span><span style="COLOR: #000000">.</span><span style="COLOR: #000000">7</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">bin.jar&nbsp;到&nbsp;eclipse&nbsp;的项目中，打开&nbsp;eclipse，右键点需要导入&nbsp;jar&nbsp;包的项&nbsp;目名，选属性（properties)，再选&nbsp;java&nbsp;构建路径（java&nbsp;Build&nbsp;Path)，后在右侧点&nbsp;(libraries)，选&nbsp;add&nbsp;external&nbsp;JARs，之后选择你要导入的&nbsp;jar&nbsp;包确定。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">接着就可以用代码来测试与&nbsp;mySql&nbsp;的连接了，代码见本文附带的&nbsp;testMySql.java&nbsp;程序，这里限于篇幅就不在赘述。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">对于数据库的操作，我们最好进行一定的封装，以提供统一的数据库操作支持，而不需要在其他的类中显示的进行数据库连接操作，而且这样也就不需要建立大量的数据库连接从而造成资源的浪费，代码详见&nbsp;DBConnection.java。主要提供的操作是：建立连接、执行&nbsp;SQL&nbsp;语句、返回操作结果。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">介绍了数据库的相关操作时候，现在我们可以来完成网页索引库的建立过程。这里要说明的是，第一条记录的偏移是&nbsp;</span><span style="COLOR: #000000">0</span><span style="COLOR: #000000">，所以在当前记录&nbsp;record&nbsp;处理之前，该记录的偏移是已经计算出来的，处理&nbsp;record&nbsp;的意义在于获得下一个记录在网页库中的偏移。假设当前&nbsp;record&nbsp;的偏移为&nbsp;offset，定位于头部的第一条属性之前，我们通过读取记录的头部和记录的数据部分来得到该记录的长度&nbsp;length，从而，offset</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">length&nbsp;即为下一条记录的偏移值。读取头部和读取记录都是通过数据间的空行来标识的，其伪代码如下：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">清单&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">.&nbsp;索引网页库建立<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">For&nbsp;each&nbsp;record&nbsp;in&nbsp;Raws&nbsp;</span><span style="COLOR: #0000ff">do</span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">begin&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;读取&nbsp;record&nbsp;的头部和数据，从头部中抽取&nbsp;URL；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;计算头部和数据的长度，加到当前偏移值上得到新的偏移；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;从&nbsp;record&nbsp;中数据中计算其&nbsp;MD5&nbsp;摘要值；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;将数据插入数据库中，包括：URL、偏移、数据&nbsp;MD5&nbsp;摘要、Raws；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">end；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">您可能会对&nbsp;MD5&nbsp;摘要算法有些疑惑，这是什么？这有什么用？&nbsp;Message&nbsp;Digest&nbsp;Algorithm&nbsp;MD5（中文名为消息摘要算法第五版）为计算机安全领域广泛使用的一种散列函数，用以提供消息的完整性保护。MD5&nbsp;的典型应用是对一段信息&nbsp;(Message)&nbsp;产生一个&nbsp;</span><span style="COLOR: #000000">128</span><span style="COLOR: #000000">&nbsp;位的二进制信息摘要&nbsp;(Message</span><span style="COLOR: #000000">-</span><span style="COLOR: #000000">Digest)，即为&nbsp;</span><span style="COLOR: #000000">32</span><span style="COLOR: #000000">&nbsp;位&nbsp;</span><span style="COLOR: #000000">16</span><span style="COLOR: #000000">&nbsp;进制数字串，以防止被篡改。对于我们来说，比如通过&nbsp;MD5&nbsp;计算，某个网页数据的摘要是&nbsp;00902914CFE6CD1A959C31C076F49EA8，如果我们任意的改变这个网页中的数据，通过计算之后，该摘要就会改变，我们可以将信息的&nbsp;MD5&nbsp;摘要视作为该信息的指纹信息。所以，存储该摘要可以验证之后获取的网页信息是否与原始网页一致。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">对&nbsp;MD5&nbsp;算法简要的叙述可以为：MD5&nbsp;以&nbsp;</span><span style="COLOR: #000000">512</span><span style="COLOR: #000000">&nbsp;位分组来处理输入的信息，且每一分组又被划分为&nbsp;</span><span style="COLOR: #000000">16</span><span style="COLOR: #000000">&nbsp;个&nbsp;</span><span style="COLOR: #000000">32</span><span style="COLOR: #000000">&nbsp;位子分组，经过了一系列的处理后，算法的输出由四个&nbsp;</span><span style="COLOR: #000000">32</span><span style="COLOR: #000000">&nbsp;位分组组成，将这四个&nbsp;</span><span style="COLOR: #000000">32</span><span style="COLOR: #000000">&nbsp;位分组级联后将生成一个&nbsp;</span><span style="COLOR: #000000">128</span><span style="COLOR: #000000">&nbsp;位散列值。其中&#8220;一系列的处理&#8221;即为计算流程，MD5&nbsp;的计算流程比较多，但是不难，同时也不难实现，您可以直接使用网上现有的&nbsp;java&nbsp;版本实现或者使用本教程提供的源码下载中的&nbsp;MD5&nbsp;类。对于&nbsp;MD5，我们知道其功能，能使用就可以，具体的每个步骤的意义不需要深入理解。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">正文信息抽取<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">PageGetter<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在正文信息抽取之前，我们首先需要一个简单的工具类，该工具类可以取出数据库中的内容并且去原始网页集中获得网页信息，dySE&nbsp;对于该功能的实现在&nbsp;originalPageGetter.java&nbsp;中，该类通过&nbsp;URL&nbsp;从数据库中获得该&nbsp;URL&nbsp;对应的网页数据的所在网页库名以及偏移，然后就可以根据偏移来读取该网页的数据内容，同样以原始网页集中各记录间的空行作为数据内容的结束标记，读取内容之后，通过&nbsp;MD5&nbsp;计算当前读取的内容的摘要，校验是否与之前的摘要一致。对于偏移的使用，BufferedReader&nbsp;类提供一个&nbsp;skip(</span><span style="COLOR: #0000ff">int</span><span style="COLOR: #000000">&nbsp;offset)&nbsp;的函数，其作用是跳过文档中，从当前开始计算的&nbsp;offset&nbsp;个字符，用这个函数我们就可以定位到我们需要的记录。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">清单&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">.&nbsp;获取原始网页库中内容<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;</span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;String&nbsp;getContent(String&nbsp;fileName,&nbsp;</span><span style="COLOR: #0000ff">int</span><span style="COLOR: #000000">&nbsp;offset)&nbsp;<br><img id=Codehighlighter1_3538_3906_Open_Image onclick="this.style.display='none'; Codehighlighter1_3538_3906_Open_Text.style.display='none'; Codehighlighter1_3538_3906_Closed_Image.style.display='inline'; Codehighlighter1_3538_3906_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3538_3906_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3538_3906_Closed_Text.style.display='none'; Codehighlighter1_3538_3906_Open_Image.style.display='inline'; Codehighlighter1_3538_3906_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif">&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3538_3906_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3538_3906_Open_Text><span style="COLOR: #000000">{&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;content&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">""</span><span style="COLOR: #000000">;&nbsp;<br><img id=Codehighlighter1_3577_3833_Open_Image onclick="this.style.display='none'; Codehighlighter1_3577_3833_Open_Text.style.display='none'; Codehighlighter1_3577_3833_Closed_Image.style.display='inline'; Codehighlighter1_3577_3833_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3577_3833_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3577_3833_Closed_Text.style.display='none'; Codehighlighter1_3577_3833_Open_Image.style.display='inline'; Codehighlighter1_3577_3833_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">try</span><span style="COLOR: #000000">&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3577_3833_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3577_3833_Open_Text><span style="COLOR: #000000">{&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;FileReader&nbsp;fileReader&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;FileReader(fileName);&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;BufferedReader&nbsp;bfReader&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;BufferedReader(fileReader);&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bfReader.skip(offset);&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;readRawHead(bfReader);&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;content&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;readRawContent(bfReader);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_3855_3876_Open_Image onclick="this.style.display='none'; Codehighlighter1_3855_3876_Open_Text.style.display='none'; Codehighlighter1_3855_3876_Closed_Image.style.display='inline'; Codehighlighter1_3855_3876_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3855_3876_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3855_3876_Closed_Text.style.display='none'; Codehighlighter1_3855_3876_Open_Image.style.display='inline'; Codehighlighter1_3855_3876_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">catch</span><span style="COLOR: #000000">&nbsp;(Exception&nbsp;e)&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3855_3876_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3855_3876_Open_Text><span style="COLOR: #000000">{e.printStackTrace();}</span></span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">return</span><span style="COLOR: #000000">&nbsp;content;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif">&nbsp;}</span></span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">上述代码中，省略了&nbsp;readRawHead&nbsp;和&nbsp;readRawContent&nbsp;的实现，这些都是基本的&nbsp;I</span><span style="COLOR: #000000">/</span><span style="COLOR: #000000">O&nbsp;操作，详见所附源码。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">正文抽取<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">对于获得的单个网页数据，我们就可以进行下一步的处理，首先要做的就是正文内容的抽取，从而剔除网页中的标签内容，这一步的操作主要采用正则表达式来完成。我们用正则表达式来匹配&nbsp;html&nbsp;的标签，并且把匹配到的标签删除，最后，剩下的内容就是网页正文。限于篇幅，我们以过滤&nbsp;script&nbsp;标签为示例，其代码如下&nbsp;:<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">清单&nbsp;</span><span style="COLOR: #000000">4</span><span style="COLOR: #000000">.&nbsp;标签过滤<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_4196_4688_Open_Image onclick="this.style.display='none'; Codehighlighter1_4196_4688_Open_Text.style.display='none'; Codehighlighter1_4196_4688_Closed_Image.style.display='inline'; Codehighlighter1_4196_4688_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_4196_4688_Closed_Image onclick="this.style.display='none'; Codehighlighter1_4196_4688_Closed_Text.style.display='none'; Codehighlighter1_4196_4688_Open_Image.style.display='inline'; Codehighlighter1_4196_4688_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif">&nbsp;</span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;String&nbsp;html2Text(String&nbsp;inputString)&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_4196_4688_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_4196_4688_Open_Text><span style="COLOR: #000000">{&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;htmlStr&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;inputString;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;含&nbsp;html&nbsp;标签的字符串&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Pattern&nbsp;p_script;&nbsp;&nbsp;&nbsp;&nbsp;Matcher&nbsp;m_script;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_4321_4608_Open_Image onclick="this.style.display='none'; Codehighlighter1_4321_4608_Open_Text.style.display='none'; Codehighlighter1_4321_4608_Closed_Image.style.display='inline'; Codehighlighter1_4321_4608_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_4321_4608_Closed_Image onclick="this.style.display='none'; Codehighlighter1_4321_4608_Closed_Text.style.display='none'; Codehighlighter1_4321_4608_Open_Image.style.display='inline'; Codehighlighter1_4321_4608_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">try</span><span style="COLOR: #000000">&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_4321_4608_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_4321_4608_Open_Text><span style="COLOR: #000000">{&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;regEx_script&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&lt;script[^&gt;]*?&gt;[\\s\\S]*?&lt;/script&gt;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;p_script&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;Pattern.compile(regEx_script,Pattern.CASE_INSENSITIVE);&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;m_script&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;p_script.matcher(htmlStr);&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;htmlStr&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;m_script.replaceAll(</span><span style="COLOR: #000000">""</span><span style="COLOR: #000000">);&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;过滤&nbsp;script&nbsp;标签&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000"><br><img id=Codehighlighter1_4628_4649_Open_Image onclick="this.style.display='none'; Codehighlighter1_4628_4649_Open_Text.style.display='none'; Codehighlighter1_4628_4649_Closed_Image.style.display='inline'; Codehighlighter1_4628_4649_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_4628_4649_Closed_Image onclick="this.style.display='none'; Codehighlighter1_4628_4649_Closed_Text.style.display='none'; Codehighlighter1_4628_4649_Open_Image.style.display='inline'; Codehighlighter1_4628_4649_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #0000ff">catch</span><span style="COLOR: #000000">(Exception&nbsp;e)&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_4628_4649_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_4628_4649_Open_Text><span style="COLOR: #000000">{e.printStackTrace();}</span></span><span style="COLOR: #000000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">return</span><span style="COLOR: #000000">&nbsp;htmlStr;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">&nbsp;返回文本字符串&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif"></span><span style="COLOR: #000000">&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">通过一系列的标签过滤，我们可以得到网页的正文内容，就可以用于下一步的分词了。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">分词<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">中文分词是指将一个汉字序列切分成一个一个单独的词，从而达到计算机可以自动识别的效果。中文分词主要有三种方法：第一种基于字符串匹配，第二种基于语义理解，第三种基于统计。由于第二和第三种的实现需要大量的数据来支持，所以我们采用的是基于字符串匹配的方法。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">基于字符串匹配的方法又叫做机械分词方法，它是按照一定的策略将待分析的汉字串与一个&#8220;充分大的&#8221;机器词典中的词条进行配，若在词典中找到某个字符串，则匹配成功（识别出一个词）。按照扫描方向的不同，串匹配分词方法可以分为正向匹配和逆向匹配；按照不同长度优先匹配的情况，可以分为最大（最长）匹配和最小（最短）匹配。常用的几种机械分词方法如下：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">正向减字最大匹配法（由左到右的方向）；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">逆向减字最大匹配法（由右到左的方向）；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">最少切分（使每一句中切出的词数最小）；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">双向最大减字匹配法（进行由左到右、由右到左两次扫描）；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">我们采用其中的正向最大匹配法。算法描述如下：输入值为一个中文语句&nbsp;S，以及最大匹配词&nbsp;n<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">取&nbsp;S&nbsp;中前&nbsp;n&nbsp;个字，根据词典对其进行匹配，若匹配成功，转&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">，否则转&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">n&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;n&nbsp;&#8211;&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">：如果&nbsp;n&nbsp;为&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">，转&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">；否则转&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">将&nbsp;S&nbsp;中的前&nbsp;n&nbsp;个字作为分词结果的一部分，S&nbsp;除去前&nbsp;n&nbsp;个字，若&nbsp;S&nbsp;为空，转&nbsp;</span><span style="COLOR: #000000">4</span><span style="COLOR: #000000">；否则，转&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">算法结束。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">需要说明的是，在第三步的起始，n&nbsp;如果不为&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">，则意味着有匹配到的词；而如果&nbsp;n&nbsp;为&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">，我们默认&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">&nbsp;个字是应该进入分词结果的，所以第三步可以将前&nbsp;n&nbsp;个字作为一个词而分割开来。还有需要注意的是对于停用词的过滤，停用词即汉语中&#8220;的，了，和，么&#8221;等字词，在搜索引擎中是忽略的，所以对于分词后的结果，我们需要在用停用词列表进行一下停用词过滤。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">您也许有疑问，如何获得分词字典或者是停用词字典。停用词字典比较好办，由于中文停用词数量有限，可以从网上获得停用词列表，从而自己建一个停用词字典；然而对于分词字典，虽然网上有许多知名的汉字分词软件，但是很少有分词的字典提供，这里我们提供一些在&nbsp;dySE&nbsp;中使用的分词字典给您。在程序使用过程中，分词字典可以放入一个集合中，这样就可以比较方便的进行比对工作。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">分词的结果对于搜索的精准性有着至关重要的影响，好的分词策略经常是由若干个简单算法拼接而成的，所以您也可以试着实现双向最大减字匹配法来提高分词的准确率。而如果遇到歧义词组，可以通过字典中附带的词频来决定哪种分词的结果更好。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">倒排索引<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">这个章节我们为您讲解预处理模块的最后两个步骤，索引的建立和倒排索引的建立。有了分词的结果，我们就可以获得一个正向的索引，即某个网页以及其对应的分词结果。如下图所示：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">图&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">.&nbsp;正向索引<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">图&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">.&nbsp;倒排索引<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在本文的开头，我们建立了索引网页库，用于通过&nbsp;URL&nbsp;可以直接定位到原始网页库中该&nbsp;URL&nbsp;对应的数据的位置；而现在的正向索引，我们可以通过某个网页的&nbsp;URL&nbsp;得到该网页的分词信息。获得正向索引看似对于我们的即将进行的查询操作没有什么实际的帮助，因为查询服务是通过关键词来获得网页信息，而正向索引并不能通过分词结果反查网页信息。其实，我们建立正向索引的目的就是通过翻转的操作建立倒排索引。所谓倒排就是相对于正向索引中网页——分词结果的映射方式，采用分词——对应的网页这种映射方式。与图&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">&nbsp;相对应的倒排索引如上图&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">&nbsp;所示。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">接下来我们分析如何从正向索引来得到倒排索引。算法过程如下：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">对于网页&nbsp;i，获取其分词列表&nbsp;List；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">对于&nbsp;List&nbsp;中的每个词组，查看倒排索引中是否含有这个词组，如果没有，将这个词组插入倒排索引的索引项，并将网页&nbsp;i&nbsp;加到其索引值中；如果倒排索引中已经含有这个词组，直接将网页&nbsp;i&nbsp;加到其索引值中；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">如果还有网页尚未分析，转&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">；否则，结束<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">建立倒排索引的算法不难实现，主要是其中数据结构的选用，在&nbsp;dySE&nbsp;中，正向索引和倒排索引都是采用&nbsp;HashMap&nbsp;来存储，映射中正向索引的键是采用网页&nbsp;URL&nbsp;对应的字符串，而倒排索引是采用分词词组，映射中的值，前者是一个分词列表，后者是一个&nbsp;URL&nbsp;的字符串列表。这里可以采用一个优化，分别建立两个表，按照标号存储分词列表和&nbsp;URL&nbsp;列表，这样，索引中的值就可以使用整型变量列表来节省空间。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">初步实验<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">到目前为止，虽然我们还没有正式的查询输入界面以及结果返回页面，但这丝毫不影响我们来对我们的搜索引擎进行初步的实验。在倒排索引建立以后，我们在程序中获得一个倒排索引的实例，然后定义一个搜索的字符串，直接在倒排索引中遍历这个字符串，然后返回该词组所指向的倒排索引中的&nbsp;URL&nbsp;列表即可。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">小结<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页的预处理是搜索引擎的核心部分，建立索引网页库是为了网页数据更方便的从原始网页库中获取，而抽取正文信息是后续操作的基础。从分词开始就正式涉及到搜索引擎中文本数据的处理，分词的好坏以及效率很大程度上决定着搜索引擎的精确性，是非常需要关注的一点，而倒排索引时根据分词的结果建立的一个&#8220;词组——对应网页列表&#8221;映射，倒排索引是网页搜索的最关键数据结构，搜索引擎执行的速度与倒排索引的建立以及倒排索引的搜索方式息息相关。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">后续内容<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在本系列的第三部分中，您将了解到如何从创建网页，从网页中输入查询信息通过倒排索引的搜索完成结果的返回，并且完成网页排名的功能。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span></div>
<img src ="http://www.cppblog.com/zzfmars/aggbug/144357.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2011-04-16 20:36 <a href="http://www.cppblog.com/zzfmars/archive/2011/04/16/144357.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>一个 Java 搜索引擎的实现，第 1 部分: 网络爬虫</title><link>http://www.cppblog.com/zzfmars/archive/2011/04/16/144356.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Sat, 16 Apr 2011 12:35:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2011/04/16/144356.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/144356.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2011/04/16/144356.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/144356.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/144356.html</trackback:ping><description><![CDATA[<div style="BORDER-BOTTOM: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; PADDING-BOTTOM: 4px; BACKGROUND-COLOR: #eeeeee; PADDING-LEFT: 4px; WIDTH: 98%; PADDING-RIGHT: 5px; FONT-SIZE: 13px; WORD-BREAK: break-all; BORDER-TOP: #cccccc 1px solid; BORDER-RIGHT: #cccccc 1px solid; PADDING-TOP: 4px"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><span style="COLOR: #000000">自己动手写一个搜索引擎，想想这有多&nbsp;cool：在界面上输入关键词，点击搜索，得到自己想要的结果；那么它还可以做什么呢？也许是自己的网站需要一个站内搜索功能，抑或是对于硬盘中文档的搜索&nbsp;——&nbsp;最重要的是，是不是觉得众多&nbsp;IT&nbsp;公司都在向你招手呢？如果你心动了，那么，Let</span><span style="COLOR: #000000">'</span><span style="COLOR: #000000">s&nbsp;Go！</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">这里首先要说明使用&nbsp;Java&nbsp;语言而不是&nbsp;C</span><span style="COLOR: #000000">/</span><span style="COLOR: #000000">C</span><span style="COLOR: #000000">++</span><span style="COLOR: #000000">&nbsp;等其它语言的原因，因为&nbsp;Java&nbsp;中提供了对于网络编程众多的基础包和类，比如&nbsp;URL&nbsp;类、InetAddress&nbsp;类、正则表达式，这为我们的搜索引擎实现提供了良好的基础，使我们可以专注于搜索引擎本身的实现，而不需要因为这些基础类的实现而分心。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">这个分三部分的系列将逐步说明如何设计和实现一个搜索引擎。在第一部分中，您将首先学习搜索引擎的工作原理，同时了解其体系结构，之后将讲解如何实现搜索引擎的第一部分，网络爬虫模块，即完成网页搜集功能。在系列的第二部分中，将介绍预处理模块，即如何处理收集来的网页，整理、分词以及索引的建立都在这部分之中。在系列的第三部分中，将介绍信息查询服务的实现，主要是查询界面的建立、查询结果的返回以及快照的实现。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">dySE&nbsp;的整体结构<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在开始学习搜索引擎的模块实现之前，您需要了解&nbsp;dySE&nbsp;的整体结构以及数据传输的流程。事实上，搜索引擎的三个部分是相互独立的，三个部分分别工作，主要的关系体现在前一部分得到的数据结果为后一部分提供原始数据。三者的关系如下图所示：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">图&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">.&nbsp;搜索引擎三段式工作流程<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在介绍搜索引擎的整体结构之前，我们借鉴《计算机网络——自顶向下的方法描述因特网特色》一书的叙事方法，从普通用户使用搜索引擎的角度来介绍搜索引擎的具体工作流程。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">自顶向下的方法描述搜索引擎执行过程：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">用户通过浏览器提交查询的词或者短语&nbsp;P，搜索引擎根据用户的查询返回匹配的网页信息列表&nbsp;L；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">上述过程涉及到两个问题，如何匹配用户的查询以及网页信息列表从何而来，根据什么而排序？用户的查询&nbsp;P&nbsp;经过分词器被切割成小词组&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">p1,p2&nbsp;&#8230;&nbsp;pn</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;并被剔除停用词&nbsp;(&nbsp;的、了、啊等字&nbsp;)，根据系统维护的一个倒排索引可以查询某个词&nbsp;pi&nbsp;在哪些网页中出现过，匹配那些&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">p1,p2&nbsp;&#8230;&nbsp;pn</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;都出现的网页集即可作为初始结果，更进一步，返回的初始网页集通过计算与查询词的相关度从而得到网页排名，即&nbsp;Page&nbsp;Rank，按照网页的排名顺序即可得到最终的网页列表；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">假设分词器和网页排名的计算公式都是既定的，那么倒排索引以及原始网页集从何而来？原始网页集在之前的数据流程的介绍中，可以得知是由爬虫&nbsp;spider&nbsp;爬取网页并且保存在本地的，而倒排索引，即词组到网页的映射表是建立在正排索引的基础上的，后者是分析了网页的内容并对其内容进行分词后，得到的网页到词组的映射表，将正排索引倒置即可得到倒排索引；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页的分析具体做什么呢？由于爬虫收集来的原始网页中包含很多信息，比如&nbsp;html&nbsp;表单以及一些垃圾信息比如广告，网页分析去除这些信息，并抽取其中的正文信息作为后续的基础数据。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在有了上述的分析之后，我们可以得到搜索引擎的整体结构如下图：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">图&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">.&nbsp;搜索引擎整体结构<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">爬虫从&nbsp;Internet&nbsp;中爬取众多的网页作为原始网页库存储于本地，然后网页分析器抽取网页中的主题内容交给分词器进行分词，得到的结果用索引器建立正排和倒排索引，这样就得到了索引数据库，用户查询时，在通过分词器切割输入的查询词组并通过检索器在索引数据库中进行查询，得到的结果返回给用户。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">无论搜索引擎的规模大小，其主要结构都是由这几部分构成的，并没有大的差别，搜索引擎的好坏主要是决定于各部分的内部实现。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">有了上述的对与搜索引擎的整体了解，我们来学习&nbsp;dySE&nbsp;中爬虫模块的具体设计和实现。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">Spider&nbsp;的设计<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页收集的过程如同图的遍历，其中网页就作为图中的节点，而网页中的超链接则作为图中的边，通过某网页的超链接&nbsp;得到其他网页的地址，从而可以进一步的进行网页收集；图的遍历分为广度优先和深度优先两种方法，网页的收集过程也是如此。综上，Spider&nbsp;收集网页的过程如下：从初始&nbsp;URL&nbsp;集合获得目标网页地址，通过网络连接接收网页数据，将获得的网页数据添加到网页库中并且分析该网页中的其他&nbsp;URL&nbsp;链接，放入未访问&nbsp;URL&nbsp;集合用于网页收集。下图表示了这个过程：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">图&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">.&nbsp;Spider&nbsp;工作流程<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">Spider&nbsp;的具体实现<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页收集器&nbsp;Gather<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页收集器通过一个&nbsp;URL&nbsp;来获取该&nbsp;URL&nbsp;对应的网页数据，其实现主要是利用&nbsp;Java&nbsp;中的&nbsp;URLConnection&nbsp;类来打开&nbsp;URL&nbsp;对应页面的网络连接，然后通过&nbsp;I</span><span style="COLOR: #000000">/</span><span style="COLOR: #000000">O&nbsp;流读取其中的数据，BufferedReader&nbsp;提供读取数据的缓冲区提高数据读取的效率以及其下定义的&nbsp;readLine()&nbsp;行读取函数。代码如下&nbsp;(&nbsp;省略了异常处理部分&nbsp;)：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">清单&nbsp;</span><span style="COLOR: #000000">1</span><span style="COLOR: #000000">.&nbsp;网页数据抓取<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">URL&nbsp;url&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;URL(&#8220;http:</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">www.xxx.com&#8221;);&nbsp;</span><span style="COLOR: #008000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #000000">URLConnection&nbsp;conn&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;url.openConnection();&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">BufferedReader&nbsp;reader&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;BufferedReader(</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;InputStreamReader(conn.getInputStream()));&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">String&nbsp;line&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">null</span><span style="COLOR: #000000">;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #0000ff">while</span><span style="COLOR: #000000">((line&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;reader.readLine())&nbsp;</span><span style="COLOR: #000000">!=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">null</span><span style="COLOR: #000000">)&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;document.append(line&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">\n</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">);&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">使用&nbsp;Java&nbsp;语言的好处是不需要自己处理底层的连接操作，喜欢或者精通&nbsp;Java&nbsp;网络编程的读者也可以不用上述的方法，自己实现&nbsp;URL&nbsp;类及相关操作，这也是一种很好的锻炼。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页处理<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">收集到的单个网页，需要进行两种不同的处理，一种是放入网页库，作为后续处理的原始数据；另一种是被分析之后，抽取其中的&nbsp;URL&nbsp;连接，放入&nbsp;URL&nbsp;池等待对应网页的收集。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页的保存需要按照一定的格式，以便以后数据的批量处理。这里介绍一种存储数据格式，该格式从北大天网的存储格式简化而来：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">网页库由若干记录组成，每个记录包含一条网页数据信息，记录的存放为顺序添加；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">一条记录由数据头、数据、空行组成，顺序为：头部&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;空行&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;数据&nbsp;</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">&nbsp;空行；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">头部由若干属性组成，有：版本号，日期，IP&nbsp;地址，数据长度，按照属性名和属性值的方式排列，中间加冒号，每个属性占用一行；<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">数据即为网页数据。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">需要说明的是，添加数据收集日期的原因，由于许多网站的内容都是动态变化的，比如一些大型门户网站的首页内容，这就意味着如果不是当天爬取的网页数据，很可能发生数据过期的问题，所以需要添加日期信息加以识别。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">URL&nbsp;的提取分为两步，第一步是&nbsp;URL&nbsp;识别，第二步再进行&nbsp;URL&nbsp;的整理，分两步走主要是因为有些网站的链接是采用相对路径，如果不整理会产生错误。URL&nbsp;的识别主要是通过正则表达式来匹配，过程首先设定一个字符串作为匹配的字符串模式，然后在&nbsp;Pattern&nbsp;中编译后即可使用&nbsp;Matcher&nbsp;类来进行相应字符串的匹配。实现代码如下：<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">清单&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">.&nbsp;URL&nbsp;识别<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img id=Codehighlighter1_3045_3887_Open_Image onclick="this.style.display='none'; Codehighlighter1_3045_3887_Open_Text.style.display='none'; Codehighlighter1_3045_3887_Closed_Image.style.display='inline'; Codehighlighter1_3045_3887_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3045_3887_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3045_3887_Closed_Text.style.display='none'; Codehighlighter1_3045_3887_Open_Image.style.display='inline'; Codehighlighter1_3045_3887_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif"></span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;ArrayList</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">URL</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;urlDetector(String&nbsp;htmlDoc)</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3045_3887_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3045_3887_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">final</span><span style="COLOR: #000000">&nbsp;String&nbsp;patternString&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&lt;[a|A]\\s+href=([^&gt;]*\\s*&gt;)</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;Pattern&nbsp;pattern&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;Pattern.compile(patternString,Pattern.CASE_INSENSITIVE);&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;ArrayList</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">URL</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&nbsp;allURLs&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;ArrayList</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">URL</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">();<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;Matcher&nbsp;matcher&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;pattern.matcher(htmlDoc);<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;String&nbsp;tempURL;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">初次匹配到的url是形如：&lt;a&nbsp;href="</span><span style="COLOR: #008000; TEXT-DECORATION: underline">http://bbs.life.xxx.com.cn/</span><span style="COLOR: #008000">"&nbsp;target="_blank"&gt;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">为此，需要进行下一步的处理，把真正的url抽取出来，<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">可以对于前两个"之间的部分进行记录得到url</span><span style="COLOR: #008000"><br><img id=Codehighlighter1_3485_3861_Open_Image onclick="this.style.display='none'; Codehighlighter1_3485_3861_Open_Text.style.display='none'; Codehighlighter1_3485_3861_Closed_Image.style.display='inline'; Codehighlighter1_3485_3861_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3485_3861_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3485_3861_Closed_Text.style.display='none'; Codehighlighter1_3485_3861_Open_Image.style.display='inline'; Codehighlighter1_3485_3861_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">while</span><span style="COLOR: #000000">(matcher.find())</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3485_3861_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3485_3861_Open_Text><span style="COLOR: #000000">{<br><img id=Codehighlighter1_3499_3778_Open_Image onclick="this.style.display='none'; Codehighlighter1_3499_3778_Open_Text.style.display='none'; Codehighlighter1_3499_3778_Closed_Image.style.display='inline'; Codehighlighter1_3499_3778_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3499_3778_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3499_3778_Closed_Text.style.display='none'; Codehighlighter1_3499_3778_Open_Image.style.display='inline'; Codehighlighter1_3499_3778_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">try</span><span style="COLOR: #000000">&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3499_3778_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3499_3778_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tempURL&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;matcher.group();&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tempURL&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;tempURL.substring(tempURL.indexOf(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">\</span><span style="COLOR: #000000">""</span><span style="COLOR: #000000">)+1);&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">if</span><span style="COLOR: #000000">(</span><span style="COLOR: #000000">!</span><span style="COLOR: #000000">tempURL.contains(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">\</span><span style="COLOR: #000000">""</span><span style="COLOR: #000000">))</span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">continue</span><span style="COLOR: #000000">;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tempURL&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;tempURL.substring(</span><span style="COLOR: #000000">0</span><span style="COLOR: #000000">,&nbsp;tempURL.indexOf(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">\</span><span style="COLOR: #000000">""</span><span style="COLOR: #000000">));&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #000000"><br><img id=Codehighlighter1_3812_3855_Open_Image onclick="this.style.display='none'; Codehighlighter1_3812_3855_Open_Text.style.display='none'; Codehighlighter1_3812_3855_Closed_Image.style.display='inline'; Codehighlighter1_3812_3855_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_3812_3855_Closed_Image onclick="this.style.display='none'; Codehighlighter1_3812_3855_Closed_Text.style.display='none'; Codehighlighter1_3812_3855_Open_Image.style.display='inline'; Codehighlighter1_3812_3855_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">catch</span><span style="COLOR: #000000">&nbsp;(MalformedURLException&nbsp;e)&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_3812_3855_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_3812_3855_Open_Text><span style="COLOR: #000000">{<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e.printStackTrace();<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">return</span><span style="COLOR: #000000">&nbsp;allURLs;&nbsp;&nbsp;&nbsp;&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif">}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">按照&#8220;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">[a</span><span style="COLOR: #000000">|</span><span style="COLOR: #000000">A]\\s</span><span style="COLOR: #000000">+</span><span style="COLOR: #000000">href</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">([</span><span style="COLOR: #000000">^&gt;</span><span style="COLOR: #000000">]</span><span style="COLOR: #000000">*</span><span style="COLOR: #000000">\\s</span><span style="COLOR: #000000">*&gt;</span><span style="COLOR: #000000">)&#8221;这个正则表达式可以匹配出&nbsp;URL&nbsp;所在的整个标签，形如&#8220;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">a&nbsp;href</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">http://bbs.life.xxx.com.cn/</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&nbsp;target</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">_blank</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">&gt;</span><span style="COLOR: #000000">&#8221;，所以在循环获得整个标签之后，需要进一步提取出真正的&nbsp;URL，我们可以通过截取标签中前两个引号中间的内容来获得这段内容。如此之后，我们可以得到一个初步的属于该网页的&nbsp;URL&nbsp;集合。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">接下来我们进行第二步操作，URL&nbsp;的整理，即对之前获得的整个页面中&nbsp;URL&nbsp;集合进行筛选和整合。整合主要是针对网页地址是相对链接的部分，由于我们可以很容易的获得当前网页的&nbsp;URL，所以，相对链接只需要在当前网页的&nbsp;URL&nbsp;上添加相对链接的字段即可组成完整的&nbsp;URL，从而完成整合。另一方面，在页面中包含的全面&nbsp;URL&nbsp;中，有一些网页比如广告网页是我们不想爬取的，或者不重要的，这里我们主要针对于页面中的广告进行一个简单处理。一般网站的广告连接都有相应的显示表达，比如连接中含有&#8220;ad&#8221;等表达时，可以将该链接的优先级降低，这样就可以一定程度的避免广告链接的爬取。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">经过这两步操作时候，可以把该网页的收集到的&nbsp;URL&nbsp;放入&nbsp;URL&nbsp;池中，接下来我们处理爬虫的&nbsp;URL&nbsp;的派分问题。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">Dispatcher&nbsp;分配器<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">分配器管理&nbsp;URL，负责保存着&nbsp;URL&nbsp;池并且在&nbsp;Gather&nbsp;取得某一个网页之后派分新的&nbsp;URL，还要避免网页的重复收集。分配器采用设计模式中的单例模式编码，负责提供给&nbsp;Gather&nbsp;新的&nbsp;URL，因为涉及到之后的多线程改写，所以单例模式显得尤为重要。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">重复收集是指物理上存在的一个网页，在没有更新的前提下，被&nbsp;Gather&nbsp;重复访问，造成资源的浪费，主要原因是没有清楚的记录已经访问的&nbsp;URL&nbsp;而无法辨别。所以，Dispatcher&nbsp;维护两个列表&nbsp;,&#8220;已访问表&#8221;，和&#8220;未访问表&#8221;。每个&nbsp;URL&nbsp;对应的页面被抓取之后，该&nbsp;URL&nbsp;放入已访问表中，而从该页面提取出来的&nbsp;URL&nbsp;则放入未访问表中；当&nbsp;Gather&nbsp;向&nbsp;Dispatcher&nbsp;请求&nbsp;URL&nbsp;的时候，先验证该&nbsp;URL&nbsp;是否在已访问表中，然后再给&nbsp;Gather&nbsp;进行作业。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">Spider&nbsp;启动多个&nbsp;Gather&nbsp;线程<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">现在&nbsp;Internet&nbsp;中的网页数量数以亿计，而单独的一个&nbsp;Gather&nbsp;来进行网页收集显然效率不足，所以我们需要利用多线程的方法来提高效率。Gather&nbsp;的功能是收集网页，我们可以通过&nbsp;Spider&nbsp;类来开启多个&nbsp;Gather&nbsp;线程，从而达到多线程的目的。代码如下：<br><img id=Codehighlighter1_4977_5008_Open_Image onclick="this.style.display='none'; Codehighlighter1_4977_5008_Open_Text.style.display='none'; Codehighlighter1_4977_5008_Closed_Image.style.display='inline'; Codehighlighter1_4977_5008_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_4977_5008_Closed_Image onclick="this.style.display='none'; Codehighlighter1_4977_5008_Closed_Text.style.display='none'; Codehighlighter1_4977_5008_Open_Image.style.display='inline'; Codehighlighter1_4977_5008_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif"></span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_4977_5008_Closed_Text>/**&nbsp;*/</span><span id=Codehighlighter1_4977_5008_Open_Text><span style="COLOR: #008000">/**</span><span style="COLOR: #008000">&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">*&nbsp;启动线程&nbsp;gather，然后开始收集网页资料<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif"></span><span style="COLOR: #008000">*/</span></span><span style="COLOR: #000000">&nbsp;<br><img id=Codehighlighter1_5031_5210_Open_Image onclick="this.style.display='none'; Codehighlighter1_5031_5210_Open_Text.style.display='none'; Codehighlighter1_5031_5210_Closed_Image.style.display='inline'; Codehighlighter1_5031_5210_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_5031_5210_Closed_Image onclick="this.style.display='none'; Codehighlighter1_5031_5210_Closed_Text.style.display='none'; Codehighlighter1_5031_5210_Open_Image.style.display='inline'; Codehighlighter1_5031_5210_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif"></span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">void</span><span style="COLOR: #000000">&nbsp;start()&nbsp;</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_5031_5210_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_5031_5210_Open_Text><span style="COLOR: #000000">{&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;Dispatcher&nbsp;disp&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;Dispatcher.getInstance();&nbsp;<br><img id=Codehighlighter1_5121_5208_Open_Image onclick="this.style.display='none'; Codehighlighter1_5121_5208_Open_Text.style.display='none'; Codehighlighter1_5121_5208_Closed_Image.style.display='inline'; Codehighlighter1_5121_5208_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_5121_5208_Closed_Image onclick="this.style.display='none'; Codehighlighter1_5121_5208_Closed_Text.style.display='none'; Codehighlighter1_5121_5208_Open_Image.style.display='inline'; Codehighlighter1_5121_5208_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="COLOR: #0000ff">for</span><span style="COLOR: #000000">(</span><span style="COLOR: #0000ff">int</span><span style="COLOR: #000000">&nbsp;i&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #000000">0</span><span style="COLOR: #000000">;&nbsp;i&nbsp;</span><span style="COLOR: #000000">&lt;</span><span style="COLOR: #000000">&nbsp;gatherNum;&nbsp;i</span><span style="COLOR: #000000">++</span><span style="COLOR: #000000">)</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_5121_5208_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_5121_5208_Open_Text><span style="COLOR: #000000">{&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Thread&nbsp;gather&nbsp;</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;Thread(</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;Gather(disp));&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;gather.start();&nbsp;<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;&nbsp;&nbsp;&nbsp;}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif">}</span></span><span style="COLOR: #000000"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在开启线程之后，网页收集器开始作业的运作，并在一个作业完成之后，向&nbsp;Dispatcher&nbsp;申请下一个作业，因为有了多线程的&nbsp;Gather，为了避免线程不安全，需要对&nbsp;Dispatcher&nbsp;进行互斥访问，在其函数之中添加&nbsp;</span><span style="COLOR: #0000ff">synchronized</span><span style="COLOR: #000000">&nbsp;关键词，从而达到线程的安全访问。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">小结<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">Spider&nbsp;是整个搜索引擎的基础，为后续的操作提供原始网页资料，所以了解&nbsp;Spider&nbsp;的编写以及网页库的组成结构为后续预处理模块打下基础。同时&nbsp;Spider&nbsp;稍加修改之后也可以单独用于某类具体信息的搜集，比如某个网站的图片爬取等。<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">回页首<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">后续内容<br><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif">在本系列的第&nbsp;</span><span style="COLOR: #000000">2</span><span style="COLOR: #000000">&nbsp;部分中，您将了解到爬虫获取的网页库如何被预处理模块逐步提取内容信息，通过分词并建成倒排索引；而在第&nbsp;</span><span style="COLOR: #000000">3</span><span style="COLOR: #000000">&nbsp;部分中，您将了解到，如何编写网页来提供查询服务，并且如何显示的返回的结果和完成快照的功能。</span></div>
<img src ="http://www.cppblog.com/zzfmars/aggbug/144356.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2011-04-16 20:35 <a href="http://www.cppblog.com/zzfmars/archive/2011/04/16/144356.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>java 下载网页</title><link>http://www.cppblog.com/zzfmars/archive/2011/04/13/144148.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Wed, 13 Apr 2011 12:42:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2011/04/13/144148.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/144148.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2011/04/13/144148.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/144148.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/144148.html</trackback:ping><description><![CDATA[<p>&nbsp;</p>
<div style="BORDER-BOTTOM: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; PADDING-BOTTOM: 4px; BACKGROUND-COLOR: #eeeeee; PADDING-LEFT: 4px; WIDTH: 98%; PADDING-RIGHT: 5px; FONT-SIZE: 13px; WORD-BREAK: break-all; BORDER-TOP: #cccccc 1px solid; BORDER-RIGHT: #cccccc 1px solid; PADDING-TOP: 4px"><span style="COLOR: #008080">&nbsp;1</span><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><span style="COLOR: #008000">//</span><span style="COLOR: #008000">获取指定网页源代码</span><span style="COLOR: #008000"><br></span><span style="COLOR: #008080">&nbsp;2</span><span style="COLOR: #008000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #0000ff">package</span><span style="COLOR: #000000">&nbsp;kevin;<br></span><span style="COLOR: #008080">&nbsp;3</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br></span><span style="COLOR: #008080">&nbsp;4</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #0000ff">import</span><span style="COLOR: #000000">&nbsp;java.io.</span><span style="COLOR: #000000">*</span><span style="COLOR: #000000">;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">java的输入输出</span><span style="COLOR: #008000"><br></span><span style="COLOR: #008080">&nbsp;5</span><span style="COLOR: #008000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span><span style="COLOR: #0000ff">import</span><span style="COLOR: #000000">&nbsp;java.net.</span><span style="COLOR: #000000">*</span><span style="COLOR: #000000">;</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">java的net包</span><span style="COLOR: #008000"><br></span><span style="COLOR: #008080">&nbsp;6</span><span style="COLOR: #008000"><img id=Codehighlighter1_103_457_Open_Image onclick="this.style.display='none'; Codehighlighter1_103_457_Open_Text.style.display='none'; Codehighlighter1_103_457_Closed_Image.style.display='inline'; Codehighlighter1_103_457_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_103_457_Closed_Image onclick="this.style.display='none'; Codehighlighter1_103_457_Closed_Text.style.display='none'; Codehighlighter1_103_457_Open_Image.style.display='inline'; Codehighlighter1_103_457_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedBlock.gif"></span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">class</span><span style="COLOR: #000000">&nbsp;fei</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_103_457_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_103_457_Open_Text><span style="COLOR: #000000">{<br></span><span style="COLOR: #008080">&nbsp;7</span><span style="COLOR: #000000"><img id=Codehighlighter1_163_455_Open_Image onclick="this.style.display='none'; Codehighlighter1_163_455_Open_Text.style.display='none'; Codehighlighter1_163_455_Closed_Image.style.display='inline'; Codehighlighter1_163_455_Closed_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockStart.gif"><img style="DISPLAY: none" id=Codehighlighter1_163_455_Closed_Image onclick="this.style.display='none'; Codehighlighter1_163_455_Closed_Text.style.display='none'; Codehighlighter1_163_455_Open_Image.style.display='inline'; Codehighlighter1_163_455_Open_Text.style.display='inline';" align=top src="http://www.cppblog.com/Images/OutliningIndicators/ContractedSubBlock.gif">&nbsp;</span><span style="COLOR: #0000ff">public</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">static</span><span style="COLOR: #000000">&nbsp;</span><span style="COLOR: #0000ff">void</span><span style="COLOR: #000000">&nbsp;main(String[]&nbsp;args)&nbsp;</span><span style="COLOR: #0000ff">throws</span><span style="COLOR: #000000">&nbsp;IOException</span><span style="BORDER-BOTTOM: #808080 1px solid; BORDER-LEFT: #808080 1px solid; BACKGROUND-COLOR: #ffffff; DISPLAY: none; BORDER-TOP: #808080 1px solid; BORDER-RIGHT: #808080 1px solid" id=Codehighlighter1_163_455_Closed_Text><img src="http://www.cppblog.com/Images/dot.gif"></span><span id=Codehighlighter1_163_455_Open_Text><span style="COLOR: #000000">{<br></span><span style="COLOR: #008080">&nbsp;8</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;URL&nbsp;url</span><span style="COLOR: #000000">=</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;URL(</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">http://www.baidu.com</span><span style="COLOR: #000000">"</span><span style="COLOR: #000000">);</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">定义一个url类的实例</span><span style="COLOR: #008000"><br></span><span style="COLOR: #008080">&nbsp;9</span><span style="COLOR: #008000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;InputStreamReader&nbsp;isr</span><span style="COLOR: #000000">=</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;InputStreamReader(url.openStream());</span><span style="COLOR: #008000">//</span><span style="COLOR: #008000">输入流</span><span style="COLOR: #008000"><br></span><span style="COLOR: #008080">10</span><span style="COLOR: #008000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif"></span><span style="COLOR: #000000">&nbsp;&nbsp;BufferedReader&nbsp;br</span><span style="COLOR: #000000">=</span><span style="COLOR: #0000ff">new</span><span style="COLOR: #000000">&nbsp;BufferedReader(isr);<br></span><span style="COLOR: #008080">11</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;String&nbsp;s;<br></span><span style="COLOR: #008080">12</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;</span><span style="COLOR: #0000ff">while</span><span style="COLOR: #000000">((s</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">br.readLine())</span><span style="COLOR: #000000">!=</span><span style="COLOR: #0000ff">null</span><span style="COLOR: #000000">)<br></span><span style="COLOR: #008080">13</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;System.out.print(s);<br></span><span style="COLOR: #008080">14</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif">&nbsp;&nbsp;URLConnection&nbsp;connection</span><span style="COLOR: #000000">=</span><span style="COLOR: #000000">url.openConnection();<br></span><span style="COLOR: #008080">15</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/InBlock.gif"><br></span><span style="COLOR: #008080">16</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedSubBlockEnd.gif">&nbsp;}</span></span><span style="COLOR: #000000"><br></span><span style="COLOR: #008080">17</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/ExpandedBlockEnd.gif">}</span></span><span style="COLOR: #000000"><br></span><span style="COLOR: #008080">18</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"><br></span><span style="COLOR: #008080">19</span><span style="COLOR: #000000"><img align=top src="http://www.cppblog.com/Images/OutliningIndicators/None.gif"></span></div>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/zzfmars/aggbug/144148.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2011-04-13 20:42 <a href="http://www.cppblog.com/zzfmars/archive/2011/04/13/144148.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Apache+php+mysql在XP下搭配详解</title><link>http://www.cppblog.com/zzfmars/archive/2011/04/10/143865.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Sun, 10 Apr 2011 04:14:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2011/04/10/143865.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/143865.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2011/04/10/143865.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/143865.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/143865.html</trackback:ping><description><![CDATA[<a href="http://www.php100.com/html/webkaifa/apache/2009/0418/1188.html">http://www.php100.com/html/webkaifa/apache/2009/0418/1188.html</a>
<img src ="http://www.cppblog.com/zzfmars/aggbug/143865.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2011-04-10 12:14 <a href="http://www.cppblog.com/zzfmars/archive/2011/04/10/143865.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>MonoDevelop</title><link>http://www.cppblog.com/zzfmars/archive/2010/10/22/130844.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Thu, 21 Oct 2010 23:29:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2010/10/22/130844.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/130844.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2010/10/22/130844.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/130844.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/130844.html</trackback:ping><description><![CDATA[<span class=Apple-style-span style="WORD-SPACING: 0px; FONT: medium Simsun; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><span class=Apple-style-span style="FONT-SIZE: 14px; LINE-HEIGHT: 24px; FONT-FAMILY: arial, 宋体, sans-serif">MonoDevelop支持使用C#和其他.NET语言进行开发，它使得开发者可以在Linux和Mac OS X上非常迅速的开发出桌面软件和ASP.NET Web应用。除此之外，MonoDevelop还允许开发者非常简单的将Visual Studio开发的.NET应用程序移植到Linux和Mac OS X下，这样开发者只需要维护一套代码即可──因为GTK#是跨平台的。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　或许有人对于Microsoft的.NET环境有些抵触，而开放的桌面环境：GNOME早已将开源实现的.NET运行环境Mono纳入了默认支持当中。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　GNOME系统的&#8220;Tomboy便笺&#8221;即是用C#编写，Novell出品的照片管理工具：F-spot也是如此，同样还有著名的索引搜索工具Beagle。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　通过Mono，能吸引更多的开发者，这何尝不是一件好事？
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　再谈最新的MonoDevelop 1.0，它是一款非常强大的集成开发环境，有如下特性：
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　代码补全。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　参数信息。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　信息提示。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　即时错误检查。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　代码导航。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　智能索引。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　自动生成XML标签。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　代码模板。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　类和成员选择器。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　单元测试。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　打包和部署。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　版本控制。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　Visual Studio支持。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　国际化支持。
<div class=spctrl style="OVERFLOW-Y: hidden; FONT-SIZE: 12px; OVERFLOW-X: hidden; LINE-HEIGHT: 14px; HEIGHT: 14px"></div>
　　最棒的是，如果你使用C#的话，还能使用集成GTK#的可视化设计。这是目前为止GNOME环境下唯一的集成可视化设计器的IDE，Anjuta也不支持。<br><br>官方网站：<a href="http://monodevelop.com/">http://monodevelop.com/</a></span></span>
<img src ="http://www.cppblog.com/zzfmars/aggbug/130844.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2010-10-22 07:29 <a href="http://www.cppblog.com/zzfmars/archive/2010/10/22/130844.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>heritrix1.14.4</title><link>http://www.cppblog.com/zzfmars/archive/2010/10/18/130323.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Mon, 18 Oct 2010 12:31:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2010/10/18/130323.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/130323.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2010/10/18/130323.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/130323.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/130323.html</trackback:ping><description><![CDATA[最好用的方法在哪里？<br>---------------------------------------------- 
<img src ="http://www.cppblog.com/zzfmars/aggbug/130323.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2010-10-18 20:31 <a href="http://www.cppblog.com/zzfmars/archive/2010/10/18/130323.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>tomcatPlugin下载地址</title><link>http://www.cppblog.com/zzfmars/archive/2010/10/17/130188.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Sun, 17 Oct 2010 02:04:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2010/10/17/130188.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/130188.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2010/10/17/130188.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/130188.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/130188.html</trackback:ping><description><![CDATA[<p>自己在自学java，自学J2EE，需要用到eclipse上的tomcatPlugin插件，把eclipse和tomcat连接起来。</p>
<p>很多资料上提供的的下载地址是：http://www.sysdeo.com/eclipse/tomcatPlugn 恼火的是，这个网址已经指向<a href="http://www.sqli.com/"><u><font color=#800080>www.sqli.com</font></u></a>，因为外语不好，也找不到下载的地方。</p>
<p>在搜索tomcatPluginV32 下载，找到的是CSDN上的，最讨厌CSDN上下载开源的东西还要登陆，还要消耗积分，其他的大多也上面的不能用的连接。</p>
<p>后来没办法，只搜索tomcatPlugin找到了官网：<a href="http://www.eclipsetotale.com/tomcatPlugin.html"><u><font color=#800080>http://www.eclipsetotale.com/tomcatPlugin.html</font></u></a></p>
<p>也找到了官方的下载地址：<a href="http://www.eclipsetotale.com/tomcatPlugin/tomcatPluginV321.zip"><u><font color=#0000ff>http://www.eclipsetotale.com/tomcatPlugin/tomcatPluginV321.zip</font></u></a></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/zzfmars/aggbug/130188.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2010-10-17 10:04 <a href="http://www.cppblog.com/zzfmars/archive/2010/10/17/130188.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Heritrix-1.14.1怎么配置?</title><link>http://www.cppblog.com/zzfmars/archive/2010/10/07/128956.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Thu, 07 Oct 2010 14:24:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2010/10/07/128956.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/128956.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2010/10/07/128956.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/128956.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/128956.html</trackback:ping><description><![CDATA[<span class=Apple-style-span style="WORD-SPACING: 0px; FONT: medium Simsun; TEXT-TRANSFORM: none; COLOR: rgb(0,0,0); TEXT-INDENT: 0px; WHITE-SPACE: normal; LETTER-SPACING: normal; BORDER-COLLAPSE: separate; orphans: 2; widows: 2; webkit-border-horizontal-spacing: 0px; webkit-border-vertical-spacing: 0px; webkit-text-decorations-in-effect: none; webkit-text-size-adjust: auto; webkit-text-stroke-width: 0px"><span class=Apple-style-span style="FONT-SIZE: 14px; LINE-HEIGHT: 22px; FONT-FAMILY: Arial">
<pre style="PADDING-RIGHT: 0px; PADDING-LEFT: 0px; FONT-WEIGHT: normal; FONT-SIZE: 14px; PADDING-BOTTOM: 0px; MARGIN: 0px; LINE-HEIGHT: 22px; PADDING-TOP: 0px; ZOOM: 1; FONT-FAMILY: Arial; WORD-WRAP: break-word">1.下载heritrix-1.14.1.zip和heritrix-1.14.1.src 并解压，解压heritrix-1.14.1.jar.
2.在eclipse下创建java project,命名为比如heritrix，进入其工程的目录，我的是F:\workspace\myeclipse\heritrix，删除src文件夹。
3.copy解压后的heritrix-1.14.1.zip文件夹下的lib，webapps，heritrix-1.14.1到F:\workspace\myeclipse\heritrix目录下，并删除F:\workspace\myeclipse\heritrix\heritrix-1.14.1目录下的org和st两个文件夹。
copy解压后的heritrix-1.14.1.src 文件夹下的heritrix-1.14.1\src\java下的org和st两个文件夹到F:\workspace\myeclipse\heritrix\heritrix-1.14.1\目录下
4.修改heritrix-1.14.1文件夹名称为src
5.修改src\heritrix.properties文件中的heritrix.cmdline.admin = 为 heritrix.cmdline.admin = admin:sun,这个就是要设置你的用户名和密码，可以随便，中间是冒号。
6.刷新工程，把lib下的jar包全部添加到工程中，即点击heritrix工程，右键属性---java build path---libraries--- add jars选择heritrix工程下lib所有jar。
7.运行org.archive.crawler.Heritrix类，在地址栏输入<a style="COLOR: rgb(38,28,220)" href="http://localhost:8080/" target=_blank><u>http://localhost:8080/</u></a>
OK!就是这么简单！ </pre>
转</span></span>自：<a href="http://zhidao.baidu.com/question/72080439.html">http://zhidao.baidu.com/question/72080439.html</a> 
<img src ="http://www.cppblog.com/zzfmars/aggbug/128956.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2010-10-07 22:24 <a href="http://www.cppblog.com/zzfmars/archive/2010/10/07/128956.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>说明一下下</title><link>http://www.cppblog.com/zzfmars/archive/2010/10/07/128928.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Thu, 07 Oct 2010 08:03:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2010/10/07/128928.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/128928.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2010/10/07/128928.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/128928.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/128928.html</trackback:ping><description><![CDATA[为了方便回查，以问答的方式，记录一下搜索过程。<br><br><br><br><br><br>---------------------------------------
<img src ="http://www.cppblog.com/zzfmars/aggbug/128928.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2010-10-07 16:03 <a href="http://www.cppblog.com/zzfmars/archive/2010/10/07/128928.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>Spider概述 </title><link>http://www.cppblog.com/zzfmars/archive/2010/09/16/126793.html</link><dc:creator>Kevin_Zhang</dc:creator><author>Kevin_Zhang</author><pubDate>Thu, 16 Sep 2010 11:29:00 GMT</pubDate><guid>http://www.cppblog.com/zzfmars/archive/2010/09/16/126793.html</guid><wfw:comment>http://www.cppblog.com/zzfmars/comments/126793.html</wfw:comment><comments>http://www.cppblog.com/zzfmars/archive/2010/09/16/126793.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/zzfmars/comments/commentRss/126793.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/zzfmars/services/trackbacks/126793.html</trackback:ping><description><![CDATA[<h2>Spider概述 </h2>
<p style="FONT-SIZE: 14pt">Spider即网络爬虫 ,其定义有广义和狭义之分。狭义上指遵循标准的 http协议利用超链接和 Web文档检索的方法遍历万维网信息空间的软件程序 ;而广义的定义则是所有能遵循 http协议检索 Web文档的软件都称之为网络爬虫。 </p>
<p style="FONT-SIZE: 14pt">Spider是一个功能很强的自动提取网页的程序 ,它为搜索引擎从万维网上下载网页 ,是搜索引擎的重要组成 .它通过请求站点上的 HTML文档访问某一站点。它遍历 Web空间 ,不断从一个站点移动到另一个站点 ,自动建立索引 ,并加入到网页数据库中。网络爬虫进入某个超级文本时 ,它利用 HTML语言的标记结构来搜索信息及获取指向其他超级文本的 URL地址 ,可以完全不依赖用户干预实现网络上的自动爬行和搜索。 </p>
<h3>Spider的队列 </h3>
<p style="FONT-SIZE: 14pt">（1）等待队列 :新发现的 URL被加入到这个队列 ,等待被 Spider程序处理 ; </p>
<p style="FONT-SIZE: 14pt">（2）处理队列 :要被处理的 URL被传送到这个队列。为了避免同一个 URL被多次处理 ,当一个 URL被处理过后 ,它将被转移到完成队列或者错误队列 (如果发生错误 )。 </p>
<p style="FONT-SIZE: 14pt">（3）错误队列 :如果在下载网页是发生错误 ,该 URL将被加入 到错误队列。</p>
<p style="FONT-SIZE: 14pt">（4）完成队列 :如果在处理网页没有发生错误 ,该 URL将被加入到完成队列。 </p>
<h3>网络爬虫搜索策略</h3>
<p style="FONT-SIZE: 14pt">在抓取网页的时候 ,目前网络爬虫一般有两种策略 :无主题搜索与基于某特定主体的专业智能搜索。其中前者主要包括 :广度优先和深度优先。广度优先是指网络爬虫会先抓取起始网页中链接的所有网页 ,然后再选择其中的一个链接网页 ,继续抓取在此网页中链接的所有网页。这是最常用的方式,因为这个方法可以让网络爬虫并行处理 ,提高其抓取速度。深度优先是指网络爬虫会从起始页开始 ,一个链接一个链接跟踪下去 ,处理完这条线路之后再转入下一个起始页 ,继续跟踪链接。这个方法有个优点是网络爬虫在设计的时候比较容易。大多数网页爬行器采用宽度优先搜索策略或者是对这种策略的某些改进。</p>
<p style="FONT-SIZE: 14pt">在专业搜索引擎中 ,网络爬虫的任务是获取 Web页面和决定链接的访问顺序 ,它通常从一个 &#8220;种子集 &#8221;(如用户查询、种子链接或种子页面 )发,以迭代的方式访问页面和提取链接。搜索过程中 ,未访问的链接被暂存在一个称为 &#8220;搜索前沿 &#8221;(Spider Frontier)的队列中 ,网络爬虫根据搜索前沿中链接的 &#8220;重要程度 &#8221;决定下一个要访问的链接。如何评价和预测链接的 &#8220;重要程度 &#8221;(或称价值 )是决定网络爬虫搜索策略的关键。</p>
<p style="FONT-SIZE: 14pt">众多的网络爬虫设计各不相同 ,但归根结底是采用不同的链接价值评价标准。</p>
<h2>常用开源网络爬虫介绍及其比较</h2>
<h3>Nutch</h3>
<p style="FONT-SIZE: 14pt">开发语言：Java</p>
<p lang=EN-US style="FONT-SIZE: 14pt">http://lucene.apache.org/nutch/</p>
<p style="FONT-SIZE: 14pt">简介：</p>
<p style="FONT-SIZE: 14pt">Apache的子项目之一，属于Lucene项目下的子项目。</p>
<p style="FONT-SIZE: 14pt">Nutch是一个基于Lucene，类似Google的完整网络搜索引擎解决方案，基于Hadoop的分布式处理模型保证了系统的性能，类似Eclipse的插件机制保证了系统的可客户化，而且很容易集成到自己的应用之中。 </p>
<p lang=EN-US style="FONT-SIZE: 14pt"></p>
<h3>Larbin</h3>
<p style="FONT-SIZE: 14pt">开发语言：C++</p>
<p lang=EN-US style="FONT-SIZE: 14pt">http://larbin.sourceforge.net/index-eng.html</p>
<p style="FONT-SIZE: 14pt">简介</p>
<p style="FONT-SIZE: 14pt">　　larbin是一种开源的网络爬虫/网络蜘蛛，由法国的年轻人 S&#233;bastien Ailleret独立开发。larbin目的是能够跟踪页面的url进行扩展的抓取，最后为搜索引擎提供广泛的数据来源。</p>
<p style="FONT-SIZE: 14pt">　　Larbin只是一个爬虫，也就是说larbin只抓取网页，至于如何parse的事情则由用户自己完成。另外，如何存储到数据库以及建立索引的事情 larbin也不提供。</p>
<p style="FONT-SIZE: 14pt">　　latbin最初的设计也是依据设计简单但是高度可配置性的原则，因此我们可以看到，一个简单的larbin的爬虫可以每天获取５００万的网页，非常高效。</p>
<p lang=EN-US style="FONT-SIZE: 14pt"></p>
<h3>Heritrix</h3>
<p style="FONT-SIZE: 14pt">开发语言：Java</p>
<p lang=EN-US style="FONT-SIZE: 14pt"><a href="http://crawler.archive.org/">http://crawler.archive.org/</a></p>
<p style="FONT-SIZE: 14pt">简介</p>
<p style="FONT-SIZE: 14pt">与Nutch比较</p>
<p style="FONT-SIZE: 14pt">和 Nutch。二者均为Java开源框架，Heritrix 是 SourceForge上的开源产品，Nutch为Apache的一个子项目，它们都称作网络爬虫/蜘蛛（ Web Crawler），它们实现的原理基本一致：深度遍历网站的资源，将这些资源抓取到本地，使用的方法都是分析网站每一个有效的URI，并提交Http请求，从而获得相应结果，生成本地文件及相应的日志信息等。</p>
<p style="FONT-SIZE: 14pt">Heritrix 是个 "archival crawler" -- 用来获取完整的、精确的、站点内容的深度复制。包括获取图像以及其他非文本内容。抓取并存储相关的内容。对内容来者不拒，不对页面进行内容上的修改。重新爬行对相同的URL不针对先前的进行替换。爬虫通过Web用户界面启动、监控、调整，允许弹性的定义要获取的URL。</p>
<p style="FONT-SIZE: 14pt">二者的差异：</p>
<p style="FONT-SIZE: 14pt">Nutch 只获取并保存可索引的内容。Heritrix则是照单全收。力求保存页面原貌 </p>
<p style="FONT-SIZE: 14pt">Nutch 可以修剪内容，或者对内容格式进行转换。 </p>
<p style="FONT-SIZE: 14pt">Nutch 保存内容为数据库优化格式便于以后索引；刷新替换旧的内容。而Heritrix 是添加(追加)新的内容。 </p>
<p style="FONT-SIZE: 14pt">Nutch 从命令行运行、控制。Heritrix 有 Web 控制管理界面。 </p>
<p style="FONT-SIZE: 14pt">Nutch 的定制能力不够强，不过现在已经有了一定改进。Heritrix 可控制的参数更多。</p>
<p style="FONT-SIZE: 14pt">Heritrix提供的功能没有nutch多，有点整站下载的味道。既没有索引又没有解析，甚至对于重复爬取URL都处理不是很好。</p>
<p style="FONT-SIZE: 14pt">Heritrix的功能强大 但是配置起来却有点麻烦。</p>
<h3>三者的比较</h3>
<p style="FONT-SIZE: 14pt">一、从功能方面来说，Heritrix与Larbin的功能类似。都是一个纯粹的网络爬虫，提供网站的镜像下载。而Nutch是一个网络搜索引擎框架，爬取网页只是其功能的一部分。</p>
<p style="FONT-SIZE: 14pt">二、从分布式处理来说，Nutch支持分布式处理，而另外两个好像尚且还没有支持。</p>
<p style="FONT-SIZE: 14pt">三、从爬取的网页存储方式来说，Heritrix和 Larbin都是将爬取下来的内容保存为原始类型的内容。而Nutch是将内容保存到其特定格式的segment中去。</p>
<p style="FONT-SIZE: 14pt">四，对于爬取下来的内容的处理来说，Heritrix和 Larbin都是将爬取下来的内容不经处理直接保存为原始内容。而Nutch对文本进行了包括链接分析、正文提取、建立索引（Lucene索引）等处理。</p>
<p style="FONT-SIZE: 14pt">五，从爬取的效率来说，Larbin效率较高，因为其是使用c++实现的并且功能单一。</p>
<p style="FONT-SIZE: 14pt" align=center>表 3种爬虫的比较</p>
<div style="FONT-SIZE: 14pt">
<table cellSpacing=0 cellPadding=0 border=0>
    <tbody>
        <tr>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">crawler</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">开发语言</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">功能单一</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=115>
            <p style="FONT-SIZE: 14pt">支持分布式爬取</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">效率</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">镜像保存</p>
            </td>
        </tr>
        <tr>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">Nutch</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">Java</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">&#215;</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=115>
            <p style="FONT-SIZE: 14pt">&#8730;</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">低</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">&#215;</p>
            </td>
        </tr>
        <tr>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">Larbin</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">C++</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">&#8730;</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=115>
            <p style="FONT-SIZE: 14pt">&#215;</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">高</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">&#8730;</p>
            </td>
        </tr>
        <tr>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">Heritrix</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p lang=EN-US style="FONT-SIZE: 14pt">Java</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">&#8730;</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=115>
            <p style="FONT-SIZE: 14pt">&#215;</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">中</p>
            </td>
            <td style="FONT-SIZE: 14pt" vAlign=top width=72>
            <p style="FONT-SIZE: 14pt">&#8730;</p>
            </td>
        </tr>
    </tbody>
</table>
</div>
<h3>其他网络爬虫介绍：</h3>
<p lang=EN-US style="FONT-SIZE: 14pt">Heritrix <br>Heritrix是一个开源，可扩展的web爬虫项目。Heritrix设计成严格按照robots.txt文件的排除指示和META robots标签。<br><a href="http://crawler.archive.org/">http://crawler.archive.org/</a><br><br>WebSPHINX <br>WebSPHINX是一个Java类包和Web爬虫的交互式开发环境。Web爬虫(也叫作机器人或蜘蛛)是可以自动浏览与处理Web页面的程序。WebSPHINX由两部分组成：爬虫工作平台和WebSPHINX类包。<br><a href="http://www.cs.cmu.edu/~rcm/websphinx/">http://www.cs.cmu.edu/~rcm/websphinx/</a><br><br>WebLech <br>WebLech是一个功能强大的Web站点下载与镜像工具。它支持按功能需求来下载web站点并能够尽可能模仿标准Web浏览器的行为。WebLech有一个功能控制台并采用多线程操作。<br><a href="http://weblech.sourceforge.net/">http://weblech.sourceforge.net/</a><br>Arale <br>Arale主要为个人使用而设计，而没有像其它爬虫一样是关注于页面索引。Arale能够下载整个web站点或来自web站点的某些资源。Arale还能够把动态页面映射成静态页面。<br><a href="http://web.tiscali.it/_flat/arale.jsp.html">http://web.tiscali.it/_flat/arale.jsp.html</a><br><br>J-Spider <br>J-Spider:是一个完全可配置和定制的Web Spider引擎.你可以利用它来检查网站的错误(内在的服务器错误等),网站内外部链接检查，分析网站的结构(可创建一个网站地图),下载整个Web站点，你还可以写一个JSpider插件来扩展你所需要的功能。<br><a href="http://j-spider.sourceforge.net/">http://j-spider.sourceforge.net/</a><br><br>spindle <br>spindle 是一个构建在Lucene工具包之上的Web索引/搜索工具.它包括一个用于创建索引的HTTP spider和一个用于搜索这些索引的搜索类。spindle项目提供了一组JSP标签库使得那些基于JSP的站点不需要开发任何Java类就能够增加搜索功能。<br><a href="http://www.bitmechanic.com/projects/spindle/">http://www.bitmechanic.com/projects/spindle/</a><br><br>Arachnid <br>Arachnid: 是一个基于Java的web spider框架.它包含一个简单的HTML剖析器能够分析包含HTML内容的输入流.通过实现Arachnid的子类就能够开发一个简单的Web spiders并能够在Web站上的每个页面被解析之后增加几行代码调用。 Arachnid的下载包中包含两个spider应用程序例子用于演示如何使用该框架。<br><a href="http://arachnid.sourceforge.net/">http://arachnid.sourceforge.net/</a><br><br>LARM <br>LARM能够为Jakarta Lucene搜索引擎框架的用户提供一个纯Java的搜索解决方案。它包含能够为文件，数据库表格建立索引的方法和为Web站点建索引的爬虫。<br><a href="http://larm.sourceforge.net/">http://larm.sourceforge.net/</a><br><br>JoBo <br>JoBo 是一个用于下载整个Web站点的简单工具。它本质是一个Web Spider。与其它下载工具相比较它的主要优势是能够自动填充form(如：自动登录)和使用cookies来处理session。JoBo还有灵活的下载规则(如：通过网页的URL，大小，MIME类型等)来限制下载。<br><a href="http://www.matuschek.net/software/jobo/index.html">http://www.matuschek.net/software/jobo/index.html</a><br><br>snoics-reptile <br>snoics -reptile是用纯Java开发的，用来进行网站镜像抓取的工具，可以使用配制文件中提供的URL入口，把这个网站所有的能用浏览器通过GET的方式获取到的资源全部抓取到本地，包括网页和各种类型的文件，如：图片、flash、mp3、zip、rar、exe等文件。可以将整个网站完整地下传至硬盘内，并能保持原有的网站结构精确不变。只需要把抓取下来的网站放到web服务器(如：Apache)中，就可以实现完整的网站镜像。<br><a href="http://www.blogjava.net/snoics">http://www.blogjava.net/snoics</a><br><br></p>
<p lang=EN-US style="FONT-SIZE: 14pt"><br>Web-Harvest <br>Web-Harvest是一个Java开源Web数据抽取工具。它能够收集指定的Web页面并从这些页面中提取有用的数据。Web-Harvest主要是运用了像XSLT,XQuery,正则表达式等这些技术来实现对text/xml的操作。<br><a href="http://web-harvest.sourceforge.net/">http://web-harvest.sourceforge.net</a><br><br>spiderpy<br>spiderpy是一个基于Python编码的一个开源web爬虫工具，允许用户收集文件和搜索网站，并有一个可配置的界面。<br><a href="http://pyspider.sourceforge.net/">http://pyspider.sourceforge.net/</a><br><br>The Spider Web Network Xoops Mod Team <br>pider Web Network Xoops Mod是一个Xoops下的模块，完全由PHP语言实现。<br><a href="http://www.tswn.com/">http://www.tswn.com/</a><br><br>larbin<br>larbin是个基于C++的web爬虫工具，拥有易于操作的界面，不过只能跑在LINUX下，在一台普通PC下larbin每天可以爬5百万个页面(当然啦，需要拥有良好的网络)<br><a href="http://larbin.sourceforge.net/index-eng.html">http://larbin.sourceforge.net/index-eng.html</a></p>
<p lang=EN-US style="FONT-SIZE: 14pt"></p>
<h3>爬虫存在的问题</h3>
<p style="FONT-SIZE: 14pt">1. robots.txt </p>
<p style="FONT-SIZE: 14pt">robots.txt是一个纯文本文件，在这个文件中网站管理者可以声明该网站中不想被robots访问的部分，或者指定搜索引擎只收录指定的内容。</p>
<p style="FONT-SIZE: 14pt">当一个搜索机器人（有的叫搜索蜘蛛）访问一个站点时，它会首先检查该站点根目录下是否存在robots.txt，如果存在，搜索机器人就会按照该文件中的内容来确定访问的范围；如果该文件不存在，那么搜索机器人就沿着链接抓取。</p>
<p style="FONT-SIZE: 14pt">另外，robots.txt必须放置在一个站点的根目录下，而且文件名必须全部小写。</p>
<p lang=EN-US style="FONT-SIZE: 14pt">2. 有些类型的网页难以爬取。例如，使用javascript调用的页面、需要注册才能访问的页面等。</p>
<h3>网络爬虫的相关研究工作</h3>
<p style="FONT-SIZE: 14pt" align=left>有些类型的网页难以爬取。例如，使用javascript调用的页面、需要注册才能访问的页面等，对于这些网络的爬取被归结为深层网络的挖掘。这些网页可归结为如下几类：（1）通过</p>
<p style="FONT-SIZE: 14pt" align=left>填写表单形成对后台再现数据库查询得到的动态页面。（2）由于缺乏被指向的超链接而没有被索引到的页面。（3）需要注册或其他限制访问的页面。（4）可访问的非网页文件。在曾伟辉等人的文章中，对这类问题进行了综述。在王映等人的文章中，提出了使用一个嵌入式的JavaScript引擎来进行动态网页采集的方法。</p>
<p lang=EN-US style="FONT-SIZE: 14pt">1. 有些非静态的Web2.0网站的内容动态生成，数据量巨大，难以抓取，例如论坛等网站。在2008年SIGIR中，Yida Wang等提出了一种爬取论坛的爬取方法。</p>
<p lang=EN-US style="FONT-SIZE: 14pt">2. 有些网站会限制网络爬虫的爬取，Analia G. Lourenco, Orlando O. Belo 在2006年提出来使用查询日志的方法限制网络爬虫的活动以减轻服务器压力。</p>
<p lang=EN-US style="FONT-SIZE: 14pt">3. 网络上的网页数量太大，在爬取时需要考虑爬取的时间及效率等问题，UCLA的Junghoo Cho等提出了使用并行的crawler的方法。</p>
<p lang=EN-US style="FONT-SIZE: 14pt">4. </p>
<img src ="http://www.cppblog.com/zzfmars/aggbug/126793.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/zzfmars/" target="_blank">Kevin_Zhang</a> 2010-09-16 19:29 <a href="http://www.cppblog.com/zzfmars/archive/2010/09/16/126793.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>