原文地址:http://my.chinaunix.net/space.php?uid=24488136&do=blog&id=64821
在书店去逛的时候,偶然看到了搜索专区的书,都是搜索引擎方面的,翻了下,感觉蛮有意思的,回来就baidu,google了下自己动手做搜索引擎,感觉开源的nutch-1.0蛮好,我就学习配置了下,遇到了一些问题,不过很快解决了。

运行环境:
    
        
            | Linux **-desktop 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:26:08 UTC 2010 i686 GNU/Linux  ubuntu 10.04
 | 
    
1.安装JDK
因为ubuntu10.04自己自带了jdk(叫做openjdk),所以我就直接用的是自带的jdk。可以直接去新立德软件包里面安装。安装完后在/usr/lib/jvm文件夹下面你就会发现有下面3个文件夹。当然你也可以直接去下载官方最新的jdk.
 
    
        
            | ├── default-java -> java-6-openjdk├── java-1.6.0-openjdk -> java-6-openjdk
 └── java-6-openjdk
 
 | 
    
2.安装并且配置tomcat,在ubuntu10.04中,tomcat的版本是tomcat6,我还安装了管理软件tomcat6-admin
    
        
            | apt-get install tomcat6 tomcat6-admin
 | 
    
安装好tomcat之后,输入/etc/init.d/tomcat6
start启动tomcat服务器。在浏览器中输入"http://localhost:8080",如果显示“it
works”说明tomcat服务器正在运行。
    
        
            | It works ! 
 
 If you're seeing this page via a web browser, it means you've setup Tomcat successfully. Congratulations!  
 
            This is the default Tomcat home page. It can be found on the local
            filesystem at: /var/lib/tomcat6/webapps/ROOT/index.html
            Tomcat6 veterans might be pleased to learn that this system instance of
            Tomcat is installed with CATALINA_HOME in /usr/share/tomcat6 and
            CATALINA_BASE in /var/lib/tomcat6, following the rules from
            /usr/share/doc/tomcat6-common/RUNNING.txt.gz.
            You might consider installing the following packages, if you haven't
            already done so: tomcat6-docs:
            This package installs a web application that allows to browse the
            Tomcat 6 documentation locally. Once installed, you can access it by
            clicking here. 
            tomcat6-examples: This package
            installs a web application that allows to access the Tomcat 6 Servlet
            and JSP examples. Once installed, you can access it by clicking here. 
            tomcat6-admin: This package installs two web applications that can help managing this Tomcat instance. Once installed, you can access the manager webapp and the host-manager webapp.
            NOTE: For security reasons, using the manager webapp is restricted to users with role "manager". The host-manager webapp is restricted to users with role "admin". Users are defined in /etc/tomcat6/tomcat-users.xml. | 
    
需要配置用户才可以进入管理界面,修改/var/lib/tomcat6/conf/tomcat-users.xml
    
        
            | 出于安全考虑,把默认的用户tomcat删掉了,并添加了自己的用户,比如hinutch,添加密码,比如3838438
 <?xml version='1.0' encoding='utf-8'?><tomcat-users>
 <role rolename="manager"/>
 <role rolename="admin"/>
 <user username="hinutch" password="3838438" roles="admin,manager"/>
 </tomcat-users>
 
 | 
    
这个时候你就可以进去管理界面了,如果不行的话,重启tomcat服务/etc/init.d/tomcat6 restart
管理界面如下:
    
        
            | Tomcat Web Application Manager | 
    
下载nutch-1.0.tar.gz,网址http://www.apache.org/dyn/closer.cgi/nutch/
![[   ]](http://apache.freelamp.com/icons/compressed.gif) apache-nutch-1.2-bin.zip        25-Sep-2010 05:38  164M
 apache-nutch-1.2-bin.zip        25-Sep-2010 05:38  164M
![[TXT]](http://apache.freelamp.com/icons/text.gif) apache-nutch-1.2-bin.zip.asc    25-Sep-2010 05:37  203
 apache-nutch-1.2-bin.zip.asc    25-Sep-2010 05:37  203
![[   ]](http://apache.freelamp.com/icons/compressed.gif) apache-nutch-1.2-src.tar.gz     25-Sep-2010 05:37   50M  GZIP compressed document
 apache-nutch-1.2-src.tar.gz     25-Sep-2010 05:37   50M  GZIP compressed document
![[TXT]](http://apache.freelamp.com/icons/text.gif) apache-nutch-1.2-src.tar.gz.asc 25-Sep-2010 05:37  203   GZIP compressed document
 apache-nutch-1.2-src.tar.gz.asc 25-Sep-2010 05:37  203   GZIP compressed document
![[   ]](http://apache.freelamp.com/icons/compressed.gif) apache-nutch-1.2-src.zip        25-Sep-2010 05:37   51M
 apache-nutch-1.2-src.zip        25-Sep-2010 05:37   51M
![[TXT]](http://apache.freelamp.com/icons/text.gif) apache-nutch-1.2-src.zip.asc    25-Sep-2010 05:37  203
 apache-nutch-1.2-src.zip.asc    25-Sep-2010 05:37  203
![[   ]](http://apache.freelamp.com/icons/compressed.gif) nutch-0.9.tar.gz                05-Apr-2007 10:17   68M  GZIP compressed document
 nutch-0.9.tar.gz                05-Apr-2007 10:17   68M  GZIP compressed document
![[TXT]](http://apache.freelamp.com/icons/text.gif) nutch-0.9.tar.gz.asc            05-Apr-2007 10:17  186   GZIP compressed document
 nutch-0.9.tar.gz.asc            05-Apr-2007 10:17  186   GZIP compressed document
![[   ]](http://apache.freelamp.com/icons/compressed.gif) nutch-1.0.tar.gz                28-Mar-2009 04:12   83M  GZIP compressed document
 nutch-1.0.tar.gz                28-Mar-2009 04:12   83M  GZIP compressed document
![[TXT]](http://apache.freelamp.com/icons/text.gif) nutch-1.0.tar.gz.asc            28-Mar-2009 04:12  197   GZIP compressed document
 nutch-1.0.tar.gz.asc            28-Mar-2009 04:12  197   GZIP compressed document
解压出来,我上面的是:
    
        
            | ├── bin├── build.xml
 ├── CHANGES.txt
 ├── conf
 ├── crawled
 ├── default.properties
 ├── docs
 ├── KEYS
 ├── lib
 ├── LICENSE.txt
 ├── logs
 ├── NOTICE.txt
 ├── nutch-1.0.jar
 ├── nutch-1.0.job
 ├── nutch-1.0.war
 ├── plugins
 ├── README.txt
 ├── src
 ├── url.txt(这个是自己建的)
 └── webapps
 
 | 
    
首先在Nutch的解压根目录下新建一个文本文件,命名为“url.txt”(这个名字你可以随便取)。里面放的是你需要抓取信息的网址。
    
        
            | 我的解压根目录为/home/**/nutch-1.0,新建一个url.txt,里面输入:
 http://bbs.chinaunix.net/
 
 | 
    
其次更新配置文件crawl-urlfilter.txt,打开“conf/crawl-urlfilter.txt”,
    
        
            | # skip URLs with slash-delimited segment that repeats 3+ times, to break loops-.*(/[^/]+)/[^/]+\1/[^/]+\1/
 
 # accept hosts in MY.DOMAIN.NAME
 +^http://bbs.chinaunix.net/(这个就是需要修改的,和url.txt里面内容一样)
 
 
 | 
    
再打开nutch-site.xml文件,修改如下,
    
        
            | <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 
 <configuration>
 <property>
 <name>http.agent.name</name>
 <value>my nutch agent</value>(红色部分可以自己命名)
 </property>
 <property>
 <name>http.agent.version</name>
 <value>1.0</value>
 </property>
 </configuration>~
 | 
    
然后运行网络蜘蛛抓紧网页。在/home/**/nutch-1.0(即文件根目录)输入以下命令:   
    
        
            | ./bin/nutch crawl url.txt -dir crawled -depth 4 -topN 100 -threads 4-dir = crawled  指明下载数据存放路径,该目录不存在时,会被自动创建
 -depth = 4       下载深度为4
 -topN = 100      下载符合条件的前100个页面
 -threads = 4       启动的线程数目
 | 
    
 蜘蛛运行时会输出大量数据,抓取结束之后,可以发现crawled目录被生成,里面有几个目录。
    
        
            | ├── crawldb├── index
 ├── indexes
 ├── linkdb
 └── segments
 
 | 
    
4.在tomcat中部署nutch项目
 将nutch根目录下的nutch-1.0.war文件放置到/var/lib/tomcat6/webapps文件夹下,然后再访问http://localhost:8080,tomcat便会将其解压。
    
        
            | root@**-desktop:/var/lib/tomcat6/webapps# lsnutch-1.0  nutch-1.0.war  ROOT
 
 | 
    
nutch-1.0文件夹下包含:
    
        
            | ├── anchors.jsp├── ca
 ├── cached.jsp
 ├── cluster.jsp
 ├── de
 ├── en
 ├── es
 ├── explain.jsp
 ├── fi
 ├── fr
 ├── hu
 ├── img
 ├── include
 ├── index.jsp
 ├── it
 ├── jp
 ├── META-INF
 ├── more.jsp
 ├── ms
 ├── nl
 ├── pl
 ├── pt
 ├── refine-query-init.jsp
 ├── refine-query.jsp
 ├── search.jsp
 ├── sh
 ├── sr
 ├── sv
 ├── text.jsp
 ├── th
 ├── WEB-INF(要修改该文件夹下面的内容)
 └── zh
 
 | 
    
修改此目录下的WEB-INF/classes/nutch-site.xml,修改如下:
    
        
            | <?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
 
 <!-- Put site-specific property overrides in this file. -->
 
 <nutch-conf>
 <property>
 <name>searcher.dir</name>
 <value>/home/**/nutch-1.0/crawled</value>
 </property>
 </nutch-conf>
 
 | 
    
上面的value要改成蜘蛛的下载目录。
5.使用nutch搜索 
 在浏览器中输入http://localhost:8080/nutch-1.0,出现下面的界面:
然后在搜索框里面输入你要查找的东西,比如:linux ,会出现:
,会出现:
第
1-1项 (共有 1 项查询结果):
论坛首页 - 中国最大的Linux/Unix技术社区 - IT人的网上社区 - bbs.ChinaUnix.net
 ... Unix操作系统 ← 
Linux论坛 RSS订阅
 ... by CU管理员 
Linux时代首页 Linux
 ... 
http://bbs.chinaunix.net/
(
网页快照)
(
评分详解)
(
anchors)
 
整个过程就完成了。
------------------------------------------------
|                
过程中出现的问题           |
------------------------------------------------
1.说找不到JAVA_HOME
解决方案:修改/etc/environment文件,添加JAVA_HOME;
    
        
            | PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"JAVA_HOME="/usr/lib/jvm/java-6-openjdk"
 
 | 
    
2.信息是抓取了,但是搜索不出来东西
解决方案:除了修改以上的东西外,有个文件还得注意下:/home/**/nutch-1.0/conf/nutch-default.xml,找到下面的部分,然后参照修改
    
        
            | <!-- searcher properties -->
 <property>
 <name>searcher.dir</name>
 <value>/home/**/nutch-1.0/crawled</value>(一定要是存抓取信息的路径)
 <description>
 | 
    
有时候出不来结果,还得运行:
    
        
            | /etc/init.d/tomcat6 restart
 | 
    
呵呵,就这么多了!!!
 
	posted on 2011-05-04 13:34 
漂漂 阅读(1171) 
评论(0)  编辑 收藏 引用  所属分类: 
linux