﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-mysileng-随笔分类-海量数据处理</title><link>http://www.cppblog.com/mysileng/category/20179.html</link><description /><language>zh-cn</language><lastBuildDate>Tue, 25 Jun 2013 03:45:12 GMT</lastBuildDate><pubDate>Tue, 25 Jun 2013 03:45:12 GMT</pubDate><ttl>60</ttl><item><title>海量数据处理专题（九）——外排序(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194634.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:30:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194634.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194634.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194634.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194634.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194634.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><h1>【引言】</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">在数据结构的课程上，我们学习了不少的排序算法，冒泡，堆，快排，归并等。但是这些排序方法有着共同的特点，那就是所有的操作都是在内存中完成的，算法过程中不需要IO，这就使得这样的算法总体上速度比较快，但是也随之出现了一个问题：当需要排序的数据量异常的大的时候，以上的算法就显得力不从心了。这时候，你需要一种另外的排序算法，它的名字叫&#8220;外排序&#8221;。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">通常的，设备的内存读取速度要比外存读取速度快得多（RAM的访问速度大约是磁盘的25万倍），但是内存的容量却要比外存小很多，当所有的数据不能在内存中完全放下的时候，就需要使用到外排序。这是外排序的一个显著特征。</p><h1>【什么是外排序】</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">外排序其实是采用一种分治（<a href="http://en.wikipedia.org/wiki/Divide_and_conquer_algorithm" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><em style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">Divide and conquer</em>&nbsp;algorithm</a>）的算法设计思想，将一个大问题划分成相对独立的若干个小问题，解决小问题，得到小问题的答案，然后合并小问题的答案，最终得到原始大问题的答案。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">在这里，我们举一个外排的典型例子，二路外部归并排序，假设我们有一个大文件，里面是待排序的数据，一共N个，这些数据在内存中放不下。排序过程如下：</p><ol style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 2.5em; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; list-style-position: initial; list-style-image: initial; color: #373737; line-height: 24px; "><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">将该大文件分割成大小为m的文件（m小于可用内存大小）</li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">将这些小文件依次读入内存，在内存中采用任一种排序算法排序并输出文件F1，F2&#8230;.Fn。（其实可以和第一步合并，可以省一次IO）</li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">分块快读取两个已经排完序的文件Fi和Fi+1，由于两个文件已经排完序，这里可以用归并排序，将两个文件排序完毕，并写入文件。（这个过程就好比有两队人马将其合并为一对一样）</li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">重复过程3，直到剩余文件数为1。</li></ol><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">以上就是二路外部归并排序的基本思路，毫无疑问，这种排序算法需要读取外存（IO）次数为log(2,N/m)，这时候算法的性能瓶颈已经不在内存中排序的时间复杂度上，而是内外村交换数据IO的次数了。这里我补充一句，各种操作的性能差别：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">读取网络 &gt; 磁盘文件IO &gt; 读取数据库 &gt; 内存读取</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">这个可谓是程序性能的黄金法则，各位在写对性能要求比较高的程序时一定要考虑。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">好，言归正传，二路归并排序这个算法的性能时比较低的。因此就有了多路归并排序算法，其IO的次数为log(b,&nbsp;N/m)，其中b为几路归并。这个可以参考以下地址：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://zh.wikipedia.org/wiki/%E5%A4%96%E6%8E%92%E5%BA%8F" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://zh.wikipedia.org/wiki/%E5%A4%96%E6%8E%92%E5%BA%8F</a></p><h1>【实战训练】</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">淘宝不同用户的浏览log有上千万or亿数据（有重复），统计其中有相同浏览爱好的用户。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">转载请注明出处：<a href="http://diducoder.com/mass-data-topic-9-external-sort.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-topic-9-external-sort.html&nbsp;</a></p></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194634.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:30 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194634.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（八）——倒排索引(搜索引擎之基石)(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194633.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:29:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194633.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194633.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194633.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194633.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194633.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><h1>引言：</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">在信息大爆炸的今天，有了搜索引擎的帮助，使得我们能够快速，便捷的找到所求。提到搜索引擎，就不得不说VSM模型，说到VSM，就不得不聊倒排索引。可以毫不夸张的讲，倒排索引是搜索引擎的基石。</p><h1>VSM检索模型</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">VSM全称是Vector Space Model(向量空间模型)，是IR(Information Retrieval信息检索)模型中的一种，由于其简单，直观，高效，所以被广泛的应用到搜索引擎的架构中。98年的Google就是凭借这样的一个模型，开始了它的疯狂扩张之路。废话不多说，让我们来看看到底VSM是一个什么东东。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">在开始之前，我默认大家对线性代数里面的向量(Vector)有一定了解的。向量是既有大小又有方向的量，通常用有向线段表示，向量有：加、减、倍数、内积、距离、模、夹角的运算。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">文档(Document)：一个完整的信息单元，对应的搜索引擎系统里，就是指一个个的网页。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">标引项(Term)：文档的基本构成单位，例如在英文中可以看做是一个单词，在中文中可以看作一个词语。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">查询(Query)：一个用户的输入，一般由多个Term构成。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">那么用一句话概况搜索引擎所做的事情就是：对于用户输入的Query，找到最相似的Document返回给用户。而这正是IR模型所解决的问题：</p><blockquote style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 3em; margin-bottom: 0px; margin-left: 3em; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: italic; font-family: Georgia, 'Bitstream Charter', serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">信息检索模型是指如何对查询和文档进行表示，然后对它们进行相似度计算的框架和方法。</p></blockquote><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">举个简单的例子：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">现在有两篇文章(Document)分别是 &#8220;春风来了，春天的脚步近了&#8221; 和 &#8220;春风不度玉门关&#8221;。然后输入的Query是&#8220;春风&#8221;，从直观上感觉，前者和输入的查询更相关一些，因为它包含有2个春，但这只是我们的直观感觉，如何量化呢，要知道计算机是门严谨的学科^_^。这个时候，我们前面讲的Term和VSM模型就派上用场了。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">首先我们要确定向量的维数，这时候就需要一个字典库，字典库的大小，即是向量的维数。在该例中，字典为<span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Consolas, Monaco, 'Courier New', Courier, monospace; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; line-height: 18px; white-space: pre; ">{春风,来了,春天, 的,脚步,近了,不度,玉门关} </span>，文档向量，查询向量如下图：</p><div id="attachment_294"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 459px; "><a href="http://www.51projob.com/uploads/allimg/120824/11332B9E-0.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-294"="" title="VSM模型示例" src="http://www.51projob.com/uploads/allimg/120824/11332B9E-0.jpg" alt="VSM模型示例" width="449" height="145" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">VSM模型示例</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">PS:为了简单起见，这里分词的粒度很大。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">将Query和Document都量化为向量以后，那么就可以计算用户的查询和哪个文档相似性更大了。简单的计算结果是D1和D2同Query的内积都是1，囧。当然了，如果分词粒度再细一些，查询的结果就是另外一个样子了，因此分词的粒度也是会对查询结果（主要是召回率和准确率）造成影响的。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">上述的例子是用一个很简单的例子来说明VSM模型的，计算文档相似度的时候也是采用最原始的内积的方法，并且只考虑了词频(TF)影响因子，而没有考虑反词频(IDF)，而现在比较常用的是cos夹角法，影响因子也非常多，据传Google的影响因子有100+之多。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />大名鼎鼎的Lucene项目就是采用VSM模型构建的，VSM的核心公式如下（由cos夹角法演变，此处省去推导过程）</p><div id="attachment_269"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 580px; "><a href="http://www.51projob.com/uploads/allimg/120824/1133264b2-1.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-269"="" title="VSM模型公式" src="http://www.51projob.com/uploads/allimg/120824/1133264b2-1.jpg" alt="VSM模型公式" width="570" height="264" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">VSM模型公式</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">从上面的例子不难看出，如果向量的维度(对汉语来将，这个值一般在30w-45w)变大，而且文档数量(通常都是海量的)变多，那么计算一次相关性，开销是非常大的，如何解决这个问题呢？不要忘记了，我们这节的主题就是 倒排索引，主角终于粉墨登场了！！！</p><h1>倒排索引</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">倒排索引非常类似我们前面提到的Hash结构。以下内容来自维基百科：</p><blockquote style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 3em; margin-bottom: 0px; margin-left: 3em; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: italic; font-family: Georgia, 'Bitstream Charter', serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">倒排索引</strong>（英语：Inverted index），也常被称为<strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">反向索引</strong>、<strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">置入档案</strong>或<strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">反向档案</strong>，是一种<a title="索引" href="http://zh.wikipedia.org/wiki/%E7%B4%A2%E5%BC%95" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">索引</a>方法，被用来<a title="存储" href="http://zh.wikipedia.org/w/index.php?title=%E5%AD%98%E5%82%A8&amp;action=edit&amp;redlink=1" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">存储</a>在<a title="全文搜索" href="http://zh.wikipedia.org/w/index.php?title=%E5%85%A8%E6%96%87%E6%90%9C%E7%B4%A2&amp;action=edit&amp;redlink=1" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">全文搜索</a>下某个单词在一个文档或者一组文档中的<a title="存储位置" href="http://zh.wikipedia.org/w/index.php?title=%E5%AD%98%E5%82%A8%E4%BD%8D%E7%BD%AE&amp;action=edit&amp;redlink=1" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">存储位置</a>的<a title="映射" href="http://zh.wikipedia.org/wiki/%E6%98%A0%E5%B0%84" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">映射</a>。它是<a title="文档检索系统" href="http://zh.wikipedia.org/w/index.php?title=%E6%96%87%E6%A1%A3%E6%A3%80%E7%B4%A2%E7%B3%BB%E7%BB%9F&amp;action=edit&amp;redlink=1" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">文档检索系统</a>中最常用的<a title="数据结构" href="http://zh.wikipedia.org/wiki/%E6%95%B0%E6%8D%AE%E7%BB%93%E6%9E%84" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">数据结构</a>。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">有两种不同的反向索引形式：</p><ul style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 2.5em; list-style-type: square; list-style-position: initial; list-style-image: initial; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">一条记录的水平反向索引（或者反向档案索引）包含每个引用单词的文档的<a title="列表" href="http://zh.wikipedia.org/wiki/%E5%88%97%E8%A1%A8" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">列表</a>。</li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">一个单词的水平反向索引（或者完全反向索引）又包含每个单词在一个文档中的位置。</li></ul><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">后者的形式提供了更多的<a title="兼容性" href="http://zh.wikipedia.org/w/index.php?title=%E5%85%BC%E5%AE%B9%E6%80%A7&amp;action=edit&amp;redlink=1" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">兼容性</a>（比如<a title="短语搜索" href="http://zh.wikipedia.org/w/index.php?title=%E7%9F%AD%E8%AF%AD%E6%90%9C%E7%B4%A2&amp;action=edit&amp;redlink=1" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">短语搜索</a>），但是需要更多的时间和空间来创建。</p></blockquote><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">由上面的定义可以知道，一个倒排索引包含一个字典的索引和所有词的列表。其中字典索引中包含了所有的Term(通俗理解为文档中的词)，索引后面跟的列表则保存该词的信息(出现的文档号，甚至包含在每个文档中的位置信息)。下面我们还采用上面的方法举一个简单的例子来说明倒排索引。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">例如现在我们要对三篇文档建立索引(实际应用中，文档的数量是海量的)：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">文档1(D1)：中国移动互联网发展迅速</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">文档2(D2)：移动互联网未来的潜力巨大</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">文档3(D3)：中华民族是个勤劳的民族</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">那么文档中的词典集合为：{中国，移动，互联网，发展，迅速，未来，的，潜力，巨大，中华，民族，是，个，勤劳}</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">建好的索引如下图：</p><div id="attachment_295"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 381px; "><a href="http://www.51projob.com/uploads/allimg/120824/11332632B-2.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-295"="" title="倒排索引" src="http://www.51projob.com/uploads/allimg/120824/11332632B-2.jpg" alt="倒排索引" width="371" height="484" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">倒排索引</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">在上面的索引中，存储了两个信息，文档号和出现的次数。建立好索引以后，我们就可以开始查询了。例如现在有一个Query是&#8221;中国移动&#8221;。首先分词得到Term集合{中国，移动}，查倒排索引，分别计算query和d1,d2,d3的距离。有没有发现，倒排表建立好以后，就不需要在检索整个文档库，而是直接从字典集合中找到&#8220;中国&#8221;和&#8220;移动&#8221;，然后遍历后面的列表直接计算。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">对倒排索引结构我们已经有了初步的了解，但在实际应用中还有些需要解决的问题(主要是由海量数据引起的)。笔者列举一些问题，并给出相应的解决方案，抛砖以引玉，希望大家可以展开讨论：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">1.左侧的索引表如何建立?怎么做才能最高效？</p><blockquote style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 3em; margin-bottom: 0px; margin-left: 3em; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: italic; font-family: Georgia, 'Bitstream Charter', serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">可能有人不假思索回答：左侧的索引当然要采取hash结构啊，这样可以快速的定位到字典项。但是这样问题又来了，hash函数如何选取呢？而且hash是有碰撞的，但是倒排表似乎又是不允许碰撞的存在的。事实上，虽然倒排表和hash异常的相思，但是两者还是有很大区别的，其实在这里我们可以采用前面提到的Bitmap的思想，每个Term(单词)对应一个位置(当然了，这里不是一个比特位)，而且是一一对应的。如何能够做到呢，一般在文字处理中，有很多的编码，汉字中的GBK编码基本上就可以包含所有用到的汉字，每个汉字的GBK编码是确定的，因此一个Term的&#8221;ID&#8221;也就确定了，从而可以做到快速定位。注：得到一个汉字的GBK号是非常快的过程，可以理解为O(1)的时间复杂度。</p></blockquote><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">2.如何快速的添加删除更新索引？</p><blockquote style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 3em; margin-bottom: 0px; margin-left: 3em; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: italic; font-family: Georgia, 'Bitstream Charter', serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">有经验的码农都知道，一般在系统的&#8220;做加法&#8221;的代价比&#8220;做减法&#8221;的代价要低很多，在搜索引擎中中也不例外。因此，在倒排表中，遇到要删除一个文档，其实不是真正的删除，而是将其标记删除。这样一个减法操作的代价就比较小了。</p></blockquote><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">3.那么多的海量文档，如果存储呢？有么有什么备份策略呢？</p><blockquote style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 3em; margin-bottom: 0px; margin-left: 3em; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: italic; font-family: Georgia, 'Bitstream Charter', serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">当然了，一台机器是存储不下的，分布式存储是采取的。一般的备份保存3份就足够了。</p></blockquote><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">好了，倒排索引终于完工了，不足的地方请指正。谢谢</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">做人要厚道，转载请注明出处：<a href="http://diducoder.com/mass-data-topic-8-inverted-index.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-topic-8-inverted-index.html</a></p></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194633.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:29 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194633.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（七）——数据库索引及优化(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194632.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:28:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194632.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194632.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194632.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194632.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194632.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">索引是对数据库表中一列或多列的值进行排序的一种结构，使用索引可快速访问数据库表中的特定信息。</p><h1>数据库索引</h1><h2>什么是索引</h2><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">数据库索引好比是一本书前面的目录，能加快数据库的查询速度。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">例如这样一个查询：select * from table1 where id=44。如果没有索引，必须遍历整个表，直到ID等于44的这一行被找到为止；有了索引之后(必须是在ID这一列上建立的索引)，直接在索引里面找44（也就是在ID这一列找），就可以得知这一行的位置，也就是找到了这一行。可见，索引是用来定位的。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">索引分为聚簇索引和非聚簇索引两种，聚簇索引 是按照数据存放的物理位置为顺序的，而非聚簇索引就不一样了；聚簇索引能提高多行检索的速度，而非聚簇索引对于单行的检索很快。</p><h2>概述</h2><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">建立索引的目的是加快对表中记录的查找或排序。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">为表设置索引要付出代价的：一是增加了数据库的存储空间，二是在插入和修改数据时要花费较多的时间(因为索引也要随之变动)。</p><div id="attachment_153"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 634px; "><a href="http://www.51projob.com/uploads/allimg/120824/11325K120-0.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-153"="" title="B树索引-Sql Server索引方式" src="http://www.51projob.com/uploads/allimg/120824/11325K120-0.jpg" alt="B树索引-Sql Server索引方式" width="624" height="429" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">B树索引-Sql Server索引方式</p></div><h2>为什么要创建索引</h2><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">创建索引可以大大提高系统的性能。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 30px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第一，通过创建唯一性索引，可以保证数据库表中每一行数据的唯一性。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />第二，可以大大加快数据的检索速度，这也是创建索引的最主要的原因。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />第三，可以加速表和表之间的连接，特别是在实现数据的参考完整性方面特别有意义。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />第四，在使用分组和排序子句进行数据检索时，同样可以显著减少查询中分组和排序的时间。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />第五，通过使用索引，可以在查询的过程中，使用优化隐藏器，提高系统的性能。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">也许会有人要问：增加索引有如此多的优点，为什么不对表中的每一个列创建一个索引呢？因为，增加索引也有许多不利的方面。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 30px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第一，创建索引和维护索引要耗费时间，这种时间随着数据量的增加而增加。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />第二，索引需要占物理空间，除了数据表占数据空间之外，每一个索引还要占一定的物理空间，如果要建立聚簇索引，那么需要的空间就会更大。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />第三，当对表中的数据进行增加、删除和修改的时候，索引也要动态的维护，这样就降低了数据的维护速度。</p><h2>在哪建索引</h2><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">索引是建立在数据库表中的某些列的上面。在创建索引的时候，应该考虑在哪些列上可以创建索引，在哪些列上不能创建索引。一般来说，应该在这些列上创建索引：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 30px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">在经常需要搜索的列上，可以加快搜索的速度；<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />在作为主键的列上，强制该列的唯一性和组织表中数据的排列结构；<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />在经常用在连接的列上，这些列主要是一些外键，可以加快连接的速度；在经常需要根据范围进行搜索的列上创建索引，因为索引已经排序，其指定的范围是连续的；<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />在经常需要排序的列上创建索引，因为索引已经排序，这样查询可以利用索引的排序，加快排序查询时间；<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />在经常使用在WHERE子句中的列上面创建索引，加快条件的判断速度。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">同样，对于有些列不应该创建索引。一般来说，不应该创建索引的的这些列具有下列特点：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第一，对于那些在查询中很少使用或者参考的列不应该创建索引。这是因为，既然这些列很少使用到，因此有索引或者无索引，并不能提高查询速度。相反，由于增加了索引，反而降低了系统的维护速度和增大了空间需求。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第二，对于那些只有很少数据值的列也不应该增加索引。这是因为，由于这些列的取值很少，例如人事表的性别列，在查询的结果中，结果集的数据行占了表中数据行的很大比例，即需要在表中搜索的数据行的比例很大。增加索引，并不能明显加快检索速度。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第三，对于那些定义为text, image和bit数据类型的列不应该增加索引。这是因为，这些列的数据量要么相当大，要么取值很少,不利于使用索引。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第四，当修改性能远远大于检索性能时，不应该创建索引。这是因为，修改性能和检索性能是互相矛盾的。当增加索引时，会提高检索性能，但是会降低修改性能。当减少索引时，会提高修改性能，降低检索性能。因此，当修改操作远远多于检索操作时，不应该创建索引。</p><h1>数据库优化</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">此外，除了数据库索引之外，在LAMP结果如此流行的今天，数据库（尤其是MySQL）性能优化也是海量数据处理的一个热点。下面就结合自己的经验，聊一聊MySQL数据库优化的几个方面。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">首先，在数据库设计的时候，要能够充分的利用索引带来的性能提升，至于如何建立索引，建立什么样的索引，在哪些字段上建立索引，上面已经讲的很清楚了，这里不在赘述。另外就是设计数据库的原则就是尽可能少的进行数据库写操作（插入，更新，删除等），查询越简单越好。如下：</p><div id="attachment_157"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 633px; "><a href="http://www.51projob.com/uploads/allimg/120824/11325JN6-1.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-157"="" title="数据库设计" src="http://www.51projob.com/uploads/allimg/120824/11325JN6-1.jpg" alt="数据库设计" width="623" height="295" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">数据库设计</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">其次，配置缓存是必不可少的，配置缓存可以有效的降低数据库查询读取次数，从而缓解数据库服务器压力，达到优化的目的，一定程度上来讲，这算是一个&#8220;围魏救赵&#8221;的办法。可配置的缓存包括索引缓存(key_buffer)，排序缓存(sort_buffer)，查询缓存(query_buffer)，表描述符缓存(table_cache)，如下图：</p><div id="attachment_158"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 632px; "><a href="http://www.51projob.com/uploads/allimg/120824/11325H5S-2.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-158"="" title="配置缓存" src="http://www.51projob.com/uploads/allimg/120824/11325H5S-2.jpg" alt="配置缓存" width="622" height="278" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">配置缓存</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第三，切表，切表也是一种比较流行的数据库优化方法。分表包括两种方式：横向分表和纵向分表，其中，横向分表比较有使用意义，故名思议，横向切表就是指把记录分到不同的表中，而每条记录仍旧是完整的（纵向切表后每条记录是不完整的），例如原始表中有100条记录，我要切成2个表，那么最简单也是最常用的方法就是ID取摸切表法，本例中，就把ID为1,3,5,7。。。的记录存在一个表中，ID为2,4,6,8,。。。的记录存在另一张表中。虽然横向切表可以减少查询强度，但是它也破坏了原始表的完整性，如果该表的统计操作比较多，那么就不适合横向切表。横向切表有个非常典型的用法，就是用户数据：每个用户的用户数据一般都比较庞大，但是每个用户数据之间的关系不大，因此这里很适合横向切表。最后，要记住一句话就是：分表会造成查询的负担，因此在数据库设计之初，要想好是否真的适合切表的优化：</p><div id="attachment_160"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 632px; "><a href="http://www.51projob.com/uploads/allimg/120824/11325J1R-3.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-160"="" title="分表" src="http://www.51projob.com/uploads/allimg/120824/11325J1R-3.jpg" alt="分表" width="622" height="291" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">分表</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">第四，日志分析，在数据库运行了较长一段时间以后，会积累大量的LOG日志，其实这里面的蕴涵的有用的信息量还是很大的。通过分析日志，可以找到系统性能的瓶颈，从而进一步寻找优化方案。</p><div id="attachment_161"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 637px; "><a href="http://www.51projob.com/uploads/allimg/120824/11325I915-4.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-161"="" title="性能分析" src="http://www.51projob.com/uploads/allimg/120824/11325I915-4.jpg" alt="性能分析" width="627" height="243" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">性能分析</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">以上讲的都是单机MySQL的性能优化的一些经验，但是随着信息大爆炸，单机的数据库服务器已经不能满足我们的需求，于是，多多节点，分布式数据库网络出现了，其一般的结构如下：</p><div id="attachment_163"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; color: #373737; line-height: 24px; width: 562px; "><a href="http://www.51projob.com/uploads/allimg/120824/11325I2U-5.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-163"="" title="分布式数据库结构" src="http://www.51projob.com/uploads/allimg/120824/11325I2U-5.png" alt="分布式数据库结构" width="552" height="690" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">分布式数据库结构</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">这种分布式集群的技术关键就是&#8220;同步复制&#8221;。。。《未完待续。。。》</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">做人要厚道，转载请注明出处：<a href="http://diducoder.com/mass-data-topic-7-index-and-optimize.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-topic-7-index-and-</a></p></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194632.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:28 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194632.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（六）——双层桶划分(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194631.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:26:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194631.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194631.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194631.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194631.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194631.html</trackback:ping><description><![CDATA[<div><span style="color: #373737; font-size: 15px; font-weight: 300; line-height: 24px; font-family: 微软雅黑; "><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000a0; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">【什么是双层桶】<br /></strong></span>事实上，与其说双层桶划分是一种数据结构，不如说它是一种算法设计思想。面对一堆大量的数据我们无法处理的时候，我们可以将其分成一个个小的单元，然后根据一定的策略来处理这些小单元，从而达到目的。</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000a0; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">【适用范围】<br /></strong></span>第k大，中位数，不重复或重复的数字</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000a0; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">【基本原理及要点】<br /></strong></span>因为元素范围很大，不能利用直接寻址表，所以通过多次划分，逐步确定范围，然后最后在一个可以接受的范围内进行。可以通过多次缩小，双层只是一个例子，分治才是其根本（只是&#8220;只分不治&#8221;）。</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000a0; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">【扩展】<br /></strong></span>当有时候需要用一个小范围的数据来构造一个大数据，也是可以利用这种思想，相比之下不同的，只是其中的逆过程。</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000a0; ">【问题实例】<br /></span><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000ff; ">1).2.5亿个整数中找出不重复的整数的个数，内存空间不足以容纳这2.5亿个整数。</span></strong></p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">有点像鸽巢原理，整数个数为2^32,也就是，我们可以将这2^32个数，划分为2^8个区域(比如用单个文件代表一个区域)，然后将数据分离到不同的区域，然后不同的区域在利用bitmap就可以直接解决了。也就是说只要有足够的磁盘空间，就可以很方便的解决。 当然这个题也可以用我们前面讲过的BitMap方法解决，正所谓条条大道通罗马~~~</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000ff; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">2).5亿个int找它们的中位数。</strong></span></p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">这个例子比上面那个更明显。首先我们将int划分为2^16个区域，然后读取数据统计落到各个区域里的数的个数，之后我们根据统计结果就可以判断中位数落到那个区域，同时知道这个区域中的第几大数刚好是中位数。然后第二次扫描我们只统计落在这个区域中的那些数就可以了。</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">实际上，如果不是int是int64，我们可以经过3次这样的划分即可降低到可以接受的程度。即可以先将int64分成2^24个区域，然后确定区域的第几 大数，在将该区域分成2^20个子区域，然后确定是子区域的第几大数，然后子区域里的数的个数只有2^20，就可以直接利用direct addr table进行统计了。</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><strong style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: bold; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; "><span style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #0000ff; ">3).现在有一个0-30000的随机数生成器。请根据这个随机数生成器，设计一个抽奖范围是0-350000彩票中奖号码列表，其中要包含20000个中奖号码。</span></strong></p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">这个题刚好和上面两个思想相反，一个0到3万的随机数生成器要生成一个0到35万的随机数。那么我们完全可以将0-35万的区间分成35/3=12个区间，然后每个区间的长度都小于等于3万，这样我们就可以用题目给的随机数生成器来生成了，然后再加上该区间的基数。那么要每个区间生成多少个随机数呢？计算公式就是：区间长度*随机数密度，在本题目中就是30000*（20000/350000）。最后要注意一点，该题目是有隐含条件的：彩票，这意味着你生成的随机数里面不能有重复，这也是我为什么用双层桶划分思想的另外一个原因。</p><p style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; ">做人好厚道，转载请注明出处：<a href="http://diducoder.com/mass-data-topic-6-multi-dividing.html" style="border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-style: inherit; font-weight: inherit; font-family: 微软雅黑; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; outline-width: 0px; outline-style: initial; outline-color: initial; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; vertical-align: baseline; color: #1982d1; text-decoration: none; ">http://diducoder.com/mass-data-topic-6-multi-dividing.html</a></p></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194631.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:26 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194631.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（五）——堆(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194628.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:24:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194628.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194628.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194628.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194628.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194628.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【什么是堆】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " /></span>概念：堆是一种特殊的二叉树，具备以下两种性质<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />1）每个节点的值都大于（或者都小于，称为最小堆）其子节点的值<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />2）树是完全平衡的，并且最后一层的树叶都在最左边<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />这样就定义了一个最大堆。如下图用一个数组来表示堆：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/1132312458-0.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-102"="" title="最大堆" src="http://www.51projob.com/uploads/allimg/120824/1132312458-0.png" alt="" width="413" height="475" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">那么下面介绍二叉堆：二叉堆是一种完全二叉树，其任意子树的左右节点（如果有的话）的键值一定比根节点大，上图其实就是一个二叉堆。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">你一定发觉了，最小的一个元素就是数组第一个元素，那么二叉堆这种有序队列如何入队呢？看图：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/1132315122-1.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-103"="" title="ds_binary_heap_insert" src="http://www.51projob.com/uploads/allimg/120824/1132315122-1.png" alt="" width="246" height="480" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">假设要在这个二叉堆里入队一个单元，键值为2，那只需在数组末尾加入这个元素，然后尽可能把这个元素往上挪，直到挪不动，经过了这种复杂度为&#927;(logn)的操作，二叉堆还是二叉堆。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">那如何出队呢？也不难，看图：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/1132315252-2.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-105"="" title="ds_binary_heap_dequeue" src="http://www.51projob.com/uploads/allimg/120824/1132315252-2.png" alt="" width="533" height="480" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />出队一定是出数组的第一个元素，这么来第一个元素以前的位置就成了空位，我们需要把这个空位挪至叶子节点，然后把数组最后一个元素插入这个空位，把这个&#8220;空位&#8221;尽量往上挪。这种操作的复杂度也是&#927;(logn)。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【适用范围】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />海量数据前n大，并且n比较小，堆可以放入内存</span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【基本原理及要点】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />最大堆求前n小，最小堆求前n大。方法，比如求前n小，我们比较当前元素与最大堆里的最大元素，如果它小于最大元素，则应该替换那个最大元 素。这样最后得到的n个元素就是最小的n个。适合大数据量，求前n小，n的大小比较小的情况，这样可以扫描一遍即可得到所有的前n元素，效率很高。</span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【扩展】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />双堆，一个最大堆与一个最小堆结合，可以用来维护中位数。</span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【问题实例】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />1)100w个数中找最大的前100个数。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />用一个100个元素大小的最小堆即可。</span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">做人要厚道：转载请注明出处：<a href="http://diducoder.com/mass-data-topic-5-heap.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-topic-5-heap.html</a></span></p></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194628.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:24 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194628.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（四）——Bit-map(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194627.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:24:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194627.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194627.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194627.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194627.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194627.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【什么是Bit-map】</strong></span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">所谓的Bit-map就是用一个bit位来标记某个元素对应的Value， 而Key即是该元素。由于采用了Bit为单位来存储数据，因此在存储空间方面，可以大大节省。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">如果说了这么多还没明白什么是Bit-map，那么我们来看一个具体的例子，假设我们要对0-7内的5个元素(4,7,2,5,3)排序（这里假设这些元素没有重复）。那么我们就可以采用Bit-map的方法来达到排序的目的。要表示8个数，我们就只需要8个Bit（1Bytes），首先我们开辟1Byte的空间，将这些空间的所有Bit位都置为0(如下图：)</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/11320K2b-0.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-109"="" title="image_thumb" src="http://www.51projob.com/uploads/allimg/120824/11320K2b-0.png" alt="" width="244" height="63" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">然后遍历这5个元素，首先第一个元素是4，那么就把4对应的位置为1（可以这样操作 p+(i/8)|(0&#215;01&lt;&lt;(i%8)) 当然了这里的操作涉及到Big-ending和Little-ending的情况，这里默认为Big-ending）,因为是从零开始的，所以要把第五位置为一（如下图）：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/11320L640-1.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-108"="" title="image" src="http://www.51projob.com/uploads/allimg/120824/11320L640-1.png" alt="" width="303" height="74" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">然后再处理第二个元素7，将第八位置为1,，接着再处理第三个元素，一直到最后处理完所有的元素，将相应的位置为1，这时候的内存的Bit位的状态如下：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/11320K344-2.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-110"="" title="image_thumb_2" src="http://www.51projob.com/uploads/allimg/120824/11320K344-2.png" alt="" width="244" height="65" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">然后我们现在遍历一遍Bit区域，将该位是一的位的编号输出（2，3，4，5，7），这样就达到了排序的目的。下面的代码给出了一个BitMap的用法：排序。</p><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-top-color: #cccccc; border-right-color: #cccccc; border-bottom-color: #cccccc; border-left-color: #cccccc; font-size: 13px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; background-color: #f8f8f8; "><pre style="padding-top: 5px; padding-right: 5px; padding-bottom: 5px; padding-left: 5px; margin-top: 5px; margin-right: 8px; margin-bottom: 5px; margin-left: 8px; font-family: verdana, arial, helvetica, sans-serif; font-size: 13px; width: 681px; overflow-x: auto; overflow-y: auto; background-image: initial; background-attachment: initial; background-origin: initial; background-clip: initial; background-color: #f4f4f4; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; line-height: 1.5; color: #000000; background-position: initial initial; background-repeat: initial initial; ">//定义每个Byte中有8个Bit位 #include ＜memory.h＞ #define BYTESIZE 8 void SetBit(char *p, int posi) { 	for(int i=0; i ＜ (posi/BYTESIZE); i++) 	{ 		p++; 	}  	*p = *p|(0x01＜＜(posi%BYTESIZE));//将该Bit位赋值1 	return; }  void BitMapSortDemo() { 	//为了简单起见，我们不考虑负数 	int num[] = {3,5,2,10,6,12,8,14,9};  	//BufferLen这个值是根据待排序的数据中最大值确定的 	//待排序中的最大值是14，因此只需要2个Bytes(16个Bit) 	//就可以了。 	const int BufferLen = 2; 	char *pBuffer = new char[BufferLen];  	//要将所有的Bit位置为0，否则结果不可预知。 	memset(pBuffer,0,BufferLen); 	for(int i=0;i＜9;i++) 	{ 		//首先将相应Bit位上置为1 		SetBit(pBuffer,num[i]); 	}  	//输出排序结果 	for(int i=0;i＜BufferLen;i++)//每次处理一个字节(Byte) 	{ 		for(int j=0;j＜BYTESIZE;j++)//处理该字节中的每个Bit位 		{ 			//判断该位上是否是1，进行输出，这里的判断比较笨。 			//首先得到该第j位的掩码（0x01＜＜j），将内存区中的 			//位和此掩码作与操作。最后判断掩码是否和处理后的 			//结果相同 			if((*pBuffer&amp;(0x01＜＜j)) == (0x01＜＜j)) 			{ 				printf("%d ",i*BYTESIZE + j); 			} 		} 		pBuffer++; 	} }  int _tmain(int argc, _TCHAR* argv[]) { 	BitMapSortDemo(); 	return 0; }</pre></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【适用范围】</strong></span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">可进行数据的快速查找，判重，删除，一般来说数据范围是int的10倍以下</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【基本原理及要点】</strong></span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">使用bit数组来表示某些元素是否存在，比如8位电话号码</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【扩展】</strong></span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">Bloom filter可以看做是对bit-map的扩展</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【问题实例】</strong></span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">1)已知某个文件内包含一些电话号码，每个号码为8位数字，统计不同号码的个数。</strong></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">8位最多99 999 999，大概需要99m个bit，大概10几m字节的内存即可。 （可以理解为从0-99 999 999的数字，每个数字对应一个Bit位，所以只需要99M个Bit==12.4MBytes，这样，就用了小小的12.4M左右的内存表示了所有的8位数的电话）</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">2)2.5亿个整数中找出不重复的整数的个数，内存空间不足以容纳这2.5亿个整数。</strong></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">将bit-map扩展一下，用2bit表示一个数即可，0表示未出现，1表示出现一次，2表示出现2次及以上，在遍历这些数的时候，如果对应位置的值是0，则将其置为1；如果是1，将其置为2；如果是2，则保持不变。或者我们不用2bit来进行表示，我们用两个bit-map即可模拟实现这个2bit-map，都是一样的道理。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">做人好厚道，转载请注明出处：<a href="http://diducoder.com/mass-data-4-bitmap.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-4-bitmap.html</a></p></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194627.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:24 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194627.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（三）——Hash(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194626.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:19:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194626.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194626.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194626.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194626.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194626.html</trackback:ping><description><![CDATA[<div><span style="font-family: Verdana, Arial, Tahoma; font-size: 12px; line-height: normal; "><h2><div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; font-size: 14px; line-height: 25px; font-weight: normal; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: large; font-style: inherit; font-weight: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【什么是Hash】</span></strong></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">Hash，一般翻译做&#8220;散列&#8221;，也有直接音译为&#8220;哈希&#8221;的，就是把任意长度的输入（又叫做预映射， pre-image），通过散列算法，变换成固定长度的输出，该输出就是散列值。这种转换是一种压缩映射，也就是，散列值的空间通常远小于输入的空间，不同的输入可能会散列成相同的输出，而不可能从散列值来唯一的确定输入值。简单的说就是一种将任意长度的消息压缩到某一固定长度的<a href="http://baike.baidu.com/view/2396437.htm" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">消息摘要</a>的函数。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">HASH主要用于信息安全领域中加密算法，它把一些不同长度的信息转化成杂乱的128位的编码,这些编码值叫做HASH值. 也可以说，hash就是找到一种数据内容和数据存放地址之间的映射关系。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">数组的特点是：寻址容易，插入和删除困难；而链表的特点是：寻址困难，插入和删除容易。那么我们能不能综合两者的特性，做出一种寻址容易，插入删除也容易的数据结构？答案是肯定的，这就是我们要提起的哈希表，哈希表有多种不同的实现方法，我接下来解释的是最常用的一种方法&#8212;&#8212;拉链法，我们可以理解为&#8220;链表的数组&#8221;，如图：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/1131391914-0.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-112"="" title="image_thumb_3" src="http://www.51projob.com/uploads/allimg/120824/1131391914-0.png" alt="" width="535" height="448" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">左边很明显是个数组，数组的每个成员包括一个指针，指向一个链表的头，当然这个链表可能为空，也可能元素很多。我们根据元素的一些特征把元素分配到不同的链表中去，也是根据这些特征，找到正确的链表，再从链表中找出这个元素。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">元素特征转变为数组下标的方法就是散列法。散列法当然不止一种，下面列出三种比较常用的。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">1，除法散列法</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />最直观的一种，上图使用的就是这种散列法，公式：<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />index = value % 16<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />学过汇编的都知道，求模数其实是通过一个除法运算得到的，所以叫&#8220;除法散列法&#8221;。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">2，平方散列法</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />求index是非常频繁的操作，而乘法的运算要比除法来得省时（对现在的CPU来说，估计我们感觉不出来），所以我们考虑把除法换成乘法和一个位移操作。公式：<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />index = (value * value) &gt;&gt; 28<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />如果数值分配比较均匀的话这种方法能得到不错的结果，但我上面画的那个图的各个元素的值算出来的index都是0&#8212;&#8212;非常失败。也许你还有个问题，value如果很大，value * value不会溢出吗？答案是会的，但我们这个乘法不关心溢出，因为我们根本不是为了获取相乘结果，而是为了获取index。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: medium; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 16px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">3，斐波那契（Fibonacci）散列法</strong></span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">平方散列法的缺点是显而易见的，所以我们能不能找出一个理想的乘数，而不是拿value本身当作乘数呢？答案是肯定的。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">1，对于16位整数而言，这个乘数是40503<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />2，对于32位整数而言，这个乘数是2654435769<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />3，对于64位整数而言，这个乘数是11400714819323198485</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">这几个&#8220;理想乘数&#8221;是如何得出来的呢？这跟一个法则有关，叫黄金分割法则，而描述黄金分割法则的最经典表达式无疑就是著名的斐波那契数列，如果你还有兴趣，就到网上查找一下&#8220;斐波那契数列&#8221;等关键字，我数学水平有限，不知道怎么描述清楚为什么，另外斐波那契数列的值居然和太阳系八大行星的轨道半径的比例出奇吻合，很神奇，对么？</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">对我们常见的32位整数而言，公式：<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />i ndex = (value * 2654435769) &gt;&gt; 28</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">如果用这种斐波那契散列法的话，那我上面的图就变成这样了：</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><a href="http://www.51projob.com/uploads/allimg/120824/11313a045-1.png" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img size-full=""  wp-image-114"="" title="image_thumb_4" src="http://www.51projob.com/uploads/allimg/120824/11313a045-1.png" alt="" width="437" height="473" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #dddddd; border-right-color: #dddddd; border-bottom-color: #dddddd; border-left-color: #dddddd; clear: both; display: block; max-width: 97.5%; height: auto; width: auto; " /></a><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />很明显，用斐波那契散列法调整之后要比原来的取摸散列法好很多。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: large; font-style: inherit; font-weight: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【适用范围】</span></strong></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">快速查找，删除的基本数据结构，通常需要总数据量可以放入内存。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: large; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 18px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【基本原理及要点】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />hash函数选择，针对字符串，整数，排列，具体相应的hash方法。<br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />碰撞处理，一种是open hashing，也称为拉链法；另一种就是closed hashing，也称开地址法，opened addressing。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: large; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 18px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【扩展】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />d-left hashing中的d是多个的意思，我们先简化这个问题，看一看2-left hashing。2-left hashing指的是将一个哈希表分成长度相等的两半，分别叫做T1和T2，给T1和T2分别配备一个哈希函数，h1和h2。在存储一个新的key时，同 时用两个哈希函数进行计算，得出两个地址h1[key]和h2[key]。这时需要检查T1中的h1[key]位置和T2中的h2[key]位置，哪一个 位置已经存储的（有碰撞的）key比较多，然后将新key存储在负载少的位置。如果两边一样多，比如两个位置都为空或者都存储了一个key，就把新key 存储在左边的T1子表中，2-left也由此而来。在查找一个key时，必须进行两次hash，同时查找两个位置。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: large; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 18px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">【问题实例】</strong></span><br style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; " />1).海量日志数据，提取出某日访问百度次数最多的那个IP。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">IP的数目还是有限的，最多2^32个，所以可以考虑使用hash将ip直接存入内存，然后进行统计。</p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 24px; ">做人要厚道，转载请注明出处：&nbsp;<a href="http://diducoder.com/mass-data-topic-3-hash.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-topic-3-hash.html</a></p></span></div></h2></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194626.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:19 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194626.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（二）——Bloom Filter(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194625.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:03:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194625.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194625.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194625.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194625.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194625.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><h1>【什么是Bloom Filter】</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">Bloom Filter是一种空间效率很高的随机数据结构，它利用位数组很简洁地表示一个集合，并能判断一个元素是否属于这个集合。Bloom Filter的这种高效是有一定代价的：在判断一个元素是否属于某个集合时，有可能会把不属于这个集合的元素误认为属于这个集合（false positive）。因此，Bloom Filter不适合那些&#8220;零错误&#8221;的应用场合。而在能容忍低错误率的应用场合下，采用Bloom Filter的数据结构，可以通过极少的错误换取了存储空间的极大节省。&nbsp;这里有一篇关于<a href="http://blog.csdn.net/jiaomeng/archive/2007/01/27/1495500.aspx" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">Bloom Filter</a>的详细介绍，不太懂的博友可以看看。</span></p><h1>【适用范围】</h1><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">可以用来实现数据字典，进行数据的判重，或者集合求交集</span></p><h1>【基本原理及要点】</h1><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">对于原理来说很简单，位数组外加k个独立hash函数。Bloom filter提供两种基本的操作，将元素加入集合和判断某一元素是否属于该集合，一下说明</span><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">如何操作：</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">将一个元素加入集合：首先将要加入集合的元素用k个hash函数进行hash，得到k个hash index，然后在集合的位数组中将这k个hash index的位置置1，下面用两幅图来描述这个过程。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><div id="attachment_224"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; width: 304px; "><a href="http://www.51projob.com/uploads/allimg/120824/1130591102-0.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-224"="" title="bloom filter位数组（集合）的初始状态" src="http://www.51projob.com/uploads/allimg/120824/1130591102-0.jpg" alt="bloom filter位数组（集合）的初始状态" width="294" height="47" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">bloom filter位数组（集合）的初始状态</p></div></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; ">插入两个个元素，X1，X2：</div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><div id="attachment_225"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; width: 304px; "><a href="http://www.51projob.com/uploads/allimg/120824/1130593R2-1.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-225"="" title="bloom-filter-插入元素" src="http://www.51projob.com/uploads/allimg/120824/1130593R2-1.jpg" alt="bloom-filter-插入元素" width="294" height="71" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">bloom-filter-插入元素</p></div></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">查找元素是否属于该集合：首先同样用定义的hash函数对该元素进行hash得到hash index，然后查位数组中对应的hash index是否都是1，如果是，则表明该元素属于该集合，反之不属于<span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; text-decoration: underline; ">【当然不全是了，请继续看后面】</span>，如图，判断元素Y1，Y2是否属于该集合。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><div id="attachment_226"  aligncenter"="" style="padding-top: 9px; padding-right: 9px; padding-bottom: 9px; padding-left: 9px; margin-top: 0.4em; margin-right: auto; margin-bottom: 1.625em; margin-left: auto; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; clear: both; background-color: #eeeeee; max-width: 96%; width: 306px; "><a href="http://www.51projob.com/uploads/allimg/120824/11305a3M-2.jpg" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><img wp-image-226"="" title="bloom-filter-判断元素是否属于集合" src="http://www.51projob.com/uploads/allimg/120824/11305a3M-2.jpg" alt="bloom-filter-判断元素是否属于集合" width="296" height="67" style="padding-top: 6px; padding-right: 6px; padding-bottom: 6px; padding-left: 6px; margin-top: 0px; margin-right: auto; margin-bottom: 0px; margin-left: auto; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-width: initial; border-color: initial; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-color: #eeeeee; border-right-color: #eeeeee; border-bottom-color: #eeeeee; border-left-color: #eeeeee; max-width: 98%; height: auto; width: auto; display: block; " /></a><p style="padding-top: 10px; padding-right: 0px; padding-bottom: 5px; padding-left: 40px; margin-top: 0px; margin-right: 0px; margin-bottom: 0.6em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 12px; font-style: inherit; font-family: Georgia, serif; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #666666; position: relative; ">bloom-filter-判断元素是否属于集合</p></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 1.625em; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 13px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">如上图，由于y1的三个hash index有一个不为1，因此不属于该集合，而y2所有的hash index的位置上都为1，因此属于该集合。</span></p></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><h1>【Bloom Filter的不足】</h1></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; font-family: 'Courier New'; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">很明显上面这个查找过程并不保证查找的结果是100%正确的。同时也不支持删除一个已经插入的关键字，因为该关键字对应的位会牵动到其他的关键字。所以一个简单的改进就是 counting Bloom filter，用一个counter数组代替位数组，就可以支持删除了。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">还有一个比较重要的问题，如何根据输入元素个数n，确定位数组m的大小及hash函数个数。当hash函数个数k=(ln2)*(m/n)时错误率最小。在错误率不大于E的情况 下，m至少要等于n*lg(1/E)才能表示任意n个元素的集合。但m还应该更大些，因为还要保证bit数组里至少一半为0，则m应 该&gt;=nlg(1/E)*lge 大概就是nlg(1/E)1.44倍(lg表示以2为底的对数)。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">举个例子我们假设错误率为0.01，则此时m应大概是n的13倍。这样k大概是8个。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">注意这里m与n的单位不同，m是bit为单位，而n则是以元素个数为单位(准确的说是不同元素的个数)。通常单个元素的长度都是有很多bit的。所以使用bloom filter内存上通常都是节省的。</span></div><h1>【扩展】</h1><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">Bloom filter将集合中的元素映射到位数组中，用k（k为哈希函数个数）个映射位是否全1表示元素在不在这个集合中。Counting bloom filter（CBF）将位数组中的每一位扩展为一个counter，从而支持了元素的删除操作。Spectral Bloom Filter（SBF）将其与集合元素的出现次数关联。SBF采用counter中的最小值来近似表示元素的出现频率。</span></div><h1>【问题实例】</h1><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">给你A,B两个文件，各存放50亿条URL，每条URL占用64字节，内存限制是4G，让你找出A,B文件共同的URL。如果是三个乃至n个文件呢？</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">根据这个问题我们来计算下内存的占用，4G=2^32大概是40亿*8大概是340亿，n=50亿，如果按出错率0.01算需要的大概是650亿个bit。 现在可用的是340亿，相差并不多，这样可能会使出错率上升些。另外如果这些urlip是一一对应的，就可以转换成ip，则大大简单了。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 15px; font-family: 微软雅黑; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; color: #373737; line-height: 25px; "><span style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-size: 14px; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">做人好厚道，转载请注明出处：<a href="http://diducoder.com/mass-data-topic-2-bloom-filter.html" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #1982d1; text-decoration: none; border-top-width: 0px; border-right-width: 0px; border-bottom-width: 0px; border-left-width: 0px; border-style: initial; border-color: initial; font-style: inherit; outline-width: 0px; outline-style: initial; outline-color: initial; vertical-align: baseline; ">http://diducoder.com/mass-data-topic-2-bloom-filter.html</a></span></div></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194625.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:03 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194625.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>海量数据处理专题（一）(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194624.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 12:02:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194624.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194624.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194624.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194624.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194624.html</trackback:ping><description><![CDATA[<div><span style="color: #333333; font-family: 微软雅黑, arial, verdana; line-height: 25px; "><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #333333; font-family: Georgia, 'Times New Roman', Times, san-serif; line-height: 25px; text-align: left; "><span>大数据量的问题是很多面试笔试中经常出现的问题，比如baidu google 腾讯 这样的一些涉及到海量数据的公司经常会问到。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #333333; font-family: Georgia, 'Times New Roman', Times, san-serif; line-height: 25px; text-align: left; "><span>　　下面的方法是我对海量数据的处理方法进行了一个一般性的总结，当然这些方法可能并不能完全覆盖所有的问题，但是这样的一些方法也基本可以处理绝大多数遇到的问题。下面的一些问题基本直接来源于公司的面试笔试题目，方法不一定最优，如果你有更好的处理方法，欢迎与我讨论。</span></div><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 12px; margin-right: auto; margin-bottom: 12px; margin-left: auto; line-height: 25px; color: #333333; font-family: Georgia, 'Times New Roman', Times, san-serif; text-align: left; ">&nbsp;</p><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #333333; font-family: Georgia, 'Times New Roman', Times, san-serif; line-height: 25px; text-align: left; "><span>　　本贴从解决这类问题的方法入手，开辟一系列专题来解决海量数据问题。拟包含 以下几个方面。</span></div><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; color: #333333; font-family: Georgia, 'Times New Roman', Times, san-serif; line-height: 25px; text-align: left; "><ol style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 50px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-topic-2-bloom-filter.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>Bloom Filter</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-topic-3-hash.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>Hash</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-4-bitmap.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>Bit-Map</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-topic-5-heap.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>堆(Heap)</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-topic-6-multi-dividing.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>双层桶划分</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-topic-7-index-and-optimize.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>数据库索引</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><a href="http://diducoder.com/mass-data-topic-8-inverted-index.html" target="_blank" style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; color: #3d81ee; text-decoration: none; outline-style: none; "><span>倒排索引（Inverted Index）</span></a></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><span>外排序</span></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><span>Trie树</span></li><li style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; list-style-type: decimal; "><span>MapReduce</span></li></ol><div style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 12px; margin-right: auto; margin-bottom: 12px; margin-left: auto; line-height: 1.8; "><span>在这些解决方案之上，再借助一定的例子来剖析海量数据处理问题的解决方案。</span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 12px; margin-right: auto; margin-bottom: 12px; margin-left: auto; line-height: 1.8; "><span>其实在园子里面好多类似的面试题都可以用这样的方法来解答，比如</span><font class="Apple-style-span" color="#3D81EE"><span class="Apple-style-span" style="border-bottom-width: 1px; border-bottom-style: dashed;">百度的TopK热门查询问题</span></font><span>，某日IP最多访问问题。</span></p><p style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 12px; margin-right: auto; margin-bottom: 12px; margin-left: auto; line-height: 1.8; "><span>把这类问题研究好了，面试像百度，腾讯这样的公司就完全没问题了！！！</span></p></div></div></span></div><img src ="http://www.cppblog.com/mysileng/aggbug/194624.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 20:02 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194624.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>教你如何迅速秒杀99%的海量数据处理面试题(转)</title><link>http://www.cppblog.com/mysileng/archive/2012/11/05/194623.html</link><dc:creator>鑫龙</dc:creator><author>鑫龙</author><pubDate>Mon, 05 Nov 2012 11:58:00 GMT</pubDate><guid>http://www.cppblog.com/mysileng/archive/2012/11/05/194623.html</guid><wfw:comment>http://www.cppblog.com/mysileng/comments/194623.html</wfw:comment><comments>http://www.cppblog.com/mysileng/archive/2012/11/05/194623.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/mysileng/comments/commentRss/194623.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/mysileng/services/trackbacks/194623.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp; 摘要: 教你如何迅速秒杀99%的海量数据处理面试题前言&nbsp;&nbsp; 一般而言，标题含有&#8220;秒杀&#8221;，&#8220;史上最全/最强&#8221;等词汇的往往都脱不了哗众取宠之嫌，但进一步来讲，如果读者读罢此文，却无任何收获，那么，我也甘愿背负这样的罪名，:-)，同时，此文可以看做是对这篇文章：十道海量数据处理面试题与十个方法大总结的一般抽象性总结。&nbsp; &nbsp; ...&nbsp;&nbsp;<a href='http://www.cppblog.com/mysileng/archive/2012/11/05/194623.html'>阅读全文</a><img src ="http://www.cppblog.com/mysileng/aggbug/194623.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/mysileng/" target="_blank">鑫龙</a> 2012-11-05 19:58 <a href="http://www.cppblog.com/mysileng/archive/2012/11/05/194623.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>