﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-实验室宅男的一亩三分地-随笔分类-KFS分析</title><link>http://www.cppblog.com/whspecial/category/20716.html</link><description /><language>zh-cn</language><lastBuildDate>Wed, 23 Oct 2013 17:06:58 GMT</lastBuildDate><pubDate>Wed, 23 Oct 2013 17:06:58 GMT</pubDate><ttl>60</ttl><item><title>KFS代码分析2（meta元数据持久化）</title><link>http://www.cppblog.com/whspecial/archive/2013/10/24/203894.html</link><dc:creator>whspecial</dc:creator><author>whspecial</author><pubDate>Wed, 23 Oct 2013 17:03:00 GMT</pubDate><guid>http://www.cppblog.com/whspecial/archive/2013/10/24/203894.html</guid><wfw:comment>http://www.cppblog.com/whspecial/comments/203894.html</wfw:comment><comments>http://www.cppblog.com/whspecial/archive/2013/10/24/203894.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/whspecial/comments/commentRss/203894.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/whspecial/services/trackbacks/203894.html</trackback:ping><description><![CDATA[<p style="line-height:150%"><span style="font-size: 9pt; line-height: 150%; font-family: 宋体;">KFS</span><span style="font-size: 9pt; line-height: 150%; font-family: 宋体;">的元数据持久化是依赖checkpoint和operation log结合来工作的，其中checkpoint顾名思义保存的是某个点内存的状态，operation log记录的是对元数据修改的操作日志。</span></p>  <h1><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">使用</span><span style="font-size:12.0pt;">checkpoint+log</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">的设计<br /></span><span style="font-weight: normal;"><span style="font-size: 9pt; line-height: 150%; font-family: 宋体;">（1）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: 9pt; line-height: 150%; font-family: 宋体;">元数据信息必须要持久化，否则掉电或者人工重启之后该信息丢失<br /></span><span style="font-size: 9pt; line-height: 150%; font-family: 宋体;">（2）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp; </span></span><span style="font-size: 9pt; line-height: 150%; font-family: 宋体;">便于快速重启，可以从最近的一个cp中快速构建内存状态，加上该cp之后的log就可以完整地构建内存<br /><br /></span></span></h1>  <h1><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">读写</span><span style="font-size:12.0pt;">checkpoint</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">和</span><span style="font-size:12.0pt;">log</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">的过程<br /><br /></span><strong style="line-height: 150%; font-size: 14px;">Metaserver</strong><strong style="line-height: 150%; font-size: 14px;"><span style="font-family:宋体;">启动时的内存构建：</span></strong></h1>  <p style="line-height:150%"><span style="font-family:宋体;">在</span>Startup.cc<span style="font-family:宋体;">调用</span>rebuild<span style="font-family:宋体;">函数</span></p>  <p style="margin-left:36.0pt;text-indent:-36.0pt;line-height:150%;">（1）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-family:宋体;">如果之前已经有了</span>checkpoint<span style="font-family:宋体;">，从</span>checkpoint<span style="font-family:宋体;">里重建内存树，否则新建一棵内存树</span></p>  <p style="margin-left:36.0pt;text-indent:-36.0pt;line-height:150%;">（2）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-family:宋体;">在内存中</span>replay<span style="font-family:宋体;">该</span>checkpoint<span style="font-family:宋体;">之后的所有</span>operation log<br /><br /></p>  <p style="line-height:150%"><strong>MetaServer</strong><strong><span style="font-family:宋体;">运行时写入新的</span>checkpoint</strong><strong><span style="font-family:宋体;">：<br /></span></strong></p>  <p style="line-height:150%">logcompactor_main.cc<span style="font-family:宋体;">的</span>main<span style="font-family:宋体;">函数调用，应该是以调用另一个进程的方式来执行，猜想是</span>Metaserver<span style="font-family:宋体;">进程会定时调用该进程</span></p>  <p style="margin-left:36.0pt;text-indent:-36.0pt;line-height:150%;">（1）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-family:宋体;">根据旧的</span>checkpoint<span style="font-family:宋体;">在内存中生成状态</span></p>  <p style="margin-left:36.0pt;text-indent:-36.0pt;line-height:150%;">（2）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-family:宋体;">在内存中</span>replay<span style="font-family:宋体;">之后的</span>op log</p>  <p style="margin-left:36.0pt;text-indent:-36.0pt;line-height:150%;">（3）<span style="font-size: 7pt; line-height: normal; font-family: 'Times New Roman';">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span><span style="font-family:宋体;">将此时的内存状态写入新的</span>checkpoint<br /><br /></p>  <p style="line-height:150%"><strong>MetaServer</strong><strong><span style="font-family:宋体;">运行时写入新的</span>log</strong><strong><span style="font-family:宋体;">：</span></strong></p>  <p style="line-height:150%"><span style="font-family:宋体;">由</span>logger.cc<span style="font-family:宋体;">来写入新</span>log<span style="font-family:宋体;">，看了代码应该是每次修改了元信息的操作，都会将这条</span>op log<span style="font-family:宋体;">写入磁盘，虽然性能不高，但是比较可靠（之前也自己写过日志库，使用的是两个</span>buffer<span style="font-family:宋体;">交换写入，这样比较高效一些）</span></p><img src ="http://www.cppblog.com/whspecial/aggbug/203894.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/whspecial/" target="_blank">whspecial</a> 2013-10-24 01:03 <a href="http://www.cppblog.com/whspecial/archive/2013/10/24/203894.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>KFS代码分析1（meta内存结构）</title><link>http://www.cppblog.com/whspecial/archive/2013/10/23/203879.html</link><dc:creator>whspecial</dc:creator><author>whspecial</author><pubDate>Tue, 22 Oct 2013 17:36:00 GMT</pubDate><guid>http://www.cppblog.com/whspecial/archive/2013/10/23/203879.html</guid><wfw:comment>http://www.cppblog.com/whspecial/comments/203879.html</wfw:comment><comments>http://www.cppblog.com/whspecial/archive/2013/10/23/203879.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/whspecial/comments/commentRss/203879.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/whspecial/services/trackbacks/203879.html</trackback:ping><description><![CDATA[<p><span style="font-family: Verdana; font-size: 10pt;">此处的KFS是指</span><span style="font-family: Verdana; font-size: 10pt; line-height: 18px; background-color: #ffffff;">Kosmos distributed file system，代码位于</span><span style="font-family: Verdana; font-size: 10pt;"><a href="http://sourceforge.net/projects/kosmosfs/">http://sourceforge.net/projects/kosmosfs/</a>，之后会写几篇相关的文章，以供后来者参考。</span><span style="font-family: Verdana; font-size: 10pt;"><br /></span><span style="font-family: 宋体;"><br />KFS里Meta的内存结构主要是一棵B+树，保存在内存里，具体分析如下：</span></p>  <h1><span style="font-size:12.0pt;">B-</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">树，</span><span style="font-size:12.0pt;">B+</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">树的定义</span></h1>  <p style="line-height:150%"><span style="font-family:宋体;">关于这些树的定义，最好还是参考算法导论等经典书，网路上的信息有些不是很准确，为了方便大家还是贴一个链接：</span></p>  <p style="line-height:150%"><a href="http://www.cnblogs.com/oldhorse/archive/2009/11/16/1604009.html">http://www.cnblogs.com/oldhorse/archive/2009/11/16/1604009.html</a></p>  <h1><span style="font-size:12.0pt;">KFS</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">为何选用</span><span style="font-size:12.0pt;">B+</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">树而非</span><span style="font-size:12.0pt;">B</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">树？</span></h1>  <p style="line-height:150%"><span style="font-family:宋体;">这是我个人的理解：</span></p>  <p style="line-height:150%"><span style="font-family:宋体;">虽然</span>B<span style="font-family:宋体;">树可以在非叶子节点命中，会缩短一些平均查找长度，但是</span>B+<span style="font-family:宋体;">树在这种应用一个优势就是每个节点都有指向</span>next<span style="font-family:宋体;">节点的指针，对于范围查询或者遍历操作很适合。对于文件系统的一个</span>ls<span style="font-family:宋体;">某个子目录的需求，用</span>B+<span style="font-family:宋体;">树可以较高效的解决。</span></p>  <h1><span style="font-size:12.0pt;">KFS</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">里</span><span style="font-size:12.0pt;">B+</span><span style="font-size:12.0pt; font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">树的类图<br /></span></h1><p style="line-height: 150%;"><span style="font-size: 11.0pt;line-height:150%"><img src="http://www.cppblog.com/images/cppblog_com/whspecial/image1.png" width="480" height="184" alt="" /><br />MetaNode</span><span style="font-size:11.0pt;line-height:150%;font-family: 宋体;">：</span>base class for both internal and leaf nodes</p>  <p style="line-height:150%"><span style="font-size: 11.0pt;line-height:150%">Meta</span><span style="font-size:11.0pt;line-height:150%;font-family: 宋体;">：</span>base class for data objects (leaf nodes)</p>  <p style="line-height:150%">Node<span style="font-family:宋体;">：</span>an internal node in the KFS search tree</p>  <p style="line-height:150%">MetaChunkInfo<span style="font-family:宋体;">：</span>chunk information for a given file offset </p>  <p style="line-height:150%">MetaDentry <span style="font-family:宋体;">：</span>Directory entry, mapping a file name to a file id</p>  <p style="line-height:150%">MetaFattr<span style="font-family:宋体;">：</span>File or directory attributes</p>  <h1><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">各节点的介绍<br /></span></h1>  <p style="line-height:150%"><strong><span style="font-family:宋体;">（</span>1</strong><strong><span style="font-family:宋体;">）</span>Meta</strong><span style="font-family:宋体;">类是子节点的父类，其最主要的成员变量是</span>fid<br /></p><p style="line-height:150%"><span style="font-family:宋体;">有三个叶子节点：</span>MetaChunkInfo<span style="font-family:宋体;">，</span>MetaDentry<span style="font-family:宋体;">，</span>MetaFattr<br /><br /></p>  <p style="line-height:150%"><strong><span style="font-family:宋体;">（</span>2</strong><strong><span style="font-family:宋体;">）</span>MetaDentry</strong><strong><span style="font-family:宋体;">：</span></strong><span style="font-family:宋体;">实现从文件名到</span>fid<span style="font-family:宋体;">的映射，对于每个文件（目录）都拥有</span>1<span style="font-family:宋体;">个</span>MetaDentry</p><p style="line-height:150%"><span style="font-family:宋体;">成员变量包括：</span></p>  <p style="line-height:150%">dir<span style="font-family:宋体;">：文件父目录的</span>fid</p>  <p style="line-height:150%">name<span style="font-family:宋体;">：</span>dentry<span style="font-family:宋体;">的名称，实际就是文件名<br /><br /></span></p>  <p style="line-height:150%"><strong><span style="font-family:宋体;">（</span>3</strong><strong><span style="font-family:宋体;">）</span>MetaFattr</strong><strong><span style="font-family: 宋体;">：</span></strong><span style="font-family:宋体;">实现从</span>fid<span style="font-family:宋体;">到文件属性的映射，对于每个文件（目录）都拥有一个</span>MetaFattr<span style="font-family:宋体;">。<br /></span></p><p style="line-height:150%"><span style="font-family:宋体;">成员变量包括：</span></p>  <p style="line-height:150%">Type<span style="font-family:宋体;">：文件还是目录</span></p>  <p style="line-height:150%">numReplicas<span style="font-family:宋体;">：文件有几份副本</span></p>  <p style="line-height:150%">mtime<span style="font-family:宋体;">：修改时间</span></p>  <p style="line-height:150%">ctime<span style="font-family:宋体;">：属性修改时间</span></p>  <p style="line-height:150%">crtime<span style="font-family:宋体;">：文件创建时间</span></p>  <p style="line-height:150%">chunkcount<span style="font-family:宋体;">：连续的</span>chunk<span style="font-family:宋体;">数目</span></p>  <p style="line-height:150%">filesize<span style="font-family:宋体;">：文件大小</span></p>  <p style="line-height:150%">nextChunkOffset<span style="font-family:宋体;">：最后一个</span>chunk<span style="font-family:宋体;">在文件的所处的</span>offset</p>  <p style="line-height:150%">mode_t mode<span style="font-family:宋体;">：文件属性（</span>rwx<span style="font-family:宋体;">位）</span></p>  <p style="line-height:150%">key<span style="font-family:宋体;">：由</span>KFS_FATTR<span style="font-family:宋体;">，</span>fid<span style="font-family:宋体;">来构成，可以通过</span>fid<span style="font-family:宋体;">直接找到保存文件属性的节点。<br /></span><br /> <strong><span style="font-family:宋体;">（</span>4</strong><strong><span style="font-family:宋体;">）</span>MetaChunkInfo</strong><strong><span style="font-family: 宋体;">：</span></strong><span style="font-family:宋体;">标志某个文件对应的</span>chunk<span style="font-family:宋体;">信息，如果一个文件包含多个</span>chunk<span style="font-family:宋体;">，那么需要有多个</span>MetaChunkInfo<span style="font-family:宋体;">。<br /></span></p><p style="line-height:150%"><span style="font-family:宋体;">成员变量包括：</span></p>  <p style="line-height:150%">offset<span style="font-family:宋体;">：</span>chunk<span style="font-family:宋体;">在文件中的偏移量，因为一个文件可能由多个</span>chunk<span style="font-family:宋体;">组成</span></p>  <p style="line-height:150%">chunkId<span style="font-family:宋体;">：</span>chunk<span style="font-family:宋体;">的</span>id<span style="font-family:宋体;">号</span></p>  <p style="line-height:150%">chunkVersion<span style="font-family:宋体;">：</span>chunk<span style="font-family:宋体;">的</span>version<span style="font-family:宋体;">值<br /><br /></span></p>  <p style="line-height:150%"><strong><span style="font-family:宋体;">（</span>5</strong><strong><span style="font-family:宋体;">）</span>Node</strong><strong><span style="font-family: 宋体;">：</span></strong><span style="font-family:宋体;">实现的是</span>B+<span style="font-family:宋体;">树的内部节点，这种节点仅仅作为索引用途，存储实际元数据信息的节点位于最底部的叶子节点。<br /></span></p><p style="line-height:150%"><span style="font-family:宋体;">成员变量包括：</span></p>  <p style="line-height:150%">NKEY = 32<span style="font-family:宋体;">：每个节点最多拥有的关键字数目，实际上也就是最多拥有的子节点数目，如果多余这个值节点进行分裂</span></p>  <p style="line-height:150%">NSPLIT = NKEY / 2<span style="font-family:宋体;">：分裂之后每个节点的关键字数目</span></p>  <p style="line-height:150%">NFEWEST = NKEY - NSPLIT<span style="font-family:宋体;">：每个节点最少拥有的关键字数目，如果少于这个值两个节点进行合并</span></p>  <p style="line-height:150%">count<span style="font-family:宋体;">：节点实际拥有的关键字数目</span></p>  <p style="line-height:150%">Key childKey[NKEY]<span style="font-family:宋体;">：节点存储的关键字列表</span></p>  <p style="line-height:150%">MetaNode *childNode[NKEY]<span style="font-family:宋体;">：节点指向子节点的指针列表</span></p>  <p style="line-height:150%">Node *next<span style="font-family:宋体;">：指向下一个同级节点的指针</span></p>  <p style="line-height:150%"><span style="font-family:宋体;">实际上每个内部节点的阶数为</span>32<span style="font-family:宋体;">，可以有</span>32<span style="font-family:宋体;">个子节点，而每个叶子节点只保存一个</span>key<span style="font-family:宋体;">值。</span></p>  <h1><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">三类子节点在</span><span style="font-size:12.0pt;">B+</span><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">树中如何分布？</span></h1>  <p style="line-height:150%"><span style="font-family:宋体;">可以想象，必定是将同一类的节点聚集在一起。因此对于排序函数就是先比较节点类型，然后再对节点内部的成员变量进行比较。</span>MetaDentry<span style="font-family:宋体;">是根据</span>dir<span style="font-family:宋体;">（父目录的</span>id<span style="font-family:宋体;">），</span>MetaFattr<span style="font-family:宋体;">是根据</span>fid<span style="font-family:宋体;">，</span>MetaChunkInfo<span style="font-family:宋体;">是根据</span>id<span style="font-family:宋体;">和</span>chunkId<span style="font-family:宋体;">来排序。</span></p>  <h1><span style="font-size:12.0pt;font-family:宋体;Times New Roman&quot;;Times New Roman&quot;">一个不太相关的思考</span></h1>  <p style="line-height:150%"><span style="font-size: 10pt; line-height: 150%; font-family: 宋体;">看上面的三类子节点，我们可以发现chunk的位置信息并没有保存在B+树里，它是单独保存在一个Map数据结构里的，也不会在meta server里进行持久化，而是每次chunk启动时向meta server来报告。之所以不做持久化，可以这样来理解：</span></p>  <p style="line-height:150%"><span style="font-size: 10pt; line-height: 150%; font-family: 宋体;">只有Chunk服务器才能最终确定一个Chunk是否在它的硬盘上。Chunk服务器的错误可能会导致Chunk自动消失(比如，硬盘损坏了或者无法访问了)，亦或者操作人员可能会重命名一个Chunk服务器，还是由chunk server来报告比较靠谱。</span></p><img src ="http://www.cppblog.com/whspecial/aggbug/203879.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/whspecial/" target="_blank">whspecial</a> 2013-10-23 01:36 <a href="http://www.cppblog.com/whspecial/archive/2013/10/23/203879.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>