﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-学海无涯-随笔分类-原创</title><link>http://www.cppblog.com/hzh416/category/19771.html</link><description>在每天的学习中不断成长</description><language>zh-cn</language><lastBuildDate>Thu, 30 Aug 2012 08:33:45 GMT</lastBuildDate><pubDate>Thu, 30 Aug 2012 08:33:45 GMT</pubDate><ttl>60</ttl><item><title>计算主题映射概率（二）计算方法</title><link>http://www.cppblog.com/hzh416/archive/2012/08/07/186494.html</link><dc:creator>小豪</dc:creator><author>小豪</author><pubDate>Tue, 07 Aug 2012 02:24:00 GMT</pubDate><guid>http://www.cppblog.com/hzh416/archive/2012/08/07/186494.html</guid><wfw:comment>http://www.cppblog.com/hzh416/comments/186494.html</wfw:comment><comments>http://www.cppblog.com/hzh416/archive/2012/08/07/186494.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/hzh416/comments/commentRss/186494.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/hzh416/services/trackbacks/186494.html</trackback:ping><description><![CDATA[&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;这部分是开始计算主题映射的概率，之前由于对这个过程比较模糊，因此浪费了许多时间，当后来对整个计算过程思路清晰时，整个代码写出来也就水到渠成了。<br />
所以首先要解释如何计算主题映射概率，设源端为e，目标端为f。拿一个例子来说明（为了简化计算，这里假设每个句子源端和目标端各有三个主题分布，实际是各有100个主题分布）。<br />
<div style="text-align: center;"><img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120807101120.jpg" alt="" align="left" /></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
上面的e1,e2,e3表示的源端的语言的词，而对应的数字表示的是对应的主题分布。下面的f1,f2,f3表示的目标端的语言的词，对应的数字表示的是对应的主题分布。<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;首先要计算源端跟目标端所有词的共现频次，即P(e1,f1),&nbsp;P(e1,f2),&nbsp;P(e1,f3),&nbsp;P(e2,f1),&nbsp;P(e2,f2),&nbsp;P(e2,f3),&nbsp;P(e3,f1),&nbsp;P(e3,f2),&nbsp;P(e3,f3)。得到这样的9个共现频次。计算方法以P(e1,f1)为例。P(e1,f1)=e1*f1*对齐连线个数=0.2*0.1*3。<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;然后计算出所有n个句子中的这9个共现频次，并将所有的9个共现频次分别相加，得到e和f总的共现频次：P(e1,f1),&nbsp;P(e1,f2), &nbsp;P(e1,f3), &nbsp;P(e2,f1), &nbsp;P(e2,f2), &nbsp;P(e2,f3), &nbsp;P(e3,f1), P(e3,f2),&nbsp;P(e3,f3)。&nbsp;<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;之后再根据这9个共现频次计算出e1,e2,e3,f1,f2,f3出现的总的频次，以P(e1)为例，即P(e1)=P(e1,f1)+P(e1,f2)+P(e1,f3),类似的P(f2)=P(e1,f2)+P(e2,f2)+P(e3,f2)。<br />
接下去就可以计算映射概率了，得到的映射概率为P(e1/f1),&nbsp;P(e1/f2), &nbsp;P(e1/f3), &nbsp;P(e2/f1), &nbsp;P(e2/f2), &nbsp;P(e2/f3), &nbsp;P(e3/f1), P(e3/f2),&nbsp;P(e3/f3)。计算方法是根据条件概率公式得来的，具体的计算方法以P(e1/f1)为例，&nbsp;P(e1/f1)=&nbsp;P(e1,f1)/P(f1)。<br />
我们可以将这9个映射概率构成一个源端映射的矩阵和一个目标端映射的矩阵，即：<br />
<div style="text-align: left;"><img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120807101816.jpg" alt="" />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120807101746.jpg" width="362" height="115" alt="" /> &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div>
之后将每个句子的源端主题分布视为一个向量{P(f1),P(f2),P(f3)}，与源端映射矩阵相乘得到直积，得到源端映射到目标端的主题分布P(e1),P(e2),P(e3)。同理可得目标端映射到源端的主题分布。<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;最后再将得到的映射主题分布插入进原始语料库中即可。<br />
<img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120807102212.jpg" width="1314" height="615" alt="" /><br />
上图中第9行即是计算出的源端映射到目标端的主题分布，第11行是目标端映射到源端的主题分布。<img src ="http://www.cppblog.com/hzh416/aggbug/186494.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hzh416/" target="_blank">小豪</a> 2012-08-07 10:24 <a href="http://www.cppblog.com/hzh416/archive/2012/08/07/186494.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>计算主题映射概率（一）读取文档主题分布</title><link>http://www.cppblog.com/hzh416/archive/2012/08/06/186475.html</link><dc:creator>小豪</dc:creator><author>小豪</author><pubDate>Mon, 06 Aug 2012 11:31:00 GMT</pubDate><guid>http://www.cppblog.com/hzh416/archive/2012/08/06/186475.html</guid><wfw:comment>http://www.cppblog.com/hzh416/comments/186475.html</wfw:comment><comments>http://www.cppblog.com/hzh416/archive/2012/08/06/186475.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/hzh416/comments/commentRss/186475.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/hzh416/services/trackbacks/186475.html</trackback:ping><description><![CDATA[本次使用的语料一共有10934个文档，假设每个文档的主题分布是一样的。一个文档对应一个主题分布。因此，在计算主题映射概率之前，需要先对语料进行预处理，首先需要在语料库中读入主题分布。每个文档由&lt;doc&gt;&lt;/doc&gt;来区分。<br />原始的语料文档的格式如下所示：<br /><img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120806164959.jpg" width="1325" height="378" alt="" /><br />这是第一个文档中的前两句话。而读入主题分布之后的文档如下所示：<br /><img src="http://www.cppblog.com/images/cppblog_com/hzh416/未命名.jpg" alt="" /><br />我们发现，其中加入了&lt;src_topic&gt;和&lt;tgt_topic&gt;这两个部分。前者是源端语言的主题分布，后者是目标端语言的主题分布。这些主题分布都是从指定文件中读入的。主题分布的文件中格式为：<br /><img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120806191828.jpg" width="1285" height="288" alt="" /><br />这是源语言主题分布中的第一和第二个主题分布，每个主题分布包含100个主题分布概率。因此只要将每个分布读到语料中每个文档中的每个句子中。&nbsp;&nbsp;&nbsp;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;整个算法的思想比较简单，首先找个每个文档，再在文档中定位到每个句子，在句子的对齐信息后面插入主题分布即可。首先给出的是定位到每个句子的代码：<br /><div style="font-size: 13px; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-top-color: #cccccc; border-right-color: #cccccc; border-bottom-color: #cccccc; border-left-color: #cccccc; border-image: initial; padding-right: 5px; padding-bottom: 4px; padding-left: 4px; padding-top: 4px; width: 98%; word-break: break-all; background-color: #eeeeee; "><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">&nbsp;1</span>&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;read_bead(<span style="color: #0000FF; ">string</span>&nbsp;bead,<span style="color: #0000FF; ">string</span>&nbsp;topic)<br /><span style="color: #008080; ">&nbsp;2</span>&nbsp;{<br /><span style="color: #008080; ">&nbsp;3</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;str,str_lag,result;<br /><span style="color: #008080; ">&nbsp;4</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;istringstream&nbsp;input(bead,istringstream::<span style="color: #0000FF; ">in</span>);<br /><span style="color: #008080; ">&nbsp;5</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;size_t&nbsp;x=0,y=0;<br /><span style="color: #008080; ">&nbsp;6</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">while</span>(getline(input,str))<br /><span style="color: #008080; ">&nbsp;7</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">&nbsp;8</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;str_lag.append(str);<br /><span style="color: #008080; ">&nbsp;9</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;str_lag.push_back('\n');<br /><span style="color: #008080; ">10</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y=str_lag.find("&lt;/bead&gt;");<span style="color: #008000; ">//</span><span style="color: #008000; ">通过&lt;/bead&gt;的标记来定位句子的末尾</span><span style="color: #008000; "><br /></span><span style="color: #008080; ">11</span>&nbsp;<span style="color: #008000; "></span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(y!=-1)<br /><span style="color: #008080; ">12</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">13</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;result+=read_topic(str_lag,topic);<br /><span style="color: #008080; ">14</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;str_lag.clear();<br /><span style="color: #008080; ">15</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">16</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">17</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;result=result+"&lt;/doc&gt;";<br /><span style="color: #008080; ">18</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;result;<br /><span style="color: #008080; ">19</span>&nbsp;}</div>找到句子之后再在句子的对齐信息之后插入主题分布：<br /><div style="font-size: 13px; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-top-color: #cccccc; border-right-color: #cccccc; border-bottom-color: #cccccc; border-left-color: #cccccc; border-image: initial; padding-right: 5px; padding-bottom: 4px; padding-left: 4px; padding-top: 4px; width: 98%; word-break: break-all; background-color: #eeeeee; "><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">1</span>&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;read_topic(<span style="color: #0000FF; ">string</span>&nbsp;bead,<span style="color: #0000FF; ">string</span>&nbsp;topic)<br /><span style="color: #008080; ">2</span>&nbsp;{<br /><span style="color: #008080; ">3</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;size_t&nbsp;x=0,y=0;<br /><span style="color: #008080; ">4</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x=bead.find("&lt;/aligment&gt;");<br /><span style="color: #008080; ">5</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;bead.insert(x+12,topic);<br /><span style="color: #008080; ">6</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;bead;<br /><span style="color: #008080; ">7</span>&nbsp;}</div>其中topic的string是事先从主题分布文件中读入的。<br />&nbsp; &nbsp; &nbsp; 这样就完成了对语料的预处理，接下去就要进行对主题映射概率的计算。<br /><br />参考文献：<br />1、A Topic Similarity Model for HPB_Xinyan Xiao_ACL 2012<br /><div>2、Hidden Topic Markov Model</div><img src ="http://www.cppblog.com/hzh416/aggbug/186475.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hzh416/" target="_blank">小豪</a> 2012-08-06 19:31 <a href="http://www.cppblog.com/hzh416/archive/2012/08/06/186475.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>linux常用命令介绍（不断更新）</title><link>http://www.cppblog.com/hzh416/archive/2012/08/06/186470.html</link><dc:creator>小豪</dc:creator><author>小豪</author><pubDate>Mon, 06 Aug 2012 08:16:00 GMT</pubDate><guid>http://www.cppblog.com/hzh416/archive/2012/08/06/186470.html</guid><wfw:comment>http://www.cppblog.com/hzh416/comments/186470.html</wfw:comment><comments>http://www.cppblog.com/hzh416/archive/2012/08/06/186470.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/hzh416/comments/commentRss/186470.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/hzh416/services/trackbacks/186470.html</trackback:ping><description><![CDATA[由于我们运行大的程序都要在服务器上运行，因此能够熟练运用一些linux命令就很重要，下面就记录一些常用的命令。<br /><p><strong><span>1、pwd </span></strong></p>  <span>pwd</span><span>命令也是最常用最基本的命令之一，用于显示用户当前所在的目录。<br /></span><strong>2、cd&nbsp;</strong><br /><p><span lang="EN-US">cd</span><span>命令不仅显示当前状态，还改变当前状态，它的用法跟</span><span lang="EN-US">dos</span><span>下的</span><span lang="EN-US">cd</span><span>命令基本一致。</span><span> </span></p>

<p><span lang="EN-US">（1）cd ..</span><span>可进入上一层目录</span><span> </span></p>

<p><span lang="EN-US">（2）cd -</span><span>可进入上一个进入的目录</span><span> </span></p>

<p><span lang="EN-US">（3）cd ~</span><span>可进入用户的</span><span lang="EN-US">home</span><span>目录</span><span> </span></p>

<p><span lang="EN-US">（4）cd XXX&nbsp; </span><span>进入</span><span lang="EN-US">XXX</span><span>子目录</span></p><p><strong><span lang="EN-US">3、ls </span></strong></p>

<span lang="EN-US">ls</span><span>命令跟</span><span lang="EN-US">dos</span><span>下的</span><span lang="EN-US">dir</span><span>命令一样，用于显示当前目录的内容。</span><br /><p><strong><span lang="EN-US">4、cp </span></strong></p>

<p><span lang="EN-US">cp</span><span>命令用于复制文件或目录。</span><span> </span></p>

<p><span lang="EN-US">cp</span><span>命令可以一次复制多个文件，例如：</span><span>$cp *.txt *.doc *.bak /home。即</span><span>将当前目录中扩展名为</span><span lang="EN-US">txt</span><span>、</span><span lang="EN-US">doc</span><span>和</span><span lang="EN-US">bak</span><span>的文件全部复制到</span><span lang="EN-US">/home</span><span>目录中。</span><span>如果要复制整个目录及其所有子目录，可以用</span><span lang="EN-US">cp -R</span><span>命令。</span></p><p><strong><span lang="EN-US">5、mv </span></strong></p>

<p><span lang="EN-US">mv</span><span>命令用于移动文件和更名文件。<br />例1：</span><span>$mv example.txt /home。即</span><span>将当前目录下的</span><span lang="EN-US">example.txt</span><span>文件移动到</span><span lang="EN-US">/home</span><span>目录下。</span></p>

<p><span lang="EN-US">例2：$mv example.txt sample.txt。即</span><span>将</span><span lang="EN-US">example.txt</span><span>文件改名为</span><span lang="EN-US">sample.txt</span><span>。</span></p>

<p><span>类似于跟</span><span lang="EN-US">cp</span><span>命令，</span><span lang="EN-US">mv</span><span>命令也可以一次移动多个文件。</span></p><p><strong><span lang="EN-US">6、mkdir </span></strong></p>

<p><span>这个命令很简单，跟</span><span lang="EN-US">dos</span><span>的</span><span lang="EN-US">md</span><span>命令用法几乎一样，用于建立目录。</span></p>

<p><span lang="EN-US">-m: </span><span>对新建目录设置存取权限</span><span lang="EN-US">,</span><span>也可以用</span><span lang="EN-US">chmod</span><span>命令设置</span><span lang="EN-US">;</span></p>

<p><span lang="EN-US">-p: </span><span>可以是一个路径名称。此时若路径中的某些目录尚不存在</span><span lang="EN-US">,</span><span>加上此选项后</span><span lang="EN-US">,</span><span>系统将自动建立好那些尚不存在的目录</span><span lang="EN-US">,</span><span>即一次可以建立多个目录</span><span>，</span><span>例如</span><span lang="EN-US">: </span><span lang="EN-US">&nbsp;</span><span lang="EN-US">$ mkdir -p DIRC/hello。<br /></span></p><strong><span>7、tar.gz文件的压缩与解压缩<br /></span></strong><p><strong></strong></p><span>解压：tar zxvf FileName.tar.gz&nbsp;</span><br /><span>压缩：tar zcvf FileName.tar.gz DirName</span>&nbsp;<br /><p><span>具体</span>Linux下tar bz gz等压缩包的压缩和解压可以参考：<a href="http://www.bitscn.com/os/linux/200802/127470.html">http://www.bitscn.com/os/linux/200802/127470.html</a></p><p><strong>8、<span>iconv</span></strong></p>  <p><span>文本文件字符编码转换</span></p>  <p><span>例：</span><span>iconv -f gbk -t utf8 filename1 &gt; filename2，即将filename1中gbk编码转换成utf8编码，并另存为filename2文档。</span></p><p><strong><span lang="EN-US">9、chmod</span></strong></p>

<p><span>使用方式</span><span lang="EN-US"> : chmod [-cfvR] [--help] [--version]
mode file...</span></p>

<p>说明<span lang="EN-US"> : Linux/Unix </span>的档案调用权限分为三级<span lang="EN-US"> : </span>档案拥有者、群组、其他。利用<span lang="EN-US"> chmod </span>可以藉以控制档案如何被他人所调用。</p>

<p>参数<span lang="EN-US"> :</span></p>

<p><span lang="EN-US">mode : </span>权限设定字串，格式如下<span lang="EN-US"> :
[ugoa...][[+-=][rwxX]...][,...]</span>，其中<span lang="EN-US">u </span>表示该档案的拥有者，<span lang="EN-US">g</span>表示与该档案的拥有者属于同一个群体<span lang="EN-US">(group)</span>者，<span lang="EN-US">o </span>表示其他以外的人，<span lang="EN-US">a </span>表示这三者皆是。</p>

<p><span lang="EN-US">+ </span><span>表示增加权限、</span><span lang="EN-US">- </span><span>表示取消权限、</span><span lang="EN-US">= </span><span>表示唯一设定权限。</span></p>

<p><span lang="EN-US">r </span><span>表示可读取，</span><span lang="EN-US">w </span><span>表示可写入，</span><span lang="EN-US">x </span><span>表示可执行，</span><span lang="EN-US">X </span><span>表示只有当该档案是个子目录或者该档案已经被设定过为可执行。</span></p>

<p><span lang="EN-US">-c : </span><span>若该档案权限确实已经更改，才显示其更改动作</span></p>

<p><span lang="EN-US">-f : </span><span>若该档案权限无法被更改也不要显示错误讯息</span></p>

<p><span lang="EN-US">-v : </span><span>显示权限变更的详细资料</span></p>

<p><span lang="EN-US">-R : </span><span>对目前目录下的所有档案与子目录进行相同的权限变更</span><span lang="EN-US">(</span><span>即以递回的方式逐个变更</span><span lang="EN-US">)</span></p>

<p><span>范例</span><span lang="EN-US"> :</span><span>将档案</span><span lang="EN-US"> file1.txt </span><span>设为所有人皆可读取</span><span lang="EN-US"> :</span>chmod ugo+r file1.txt。</p>

<p><span>将档案</span><span lang="EN-US"> file1.txt </span><span>设为所有人皆可读取</span><span lang="EN-US"> :</span>chmod a+r file1.txt。</p>

<p>将档案<span lang="EN-US"> file1.txt </span>与<span lang="EN-US"> file2.txt </span>设为该档案拥有者，与其所属同一个群体者可写入，但其他以外的人则不可写入<span lang="EN-US"> :</span>chmod ug+w,o-w file1.txt file2.txt。</p>

<p><span>将</span><span lang="EN-US"> ex1.py </span><span>设定为只有该档案拥有者可以执行</span><span lang="EN-US"> :</span>chmod u+x ex1.py。</p>

<p>将目前目录下的所有档案与子目录皆设为任何人可读取<span lang="EN-US"> :</span>chmod -R a+r *。　　</p>

<p><span><br />此外</span><span lang="EN-US">chmod</span><span>也可以用数字来表示权限如</span><span lang="EN-US"> chmod 777 file</span></p>

<p><span>语法为：</span><span lang="EN-US">chmod abc file</span></p>

<p>其中<span lang="EN-US">a,b,c</span>各为一个数字，分别表示<span lang="EN-US">User</span>、<span lang="EN-US">Group</span>、及<span lang="EN-US">Other</span>的权限。</p>

<p><span lang="EN-US">r=4</span>，<span lang="EN-US">w=2</span>，<span lang="EN-US">x=1</span></p>

<p><span>若要</span><span lang="EN-US">rwx</span><span>属性则</span><span lang="EN-US">4+2+1=7</span><span>；</span></p>

<p><span>若要</span><span lang="EN-US">rw-</span><span>属性则</span><span lang="EN-US">4+2=6</span><span>；</span></p>

<p><span>若要</span><span lang="EN-US">r-x</span><span>属性则</span><span lang="EN-US">4+1=7</span><span>。</span></p>

<p>范例：</p>

<p><span lang="EN-US">chmod a=rwx file</span>和chmod 777 file效果相同，chmod ug=rwx,o=x file和chmod 771 file效果相同，若用<span lang="EN-US">chmod 4755 filename</span>可使此程序具有<span lang="EN-US">root</span>的权限。</p><p><strong><span lang="EN-US">10、head</span></strong></p>

<p><span lang="EN-US">head &lt;filename&gt;:</span></p>

<p>你可以通过<span lang="EN-US">head</span>命令查看具体文件最初的几行内容，该命令默认是前<span lang="EN-US">10</span>行内容，如果你想查看前面更多内容，你可以通过一个数字选项来设置，例如&nbsp;head -20 filename.txt。</p>

<p><strong><span lang="EN-US">11、tail</span></strong></p>

<p><span>与</span><span lang="EN-US">head</span><span>命令相反，</span><span lang="EN-US">tail</span><span>命令是用来查看具体文件后面几行的内容，默认情况下，是查看该文件尾</span><span lang="EN-US">10</span><span>行的内容，同样，如果想查看后面更多内容，也是通过数字选项来设置，例如</span>tail -20 filename.txt。</p>

<p><span lang="EN-US"><strong>12、more</strong></span></p>

<p><span>功能：在终端屏幕按屏显示文本文件。</span></p>

<p><span>语法：</span><span lang="EN-US"> more </span><span>［</span><span lang="EN-US"> - </span><span>选项</span><span> </span><span>］</span><span> </span><span>文件</span></p>

<p><span>说明：</span><span> </span><span>该命令一次显示一屏文本，显示满之后，停下来，并在终端底部打印出</span><span lang="EN-US">- - More- - </span><span>，系统还将同时显示出已显示文本占全部文本的百分比，若要继续显示，按回车或空格键即可。</span></p>

<p><span lang="EN-US">more</span><span>命令中各个选项的含义为：</span></p>

<p><span lang="EN-US">- p </span><span>显示下一屏之前先清屏。</span></p>

<p><span lang="EN-US">- c </span><span>作用同</span><span lang="EN-US">- p</span><span>基本一样。</span></p>

<p><span lang="EN-US">- d </span><span>在每屏的底部显示更友好的提示信息：</span></p>

<p><span lang="EN-US">- - More- - </span><span>（</span><span lang="EN-US">XX%</span><span>）［</span><span lang="EN-US">Press space to
contiune , </span><span>&#8216;</span><span lang="EN-US">q</span><span>&#8217;</span><span lang="EN-US"> to quit . </span><span>］</span></p>

<p><span>而且若用户输入了－个错误命令则显示出错信息，而不是简单地鸣响终端。</span></p>

<p><span lang="EN-US">- l </span><span>不处理（换页符）。如果没有给出这个选项，则</span><span lang="EN-US">more</span><span>命令在显示了一个包含有字符的行后将暂停显示，并等待接收命令。</span></p>

<p><span lang="EN-US">- s </span><span>文件中连续的空白行压缩成一个空白行显示。</span></p>

<p><span>执行中的命令</span></p>

<p><span>在</span><span lang="EN-US">more</span><span>命令的执行过程中，用户可以使用</span><span lang="EN-US">more</span><span>自己的一系列命令动态地根据需要来选择显示的部分。</span><span lang="EN-US">more</span><span>在显示完一屏内容之后，将停下来等待用户输入某个命令。下表列出了</span><span lang="EN-US">more</span><span>指令在执行中用到的一些常用命令，而有关这些命令的完整内容，可以在</span><span lang="EN-US">more</span><span>执行时按</span><span lang="EN-US">h</span><span>查看。这些命令的执行方法是先输入</span><span lang="EN-US">i</span><span>（行数）的值，再打所要的命令，不然它会以预设值来执行命令。</span></p>

<p><span lang="EN-US">i</span><span>空格</span><span> </span><span>若指定</span><span lang="EN-US">i</span><span>，显示下面的</span><span lang="EN-US">i</span><span>行；否则，显示下一整屏。</span></p>

<p><span lang="EN-US">i</span><span>回车</span><span> </span><span>若指定</span><span lang="EN-US">i</span><span>，显示下面的</span><span lang="EN-US">i</span><span>行；否则，显示下一行。</span></p>

<p><span lang="EN-US">iD </span><span>若指定</span><span lang="EN-US">i</span><span>，显示下面的</span><span lang="EN-US">i</span><span>行；否则，往下显示半屏（一般为</span><span lang="EN-US">11</span><span>行）。</span></p>

<p><span lang="EN-US">id </span><span>同</span><span lang="EN-US">iD </span><span>。</span></p>

<p><span lang="EN-US">iz </span><span>同&#8220;</span><span lang="EN-US">i</span><span>空格&#8221;类似，只是</span><span lang="EN-US">i</span><span>将成为以下每个满屏的缺省行数。</span></p>

<p><span lang="EN-US">is </span><span>跳过下面的</span><span lang="EN-US">i</span><span>行再显示一个整屏。预设值为</span><span lang="EN-US">1</span><span>。</span></p>

<p><span lang="EN-US">if </span><span>跳过下面的</span><span lang="EN-US">i</span><span>屏再显示一个整屏。预设值为</span><span lang="EN-US">1</span><span>。</span></p>

<p><span lang="EN-US">iB </span><span>往回跳过（即向文件首回跳）</span><span lang="EN-US">i</span><span>屏，再显示一个满屏。预设值为</span><span lang="EN-US">1</span><span>。</span></p>

<p><span lang="EN-US">b </span><span>与</span><span lang="EN-US">iB</span><span>相同。</span></p>

<p><span>&#8217;</span><span> </span><span>回到上次搜索的地方　</span></p>

<p><span lang="EN-US">q</span><span>或</span><span lang="EN-US">Q </span><span>退出</span><span lang="EN-US">more</span><span>。</span></p>

<p><span>＝</span><span> </span><span>显示当前行号。</span></p>

<p><span lang="EN-US">v </span><span>在当前行启动</span><span lang="EN-US">/usr/bin/vi</span><span>对之进行编辑修改。</span></p>

<p><span lang="EN-US">h </span><span>显示各命令的帮助信息。</span></p>

<p><span lang="EN-US">i/pattern </span><span>查找匹配该模式的第</span><span lang="EN-US">i</span><span>行。预设值为</span><span lang="EN-US">1</span><span>。</span></p>

<p><span lang="EN-US">in </span><span>查找符合表达式的倒数</span><span lang="EN-US">i</span><span>行。预设值为</span><span lang="EN-US">1</span><span>。</span></p>

<p><span lang="EN-US">! </span><span>或</span><span> </span><span>：</span><span lang="EN-US">! </span><span>在子</span><span lang="EN-US">shell</span><span>中执行命令。</span></p>

<p><span lang="EN-US">i</span><span>：</span><span lang="EN-US">n </span><span>在命令行中指定了多个文件名的情况下，可用此命令使之显示第</span><span lang="EN-US">i</span><span>个文件，若</span><span lang="EN-US">i</span><span>过大（出界），则显示文件名列表中的最后一个文件。</span></p>

<p><span lang="EN-US">i</span><span>：</span><span lang="EN-US">p </span><span>在命令行中指定了多个文件名的情况下，可用此命令使之显示倒数第</span><span lang="EN-US">i</span><span>个文件。若</span><span lang="EN-US">i</span><span>过大（出界），则显示第一个文件。</span></p>

<p><span lang="EN-US">i</span><span>：</span><span lang="EN-US">f </span><span>显示当前文件的文件名和行数。</span></p>

<p><span lang="EN-US">? </span><span>重复上次键人的命令。</span></p>

<p><span lang="EN-US"><strong>13、Sed</strong></span></p>

<p><span lang="EN-US">1.sed -n '2'p filename </span></p>

<p><span>打印文件的第二行。</span><span> </span></p>

<p><span lang="EN-US">2.sed -n '1,3'p filename </span></p>

<p><span>打印文件的</span><span lang="EN-US">1</span><span>到</span><span lang="EN-US">3</span><span>行</span><span> </span></p>

<p><span lang="EN-US">3. sed -n '/Neave/'p filename </span></p>

<p><span>打印匹配</span><span lang="EN-US">Neave</span><span>的行</span><span lang="EN-US">(</span><span>模糊匹配</span><span lang="EN-US">) </span></p>

<p><span lang="EN-US">4. sed -n '4,/The/'p filename </span></p>

<p><span>在第</span><span lang="EN-US">4</span><span>行查询模式</span><span lang="EN-US">The </span></p>

<p><span lang="EN-US">5. sed -n '1,$'p filename </span></p>

<p><span>打印整个文件，</span><span lang="EN-US">$</span><span>表示最后一行。</span><span> </span></p>

<p><span lang="EN-US">6. sed -n '/.*ing/'p filename </span></p>

<p><span>匹配任意字母，并以</span><span lang="EN-US">ing</span><span>结尾的单词</span><span lang="EN-US">(</span><span>点号不能少</span><span lang="EN-US">) </span></p>

<p><span lang="EN-US">7 sed -n / -e '/music/'= filename </span></p>

<p><span>打印匹配行的行号，</span><span lang="EN-US">-e </span><span>会打印文件的内容，同时在匹配行的前面标志行号。</span><span lang="EN-US">-n</span><span>只打印出实际的行号。</span><span> </span></p>

<p><span lang="EN-US">8.sed -n -e '/music/'p -e '/music/'= filename </span></p>

<p><span>打印匹配的行和行号，行号在内容的下面</span><span> </span></p>

<p><span lang="EN-US">9.sed '/company/' a\ "Then suddenly it happend" filename </span></p>

<p><span>选择含有</span><span lang="EN-US">company</span><span>的行，将后面的内容</span><span lang="EN-US">"Then
suddenly it happend"</span><span>加入下一行。注意：它并不改变文件，所有操作在缓冲区，如果要保存输出，重定向到一个文件。</span><span> </span></p>

<p><span lang="EN-US">10. sed '/company/' i\ "Then suddenly it happend" filename </span></p>

<p><span>同</span><span lang="EN-US">9</span><span>，只是在匹配的行前插入</span><span> </span></p>

<p><span lang="EN-US">11.sed '/company/' c\ "Then suddenly it happend" filename </span></p>

<p><span>用</span><span lang="EN-US">"Then suddenly it
happend"</span><span>替换匹配</span><span lang="EN-US">company</span><span>的行的内容。</span><span> </span></p>

<p><span lang="EN-US">12.sed '1'd ( '1,3'd '$'d '/Neave/'d) filename </span></p>

<p><span>删除第一行</span><span lang="EN-US">(1</span><span>到</span><span lang="EN-US">3</span><span>行，最后一行，匹配</span><span lang="EN-US">Neave</span><span>的行</span><span lang="EN-US">) </span></p>

<p><span lang="EN-US">13.[ address [</span><span>，</span><span lang="EN-US">address]] s/ pattern-to-find
/replacement-pattern/[g p w n] </span></p>

<p><span lang="EN-US">s</span><span>选项通知</span><span lang="EN-US">s e d</span><span>这是一个替换操作，并查询</span><span lang="EN-US">pattern-to-find</span><span>，成功后用</span><span lang="EN-US">replacement-pattern</span><span>替换它。</span><span> </span></p>

<p><span>替换选项如下：</span><span> </span></p>

<p><span lang="EN-US">g </span><span>缺省情况下只替换第一次出现模式，使用</span><span lang="EN-US">g</span><span>选项替换全局所有出现模式。</span><span> </span></p>

<p><span lang="EN-US">p </span><span>缺省</span><span lang="EN-US">s e d</span><span>将所有被替换行写入标准输出，加</span><span lang="EN-US">p</span><span>选项将使</span><span lang="EN-US">- n</span><span>选项无效。</span><span lang="EN-US">- n</span><span>选项不打印输出结果。</span><span> </span></p>

<p><span lang="EN-US">w </span><span>文件名使用此选项将输出定向到一个文件。</span><span lang="EN-US">(</span><span>注意只将匹配替换的行写入文件，而不是整个内容</span><span lang="EN-US">) </span></p>

<p><span lang="EN-US">14.sed s'/nurse/"hello "&amp;/' filename </span></p>

<p><span>将</span><span lang="EN-US">'hello '</span><span>增加到</span><span lang="EN-US">'nurse' </span><span>的前面。</span><span> </span></p>

<p><span lang="EN-US">15. sed '/company/r append.txt' filename </span></p>

<p><span>在匹配</span><span lang="EN-US">company</span><span>的行的下一行开始加入文件</span><span lang="EN-US">append.txt</span><span>的内容。</span><span> </span></p>

<p><span lang="EN-US">16. sed '/company/'q filename </span></p>

<p><span>首次匹配</span><span lang="EN-US">company</span><span>后就退出</span><span lang="EN-US">sed</span><span>程序</span></p><strong><span>14、ln<br /></span></strong><p>这是linux中一个非常重要命令，请大家一定要熟悉。它的功能是为某一个文件在另外一个位置建立一个同不的链接，这个命令最常用的参数是-s,具体用法是：ln -s 源文件 目标文件。&nbsp;当我们需要在不同的目录，用到相同的文件时，我们不需要在每一个需要的目录下都放一个必须相同的文件，我们只要在某个固定的目录，放上该文件，然后在其它的目录下用ln命令链接（link）它就可以，不必重复的占用磁盘空间。</p><p>例如：ln -s /bin/<a href="http://www.linuxso.com/command/less.html" target="_blank"><u>less</u></a>&nbsp;/usr/local/bin/less&nbsp;<br />-s 是代号（symbolic）的意思。&nbsp;<br />这里有两点要注意：&nbsp;<br />第一，ln命令会保持每一处链接文件的同步性，也就是说，不论你改动了哪一处，其它的文件都会发生相同的变化；&nbsp;<br />第二，ln的链接有软链接和硬链接两种，软链接就是ln -s ** **,它只会在你选定的位置上生成一个文件的镜像，不会占用磁盘空间，硬链接ln ** **,没有参数-s, 它会在你选定的位置上生成一个和源文件大小相同的文件，无论是软链接还是硬链接，文件都保持同步变化。&nbsp;<br />第三，软链接是可以跨分区的，但是硬链接只能在同一分区内。如果你用<a href="http://www.linuxso.com/command/ls.html" target="_blank"><u>ls</u></a>察看一个目录时，发现有的文件或文件夹的颜色和别的不一样，我机子上是蓝色的，那就是一个用ln命令生成的文件，用ls -l命令去察看，就可以看到显示的link的路径了。</p><p><strong><span lang="EN-US">15</span></strong><strong><span>、</span></strong><strong><span lang="EN-US">rm</span></strong></p>

<p><span lang="EN-US">Rmdir </span><span>空目录名</span><span lang="EN-US">&nbsp; </span><span>删除一个空目录</span></p>

<p><span lang="EN-US">rm </span><span>文件名</span><span> </span><span>文件名</span><span lang="EN-US">&nbsp;&nbsp;&nbsp; </span><span>删除一个文件或多个文件</span></p>

<p><span lang="EN-US">rm -rf </span><span>非空目录名</span><span lang="EN-US">&nbsp;&nbsp;&nbsp;&nbsp;  </span><span>递归删除一个非空目录下的一切</span></p><strong><span><br /><br /><br /></span></strong><img src ="http://www.cppblog.com/hzh416/aggbug/186470.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hzh416/" target="_blank">小豪</a> 2012-08-06 16:16 <a href="http://www.cppblog.com/hzh416/archive/2012/08/06/186470.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>统计源语言规则满足对齐一致性的次数（二）实际代码编写</title><link>http://www.cppblog.com/hzh416/archive/2012/08/06/186439.html</link><dc:creator>小豪</dc:creator><author>小豪</author><pubDate>Mon, 06 Aug 2012 04:17:00 GMT</pubDate><guid>http://www.cppblog.com/hzh416/archive/2012/08/06/186439.html</guid><wfw:comment>http://www.cppblog.com/hzh416/comments/186439.html</wfw:comment><comments>http://www.cppblog.com/hzh416/archive/2012/08/06/186439.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/hzh416/comments/commentRss/186439.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/hzh416/services/trackbacks/186439.html</trackback:ping><description><![CDATA[&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">其实这个代码我写了两个版本，第一个版本仅仅是为了实现功能，而没有去考虑算法的复杂度与计算时间，而由于统计的语料是100万的语料，因此用第一个版本用了两三个小时都得不出结果。所以我在向学长请教之后，写出了第二个改进的版本，虽然耗时还是比较长，但是总算能够得出结果，而我也希望在日后的学习过程中能够能写出更优的算法。 &nbsp;</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">算法的整体思想同样也比较简单，就是遍历整篇文档，切分其中的句子，再对句子进行单独的操作。对单个句子中，先遍历得出其中所有的源语言规则，同时统计其对齐信息，存放到map中，之后再判断是否满足对齐一致性，分别将所有出现的次数以及满足对齐一致性的次数存入两个map中，最后再输出结果。接下来看看具体代码。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">由于源语言以及对齐信息对是连续的string，但是有用空格进行切分，因此首先写了一个小函数将每个部分单独切分出来，以便于后面的使用：</span>&nbsp;<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">&nbsp;1</span>&nbsp;inline&nbsp;vector&lt;<span style="color: #0000FF; ">string</span>&gt;&nbsp;split_word(<span style="color: #0000FF; ">string</span>&nbsp;str,<span style="color: #0000FF; ">string</span>&nbsp;sym)<br /><span style="color: #008080; ">&nbsp;2</span>&nbsp;{<br /><span style="color: #008080; ">&nbsp;3</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;str+=sym;<br /><span style="color: #008080; ">&nbsp;4</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vector&nbsp;&lt;<span style="color: #0000FF; ">string</span>&gt;&nbsp;result;<br /><span style="color: #008080; ">&nbsp;5</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;size_t&nbsp;pos;<br /><span style="color: #008080; ">&nbsp;6</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;size=str.size();<br /><span style="color: #008080; ">&nbsp;7</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>(<span style="color: #0000FF; ">int</span>&nbsp;i=0;&nbsp;i&lt;size;&nbsp;i++)<br /><span style="color: #008080; ">&nbsp;8</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">&nbsp;9</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pos=str.find(sym,i);<br /><span style="color: #008080; ">10</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>(pos&lt;size)<br /><span style="color: #008080; ">11</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">12</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;sub_string=str.substr(i,pos-i);<br /><span style="color: #008080; ">13</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>(sub_string.length()!=0)<br /><span style="color: #008080; ">14</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">15</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;result.push_back(sub_string);<br /><span style="color: #008080; ">16</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">17</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;i=pos+sym.size()-1;<br /><span style="color: #008080; ">18</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">19</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">20</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;result;<br /><span style="color: #008080; ">21</span>&nbsp;}</div>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这里使用了inline是因为之前看到内联函数适用于那些频繁使用的小函数，有利于提高运行效率。这里str表示的是需要进行切分的整串string，而sym表示的就是切分依据的分隔符，比如空格。第三行中在str后面又加了一个sym是为了便于切分，因为切分依据都是先找到sym的位置，再切分出sym的位置与初始位置之间的字符串。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">接下来是将源语言对齐到目标语言的信息与目标语言对齐到源语言的对齐信息存入两个map中，由于其中可能存在一对多的情况，因此使用了map&lt;int,vector&lt;int&gt; &gt;来存取多个对齐关系。</span>&nbsp;<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">&nbsp;1</span>&nbsp;<span style="color: #0000FF; ">void</span>&nbsp;get_alignment_relationship(<span style="color: #0000FF; ">string</span>&nbsp;alignment,&nbsp;map&lt;<span style="color: #0000FF; ">int</span>,vector&lt;<span style="color: #0000FF; ">int</span>&gt;&nbsp;&gt;&nbsp;&amp;stt_alignment,&nbsp;map&lt;<span style="color: #0000FF; ">int</span>,vector&lt;<span style="color: #0000FF; ">int</span>&gt;&nbsp;&gt;&nbsp;&amp;tts_alignment)<br /><span style="color: #008080; ">&nbsp;2</span>&nbsp;{<br /><span style="color: #008080; ">&nbsp;3</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vector&lt;<span style="color: #0000FF; ">string</span>&gt;alignment_element&nbsp;=&nbsp;split_word(alignment,"&nbsp;");<br /><span style="color: #008080; ">&nbsp;4</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;assert&nbsp;(alignment_element.size()&gt;=0);<br /><span style="color: #008080; ">&nbsp;5</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(<span style="color: #0000FF; ">int</span>&nbsp;i=0;&nbsp;i&lt;alignment_element.size();&nbsp;i++)<br /><span style="color: #008080; ">&nbsp;6</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">&nbsp;7</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;vector&lt;<span style="color: #0000FF; ">string</span>&gt;s_t_index=&nbsp;split_word(alignment_element[i],"-");<br /><span style="color: #008080; ">&nbsp;8</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;s_index&nbsp;=&nbsp;atoi(s_t_index[0].c_str());<br /><span style="color: #008080; ">&nbsp;9</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;t_index&nbsp;=&nbsp;atoi(s_t_index[1].c_str());<br /><span style="color: #008080; ">10</span>&nbsp;<br /><span style="color: #008080; ">11</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;stt_alignment[s_index].push_back(t_index);<br /><span style="color: #008080; ">12</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tts_alignment[t_index].push_back(s_index);<br /><span style="color: #008080; ">13</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">14</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;map&lt;<span style="color: #0000FF; ">int</span>,vector&lt;<span style="color: #0000FF; ">int</span>&gt;&nbsp;&gt;::iterator&nbsp;it1,it2;<br /><span style="color: #008080; ">15</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;it1=stt_alignment.begin();<br /><span style="color: #008080; ">16</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;stt_alignment.erase(it1);<br /><span style="color: #008080; ">17</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;it2=tts_alignment.begin();<br /><span style="color: #008080; ">18</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tts_alignment.erase(it2);<br /><span style="color: #008080; ">19</span>&nbsp;}</div><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">stt_alignment</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;表示的是source to target，即源语言对齐到目标语言的对齐关系，而反之，</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">tts_alignment</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;则表示目标语言对齐到源语言的对齐关系。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在得到对齐关系之后，通过判断对齐连线个数来判断是否符合对齐一致性：</span>&nbsp;<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">&nbsp;1</span>&nbsp;inline&nbsp;<span style="color: #0000FF; ">bool</span>&nbsp;is_fit_alignment(map&lt;<span style="color: #0000FF; ">int</span>,vector&lt;<span style="color: #0000FF; ">int</span>&gt;&nbsp;&gt;&nbsp;stt_alignment,&nbsp;map&lt;<span style="color: #0000FF; ">int</span>,vector&lt;<span style="color: #0000FF; ">int</span>&gt;&nbsp;&gt;&nbsp;tts_alignment,&nbsp;size_t&nbsp;s_begin,&nbsp;size_t&nbsp;s_end)<br /><span style="color: #008080; ">&nbsp;2</span>&nbsp;{<br /><span style="color: #008080; ">&nbsp;3</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;src_size=0,tgt_size=0;<br /><span style="color: #008080; ">&nbsp;4</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;map&lt;<span style="color: #0000FF; ">int</span>,<span style="color: #0000FF; ">int</span>&gt;&nbsp;tgtcount;<br /><span style="color: #008080; ">&nbsp;5</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;map&lt;<span style="color: #0000FF; ">int</span>,<span style="color: #0000FF; ">int</span>&gt;::iterator&nbsp;iter;<br /><span style="color: #008080; ">&nbsp;6</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(<span style="color: #0000FF; ">int</span>&nbsp;x=s_begin;x&lt;s_end;x++)<br /><span style="color: #008080; ">&nbsp;7</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">&nbsp;8</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;src_size+=stt_alignment[x].size();<br /><span style="color: #008080; ">&nbsp;9</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>&nbsp;(size_t&nbsp;a=0;a&lt;stt_alignment[x].size();a++)<br /><span style="color: #008080; ">10</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br /><span style="color: #008080; ">11</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tgtcount[stt_alignment[x][a]]++;<br /><span style="color: #008080; ">12</span>&nbsp;<br /><span style="color: #008080; ">13</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">14</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}&nbsp;&nbsp;&nbsp;&nbsp;<br /><span style="color: #008080; ">15</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">for</span>(iter=tgtcount.begin();iter!=tgtcount.end();iter++)&nbsp;<br /><span style="color: #008080; ">16</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{&nbsp;<br /><span style="color: #008080; ">17</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tgt_size+=tts_alignment[iter-&gt;first].size();<br /><span style="color: #008080; ">18</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><span style="color: #008080; ">19</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(src_size==tgt_size&nbsp;&amp;&amp;&nbsp;src_size!=0)<br /><span style="color: #008080; ">20</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;<span style="color: #0000FF; ">true</span>;<br /><span style="color: #008080; ">21</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;<span style="color: #0000FF; ">false</span>;<br /><span style="color: #008080; ">22</span>&nbsp;}</div><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">通过bool函数来判断是否满足对齐一致性。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;以上就是一些主要的函数方法。写完这个代码最大的收获就是由于之前不懂得怎么在函数中返回STL容器，因此当遇到需要使用map或者vector之类的容器时，就只好在main函数中实现，而现在了解了可以在函数中使用指针来返回容器。这对于以后代码的编写提供了非常大的便利。而且通过写这个代码，也对于语料的结构以及怎么处理语料有了更深入的了解，这对于以后编写自然语言处理方面的代码有了很大的帮助。</span>&nbsp;<br /><img src ="http://www.cppblog.com/hzh416/aggbug/186439.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hzh416/" target="_blank">小豪</a> 2012-08-06 12:17 <a href="http://www.cppblog.com/hzh416/archive/2012/08/06/186439.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>统计源语言规则满足对齐一致性的次数（一）概念介绍</title><link>http://www.cppblog.com/hzh416/archive/2012/08/06/186437.html</link><dc:creator>小豪</dc:creator><author>小豪</author><pubDate>Mon, 06 Aug 2012 04:14:00 GMT</pubDate><guid>http://www.cppblog.com/hzh416/archive/2012/08/06/186437.html</guid><wfw:comment>http://www.cppblog.com/hzh416/comments/186437.html</wfw:comment><comments>http://www.cppblog.com/hzh416/archive/2012/08/06/186437.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/hzh416/comments/commentRss/186437.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/hzh416/services/trackbacks/186437.html</trackback:ping><description><![CDATA[&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这是写的第二个个人感觉较有挑战性的代码，老师布置的任务真是一次比一次难，不过也从中学习到蛮多东西的。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><blockquote style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; border-style: initial; border-color: initial; border-left-style: none; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; width: 813px; color: #000000; border-top-style: none; border-right-style: none; border-bottom-style: none; border-width: initial; border-color: initial; "></blockquote>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这次的任务是要计算语料库中，源语言规则出现的总的次数以及满足对齐一致性的次数。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><blockquote style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; border-style: initial; border-color: initial; border-left-style: none; padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; width: 813px; color: #000000; border-top-style: none; border-right-style: none; border-bottom-style: none; border-width: initial; border-color: initial; "></blockquote>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这个任务的第一个难点就是我对什么是源语言规则以及什么叫满足对齐一致性这个概念不大清楚。因此首先来介绍一下这两个概念。我们用一个例子来说明：</span>&nbsp;<br /><div style="text-align: center;"><img src="http://www.cppblog.com/images/cppblog_com/hzh416/QQ截图20120806110315.jpg" alt="" /></div><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;对于这句话，其中上面的中文是源语言，下面的英文是目标语言，而中间的连线则是它们之间的对齐关系。这句话在语料库中的表示应该为：</span>&nbsp;<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->1&nbsp;&lt;bead&nbsp;id="1"&gt;<br />2&nbsp;&lt;srcword&gt;&lt;s&gt;&nbsp;是&nbsp;不&nbsp;能&nbsp;忘记&nbsp;的&nbsp;。&nbsp;&lt;/s&gt;&lt;/srcword&gt;<br />3&nbsp;&lt;tgtword&gt;&lt;s&gt;&nbsp;was&nbsp;not&nbsp;to&nbsp;be&nbsp;forgotten&nbsp;.&nbsp;&lt;/s&gt;&lt;/tgtword&gt;<br />4&nbsp;&lt;alignment&gt;0-0&nbsp;1-1&nbsp;2-2&nbsp;3-2&nbsp;4-4&nbsp;4-5&nbsp;6-6&nbsp;7-7&lt;/aligment&gt;<br />5&nbsp;&lt;/bead&gt;</div><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;</span>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">其中bead范围内表示的是一整个句子，&lt;s&gt;和&lt;/s&gt;是句子的首尾标识符，同样也算在对齐关系里面。而&lt;srcword&gt;表示的是源语言，&lt;tgtword&gt;表示的是目标语言，&lt;alignment&gt;表示的是对齐关系。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">我们首先来介绍什么是源语言规则，源语言规则有一些限制，首先要限制在一定的长度之内，这里我将长度限制为7，然后繁殖度规则，不过这里我没有将其考虑进去。还有一些概念我也说不大清楚，因此同样还是举例来说明。对于&#8220;是不能忘记的&#8221;这句话，其中包含的源语言规则就包括：是，是不，是不能，是不能忘记，是不能忘记的；不，不能，不能忘记，不能忘记的；能，能忘记，能忘记的；忘记，忘记的。通过这个例子就可以看到源语言规则即是将句子中所有可能组成遍历一遍，而其中单独的&#8220;的&#8221;不构成源语言规则是因为它没有对齐关系。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">接下来我们介绍什么叫满足对齐一致性。我同样觉得概念好像很难解释清楚，当初请教学长的时候也是画图来表示比较直观明了。如果非要说概念的话应该是就源语言和目标语言的对齐不会超出互相对齐的范围之内。用例子来表示就是与&#8220;是 不&#8221;对齐的是&#8220;was not&#8221;，但是与&#8220;was not&#8221;对齐的是&#8220;是 不 能&#8221;，可见目标语言对齐到源语言时超出了源语言的范围，因此&#8220;是 不&#8221;这个源语言规则就不满足对齐一致性，而&#8220;是 不 能&#8221;则满足对齐一致性。而在代码编写过程中，要判断一个源语言规则是否满足对齐一致性的简单的方法就是判断对齐连线个数，即该源语言规则对齐到相应目标语言的对齐连线数目等于相应目标语言对齐到源语言规则的对齐连线时，则该源语言规则满足对齐一致性。</span>&nbsp;<br /><img src ="http://www.cppblog.com/hzh416/aggbug/186437.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hzh416/" target="_blank">小豪</a> 2012-08-06 12:14 <a href="http://www.cppblog.com/hzh416/archive/2012/08/06/186437.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>测试集语料的合并</title><link>http://www.cppblog.com/hzh416/archive/2012/08/06/186436.html</link><dc:creator>小豪</dc:creator><author>小豪</author><pubDate>Mon, 06 Aug 2012 04:11:00 GMT</pubDate><guid>http://www.cppblog.com/hzh416/archive/2012/08/06/186436.html</guid><wfw:comment>http://www.cppblog.com/hzh416/comments/186436.html</wfw:comment><comments>http://www.cppblog.com/hzh416/archive/2012/08/06/186436.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/hzh416/comments/commentRss/186436.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/hzh416/services/trackbacks/186436.html</trackback:ping><description><![CDATA[&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这是久久没有写c++程序之后，写的第一个相对比较久的程序。目的就是将nist03，04，05这三个单独的测试集进行合并，以进行bleu值的测算。三个测试集中分别包含源文，参考译文，还有4个机器译文。而最后的结果就是要分别将三个测试集的源文，参考译文，以及机器译文进行合并。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">方法的思想其实很简单，其中源文和参考译文的合并只要将三个单独的文档合并成一个文档，并稍微改一下格式就可以完成了。而难点就在于机器译文的合并。因为每篇源文中对应了四个机器译文。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">单独用文字不容易说明这个问题，我们将其形象化。比如nist03的源文是A，04的源文是B，05的源文是C。而A对应的对应的机器译文是abcd，B对应的机器译文是efgh，C对应的机器译文是ijkl。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这里解释一下什么是A对应abcd，即比如A是一整篇文档，而a，b，c，d分别是机器给出的这篇文档的4个翻译，所以机器译文的文档就是将a,b,c,d这4个译文顺序排列，并用指定的标识符来与源文进行对应。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">而如果我们将源文进行合并了之后，那么源文就变成了ABC，所以机器译文也要相应做出改变，而不能单纯地将机器译文的三个文档简单合并。ABC对应的第一个机器译文是aei，第二个机器译文是bfj,第三个机器译文是cgk,第四个机器译文是dhl，所以我们将机器译文合并后的排列顺序就应该调整为aeibfjcgkdhl。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">由于对于c++我是初学者，所以我实现的方法可能比较粗暴简单。对于每个机器译文，在最开始都有一个&lt;DOC docid= 的标识符来标识指定的翻译，而末尾都有&lt;/DOC&gt;来进行结束。因此我们只要根据这两个标识符就可以区分出所有的机器翻译，然后再对其进行重新组合。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " /><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">以其中一个代码为例：</span>&nbsp;<br /><div style="font-size: 13px; border-top-width: 1px; border-right-width: 1px; border-bottom-width: 1px; border-left-width: 1px; border-top-style: solid; border-right-style: solid; border-bottom-style: solid; border-left-style: solid; border-top-color: #cccccc; border-right-color: #cccccc; border-bottom-color: #cccccc; border-left-color: #cccccc; border-image: initial; padding-right: 5px; padding-bottom: 4px; padding-left: 4px; padding-top: 4px; width: 98%; word-break: break-all; background-color: #eeeeee; "><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008080; ">&nbsp;1</span>&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;PartOne(<span style="color: #0000FF; ">string</span>&nbsp;s)<br /><span style="color: #008080; ">&nbsp;2</span>&nbsp;{<br /><span style="color: #008080; ">&nbsp;3</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;size_t&nbsp;x=0,y=1,z=0;<br /><span style="color: #008080; ">&nbsp;4</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;tmp;<br /><span style="color: #008080; ">&nbsp;5</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x=s.find("&lt;DOC&nbsp;docid=\"AFC20030102.0015\"&nbsp;sysid=\"E01\"&gt;");<br /><span style="color: #008080; ">&nbsp;6</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y=s.rfind("sysid=\"E01\"");<br /><span style="color: #008080; ">&nbsp;7</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;z=s.find("&lt;/DOC&gt;",y);<br /><span style="color: #008080; ">&nbsp;8</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tmp.append(s,x,z-x+6);<br /><span style="color: #008080; ">&nbsp;9</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;tmp;<br /><span style="color: #008080; ">10</span>&nbsp;}</div>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这个代码是分割出nist03中的第一个机器译文，我们可以看出</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">&lt;DOC&nbsp;docid=\</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">AFC20030102.</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">0015</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">\</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">&nbsp;sysid=\</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">E01\</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">&gt;</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;是这个译文的标识符，E01表示的是第一个译文，同理E02，E03，E04表示的就是第二，第三，第四个译文。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">而由于nist03，04，05的机器译文格式不完全一样，因此为了最后计算bleu值时能够被识别，必须将所有机器译文的格式进行统一（我一开始就是没有将格式进行统一，以至于合并了之后也无法计算bleu值）。</span>&nbsp;<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->&nbsp;1&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;PartTwo(<span style="color: #0000FF; ">string</span>&nbsp;s)<br />&nbsp;2&nbsp;{<br />&nbsp;3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;size_t&nbsp;x=0,y=1,z=0;<br />&nbsp;4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">string</span>&nbsp;tmp;<br />&nbsp;5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;x=s.find("&lt;DOC&nbsp;docid=\"PD20040202.001\"&nbsp;sysid=\"cha\"&gt;");<br />&nbsp;6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;y=s.rfind("sysid=\"cha\"");<br />&nbsp;7&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;z=s.find("&lt;/DOC&gt;",y);<br />&nbsp;8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tmp.append(s,x,z-x+6);<br />&nbsp;9&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">int</span>&nbsp;pos=0;<br />10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">while</span>(1)<br />11&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />12&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;pos=tmp.find("sysid=\"cha\"",pos+5);<br />13&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">if</span>&nbsp;(-1&nbsp;==&nbsp;pos)<br />14&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">break</span>;<br />15&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">else</span><br />16&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;tmp&nbsp;=&nbsp;tmp.substr(0,pos)+"sysid=\"E01\""+tmp.substr(pos+11);<br />17&nbsp;<br />18&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />19&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">return</span>&nbsp;tmp;<br />20&nbsp;}</div><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">&nbsp;</span>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">以这个例子来说明，这是分割出nist04的第一个机器译文，而由于nist04的机器译文中，第一个译文是用</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">&nbsp;sysid=</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">cha</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #eeeeee; font-size: 13px; ">"</span><span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">来进行识别，因此第11行到第18行就是进行格式的统一，将cha替换成E01（我们这里默认都使用跟nist03一样的格式）。后面的机器译文也是使用类似的方法进行处理。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这样将所有译文都分割出来并统一格式之后，再将他们合并之后就完成了整个代码的编写。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">这里再给出将所有代译文合并的过程：</span>&nbsp;<br /><div style="background-color:#eeeeee;font-size:13px;border:1px solid #CCCCCC;padding-right: 5px;padding-bottom: 4px;padding-left: 4px;padding-top: 4px;width: 98%;word-break:break-all"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />-->1&nbsp;<span style="color: #0000FF; ">out</span>&lt;&lt;"&lt;refset&nbsp;setid=\"mt05_chinese_eval\"&nbsp;srclang=\"zh\"&nbsp;trglang=\"en\"&gt;";<br />2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">out</span>&lt;&lt;PartOne(n3)&lt;&lt;endl&lt;&lt;PartTwo(n4)&lt;&lt;endl&lt;&lt;PartThree(n5)&lt;&lt;endl<br />3&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;&lt;PartOneS(n3)&lt;&lt;endl&lt;&lt;PartTwoS(n4)&lt;&lt;endl&lt;&lt;PartThreeS(n5)&lt;&lt;endl<br />4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;&lt;PartOneT(n3)&lt;&lt;endl&lt;&lt;PartTwoT(n4)&lt;&lt;endl&lt;&lt;PartThreeT(n5)&lt;&lt;endl<br />5&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&lt;&lt;PartOneF(n3)&lt;&lt;endl&lt;&lt;PartTwoF(n4)&lt;&lt;endl&lt;&lt;PartThreeF(n5)&lt;&lt;endl;<br />6&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="color: #0000FF; ">out</span>&lt;&lt;"&lt;/refset&gt;";</div>&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">头尾的两个out是输出特定的首尾格式。out是将其输出到事先指定好的文档中。</span><br style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; " />&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;<span style="font-family: 'WenQuanYi Micro Hei Mono', 'WenQuanYi Micro Hei', 'Microsoft Yahei Mono', 'Microsoft Yahei', sans-serif; background-color: #ffffff; ">最后对这个编写代码的过程进行思考总结。首先可能由于我对c++的编写不是太熟练，很多方法也不会用，因此将这个看去其实很简单的代码也写了好久，并不断修正各种小错误。其中遇到最大的困难还是不知道要如何更便捷的修改格式，因此采用了最粗暴的方式，这样可能会导致算法的复杂度更高，需要消耗的时间更久，在以后的学习过程中希望能够学会使用更简便的方法。</span>&nbsp;<br /><br /><br /><br /><br /><img src ="http://www.cppblog.com/hzh416/aggbug/186436.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hzh416/" target="_blank">小豪</a> 2012-08-06 12:11 <a href="http://www.cppblog.com/hzh416/archive/2012/08/06/186436.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>