﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-爱生活 爱技术-随笔分类-CUDA</title><link>http://www.cppblog.com/hktk/category/11855.html</link><description /><language>zh-cn</language><lastBuildDate>Sun, 20 Sep 2009 03:39:47 GMT</lastBuildDate><pubDate>Sun, 20 Sep 2009 03:39:47 GMT</pubDate><ttl>60</ttl><item><title>深入浅出谈CUDA-[第六章][GPU的硬件架构]</title><link>http://www.cppblog.com/hktk/archive/2009/09/20/96759.html</link><dc:creator>海 阔 天 空</dc:creator><author>海 阔 天 空</author><pubDate>Sun, 20 Sep 2009 02:41:00 GMT</pubDate><guid>http://www.cppblog.com/hktk/archive/2009/09/20/96759.html</guid><description><![CDATA[&nbsp;
<p align=left><strong><span>GPU </span></strong><strong><span>的硬件架构</span></strong><span> </span></p>
<p align=left><span>这里我们会简单介绍，<span>NVIDIA </span>目前支持<span> CUDA </span>的<span> GPU</span>，其在执行<span> CUDA </span>程序的部份（基本上就是其<span> shader </span>单元）的架构。这里的数据是综合<span> NVIDIA </span>所公布的信息，以及<span> NVIDIA </span>在各个研讨会、学校课程等所提供的数据，因此有可能会有不正确的地方。主要的数据源包括<span> NVIDIA </span>的<span> CUDA Programming Guide 1.1</span>、<span>NVIDIA </span>在<span> Supercomputing '07 </span>介绍<span> CUDA </span>的<span> session</span>，以及<span> UIUC </span>的<span> CUDA </span>课程。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>GPU </span></strong><strong><span>的基本介绍</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>目前<span> NVIDIA </span>推出的显示芯片，支持<span> CUDA </span>的是<span> G80 </span>系列的显示芯片。其中<span> G80 </span>显示芯片支持<span> CUDA 1.0 </span>版，而<span> G84</span>、<span>G86</span>、<span>G92</span>、<span>G94</span>、<span>G96 </span>则支援<span> CUDA 1.1 </span>版。基本上，除了最早的<span> GeForce 8800 Ultra/GTX </span>及<span> 320MB/640MB </span>版本的<span> GeForce 8800GTS</span>、<span>Tesla </span>等显卡是<span> CUDA 1.0 </span>版之外，其它<span> GeForce 8 </span>系列及<span> 9 </span>系列显卡都支持<span> CUDA 1.1</span>。详细情形可以参考<span> CUDA Programming Guide 1.1 </span>的<span> Appendix A</span>。</span></p>
<p align=left><span>所有目前支持<span> CUDA </span>的<span> NVIDIA </span>显示芯片，其<span> shader </span>部份都是由多个 <strong><span>multiprocessors</span></strong><span> </span>组成。每个<span> multiprocessor </span>里包含了八个 <strong><span>stream processors</span></strong>， 其组成是四个四个一组，也就是说实际上可以看成是有两组<span> 4D </span>的<span> SIMD </span>处理器。此外，每个<span> multiprocessor </span>还具有<span> 8192 </span>个寄存器，<span>16KB </span>的<span> share memory</span>，以及<span> texture cache </span>和<span> constant cache</span>。大致上如下图所示：</span></p>
<p align=center><img height=256 alt="" src="http://www.cppblog.com/images/cppblog_com/hktk/CUDA_09-09-20_6_1.jpg" width=256 border=0></p>
<p align=left><span>详细的<span> multiprocessor </span>信息，都可以透过<span> CUDA </span>的<span> cudaGetDeviceProperties() </span>函式或 </span><span>cuDeviceGetProperties()</span><span> </span><span>函式取得。不过，目前还没有办法直接取得一个显示芯片中有多少<span> multiprocessor </span>的信息。</span></p>
<p align=left><span>在<span> CUDA </span>中，大部份基本的运算动作，都可以由<span> stream processor </span>进行。每个<span> stream processor </span>都包含一个<span> FMA</span>（<span>fused-multiply-add</span>）单元，可以进行一个乘法和一个加法。比较复杂的运算则会需要比较长的时间。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>执行过程</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>在执行<span> CUDA </span>程序的时候，每个<span> stream processor </span>就是对应一个<span> thread</span>。每个<span> multiprocessor </span>则对应一个<span> block</span>。从之前的文章中，可以注意到一个<span> block </span>经常有很多个<span> thread</span>（例如<span> 256 </span>个），远超过一个<span> multiprocessor </span>所有的<span> stream processor </span>数目。这又是怎么回事呢？</span></p>
<p align=left><span>实际上，虽然一个<span> multiprocessor </span>只有八个<span> stream processor</span>，但是由于<span> stream processor </span>进行各种运算都有<span> latency</span>，更不用提内存存取的<span> latency</span>，因此<span> CUDA </span>在执行程序的时候，是以 <strong><span>warp</span></strong><span> </span>为单位。目前的<span> CUDA </span>装置，一个<span> warp </span>里面有<span> 32 </span>个<span> threads</span>，分成两组<span> 16 threads </span>的<span> half-warp</span>。由于<span> stream processor </span>的运算至少有<span> 4 cycles </span>的<span> latency</span>，因此对一个<span> 4D </span>的<span> stream processors </span>来说，一次至少执行<span> 16 </span>个<span> threads</span>（即<span> half-warp</span>）才能有效隐藏各种运算的<span> latency</span>。</span></p>
<p align=left><span>由于<span> multiprocessor </span>中并没有太多别的内存，因此每个<span> thread </span>的状态都是直接保存在<span> multiprocessor </span>的寄存器中。所以，如果一个<span> multiprocessor </span>同时有愈多的<span> thread </span>要执行，就会需要愈多的寄存器空间。例如，假设一个<span> block </span>里面有<span> 256 </span>个<span> threads</span>，每个<span> thread </span>用到<span> 20 </span>个寄存器，那么总共就需要<span> 256x20 = 5,120 </span>个寄存器才能保存每个<span> thread </span>的状态。</span></p>
<p align=left><span>目前<span> CUDA </span>装置中每个<span> multiprocessor </span>有<span> 8,192 </span>个寄存器，因此，如果每个<span> thread </span>使用到<span> 16 </span>个寄存器，那就表示一个<span> multiprocessor </span>同时最多只能维持<span> 512 </span>个<span> thread </span>的执行。如果同时进行的<span> thread </span>数目超过这个数字，那么就会需要把一部份的数据储存在显卡内存中，就会降低执行的效率了。</span></p>
<p align=left><em><span>编者注：在<span>NVIDIA GT200</span>中的<span>Register File</span>大小增加了一倍，在<span>FP32</span>下可用的<span>register file</span>为<span>16K</span>，<span>FP64</span>下是<span>8K</span>。</span></em></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>Shared memory</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>目前<span> CUDA </span>装置中，每个<span> multiprocessor </span>有<span> 16KB </span>的<span> shared memory</span>。<span>Shared memory </span>分成<span> 16 </span>个<span> bank</span>。如果同时每个<span> thread </span>是存取不同的<span> bank</span>，就不会产生任何问题，存取<span> shared memory </span>的速度和存取寄存器相同。不过，如果同时有两个（或更多个）<span> threads </span>存取同一个<span> bank </span>的数据，就会发生<span> bank conflict</span>，这些<span> threads </span>就必须照顺序去存取，而无法同时存取<span> shared memory </span>了。</span></p>
<p align=left><span>Shared memory </span><span>是以<span> 4 bytes </span>为单位分成<span> banks</span>。因此，假设以下的数据：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __shared__ int data[128];</span></p>
<p align=left><span>那么，<span>data[0] </span>是<span> bank 0</span>、<span>data[1] </span>是<span> bank 1</span>、<span>data[2] </span>是<span> bank 2</span>、<span>&#8230;</span>、<span>data[15] </span>是<span> bank 15</span>，而<span> data[16] </span>又回到<span> bank 0</span>。由于<span> warp </span>在执行时是以<span> half-warp </span>的方式执行，因此分属于不同的<span> half warp </span>的<span> threads</span>，不会造成<span> bank conflict</span>。</span></p>
<p align=left><span>因此，如果程序在存取<span> shared memory </span>的时候，使用以下的方式：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int number = data[base + tid];</span></p>
<p align=left><span>那就不会有任何<span> bank conflict</span>，可以达到最高的效率。但是，如果是以下的方式：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int number = data[base + 4 * tid];</span></p>
<p align=left><span>那么，<span>thread 0 </span>和<span> thread 4 </span>就会存取到同一个<span> bank</span>，<span>thread 1 </span>和<span> thread 5 </span>也是同样，这样就会造成<span> bank conflict</span>。在这个例子中，一个<span> half warp </span>的<span> 16 </span>个<span> threads </span>会有四个<span> threads </span>存取同一个<span> bank</span>，因此存取<span> share memory </span>的速度会变成原来的<span> 1/4</span>。</span></p>
<p align=left><span>一个重要的例外是，当多个<span> thread </span>存取到同一个<span> shared memory </span>的地址时，<span>shared memory </span>可以将这个地址的<span> 32 bits </span>数据「广播」到所有读取的<span> threads</span>，因此不会造成<span> bank conflict</span>。例如：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int number = data[3];</span></p>
<p align=left><span>这样不会造成<span> bank conflict</span>，因为所有的<span> thread </span>都读取同一个地址的数据。</span></p>
<p align=left><span>很多时候<span> shared memory </span>的<span> bank conflict </span>可以透过修改数据存放的方式来解决。例如，以下的程序：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; data[tid] = global_data[tid];<br>&nbsp;&nbsp;&nbsp; ...<br>&nbsp;&nbsp;&nbsp; int number = data[16 * tid];</span></p>
<p align=left><span>会造成严重的<span> bank conflict</span>，为了避免这个问题，可以把数据的排列方式稍加修改，把存取方式改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int row = tid / 16;<br>&nbsp;&nbsp;&nbsp; int column = tid % 16;<br>&nbsp;&nbsp;&nbsp; data[row * 17 + column] = global_data[tid];<br>&nbsp;&nbsp;&nbsp; ...<br>&nbsp;&nbsp;&nbsp; int number = data[17 * tid];</span></p>
<p align=left><span>这样就不会造成<span> bank conflict </span>了。</span></p>
<p align=left><em><span>编者注：<span>share memory</span>在<span>NVIDIA</span>的文档中其实还有不同的叫法，例如<span>PDC</span>（<span>Parallel Data Cache</span>）、<span>PBSM</span>（<span>per-block share memory</span>）。</span></em></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>Global memory</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>由于<span> multiprocessor </span>并没有对<span> global memory </span>做<span> cache</span>（如果每个<span> multiprocessor </span>都有自己的<span> global memory cache</span>，将会需要<span> cache coherence protocol</span>，会大幅增加<span> cache </span>的复杂度），所以<span> global memory </span>存取的<span> latency </span>非常的长。除此之外，前面的文章中也提到过<span> global memory </span>的存取，要尽可能的连续。这是因为<span> DRAM </span>存取的特性所造成的结果。</span></p>
<p align=left><span>更精确的说，<span>global memory </span>的存取，需要是<span> "coalesced"</span>。所谓的<span> coalesced</span>，是表示除了连续之外，而且它开始的地址，必须是每个<span> thread </span>所存取的大小的<span> 16 </span>倍。例如，如果每个<span> thread </span>都读取<span> 32 bits </span>的数据，那么第一个<span> thread </span>读取的地址，必须是<span> 16*4 = 64 bytes </span>的倍数。</span></p>
<p align=left><span>如果有一部份的<span> thread </span>没有读取内存，并不会影响到其它的<span> thread </span>速行<span> coalesced </span>的存取。例如：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; if(tid != 3) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int number = data[tid];<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>虽然<span> thread 3 </span>并没有读取数据，但是由于其它的<span> thread </span>仍符合<span> coalesced </span>的条件（假设<span> data </span>的地址是<span> 64 bytes </span>的倍数），这样的内存读取仍会符合<span> coalesced </span>的条件。</span></p>
<p align=left><span>在目前的<span> CUDA 1.1 </span>装置中，每个<span> thread </span>一次读取的内存数据量，可以是<span> 32 bits</span>、<span>64 bits</span>、或<span> 128 bits</span>。不过，<span>32 bits </span>的效率是最好的。<span>64 bits </span>的效率会稍差，而一次读取<span> 128 bits </span>的效率则比一次读取<span> 32 bits </span>要显著来得低（但仍比<span> non-coalesced </span>的存取要好）。<span> </span></span></p>
<p align=left><span>如果每个<span> thread </span>一次存取的数据并不是<span> 32 bits</span>、<span>64 bits</span>、或<span> 128 bits</span>，那就无法符合<span> coalesced </span>的条件。例如，以下的程序：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; struct vec3d { float x, y, z; };<br>&nbsp;&nbsp;&nbsp; ...<br>&nbsp;&nbsp;&nbsp; __global__ void func(struct vec3d* data, float* output)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; output[tid] = data[tid].x * data[tid].x +<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; data[tid].y * data[tid].y +<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; data[tid].z * data[tid].z;<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>并不是<span> coalesced </span>的读取，因为<span> vec3d </span>的大小是<span> 12 bytes</span>，而非<span> 4 bytes</span>、<span>8 bytes</span>、或<span> 16 bytes</span>。要解决这个问题，可以使用<span> __align(n)__ </span>的指示，例如：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; struct __align__(16) vec3d { float x, y, z; };</span></p>
<p align=left><span>这会让<span> compiler </span>在<span> vec3d </span>后面加上一个空的<span> 4 bytes</span>，以补齐<span> 16 bytes</span>。另一个方法，是把数据结构转换成三个连续的数组，例如：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __global__ void func(float* x, float* y, float* z, float* output)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; output[tid] = x[tid] * x[tid] + y[tid] * y[tid] +<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; z[tid] * z[tid];<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>如果因为其它原因使数据结构无法这样调整，也可以考虑利用<span> shared memory </span>在<span> GPU </span>上做结构的调整。例如：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __global__ void func(struct vec3d* data, float* output)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __shared__ float temp[THREAD_NUM * 3];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const float* fdata = (float*) data;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; temp[tid] = fdata[tid];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; temp[tid + THREAD_NUM] = fdata[tid + THREAD_NUM];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; temp[tid + THREAD_NUM*2] = fdata[tid + THREAD_NUM*2];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; output[tid] = temp[tid*3] * temp[tid*3] +<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; temp[tid*3+1] * temp[tid*3+1] +<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; temp[tid*3+2] * temp[tid*3+2];<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>在上面的例子中，我们先用连续的方式，把数据从<span> global memory </span>读到<span> shared memory</span>。由于<span> shared memory </span>不需要担心存取顺序（但要注意<span> bank conflict </span>问题，参照前一节），所以可以避开<span> non-coalesced </span>读取的问题。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>Texture</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>CUDA </span><span>支援<span> texture</span>。在<span> CUDA </span>的<span> kernel </span>程序中，可以利用显示芯片的<span> texture </span>单元，读取<span> texture </span>的数据。使用<span> texture </span>和<span> global memory </span>最大的差别在于<span> texture </span>只能读取，不能写入，而且显示芯片上有一定大小的<span> texture cache</span>。因此，读取<span> texture </span>的时候，不需要符合<span> coalesced </span>的规则，也可以达到不错的效率。此外，读取<span> texture </span>时，也可以利用显示芯片中的<span> texture filtering </span>功能（例如<span> bilinear filtering</span>），也可以快速转换数据型态，例如可以直接将<span> 32 bits RGBA </span>的数据转换成四个<span> 32 bits </span>浮点数。</span></p>
<p align=left><span>显示芯片上的<span> texture cache </span>是针对一般绘图应用所设计，因此它仍最适合有区块性质的存取动作，而非随机的存取。因此，同一个<span> warp </span>中的各个<span> thread </span>最好是读取地址相近的数据，才能达到最高的效率。</span></p>
<p align=left><span>对于已经能符合<span> coalesced </span>规则的数据，使用<span> global memory </span>通常会比使用<span> texture </span>要来得快。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>运算单元</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>Stream processor </span><span>里的运算单元，基本上是一个浮点数的<span> fused multiply-add </span>单元，也就是说它可以进行一次乘法和一次加法，如下所示：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; a = b * c + d;</span></p>
<p align=left><span>compiler </span><span>会自动把适当的加法和乘法运算，结合成一个<span> fmad </span>指令。</span></p>
<p align=left><span>除了浮点数的加法及乘法之外，整数的加法、位运算、比较、取最小值、取最大值、及以型态的转换（浮点数转整数或整数转浮点数）都是可以全速进行的。整数的乘法则无法全速进行，但<span> 24 bits </span>的乘法则可以。在<span> CUDA </span>中可以利用内建的 <span>__mul24 </span>和<span> __umul24 </span>函式来进行<span> 24 bits </span>的整数乘法。</span></p>
<p align=left><span>浮点数的除法是利用先取倒数，再相乘的方式计算，因此精确度并不能达到<span> IEEE 754 </span>的规范（最大误差为<span> 2 ulp</span>）。内建的<span> __fdividef(x,y) </span>提供更快速的除法，和一般的除法有相同的精确度，但是在<span> 2<sup>216</sup> &lt; y &lt; 2<sup>218</sup> </span>时会得到错误的结果。</span></p>
<p align=left><span>此外<span> CUDA </span>还提供了一些精确度较低的内部函数，包括<span> __expf</span>、<span>__logf</span>、<span>__sinf</span>、<span>__cosf</span>、<span>__powf </span>等等。这些函式的速度较快，但精确度不如标准的函式。详细的数据可以参考<span> CUDA Programming Guide 1.1 </span>的<span> Appendix B</span>。<span> </span></span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>和主内存间的数据传输</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>在<span> CUDA </span>中，<span>GPU </span>不能直接存取主内存，只能存取显卡上的显示内存。因此，会需要将数据从主内存先复制到显卡内存中，进行运算后，再将结果从显卡内存中复制到主内存中。这些复制的动作会限于<span> PCI Express </span>的速度。使用<span> PCI Express x16 </span>时，<span>PCI Express 1.0 </span>可以提供双向各<span> 4GB/s </span>的带宽，而<span> PCI Express 2.0 </span>则可提供<span> 8GB/s </span>的带宽。当然这都是理论值。</span></p>
<p align=left><span>从一般的内存复制数据到显卡内存的时候，由于一般的内存可能随时会被操作系统搬动，因此<span> CUDA </span>会先将数据复制到一块内部的内存中，才能利用<span> DMA </span>将数据复制到显卡内存中。如果想要避免这个重复的复制动作，可以使用<span> cudaMallocHost </span>函式，在主内存中取得一块<span> page locked </span>的内存。不过，如果要求太大量的<span> page locked </span>的内存，将会影响到操作系统对内存的管理，可能会减低系统的效率。</span></p>
<p align=left><span><a href="http://www.kimicat.com/cuda%E7%B0%A1%E4%BB%8B" target=_blank><span><span>原文地址</span></span></a></span></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/hktk/aggbug/96759.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hktk/" target="_blank">海 阔 天 空</a> 2009-09-20 10:41 <a href="http://www.cppblog.com/hktk/archive/2009/09/20/96759.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入浅出谈CUDA-[第五章][ 第二个CUDA程序]</title><link>http://www.cppblog.com/hktk/archive/2009/09/20/96758.html</link><dc:creator>海 阔 天 空</dc:creator><author>海 阔 天 空</author><pubDate>Sun, 20 Sep 2009 02:39:00 GMT</pubDate><guid>http://www.cppblog.com/hktk/archive/2009/09/20/96758.html</guid><description><![CDATA[&nbsp;
<p align=left><strong><span>第二个<span> CUDA</span>程序</span></strong><span> </span></p>
<p align=left><span>前面介绍的计算平方和的程序，似乎没有什么实用价值。所以我们的第二个<span> CUDA </span>程序，要做一个确实有（某些）实用价值的程序，也就是进行矩阵乘法。而且，这次我们会使用浮点数。</span></p>
<p align=left><span>虽然矩阵乘法有点老套，不过因为它相当简单，而且也可以用来介绍一些有关<span> CUDA </span>的有趣性质。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>矩阵乘法</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>为了单纯起见，我们这里以方形的矩阵为例子。基本上，假设有两个矩阵<span> A </span>和<span> B</span>，则计算<span> AB = C </span>的方法如下：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = 0; j &lt; n; j++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; C[i][j] = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(k = 0; k &lt; n; k++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; C[i][j] += A[i][k] * B[k][j];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>一开始，我们先准备好产生数据、设定<span> CUDA </span>等等的工作：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int main()<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float *a, *b, *c, *d;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int n = 1000;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(!InitCUDA()) return 0;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; a = (float*) malloc(sizeof(float) * n * n);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; b = (float*) malloc(sizeof(float) * n * n);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; c = (float*) malloc(sizeof(float) * n * n);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; d = (float*) malloc(sizeof(float) * n * n);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; srand(0);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matgen(a, n, n);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matgen(b, n, n);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; clock_t time = matmultCUDA(a, n, b, n, c, n, n);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matmult(a, n, b, n, d, n, n);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; compare_mat(c, n, d, n, n);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; double sec = (double) time / CLOCKS_PER_SEC;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; printf("Time used: %<st1:chmetcnv w:st="on" TCSC="0" NumberType="1" Negative="False" HasSpace="False" SourceValue="0.2" UnitName="F">.2f</st1:chmetcnv> (%.2lf GFLOPS)\n", sec,<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; 2.0 * n * n * n / (sec * 1E9));<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; return 0;<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>InitCUDA</span><span> </span><span>函式和第一个<span> CUDA </span>程序一样，可以直接参考前面的文章。以下是上面用到的一些其它的函式：</span></p>
<p align=left><span>产生矩阵：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; void matgen(float* a, int lda, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i, j;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = 0; j &lt; n; j++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; a[i * lda + j] = (float) rand() / RAND_MAX + <br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; (float) rand() / (RAND_MAX * RAND_MAX);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这个函式只是利用随机数生成器把矩阵填满<span> 0 ~ 1 </span>之间的数字。特别注意到因为<span> C </span>语言中无法声明变动大小的二维矩阵，所以我们使用 </span><span>i * lda + j</span><span> </span><span>的方式。</span></p>
<p align=left><span>进行矩阵乘法：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; void matmult(const float* a, int lda, const float* b, int ldb, float* c, int ldc, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i, j, k;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = 0; j &lt; n; j++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; double t = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(k = 0; k &lt; n; k++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; t += a[i * lda + k] * b[k * ldb + j];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; c[i * ldc + j] = t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这是以<span> CPU </span>进行矩阵乘法、用来进行验证答案正确与否的程序。特别注意到它用 </span><span>double</span><span> </span><span>来储存暂时的计算结果，以提高精确度。</span></p>
<p align=left><span>验证结果：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; void compare_mat(const float* a, int lda, const float* b, int ldb, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float max_err = 0;<br>&nbsp; &nbsp;&nbsp;&nbsp; &nbsp; float average_err = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i, j;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = 0; j &lt; n; j++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(b[i * ldb + j] != 0) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float err = fabs((a[i * lda + j] - <br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; b[i * ldb + j]) / b[i * ldb + j]);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(max_err &lt; err) max_err = err;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; average_err += err;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; printf("Max error: %g Average error: %g\n",<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; max_err, average_err / (n * n));<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这个函式计算两个矩阵的最大相对误差和平均相对误差，并把结果印出来。</span></p>
<p align=left><span>最后是<span> CUDA </span>的矩阵乘法的部份：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; #define NUM_THREADS 256<br><br>&nbsp;&nbsp;&nbsp; clock_t matmultCUDA(const float* a, int lda, const float* b, int ldb, float* c, int ldc, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float *ac, *bc, *cc;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; clock_t start, end;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; start = clock();<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;ac, sizeof(float) * n * n); <br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;bc, sizeof(float) * n * n);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;cc, sizeof(float) * n * n);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaMemcpy2D(ac, sizeof(float) * n, a, sizeof(float) * lda, sizeof(float) * n, n, cudaMemcpyHostToDevice);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaMemcpy2D(bc, sizeof(float) * n, b, sizeof(float) * ldb, sizeof(float) * n, n, cudaMemcpyHostToDevice);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int blocks = (n + NUM_THREADS - 1) / NUM_THREADS; matMultCUDA&lt;&lt;&lt;blocks * n, NUM_THREADS&gt;&gt;&gt;(ac, n, bc, n, cc, n, n);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaMemcpy2D(c, sizeof(float) * ldc, cc, sizeof(float) * n, sizeof(float) * n, n, cudaMemcpyDeviceToHost);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaFree(ac);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaFree(bc);<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaFree(cc);<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; end = clock();<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; return end - start;<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这个函式相当单纯，就是在显卡内存中配置存放矩阵的内存，然后把主内存中的矩阵数据复制到显卡内存上。不过，因为我们的矩阵乘法函式可以指定<span> pitch</span>（即<span> lda</span>、<span>ldb</span>、和<span> ldc</span>），所以如果用一般的<span> cudaMemcpy </span>函式来复制内存的话，会需要每个<span> row </span>都分开复制，那会需要呼叫很多次<span> cudaMemcpy </span>函式，会使效率变得很差。因此，在这里我们用了一个新的 </span><span>cudaMemcpy2D</span><span> </span><span>函式，它是用来复制二维数组，可以指定数组的<span> pitch</span>。这样就可以透过一次函数调用就可以了。</span></p>
<p align=left><span>进行计算的<span> kernel </span>如下：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __global__ static void matMultCUDA(const float* a, size_t lda, const float* b, size_t ldb, float* c, size_t ldc, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int bid = blockIdx.x;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int idx = bid * blockDim.x + tid;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int row = idx / n;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int column = idx % n;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(row &lt; n &amp;&amp; column &lt; n) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float t = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; t += a[row * lda + i] * b[i * ldb + column];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; c[row * ldc + column] = t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这个函式一开始先从<span> bid </span>和<span> tid </span>计算出这个<span> thread </span>应该计算的<span> row </span>和<span> column</span>，在判断<span> row </span>和<span> column </span>在范围内之后，就直接进行计算，并把结果写到<span> c </span>矩阵中，是非常单纯的函式。</span></p>
<p align=left><span>在<span> GeForce 8800GT </span>上实际执行的结果如下：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; Max error: 2.01484e-006 Average error: 3.36637e-007<br>&nbsp;&nbsp;&nbsp; Time used: 1.1560 (1.73 GFLOPS)</span></p>
<p align=left><span>可以看到两个问题：</span></p>
<ol type=1>
    <li><span>很明显的，执行效率相当低落。<span> </span></span>
    <li><span>最大相对误差偏高。理想上应该要低于<span> 1e-6</span>。</span> </li>
</ol>
<p align=left><span>计算结果的误差偏高的原因是，在<span> CPU </span>上进行计算时，我们使用 </span><span>double</span><span>（即<span> 64 bits </span>浮点数）来累进计算过程，而在<span> GPU </span>上则只能用 </span><span>float</span><span>（<span>32 bits </span>浮点数）。在累加大量数字的时候，由于累加结果很快会变大，因此后面的数字很容易被舍去过多的位数。</span></p>
<p align=left><span>由于<span> CUDA </span>的浮点数运算，在进行加、减、乘法时是符合<span> IEEE 754 </span>规定的精确度的，因此，我们可以利用<span> Kahan's Summation Formula </span>来提高精确度。把程序改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; if(row &lt; n &amp;&amp; column &lt; n) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float t = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float y = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float r;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; y -= a[row * lda + i] * b[i * ldb + column];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; r = t - y;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; y = (r - t) + y;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; t = r;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>修改后的程序的执行结果是：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; Max error: 1.19209e-007 Average error: 4.22751e-008<br>&nbsp;&nbsp;&nbsp; Time used: 1.1560 (1.73 GFLOPS)</span></p>
<p align=left><span>可以看到相对误差有很大的改善，效率则没什么变化。</span></p>
<p align=left><span>由于<span> Kahan's Summation Formula </span>需要的运算量提高，但是效率却没有什么改变，可以看出这个<span> kernel </span>主要的瓶颈应该是在内存的存取动作上。这是因为有大量的内存读取是重复的。例如，矩阵<span> a </span>的一个<span> row </span>在每次进行计算时都被重复读入，但这是相当浪费的。这样的计算方式，总共需要读取<span> 2*n<sup>3</sup> </span>次内存。如果让一个<span> row </span>只需要读入一次的话，就可以减到为<span> n<sup>3</sup>+n<sup>2</sup> </span>次。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>第一个改良</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left>&nbsp;</p>
<p align=left><span>和我们的第一个<span> CUDA </span>程序一样，我们可以利用<span> shared memory </span>来储存每个<span> row </span>的数据。不过，因为只有同一个<span> block </span>的<span> threads </span>可以共享<span> shared memory</span>，因此现在一个<span> row </span>只能由同一个<span> block </span>的<span> threads </span>来进行计算。另外我们也需要能存放一整个<span> row </span>的<span> shared memory</span>。因此，把先把呼叫<span> kernel </span>的部份改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matMultCUDA&lt;&lt;&lt;n, NUM_THREADS, sizeof(float) * n&gt;&gt;&gt;(ac, n, bc, n, cc, n, n);</span></p>
<p align=left><span>kernel </span><span>的部份则改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __global__ static void matMultCUDA(const float* a, size_t lda, const float* b, size_t ldb, float* c, size_t ldc, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; extern __shared__ float data[];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int row = blockIdx.x;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i, j;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = tid; i &lt; n; i += blockDim.x) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; data[i] = a[row * lda + i];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __syncthreads();<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = tid; j &lt; n; j += blockDim.x) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; float t = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float y = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; n; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; float r;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; y -= data[i] * b[i * ldb + j];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; r = t - y;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; y = (r - t) + y;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; t = r;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; c[row * ldc + j] = t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span><br></span><span>第一个部份先把整个<span> row </span>读到<span> shared memory </span>中，而第二个部份则进行计算，并没有太大的变化。主要的差别是现在一个<span> row </span>只由一个<span> block </span>进行计算。</span></p>
<p align=left><span>在<span> GeForce 8800GT </span>上，执行的结果是：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; Max error: 1.19209e-007&nbsp; Average error: 4.22751e-008<br>&nbsp;&nbsp;&nbsp; Time used: 0.4220&nbsp;&nbsp; (4.74 GFLOPS)</span></p>
<p align=left><span>很明显的，计算的结果并没有改变，不过速度则提高了超过一倍。虽然如此，但是这样的效率仍不尽理想，因为理论上<span> GeForce 8800GT </span>有超过<span> 300GFLOPS </span>的运算性能。即使是把<span> Kahan's Summation Formula </span>所需要的额外运算考虑进去，这样的效率仍然连理论最大值的十分之一都不到。</span></p>
<p align=left><span>会有这样的结果，原因其实还是同样的：对内存的存取次数太多了。虽然现在<span> A </span>矩阵的<span> row </span>的数据已经不再需要重复读取，但是<span> B </span>矩阵的<span> column </span>的数据仍然一直被重复读取。</span></p>
<p align=left><span>另一个问题比较不是那么明显：对<span> B </span>矩阵的读取，虽然看起来不连续，但实际上它是连续的。这是因为不同的<span> thread </span>会读取不同的<span> column</span>，因此同时间每个<span> thread </span>读取的各个<span> column </span>加起来，就是一个连续的内存区块。那么，为什么效率还是不佳呢？这是因为，<span>GPU </span>上的内存控制器，从某个固定的倍数地址开始读取，才会有最高的效率（例如<span> 16 bytes </span>的倍数）。由于矩阵大小并不是<span> 16 </span>的倍数（这里使用的是<span> 1000x1000 </span>的矩阵），所以造成效率不佳的情形。</span></p>
<p align=left><span>要解决这个问题，我们可以在 </span><span>cudaMalloc </span><span>的时候稍微修改一下，让宽度变成 适当的倍数就可以了。但是，适当的倍数是多少呢？幸运的是，我们并不需要知道这些细节。<span>CUDA </span>提供了一个 </span><span>cudaMallocPitch </span><span>的函式，可以自动以最佳的倍数来配置内存。因此，我们可以把 </span><span>cudaMalloc</span><span> </span><span>的部份改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; size_t pitch_a, pitch_b, pitch_c;<br>&nbsp;&nbsp;&nbsp; cudaMallocPitch((void**) &amp;ac, &amp;pitch_a, sizeof(float) * n, n);<br>&nbsp;&nbsp;&nbsp; cudaMallocPitch((void**) &amp;bc, &amp;pitch_b, sizeof(float) * n, n);<br>&nbsp;&nbsp;&nbsp; cudaMallocPitch((void**) &amp;cc, &amp;pitch_c, sizeof(float) * n, n);</span></p>
<p align=left><span>cudaMallocPitch </span><span>函式会以适当的倍数配置内存，并把配置的宽度传回。因此，在把矩阵复制到显卡内存上时，要使用它传回的宽度：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; cudaMemcpy2D(ac, pitch_a, a, sizeof(float) * lda, sizeof(float) * n, n, cudaMemcpyHostToDevice);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy2D(bc, pitch_b, b, sizeof(float) * ldb, sizeof(float) * n, n, cudaMemcpyHostToDevice);</span></p>
<p align=left><span>呼叫<span> kernel </span>的部份也需要修改：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; matMultCUDA&lt;&lt;&lt;n, NUM_THREADS, sizeof(float) * n&gt;&gt;&gt;(ac, pitch_a / sizeof(float), bc, pitch_b / sizeof(float), cc, pitch_c / sizeof(float), n);</span></p>
<p align=left><span>同样的，把计算结果复制回到主内存时，也要使用传回的宽度值：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; cudaMemcpy2D(c, sizeof(float) * ldc, cc, pitch_c, sizeof(float) * n, n, cudaMemcpyDeviceToHost);</span></p>
<p align=left><span>这样就完成了。<span>Kernel </span>部份则不需要修改。</span></p>
<p align=left><span>这样的修改有多大的效果呢？在<span> GeForce 8800GT </span>上执行，结果如下：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; Max error: 1.19209e-007&nbsp; Average error: 4.22751e-008<br>&nbsp;&nbsp;&nbsp; Time used: 0.1250&nbsp;&nbsp; (16.00 GFLOPS)</span></p>
<p align=left><span>可以看到，执行速度又再大幅提高了三倍多！而这只是把内存的配置方式稍微修改一下而已。</span></p>
<p align=left><span>虽然执行速度提高了很多，但是，和前面提到的理论值相比，其实还是有相当的差距。这是因为，前面也提到过，这样的做法需要<span> n<sup>3</sup>+n<sup>2 </sup></span>次的内存读取，和<span> n<sup>2 </sup></span>次 的内存写入动作。由于<span> n = 1000</span>，每个数字的大小是<span> 32 bits</span>，所以总共的内存存取数据量约为<span> 4GB</span>。除以实际执行的时间<span> 0.125 </span>秒，得到的带宽数值是约<span> 32GB/s</span>，这已经接近<span> GeForce 8800GT </span>显卡内存的带宽了。由于我们计算时间的时候，把配置内存、以及数据的复制动作也计算进去，因此实际上花费在<span> kernel </span>的时间是更短的（约<span> 0.09 </span>秒）。因此，可以很明显的看出，这个程序的效率，是受限于内存带宽的。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>进一步的改良 </span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>上一节的结论显示出，矩阵乘法的程序，效率是受限于内存带宽的。那有没有办法降低内存的存取次数呢？答案当然是有的，不然就不会有这一节了<span> :)</span></span></p>
<p align=left><span>要进一步降低内存带宽的使用，可以注意到，在上一节的方法中，虽然<span> A </span>矩阵的存取次数被减至最低，但是<span> B </span>矩阵的存取次数并没有减少。这是因为我们只将<span> A </span>矩阵的<span> row </span>加载到<span> shared memory </span>中，但是<span> B </span>矩阵的<span> column </span>也是有被重复使用的。理想上应该也可以避免重复加载才对。不过，由于<span> B </span>矩阵的<span> column </span>使用的时机，和<span> A </span>矩阵的<span> row </span>是不同的，所以并不能直接这样做。</span></p>
<p align=left><span>解决方法是<span> "blocking"</span>。也就是把整个矩阵乘法的动作，切割成很多小矩阵的乘法。例如，要计算<span> C </span>矩阵的<span> (0, 0) ~ (15, 15) </span>的值，可以把它想成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; A(0~15, 0~15) * B(0~15, 0~15) + A(0~15,16~31) * B(16~31, 0~15)<br>&nbsp;&nbsp;&nbsp; + A(0~15, 32~47) * B(32~47, 0~15) + ...</span></p>
<p align=left><span>这样一来，我们就可以把两个小矩阵加载到<span> shared memory</span>，则小矩阵本身的乘法就不需要再存取任何外部的内存了！这样一来，假设小矩阵的大小是<span> k</span>，则实际上需要的内存存取次数就会变成约<span> 2k<sup>2</sup>(n/k)<sup>3</sup> = 2n<sup>3</sup>/k</span>。</span></p>
<p align=left><span>由于目前<span> CUDA </span>每个<span> block </span>的<span> thread </span>数目最多是<span> 512</span>，因此<span> k = 16 </span>似乎是一个相当理想的数字（共<span> 256 </span>个<span> threads</span>）。因此，对于一个<span> n = 1000 </span>的矩阵来说，我们可以把内存存取的量减少到约<span> 500MB</span>，也就是上一节的存取量的<span> 1/8</span>。理论上，这样应该可以让效率提高八倍（假设没有遇到别的瓶颈）。</span></p>
<p align=left><span>为了方便进行区块的计算，我们让每个<span> block </span>有<span> 16x16 </span>个<span> threads</span>，再建立<span> (n/16)x(n/16) </span>个<span> blocks</span>。把呼叫<span> kernel </span>的地方改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int bx = (n + BLOCK_SIZE - 1) / BLOCK_SIZE;<br>&nbsp;&nbsp;&nbsp; dim3 blocks(bx, bx);<br>&nbsp;&nbsp;&nbsp; dim3 threads(BLOCK_SIZE, BLOCK_SIZE);<br>&nbsp;&nbsp;&nbsp; matMultCUDA&lt;&lt;&lt;blocks, threads&gt;&gt;&gt;(ac, pitch_a / sizeof(float), bc, pitch_b / sizeof(float), cc, pitch_c / sizeof(float), n);</span></p>
<p align=left><span>BLOCK_SIZE</span><span> </span><span>则是定义成<span> 16</span>。</span><span>dim3</span><span> </span><span>是<span> CUDA </span>的一种数据型态，表示一个<span> 3D </span>的向量。在这里，我们透过 </span><span>dim3</span><span> </span><span>来建立<span> 16x16 </span>个<span> threads </span>的<span> block</span>，和<span> (n/16)x(n/16) </span>个<span> blocks</span>。</span></p>
<p align=left><span>Kernel </span><span>程序的部份，则改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __global__ static void matMultCUDA(const float* a, size_t lda, const float* b, size_t ldb, float* c, size_t ldc, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __shared__ float matA[BLOCK_SIZE][BLOCK_SIZE];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __shared__ float matB[BLOCK_SIZE][BLOCK_SIZE];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int tidc = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int tidr = threadIdx.y;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int bidc = blockIdx.x * BLOCK_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int bidr = blockIdx.y * BLOCK_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i, j;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float results = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float comp = 0;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = 0; j &lt; n; j += BLOCK_SIZE) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; if(tidr + bidr &lt; n &amp;&amp; tidc + j &lt; n) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; matA[tidr][tidc] = a[(tidr + bidr) * lda + tidc + j];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; else {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matA[tidr][tidc] = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(tidr + j &lt; n &amp;&amp; tidc + bidc &lt; n) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matB[tidr][tidc] = b[(tidr + j) * ldb + tidc + bidc];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; else {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matB[tidr][tidc] = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; __syncthreads();<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 0; i &lt; BLOCK_SIZE; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; comp -= matA[tidr][i] * matB[i][tidc];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; t = results - comp;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; comp = (t - results) + comp;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; results = t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(tidr + bidr &lt; n &amp;&amp; tidc + bidc &lt; n) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; c[(tidr + bidr) * ldc + tidc + bidc] = results;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>注意到因为我们现在使用<span> 16x16 </span>的<span> threads</span>，因此 </span><span>threadIdx</span><span> </span><span>变量可以取得 </span><span>threadIdx.x</span><span> </span><span>和 </span><span>threadIdx.y</span><span>，范围分别是<span> 0 ~ 15</span>。</span><span>blockIdx.x</span><span> </span><span>和 </span><span>blockIdx.y</span><span> </span><span>变量也是同样的情形，范围分别是<span> 0 ~ n/16</span>。</span></p>
<p align=left><span>在程序中，因为矩阵的大小不一定会是<span> 16 </span>的倍数，因此需要使用<span> if </span>判断式检查是否超出矩阵范围。</span></p>
<p align=left><span>这个版本在<span> GeForce 8800GT </span>上的执行结果如下：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; Max error: 1.19209e-007&nbsp; Average error: 4.22751e-008<br>&nbsp;&nbsp;&nbsp; Time used: 0.0780&nbsp;&nbsp; (25.64 GFLOPS)</span></p>
<p align=left><span>速度虽然提高了，但是似乎并没有达到预期中的八倍。当然，前面提到过，我们在计算时间时，把一些复制内存、配置内存的动作也计算在内，这些动作的时间并不会缩短。实际上<span> kernel </span>的运行时间，大约是<span> 0.053 </span>秒左右（约略相当于<span> 38GFLOPS</span>），比上一节的版本快了将近一倍。</span></p>
<p align=left><span>如果这一版程序已经不再限于内存带宽，那为什么没有达到预期的效率呢？这是因为这一版程序已经是限于指令周期了。除了使用<span> Kahan's Summation Formula </span>会需要更多的运算之外，程序中也有大量计算矩阵地址的乘法等等，这都会需要花费运算资源。另外，那些用来判断超出矩阵范围的<span> if </span>判断式，也会有一定的影响。</span></p>
<p align=left><span>要把那些<span> if </span>判断式去掉，有一个方法是，在配置内存时，就配置成<span> 16 </span>的倍数，并在复制矩阵到显卡内存之前，先将它清为<span> 0</span>。如下所示：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int newn = ((n + BLOCK_SIZE - 1) / BLOCK_SIZE) * BLOCK_SIZE;<br><br>&nbsp;&nbsp;&nbsp; cudaMallocPitch((void**) &amp;ac, &amp;pitch_a, sizeof(float) * newn, newn);<br>&nbsp;&nbsp; &nbsp;cudaMallocPitch((void**) &amp;bc, &amp;pitch_b, sizeof(float) * newn, newn);<br>&nbsp;&nbsp; &nbsp;cudaMallocPitch((void**) &amp;cc, &amp;pitch_c, sizeof(float) * newn, newn);<br><br>&nbsp;&nbsp; &nbsp;cudaMemset(ac, 0, pitch_a * newn);<br>&nbsp;&nbsp; &nbsp;cudaMemset(bc, 0, pitch_b * newn);</span></p>
<p align=left>&nbsp;<span>这样一来，我们就可以把<span> kernel </span>中的<span> if </span>判断式都移除了：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; __global__ static void matMultCUDA(const float* a, size_t lda, const float* b, size_t ldb, float* c, size_t ldc, int n)<br>&nbsp;&nbsp;&nbsp; {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __shared__ float matA[BLOCK_SIZE][BLOCK_SIZE];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __shared__ float matB[BLOCK_SIZE][BLOCK_SIZE];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int tidc = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int tidr = threadIdx.y;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int bidc = blockIdx.x * BLOCK_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; const int bidr = blockIdx.y * BLOCK_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; int i, j;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float results = 0;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; float comp = 0;<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(j = 0; j &lt; n; j += BLOCK_SIZE) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; matA[tidr][tidc] = a[(tidr + bidr) * lda + tidc + j];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; matB[tidr][tidc] = b[(tidr + j) * ldb + tidc + bidc];<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __syncthreads();<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; for(i = 0; i &lt; BLOCK_SIZE; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; float t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; comp -= matA[tidr][i] * matB[i][tidc];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; t = results - comp;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; comp = (t - results) + comp;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; results = t;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; c[(tidr + bidr) * ldc + tidc + bidc] = results;<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这个版本的执行结果是：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; Max error: 1.19209e-007&nbsp; Average error: 4.22751e-008<br>&nbsp;&nbsp;&nbsp; Time used: 0.0780&nbsp;&nbsp; (25.64 GFLOPS)</span></p>
<p align=left><span>似乎没有改善。不过，实际上<span> kernel </span>的运行时间已经减少到<span> 0.042 </span>秒（约略相当于<span> 48GFLOPS</span>）。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>结论</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>有些读者可能会想，如果把<span> block </span>再变得更大（例如<span> 32x32</span>）是否会有帮助呢？当然，由于最后的程序已经不再是受限于内存带宽（在<span> 0.042 </span>秒内存取<span> 500MB </span>的数据约相当于<span> 12GB/s </span>的带宽），所以把<span> block </span>再加大并不会有帮助了。而且，由于一个<span> block </span>内的<span> thread </span>数目最多只能到<span> 512 </span>个，将<span> block </span>变大也会造成很多额外负担。而且<span> shared memory </span>的大小也有限制（<span>GeForce 8800GT </span>的<span> shared memory </span>大小限制是<span> 16384 bytes</span>），所以也不能任意增加<span> block </span>的大小。</span></p>
<p align=left><span>最后一版程序的完整档案可以从<span><a href="http://www.pcinlife.com/article_photo/hotball_cuda/second_cuda.cu"><span><span>这里</span></span></a></span>下载。</span></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/hktk/aggbug/96758.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hktk/" target="_blank">海 阔 天 空</a> 2009-09-20 10:39 <a href="http://www.cppblog.com/hktk/archive/2009/09/20/96758.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入浅出谈CUDA-[第四章][改良第一个CUDA程序]</title><link>http://www.cppblog.com/hktk/archive/2009/09/20/96757.html</link><dc:creator>海 阔 天 空</dc:creator><author>海 阔 天 空</author><pubDate>Sun, 20 Sep 2009 02:37:00 GMT</pubDate><guid>http://www.cppblog.com/hktk/archive/2009/09/20/96757.html</guid><description><![CDATA[&nbsp;
<p align=left><strong><span>改良第一个<span> CUDA</span>程序</span></strong><span> </span></p>
<p align=left><span>在<span><a href="http://www.pcinlife.com/article/graphics/2008-06-04/1212575164d532_2.html" target=_blank><span><span>上一篇文章</span></span></a></span>中，我们做了一个计算一大堆数字的平方和的程序。不过，我们也提到这个程序的执行效率并不理想。当然，实际上来说，如果只是要做计算平方和的动作，用<span> CPU </span>做会比用<span> GPU </span>快得多。这是因为平方和的计算并不需要太多运算能力，所以几乎都是被内存带宽所限制。因此，光是把数据复制到显卡内存上的这个动作，所需要的时间，可能已经和直接在<span> CPU </span>上进行计算差不多了。<span> </span></span></p>
<p align=left><span>不过，如果进行平方和的计算，只是一个更复杂的计算过程的一部份的话，那么当然在<span> GPU </span>上计算还是有它的好处的。而且，如果数据已经在显卡内存上（例如在<span> GPU </span>上透过某种算法产生），那么，使用<span> GPU </span>进行这样的运算，还是会比较快的。</span></p>
<p align=left><span>刚才也提到了，由于这个计算的主要瓶颈是内存带宽，所以，理论上显卡的内存带宽是相当大的。这里我们就来看看，倒底我们的第一个程序，能利用到多少内存带宽。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>程序的并行化</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left>&nbsp;</p>
<p align=left><span>我们的第一个程序，并没有利用到任何并行化的功能。整个程序只有一个<span> thread</span>。在<span> GeForce 8800GT </span>上面，在<span> GPU </span>上执行的部份（称为<span> "<strong>kernel</strong>"</span>）大约花费<span> <st1:chmetcnv w:st="on" TCSC="0" NumberType="1" Negative="False" HasSpace="False" SourceValue="640" UnitName="m">640M</st1:chmetcnv> </span>个频率。<span>GeForce 8800GT </span>的执行单元的频率是<span> 1.5GHz</span>，因此这表示它花费了约<span> 0.43 </span>秒的时间。<st1:chmetcnv w:st="on" TCSC="0" NumberType="1" Negative="False" HasSpace="False" SourceValue="1" UnitName="m"><span>1M</span></st1:chmetcnv><span> </span>个<span> 32 bits </span>数字的数据量是<span> 4MB</span>，因此，这个程序实际上使用的内存带宽，只有<span> 9.3MB/s </span>左右！这是非常糟糕的表现。</span></p>
<p align=left><span>为什么会有这样差的表现呢？这是因为<span> GPU </span>的架构特性所造成的。在<span> CUDA </span>中，一般的数据复制到的显卡内存的部份，称为 <strong><span>global memory</span></strong>。 这些内存是没有<span> cache </span>的，而且，存取<span> global memory </span>所需要的时间（即<span> latency</span>）是非常长的，通常是数百个<span> cycles</span>。由于我们的程序只有一个<span> thread</span>，所以每次它读取<span> global memory </span>的内容，就要等到实际读取到数据、累加到<span> sum </span>之后，才能进行下一步。这就是为什么它的表现会这么的差。</span></p>
<p align=left><span>由于<span> global memory </span>并没有<span> cache</span>，所以要避开巨大的<span> latency </span>的方法，就是要利用大量的<span> threads</span>。假设现在有大量的<span> threads </span>在同时执行，那么当一个<span> thread </span>读取内存，开始等待结果的时候，<span>GPU </span>就可以立刻切换到下一个<span> thread</span>，并读取下一个内存位置。因此，理想上当<span> thread </span>的数目够多的时候，就可以完全把<span> global memory </span>的巨大<span> latency </span>隐藏起来了。</span></p>
<p align=left><span>要怎么把计算平方和的程序并行化呢？最简单的方法，似乎就是把数字分成若干组，把各组数字分别计算平方和后，最后再把每组的和加总起来就可以了。一开始，我们可以把最后加总的动作，由<span> CPU </span>来进行。</span></p>
<p align=left><span>首先，在<span> first_cuda.cu </span>中，在 </span><span>#define DATA_SIZE</span><span> </span><span>的后面增加一个<span> #define</span>，设定<span> thread </span>的数目：</span></p>
<p align=left><span>#define DATA_SIZE&nbsp;&nbsp;&nbsp; 1048576<br>#define THREAD_NUM&nbsp;&nbsp; 256</span></p>
<p align=left><span>接着，把<span> kernel </span>程序改成：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result, clock_t* time)<br>{<br>&nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; const int size = DATA_SIZE / THREAD_NUM;<br>&nbsp;&nbsp;&nbsp; int sum = 0;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; clock_t start;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) start = clock();<br>&nbsp;&nbsp;&nbsp; for(i = tid * size; i &lt; (tid + 1) * size; i++) {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sum += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; result[tid] = sum;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) *time = clock() - start;<br>}</span></p>
<p align=left><span>程序里的 </span><span>threadIdx</span><span> </span><span>是<span> CUDA </span>的一个内建的变量，表示目前的<span> thread </span>是第几个<span> thread</span>（由<span> 0 </span>开始计算）。以我们的例子来说，会有<span> 256 </span>个<span> threads</span>，所以同时会有<span> 256 </span>个 </span><span>sumOfSquares</span><span> </span><span>函式在执行，但每一个的 </span><span>threadIdx.x</span><span> </span><span>则分别会是<span> 0 ~ 255</span>。利用这个变量，我们就可以让每一份函式执行时，对整个数据不同的部份计算平方和。另外，我们也让计算时间的动作，只在<span> thread 0</span>（即 </span><span>threadIdx.x</span><span> = 0 </span><span>的时候）进行。</span></p>
<p align=left><span>同样的，由于会有 <span>256 </span>个计算结果，所以原来存放 </span><span>result</span><span> </span><span>的内存位置也要扩大。把 </span><span>main</span><span> </span><span>函式中的中间部份改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int* gpudata, *result;<br>&nbsp;&nbsp;&nbsp; clock_t* time;<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;gpudata, sizeof(int) * DATA_SIZE);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;result, sizeof(int) * THREAD_NUM);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;time, sizeof(clock_t));<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);<br><br>&nbsp;&nbsp;&nbsp; sumOfSquares&lt;&lt;&lt;1, THREAD_NUM, 0&gt;&gt;&gt;(gpudata, result, time);<br><br>&nbsp;&nbsp;&nbsp; int sum[THREAD_NUM];<br>&nbsp;&nbsp;&nbsp; clock_t time_used;<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(∑, result, sizeof(int) * THREAD_NUM, cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(&amp;time_used, time, sizeof(clock_t), cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaFree(gpudata);<br>&nbsp;&nbsp;&nbsp; cudaFree(result);<br>&nbsp;&nbsp;&nbsp; cudaFree(time);</span></p>
<p align=left><span>可以注意到我们在呼叫 </span><span>sumOfSquares</span><span> </span><span>函式时，指定 </span><span>THREAD_NUM</span><span> </span><span>为<span> thread </span>的数目。最后，在<span> CPU </span>端把计算好的各组数据的平方和进行加总：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int final_sum = 0;<br>&nbsp;&nbsp;&nbsp; for(int i = 0; i &lt; THREAD_NUM; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; final_sum += sum[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; printf("sum: %d&nbsp; time: %d\n", final_sum, time_used);<br><br>&nbsp;&nbsp;&nbsp; final_sum = 0;<br>&nbsp;&nbsp;&nbsp; for(int i = 0; i &lt; DATA_SIZE; i++) {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sum += data[i] * data[i];<br>&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; printf("sum (CPU): %d\n", final_sum);</span></p>
<p align=left><span>编译后执行，确认结果和原来相同。</span></p>
<p align=left><span>这个版本的程序，在<span> GeForce 8800GT </span>上执行，只需要约<span> 8.3M cycles</span>，比前一版程序快了<span> 77 </span>倍！这就是透过大量<span> thread </span>来隐藏<span> latency </span>所带来的效果。</span></p>
<p align=left><span>不过，如果计算一下它使用的内存带宽，就会发现其实仍不是很理想，大约只有<span> 723MB/s </span>而已。这和<span> GeForce 8800GT </span>所具有的内存带宽是很大的差距。为什么会这样呢？</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>内存的存取模式</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>显卡上的内存是<span> DRAM</span>，因此最有效率的存取方式，是以连续的方式存取。前面的程序，虽然看起来是连续存取内存位置（每个<span> thread </span>对一块连续的数字计算平方和），但是我们要考虑到实际上<span> thread </span>的执行方式。前面提过，当一个<span> thread </span>在等待内存的数据时，<span>GPU </span>会切换到下一个<span> thread</span>。也就是说，实际上执行的顺序是类似</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; thread 0 -&gt; thread 1 -&gt; thread 2 -&gt; ...</span></p>
<p align=left><span>因此，在同一个<span> thread </span>中连续存取内存，在实际执行时反而不是连续了。要让实际执行结果是连续的存取，我们应该要让<span> thread 0 </span>读取第一个数字，<span>thread 1 </span>读取第二个数字<span>&#8230;</span>依此类推。所以，我们可以把<span> kernel </span>程序改成如下：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result, clock_t* time)<br>{<br>&nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; int sum = 0;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; clock_t start;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) start = clock();<br>&nbsp;&nbsp;&nbsp; for(i = tid; i &lt; DATA_SIZE; i += THREAD_NUM) {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sum += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; result[tid] = sum;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) *time = clock() - start;<br>}</span></p>
<p align=left><span>编译后执行，确认结果相同。</span></p>
<p align=left><span>仅仅是这样简单的修改，实际执行的效率就有很大的差别。在<span> GeForce 8800GT </span>上，上面的程序执行需要的频率是<span> 2.6M cycles</span>，又比前一版程序快了三倍。不过，这样仍只有<span> 2.3GB/s </span>的带宽而已。</span></p>
<p align=left><span>这是因为我们使用的<span> thread </span>数目还是不够多的原因。理论上<span> 256 </span>个<span> threads </span>最多只能隐藏<span> 256 cycles </span>的<span> latency</span>。但是<span> GPU </span>存取<span> global memory </span>时的<span> latency </span>可能高达<span> 500 cycles </span>以上。如果增加<span> thread </span>数目，就可以看到更好的效率。例如，可以把 </span><span>THREAD_NUM</span><span> </span><span>改成<span> 512</span>。在<span> GeForce 8800GT </span>上，这可以让执行花费的时间减少到<span> 1.95M cycles</span>。有些改进，但是仍不够大。不幸的是，目前<span> GeForce 8800GT </span>一个<span> block </span>最多只能有<span> 512 </span>个<span> threads</span>，所以不能再增加了，而且，如果<span> thread </span>数目增加太多，那么在<span> CPU </span>端要做的最后加总工作也会变多。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>更多的并行化</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>前面提到了<span> block</span>。在之前介绍呼叫<span> CUDA </span>函式时，也有提到<span> "block </span>数目<span>" </span>这个参数。到目前为止，我们都只使用一个<span> block</span>。究竟<span> block </span>是什么呢？</span></p>
<p align=left><span>在<span> CUDA </span>中，<span>thread </span>是可以分组的，也就是<span> block</span>。一个<span> block </span>中的<span> thread</span>，具有一个共享的<span> shared memory</span>，也可以进行同步工作。不同<span> block </span>之间的<span> thread </span>则不行。在我们的程序中，其实不太需要进行<span> thread </span>的同步动作，因此我们可以使用多个<span> block </span>来进一步增加<span> thread </span>的数目。</span></p>
<p align=left><span>首先，在 </span><span>#define DATA_SIZE</span><span> </span><span>的地方，改成如下：</span></p>
<p align=left><span>#define DATA_SIZE&nbsp;&nbsp; 1048576<br>#define BLOCK_NUM&nbsp;&nbsp; 32<br>#define THREAD_NUM&nbsp;&nbsp; 256</span></p>
<p align=left><span>这表示我们会建立<span> 32 </span>个<span> blocks</span>，每个<span> blocks </span>有<span> 256 </span>个<span> threads</span>，总共有<span> 32*256 = 8192 </span>个<span> threads</span>。</span></p>
<p align=left><span>接着，我们把<span> kernel </span>部份改成：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result, clock_t* time)<br>{<br>&nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; const int bid = blockIdx.x;<br>&nbsp;&nbsp;&nbsp; int sum = 0;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) time[bid] = clock();<br>&nbsp;&nbsp;&nbsp; for(i = bid * THREAD_NUM + tid; i &lt; DATA_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; i += BLOCK_NUM * THREAD_NUM) {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sum += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; result[bid * THREAD_NUM + tid] = sum;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) time[bid + BLOCK_NUM] = clock();<br>}</span></p>
<p align=left><span>blockIdx.x</span><span> </span><span>和 </span><span>threadIdx.x</span><span> </span><span>一样是<span> CUDA </span>内建的变量，它表示的是目前的<span> block </span>编号。另外，注意到我们把计算时间的方式改成每个<span> block </span>都会记录开始时间及结束时间。</span></p>
<p align=left><span>main </span><span>函式部份，修改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int* gpudata, *result;<br>&nbsp;&nbsp;&nbsp; clock_t* time;<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;gpudata, sizeof(int) * DATA_SIZE);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;result, sizeof(int) * THREAD_NUM * BLOCK_NUM);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;time, sizeof(clock_t) * BLOCK_NUM * 2);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);<br><br>&nbsp;&nbsp;&nbsp; sumOfSquares&lt;&lt;&lt;BLOCK_NUM, THREAD_NUM, 0&gt;&gt;&gt;(gpudata, result, time);<br><br>&nbsp;&nbsp;&nbsp; int sum[THREAD_NUM * BLOCK_NUM];<br>&nbsp;&nbsp;&nbsp; clock_t time_used[BLOCK_NUM * 2];<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(∑, result, sizeof(int) * THREAD_NUM * BLOCK_NUM, cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(&amp;time_used, time, sizeof(clock_t) * BLOCK_NUM * 2, cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaFree(gpudata);<br>&nbsp;&nbsp;&nbsp; cudaFree(result);<br>&nbsp;&nbsp;&nbsp; cudaFree(time);<br><br>&nbsp;&nbsp;&nbsp; int final_sum = 0;<br>&nbsp;&nbsp;&nbsp; for(int i = 0; i &lt; THREAD_NUM * BLOCK_NUM; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; final_sum += sum[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; clock_t min_start, max_end;<br>&nbsp;&nbsp;&nbsp; min_start = time_used[0];<br>&nbsp;&nbsp;&nbsp; max_end = time_used[BLOCK_NUM];<br>&nbsp;&nbsp;&nbsp; for(int i = 1; i &lt; BLOCK_NUM; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(min_start &gt; time_used[i])<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; min_start = time_used[i];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(max_end &lt; time_used[i + BLOCK_NUM])<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; max_end = time_used[i + BLOCK_NUM];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; printf("sum: %d&nbsp; time: %d\n", final_sum, max_end - min_start);</span></p>
<p align=left><span>基本上我们只是把<span> result </span>的大小变大，并修改计算时间的方式，把每个<span> block </span>最早的开始时间，和最晚的结束时间相减，取得总运行时间。</span></p>
<p align=left><span>这个版本的程序，执行的时间减少很多，在<span> GeForce 8800GT </span>上只需要约<span> 150K cycles</span>，相当于<span> 40GB/s </span>左右的带宽。不过，它在<span> CPU </span>上执行的部份，需要的时间加长了（因为<span> CPU </span>现在需要加总<span> 8192 </span>个数字）。为了避免这个问题，我们可以让每个<span> block </span>把自己的每个<span> thread </span>的计算结果进行加总。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>Thread </span></strong><strong><span>的同步</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>前面提过，一个<span> block </span>内的<span> thread </span>可以有共享的内存，也可以进行同步。我们可以利用这一点，让每个<span> block </span>内的所有<span> thread </span>把自己计算的结果加总起来。把<span> kernel </span>改成如下：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result, clock_t* time)<br>{<br>&nbsp;&nbsp;&nbsp; extern __shared__ int shared[];<br>&nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; const int bid = blockIdx.x;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) time[bid] = clock();<br>&nbsp;&nbsp;&nbsp; shared[tid] = 0;<br>&nbsp;&nbsp;&nbsp; for(i = bid * THREAD_NUM + tid; i &lt; DATA_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; i += BLOCK_NUM * THREAD_NUM) {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; shared[tid] += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid == 0) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; for(i = 1; i &lt; THREAD_NUM; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; shared[0] += shared[i];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; result[bid] = shared[0];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; if(tid == 0) time[bid + BLOCK_NUM] = clock();<br>}</span></p>
<p align=left><span>利用 </span><span>__shared__</span><span> </span><span>声明的变量表示这是<span> shared memory</span>，是一个<span> block </span>中每个<span> thread </span>都共享的内存。它会使用在<span> GPU </span>上的内存，所以存取的速度相当快，不需要担心<span> latency </span>的问题。</span></p>
<p align=left><span>__syncthreads()</span><span> </span><span>是一个<span> CUDA </span>的内部函数，表示<span> block </span>中所有的<span> thread </span>都要同步到这个点，才能继续执行。在我们的例子中，由于之后要把所有<span> thread </span>计算的结果进行加总，所以我们需要确定每个<span> thread </span>都已经把结果写到<span> shared[tid] </span>里面了。</span></p>
<p align=left><span>接下来，把 </span><span>main</span><span> </span><span>函式的一部份改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int* gpudata, *result;<br>&nbsp;&nbsp;&nbsp; clock_t* time;<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;gpudata, sizeof(int) * DATA_SIZE);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;result, sizeof(int) * BLOCK_NUM);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;time, sizeof(clock_t) * BLOCK_NUM * 2);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);<br><br>&nbsp;&nbsp;&nbsp; sumOfSquares&lt;&lt;&lt;BLOCK_NUM, THREAD_NUM, THREAD_NUM * sizeof(int)&gt;&gt;&gt;(gpudata, result, time);<br><br>&nbsp;&nbsp;&nbsp; int sum[BLOCK_NUM];<br>&nbsp;&nbsp;&nbsp; clock_t time_used[BLOCK_NUM * 2];<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(∑, result, sizeof(int) * BLOCK_NUM, cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(&amp;time_used, time, sizeof(clock_t) * BLOCK_NUM * 2, cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaFree(gpudata);<br>&nbsp;&nbsp;&nbsp; cudaFree(result);</span><span><br></span><span>&nbsp;&nbsp;&nbsp; cudaFree(time);</span><span><br></span><span><br>&nbsp;&nbsp;&nbsp; int final_sum = 0;<br>&nbsp;&nbsp;&nbsp; for(int i = 0; i &lt; BLOCK_NUM; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; final_sum += sum[i];<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>可以注意到，现在<span> CPU </span>只需要加总 </span><span>BLOCK_NUM</span><span> </span><span>也就是<span> 32 </span>个数字就可以了。</span></p>
<p align=left><span>不过，这个程序由于在<span> GPU </span>上多做了一些动作，所以它的效率会比较差一些。在<span> GeForce 8800GT </span>上，它需要约<span> 164K cycles</span>。</span></p>
<p align=left><span>当然，效率会变差的一个原因是，在这一版的程序中，最后加总的工作，只由每个<span> block </span>的<span> thread 0 </span>来进行，但这并不是最有效率的方法。理论上，把<span> 256 </span>个数字加总的动作，是可以并行化的。最常见的方法，是透过树状的加法：</span></p>
<p align=center><img height=200 alt="" src="http://www.cppblog.com/images/cppblog_com/hktk/CUDA_09-09-20_4_1.jpg" width=200 border=0></p>
<p align=left><span>把<span> kernel </span>改成如下：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result, clock_t* time)<br>{<br>&nbsp;&nbsp;&nbsp; extern __shared__ int shared[];<br>&nbsp;&nbsp;&nbsp; const int tid = threadIdx.x;<br>&nbsp;&nbsp;&nbsp; const int bid = blockIdx.x;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; int offset = 1, mask = 1;<br>&nbsp;&nbsp;&nbsp; if(tid == 0) time[bid] = clock();<br>&nbsp;&nbsp;&nbsp; shared[tid] = 0;<br>&nbsp;&nbsp;&nbsp; for(i = bid * THREAD_NUM + tid; i &lt; DATA_SIZE;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; i += BLOCK_NUM * THREAD_NUM) {<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; shared[tid] += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; while(offset &lt; THREAD_NUM) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if((tid &amp; mask) == 0) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; shared[tid] += shared[tid + offset];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; offset += offset;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; mask = offset + mask;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; if(tid == 0) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; result[bid] = shared[0];&nbsp; &nbsp; <br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; time[bid + BLOCK_NUM] = clock();<br>&nbsp;&nbsp;&nbsp; }<br>}</span></p>
<p align=left><span>后面的<span> while </span>循环就是进行树状加法。</span><span>main</span><span> </span><span>函式则不需要修改。</span></p>
<p align=left><span>这一版的程序，在<span> GeForce 8800GT </span>上执行需要的时间，大约是<span> 140K cycles</span>（相当于约<span> 43GB/s</span>），比完全不在<span> GPU </span>上进行加总的版本还快！这是因为，在完全不在<span> GPU </span>上进行加总的版本，写入到<span> global memory </span>的数据数量很大（<span>8192 </span>个数字），也对效率会有影响。所以，这一版程序不但在<span> CPU </span>上的运算需求降低，在<span> GPU </span>上也能跑的更快。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>进一步改善</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>上一个版本的树状加法是一般的写法，但是它在<span> GPU </span>上执行的时候，会有<span> share memory </span>的<span> bank conflict </span>的问题（详情在后面介绍<span> GPU </span>架构时会提到）。采用下面的方法，可以避免这个问题：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; offset = THREAD_NUM / 2;<br>&nbsp;&nbsp;&nbsp; while(offset &gt; 0) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(tid &lt; offset) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; shared[tid] += shared[tid + offset];<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; offset &gt;&gt;= 1;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; }</span></p>
<p align=left><span>这样同时也省去了 </span><span>mask</span><span> </span><span>变数。因此，这个版本的执行的效率就可以再提高一些。在<span> GeForce 8800GT </span>上，这个版本执行的时间是约<span> 137K cycles</span>。当然，这时差别已经很小了。如果还要再提高效率，可以把树状加法整个展开：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; if(tid &lt; 128) { shared[tid] += shared[tid + 128]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 64) { shared[tid] += shared[tid + 64]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 32) { shared[tid] += shared[tid + 32]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 16) { shared[tid] += shared[tid + 16]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 8) { shared[tid] += shared[tid + 8]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 4) { shared[tid] += shared[tid + 4]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 2) { shared[tid] += shared[tid + 2]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();<br>&nbsp;&nbsp;&nbsp; if(tid &lt; 1) { shared[tid] += shared[tid + 1]; }<br>&nbsp;&nbsp;&nbsp; __syncthreads();</span><span> </span></p>
<p align=left><span>当然这只适用于<span> THREAD_NUM </span>是<span> 256 </span>的情形。这样可以再省下约<span> 1000 cycles </span>左右（约<span> 44GB/s</span>）。最后完整的程序文件可以从<span><a href="http://www.pcinlife.com.com/article_photo/hotball_cuda/first_cuda.cu"><span><span>这里<span>下载</span></span></span></a></span>。</span></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/hktk/aggbug/96757.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hktk/" target="_blank">海 阔 天 空</a> 2009-09-20 10:37 <a href="http://www.cppblog.com/hktk/archive/2009/09/20/96757.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入浅出谈CUDA-[第三章][第一个CUDA程序]</title><link>http://www.cppblog.com/hktk/archive/2009/09/20/96756.html</link><dc:creator>海 阔 天 空</dc:creator><author>海 阔 天 空</author><pubDate>Sun, 20 Sep 2009 02:34:00 GMT</pubDate><guid>http://www.cppblog.com/hktk/archive/2009/09/20/96756.html</guid><description><![CDATA[&nbsp;
<p align=left><strong><span>第一个<span>CUDA</span>程序</span></strong><span> </span></p>
<p align=left><span>CUDA </span><span>目前有两种不同的<span> API</span>：<span>Runtime API </span>和<span> Driver API</span>，两种<span> API </span>各有其适用的范围。由于<span> runtime API </span>较容易使用，一开始我们会以<span> runetime API </span>为主。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>CUDA </span></strong><strong><span>的初始化</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>首先，先建立一个档案<span> first_cuda.cu</span>。如果是使用<span> Visual Studio </span>的话，则请先按照<span><a href="http://www.pcinlife.com/article/graphics/2008-06-04/1212575164d532_1.html"><span><span>这里</span></span></a></span>的设定方式设定<span> project</span>。</span></p>
<p align=left><span>要使用<span> runtime API </span>的时候，需要<span> include </span></span><span>cuda_runtime.h</span><span>。所以，在程序的最前面，加上</span></p>
<p align=left><span>#include &lt;stdio.h&gt;<br>#include &lt;cuda_runtime.h&gt;</span></p>
<p align=left><span>接下来是一个 </span><span>InitCUDA</span><span> </span><span>函式，会呼叫<span> runtime API </span>中，有关初始化<span> CUDA </span>的功能：</span></p>
<p align=left><span>bool InitCUDA()<br>{<br>&nbsp;&nbsp;&nbsp; int count;<br><br>&nbsp;&nbsp;&nbsp; cudaGetDeviceCount(&amp;count);<br>&nbsp;&nbsp;&nbsp; if(count == 0) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; fprintf(stderr, "There is no device.\n");<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; return false;<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; for(i = 0; i &lt; count; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; cudaDeviceProp prop;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(cudaGetDeviceProperties(</span><span>&#8733;</span><span>, i) == cudaSuccess) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; if(prop.major &gt;= 1) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp; &nbsp; &nbsp;&nbsp; &nbsp;&nbsp;&nbsp; break;<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; if(i == count) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; fprintf(stderr, "There is no device supporting CUDA 1.x.\n");<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; return false;<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; cudaSetDevice(i);<br><br>&nbsp;&nbsp;&nbsp; return true;<br>}</span></p>
<p align=left><span>这个函式会先呼叫 </span><span>cudaGetDeviceCount</span><span> </span><span>函式，取得支持<span> CUDA </span>的装置的数目。如果系统上没有支持<span> CUDA </span>的装置，则它会传回<span> 1</span>，而<span> device 0 </span>会是一个仿真的装置，但不支持<span> CUDA 1.0 </span>以上的功能。所以，要确定系统上是否有支持<span> CUDA </span>的装置，需要对每个<span> device </span>呼叫 </span><span>cudaGetDeviceProperties</span><span> </span><span>函式，取得装置的各项数据，并判断装置支持的<span> CUDA </span>版本（<span>prop.major </span>和<span> prop.minor </span>分别代表装置支持的版本号码，例如<span> 1.0 </span>则<span> prop.major </span>为<span> 1 </span>而<span> prop.minor </span>为<span> 0</span>）。</span></p>
<p align=left><span>透过</span><span> cudaGetDeviceProperties </span><span>函式可以取得许多数</span><span>据，除了装置支持的<span> CUDA </span>版本之外，还有装置的名称、内存的大小、最大的<span> thread </span>数目、执行单元的频率等等。详情可参考<span> NVIDIA </span>的<span> CUDA Programming Guide</span>。</span></p>
<p align=left><span>在找到支持<span> CUDA 1.0 </span>以上的装置之后，就可以呼叫 </span><span>cudaSetDevice</span><span> </span><span>函式，把它设为目前要使用的装置。</span></p>
<p align=left><span>最后是<span> main </span>函式。在<span> main </span>函式中我们直接呼叫刚才的<span> InitCUDA </span>函式，并显示适当的讯息：</span></p>
<p align=left><span>int main()<br>{<br>&nbsp;&nbsp;&nbsp; if(!InitCUDA()) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; return 0;<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; printf("CUDA initialized.\n");<br><br>&nbsp;&nbsp;&nbsp; return 0;<br>}</span></p>
<p align=left><span>这样就可以利用<span> nvcc </span>来<span> compile </span>这个程序了。使用<span> Visual Studio </span>的话，若按照先前的设定方式，可以直接<span> Build Project </span>并执行。</span></p>
<p align=left><span>nvcc </span><span>是<span> CUDA </span>的<span> compile </span>工具，它会将<span> .cu </span>檔拆解出在<span> GPU </span>上执行的部份，及在<span> host </span>上执行的部份，并呼叫适当的程序进行<span> compile </span>动作。在<span> GPU </span>执行的部份会透过<span> NVIDIA </span>提供的<span> compiler </span>编译成中介码，而<span> host </span>执行的部份则会透过系统上的<span> C++ compiler </span>编译（在<span> Windows </span>上使用<span> Visual C++ </span>而在<span> Linux </span>上使用<span> gcc</span>）。</span></p>
<p align=left><span>编译后的程序，执行时如果系统上有支持<span> CUDA </span>的装置，应该会显示<span> CUDA initialized. </span>的讯息，否则会显示相关的错误讯息。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>利用<span> CUDA </span>进行运算</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left>&nbsp;</p>
<p align=left><span>到目前为止，我们的程序并没有做什么有用的工作。所以，现在我们加入一个简单的动作，就是把一大堆数字，计算出它的平方和。</span></p>
<p align=left><span>首先，把程序最前面的<span> include </span>部份改成：</span></p>
<p align=left><span>#include &lt;stdio.h&gt;<br>#include &lt;stdlib.h&gt;<br>#include &lt;cuda_runtime.h&gt;<br><br>#define DATA_SIZE 1048576<br><br>int data[DATA_SIZE];</span></p>
<p align=left><span>并加入一个新函式 </span><span>GenerateNumbers</span><span>：</span></p>
<p align=left><span>void GenerateNumbers(int *number, int size)<br>{<br>&nbsp;&nbsp;&nbsp; for(int i = 0; i &lt; size; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; number[i] = rand() % 10;<br>&nbsp;&nbsp;&nbsp; }<br>}</span></p>
<p align=left><span>这个函式会产生一大堆<span> 0 ~ 9 </span>之间的随机数。</span></p>
<p align=left><span>要利用<span> CUDA </span>进行计算之前，要先把数据复制到显卡内存中，才能让显示芯片使用。因此，需要取得一块适当大小的显卡内存，再把产生好的数据复制进去。在 </span><span>main</span><span> </span><span>函式中加入：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; GenerateNumbers(data, DATA_SIZE);<br><br>&nbsp;&nbsp;&nbsp; int* gpudata, *result;<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;gpudata, sizeof(int) * DATA_SIZE);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;result, sizeof(int));<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);</span></p>
<p align=left><span>上面这段程序会先呼叫 </span><span>GenerateNumbers</span><span> </span><span>产生随机数，并呼叫 </span><span>cudaMalloc</span><span> </span><span>取得一块显卡内存（</span><span>result</span><span> </span><span>则是用来存取计算结果，在稍后会用到），并透过 </span><span>cudaMemcpy</span><span> </span><span>将产生的随机数复制到显卡内存中。<span>cudaMalloc </span>和<span> cudaMemcpy </span>的用法和一般的<span> malloc </span>及<span> memcpy </span>类似，不过<span> cudaMemcpy </span>则多出一个参数，指示复制内存的方向。在这里因为是从主内存复制到显卡内存，所以使用 </span><span>cudaMemcpyHostToDevice</span><span>。如果是从显卡内存到主内存，则使用 </span><span>cudaMemcpyDeviceToHost</span><span>。这在之后会用到。</span></p>
<p align=left><span>接下来是要写在显示芯片上执行的程序。在<span> CUDA </span>中，在函式前面加上 </span><span>__global__</span><span> </span><span>表示这个函式是要在显示芯片上执行的。因此，加入以下的函式：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result)<br>{<br>&nbsp;&nbsp;&nbsp; int sum = 0;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; for(i = 0; i &lt; DATA_SIZE; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; sum += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; *result = sum;<br>}</span></p>
<p align=left><span>在显示芯片上执行的程序有一些限制，例如它不能有传回值。其它的限制会在之后提到。</span></p>
<p align=left><span>接下来是要让<span> CUDA </span>执行这个函式。在<span> CUDA </span>中，要执行一个函式，使用以下的语法：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; </span><span>函式名称</span><span>&lt;&lt;&lt;block </span><span>数目</span><span>, thread </span><span>数目</span><span>, shared memory </span><span>大小</span><span>&gt;&gt;&gt;(</span><span>参数</span><span>...);</span></p>
<p align=left><span>呼叫完后，还要把结果从显示芯片复制回主内存上。在 </span><span>main</span><span> </span><span>函式中加入以下的程序：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; sumOfSquares&lt;&lt;&lt;1, 1, 0&gt;&gt;&gt;(gpudata, result);<br><br>&nbsp;&nbsp;&nbsp; int sum;<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(∑, result, sizeof(int), cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaFree(gpudata);<br>&nbsp;&nbsp;&nbsp; cudaFree(result);<br><br>&nbsp;&nbsp;&nbsp; printf("sum: %d\n", sum);</span></p>
<p align=left><span>因为这个程序只使用一个<span> thread</span>，所以<span> block </span>数目、<span>thread </span>数目都是<span> 1</span>。我们也没有使用到任何<span> shared memory</span>，所以设为<span> 0</span>。编译后执行，应该可以看到执行的结果。</span></p>
<p align=left><span>为了确定执行的结果正确，我们可以加上一段以<span> CPU </span>执行的程序代码，来验证结果：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; sum = 0;<br>&nbsp;&nbsp;&nbsp; for(int i = 0; i &lt; DATA_SIZE; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; sum += data[i] * data[i];<br>&nbsp;&nbsp;&nbsp; }<br>&nbsp;&nbsp;&nbsp; printf("sum (CPU): %d\n", sum);</span></p>
<p align=left><span>编译后执行，确认两个结果相同。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>计算运行时间</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left>&nbsp;</p>
<p align=left><span>CUDA </span><span>提供了一个 </span><span>clock</span><span> </span><span>函式，可以取得目前的<span> timestamp</span>，很适合用来判断一段程序执行所花费的时间（单位为<span> GPU </span>执行单元的频率）。这对程序的优化也相当有用。要在我们的程序中记录时间，把 </span><span>sumOfSquares</span><span> </span><span>函式改成：</span></p>
<p align=left><span>__global__ static void sumOfSquares(int *num, int* result, clock_t* time)<br>{<br>&nbsp;&nbsp;&nbsp; int sum = 0;<br>&nbsp;&nbsp;&nbsp; int i;<br>&nbsp;&nbsp;&nbsp; clock_t start = clock();<br>&nbsp;&nbsp;&nbsp; for(i = 0; i &lt; DATA_SIZE; i++) {<br>&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; sum += num[i] * num[i];<br>&nbsp;&nbsp;&nbsp; }<br><br>&nbsp;&nbsp;&nbsp; *result = sum;<br>&nbsp;&nbsp;&nbsp; *time = clock() - start;<br>}</span></p>
<p align=left><span>把 </span><span>main</span><span> </span><span>函式中间部份改成：</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; int* gpudata, *result;<br>&nbsp;&nbsp;&nbsp; clock_t* time;<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;gpudata, sizeof(int) * DATA_SIZE);<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;result, sizeof(int));<br>&nbsp;&nbsp;&nbsp; cudaMalloc((void**) &amp;time, sizeof(clock_t));<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);</span></p>
<p align=left><span>&nbsp;&nbsp;&nbsp; sumOfSquares&lt;&lt;&lt;1, 1, 0&gt;&gt;&gt;(gpudata, result, time);<br><br>&nbsp;&nbsp;&nbsp; int sum;<br>&nbsp;&nbsp;&nbsp; clock_t time_used;<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(∑, result, sizeof(int), cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaMemcpy(&amp;time_used, time, sizeof(clock_t), cudaMemcpyDeviceToHost);<br>&nbsp;&nbsp;&nbsp; cudaFree(gpudata);<br>&nbsp;&nbsp;&nbsp; cudaFree(result);<br><br>&nbsp;&nbsp;&nbsp; printf("sum: %d time: %d\n", sum, time_used);</span></p>
<p align=left><span>编译后执行，就可以看到执行所花费的时间了。</span></p>
<p align=left><span>如果计算实际运行时间的话，可能会注意到它的执行效率并不好。这是因为我们的程序并没有利用到<span> CUDA </span>的主要的优势，即并行化执行。在下一段文章中，会讨论如何进行优化的动作。</span></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/hktk/aggbug/96756.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hktk/" target="_blank">海 阔 天 空</a> 2009-09-20 10:34 <a href="http://www.cppblog.com/hktk/archive/2009/09/20/96756.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入浅出谈CUDA-[第二章][CUDA Toolkit的安装]</title><link>http://www.cppblog.com/hktk/archive/2009/09/20/96755.html</link><dc:creator>海 阔 天 空</dc:creator><author>海 阔 天 空</author><pubDate>Sun, 20 Sep 2009 02:32:00 GMT</pubDate><guid>http://www.cppblog.com/hktk/archive/2009/09/20/96755.html</guid><description><![CDATA[&nbsp;
<p align=left><strong><span>CUDA Toolkit</span></strong><strong><span>的安装</span></strong><span> </span></p>
<p align=left><span>目前<span> NVIDIA </span>提供的<span> CUDA Toolkit</span>（可从<span><a href="http://www.nvidia.com/object/cuda_get.html"><span><span>这里下载</span></span></a></span>）支持<span> Windows </span>（<span>32 bits </span>及<span> 64 bits </span>版本）及许多不同的<span> Linux </span>版本。<span> </span></span></p>
<p align=left><span>CUDA Toolkit </span><span>需要配合<span> C/C++ compiler</span>。在<span> Windows </span>下，目前只支持<span> Visual Studio 7.x </span>及<span> Visual Studio 8</span>（包括免费的<span> Visual Studio C++ 2005 Express</span>）。<span>Visual Studio 6 </span>和<span> gcc </span>在<span> Windows </span>下是不支援的。在<span> Linux </span>下则只支援<span> gcc</span>。</span></p>
<p align=left><span>这里简单介绍一下在<span> Windows </span>下设定并使用<span> CUDA </span>的方式。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>下载及安装</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>在<span> Windows </span>下，<span>CUDA Toolkit </span>和<span> CUDA SDK </span>都是由安装程序的形式安装的。<span>CUDA Toolkit </span>包括<span> CUDA </span>的基本工具，而<span> CUDA SDK </span>则包括许多范例程序以及链接库。基本上要写<span> CUDA </span>的程序，只需要安装<span> CUDA Toolkit </span>即可。不过<span> CUDA SDK </span>仍值得安装，因为里面的许多范例程序和链接库都相当有用。</span></p>
<p align=left><span>CUDA Toolkit </span><span>安装完后，预设会安装在<span> C:\CUDA </span>目录里。其中包括几个目录：</span></p>
<ul type=disc>
    <li><span>bin </span><span>-- </span><span>工具程序及动态链接库<span> </span></span></li>
    <li><span>doc </span><span>-- </span><span>文件<span> </span></span></li>
    <li><span>include </span><span>-- header </span><span>檔<span> </span></span></li>
    <li><span>lib </span><span>-- </span><span>链接库档案<span> </span></span></li>
    <li><span>open64 </span><span>-- </span><span>基于<span> Open64 </span>的<span> CUDA compiler </span></span></li>
    <li><span>src </span><span>-- </span><span>一些原始码</span></li>
</ul>
<p align=left><span>安装程序也会设定一些环境变量，包括：</span></p>
<ul type=disc>
    <li><span>CUDA_BIN_PATH</span><span> -- </span><span>工具程序的目录，默认为<span> C:\CUDA\bin </span></span></li>
    <li><span>CUDA_INC_PATH</span><span> -- header </span><span>文件的目录，默认为<span> C:\CUDA\inc </span></span></li>
    <li><span>CUDA_LIB_PATH</span><span> -- </span><span>链接库文件的目录，默认为<span> C:\CUDA\lib</span></span></li>
</ul>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>在<span> Visual Studio </span>中使用<span> CUDA </span></span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>CUDA </span><span>的主要工具是<span> nvcc</span>，它会执行所需要的程序，将<span> CUDA </span>程序代码编译成执行档<span> (</span>或<span> object </span>檔<span>) </span>。在<span> Visual Studio </span>下，我们透过设定<span> custom build tool </span>的方式，让<span> Visual Studio </span>会自动执行<span> nvcc</span>。</span></p>
<p align=left><span>这里以<span> Visual Studio 2005 </span>为例：</span></p>
<ol type=1>
    <li><span>首先，建立一个<span> Win32 Console </span>模式的<span> project</span>（在<span> Application Settings </span>中记得勾选<span> Empty project</span>），并新增一个档案，例如<span> main.cu</span>。<span> </span></span></li>
    <li><span>在<span> main.cu </span>上右键单击，并选择 <strong><span>Properties</span></strong>。点选 <strong><span>General</span></strong>，确定 <strong><span>Tool</span></strong><span> </span>的部份是选择 <strong><span>Custom Build Tool</span></strong>。<span> </span></span></li>
    <li><span>选择<span> Custom Build Step</span>，在<span> Command Line </span>使用以下设定：<span> </span></span></li>
    <ul type=circle>
        <li><strong><span>Release </span></strong><strong><span>模式</span></strong><span>：<span>"$(CUDA_BIN_PATH)\nvcc.exe" -ccbin "$(VCInstallDir)bin" -c -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"$(CUDA_INC_PATH)" -o $(ConfigurationName)\$(InputName).obj $(InputFileName)</span></span></li>
        <li><strong><span>Debug </span></strong><strong><span>模式</span></strong><span>：<span>"$(CUDA_BIN_PATH)\nvcc.exe" -ccbin "$(VCInstallDir)bin" -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"$(CUDA_INC_PATH)" -o $(ConfigurationName)\$(InputName).obj $(InputFileName)</span></span></li>
    </ul>
    <li><span>如果想要使用软件仿真的模式，可以新增两个额外的设定：<span> </span></span></li>
    <ul type=circle>
        <li><strong><span>EmuRelease </span></strong><strong><span>模式</span></strong><span>：<span>"$(CUDA_BIN_PATH)\nvcc.exe" -ccbin "$(VCInstallDir)bin" -deviceemu -c -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"$(CUDA_INC_PATH)" -o $(ConfigurationName)\$(InputName).obj $(InputFileName) </span></span></li>
        <li><strong><span>EmuDebug </span></strong><strong><span>模式</span></strong><span>：<span>"$(CUDA_BIN_PATH)\nvcc.exe" -ccbin "$(VCInstallDir)bin" -deviceemu -c -D_DEBUG -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/Od,/Zi,/RTC1,/MTd -I"$(CUDA_INC_PATH)" -o $(ConfigurationName)\$(InputName).obj $(InputFileName)</span></span></li>
    </ul>
    <li><span>对所有的配置文件，在 <strong><span>Custom Build Step</span></strong><span> </span>的 <strong><span>Outputs</span></strong><span> </span>中加入<span> $(ConfigurationName)\$(InputName).obj</span>。<span> </span></span></li>
    <li><span>选择<span> project</span>，右键单击选择 <strong><span>Properties</span></strong>，再点选 <strong><span>Linker</span></strong>。对所有的配置文件修改以下设定：<span> </span></span></li>
    <ul type=circle>
        <li><span>General/Enable Incremental Linking</span><span>：<span>No </span></span></li>
        <li><span>General/Additional Library Directories</span><span>：<span>$(CUDA_LIB_PATH) </span></span></li>
        <li><span>Input/Additional Dependencies</span><span>：<span>cudart.lib</span></span></li>
    </ul>
</ol>
<p align=left><span>这样应该就可以直接在<span> Visual Studio </span>的<span> IDE </span>中，编辑<span> CUDA </span>程序后，直接<span> build </span>以及执行程序了。</span></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/hktk/aggbug/96755.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hktk/" target="_blank">海 阔 天 空</a> 2009-09-20 10:32 <a href="http://www.cppblog.com/hktk/archive/2009/09/20/96755.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>深入浅出谈CUDA-[第一章][CUDA是什么]</title><link>http://www.cppblog.com/hktk/archive/2009/09/20/96754.html</link><dc:creator>海 阔 天 空</dc:creator><author>海 阔 天 空</author><pubDate>Sun, 20 Sep 2009 02:28:00 GMT</pubDate><guid>http://www.cppblog.com/hktk/archive/2009/09/20/96754.html</guid><description><![CDATA[&nbsp;
<p align=left><span>&#8220;CUDA </span><span>是<span> NVIDIA </span>的<span> GPGPU </span>模型，它使用<span> C </span>语言为基础，可以直接以大多数人熟悉的<span> C </span>语言，写出在显示芯片上执行的程序，而不需要去学习特定的显示芯片的指令或是特殊的结构。<span>&#8221;</span></span></p>
<p align=left><strong><span>CUDA</span></strong><strong><span>是什么？能吃吗？</span></strong><span> </span></p>
<p align=left><em><span>编者注：<span>NVIDIA</span>的<span>GeFoce 8800GTX</span>发布后，它的通用计算架构<span>CUDA</span>经过一年多的推广后，现在已经在有相当多的论文发表，在商业应用软件等方面也初步出现了视频编解码、金融、地质勘探、科学计算等领域的产品，是时候让我们对其作更深一步的了解。为了让大家更容易了解<span>CUDA</span>，我们征得<span>Hotball</span>的本人同意，发表他最近 亲自撰写的本文。这篇文章的特点是深入浅出，也包含了<span>hotball</span>本人编写一些简单<span>CUDA</span>程序的亲身体验，对于希望了解<span>CUDA</span>的读者来说是非常不错 的入门文章，<span>PCINLIFE</span>对本文的发表没有作任何的删减，主要是把一些台湾的词汇转换成大陆的词汇以及作了若干<span>"</span>编者注<span>"</span>的注释。</span></em></p>
<p align=left><span>现代的显示芯片已经具有高度的可程序化能力，由于显示芯片通常具有相当高的内存带宽，以及大量的执行单元，因此开始有利用显示芯片来帮助进行一些计算工作的想法，即<span> GPGPU</span>。<span><a href="http://www.nvidia.com/object/cuda_home.html">CUDA</a> </span>即是 <span><a href="http://www.nvidia.com/">NVIDIA</a> </span>的<span> GPGPU </span>模型。<span> </span></span></p>
<p align=left><span>NVIDIA </span><span>的新一代显示芯片，包括<span> GeForce 8 </span>系列及更新的显示芯片都支持<span> CUDA</span>。<span>NVIDIA </span>免费提供<span> CUDA </span>的开发工具（包括<span> Windows </span>版本和<span> Linux </span>版本）、程序范例、文件等等，可以在 <span><a href="http://www.nvidia.com/object/cuda_home.html">CUDA Zone</a> </span>下载。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>GPU </span></strong><strong><span>的优缺点</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>使用显示芯片来进行运算工作，和使用<span> CPU </span>相比，主要有几个好处：</span></p>
<ol type=1>
    <li><span>显示芯片通常具有更大的内存带宽。例如，<span>NVIDIA </span>的<span> GeForce 8800GTX </span>具有超过<span> 50GB/s </span>的内存带宽，而目前高阶<span> CPU </span>的内存带宽则在<span> 10GB/s </span>左右。<span> </span></span>
    <li><span>显示芯片具有更大量的执行单元。例如<span> GeForce 8800GTX </span>具有<span> 128 </span>个<span> "stream processors"</span>，频率为<span> 1.35GHz</span>。<span>CPU </span>频率通常较高，但是执行单元的数目则要少得多。<span> </span></span>
    <li><span>和高阶<span> CPU </span>相比，显卡的价格较为低廉。例如目前一张<span> GeForce 8800GT </span>包括<span> 512MB </span>内存的价格，和一颗<span> 2.4GHz </span>四核心<span> CPU </span>的价格相若。</span> </li>
</ol>
<p align=left><span>当然，使用显示芯片也有它的一些缺点：</span></p>
<ol type=1>
    <li><span>显示芯片的运算单元数量很多，因此对于不能高度并行化的工作，所能带来的帮助就不大。<span> </span></span>
    <li><span>显示芯片目前通常只支持<span> 32 bits </span>浮点数，且多半不能完全支持<span> IEEE 754 </span>规格， 有些运算的精确度可能较低。目前许多显示芯片并没有分开的整数运算单元，因此整数运算的效率较差。<span> </span></span>
    <li><span>显示芯片通常不具有分支预测等复杂的流程控制单元，因此对于具有高度分支的程序，效率会比较差。<span> </span></span>
    <li><span>目前<span> GPGPU </span>的程序模型仍不成熟，也还没有公认的标准。例如<span> NVIDIA </span>和<span> AMD/ATI </span>就有各自不同的程序模型。</span> </li>
</ol>
<p align=left><span>整体来说，显示芯片的性质类似<span> stream processor</span>，适合一次进行大量相同的工作。<span>CPU </span>则比较有弹性，能同时进行变化较多的工作。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>CUDA </span></strong><strong><span>架构</span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>CUDA </span><span>是<span> NVIDIA </span>的<span> GPGPU </span>模型，它使用<span> C </span>语言为基础，可以直接以大多数人熟悉的<span> C </span>语言，写出在显示芯片上执行的程序，而不需要去学习特定的显示芯片的指令或是特殊的结构。</span></p>
<p align=left><span>在<span> CUDA </span>的架构下，一个程序分为两个部份：<span>host </span>端和<span> device </span>端。<span>Host </span>端是指在<span> CPU </span>上执行的部份，而<span> device </span>端则是在显示芯片上执行的部份。<span>Device </span>端的程序又称为<span> "kernel"</span>。通常<span> host </span>端程序会将数据准备好后，复制到显卡的内存中，再由显示芯片执行<span> device </span>端程序，完成后再由<span> host </span>端程序将结果从显卡的内存中取回。</span></p>
<p align=center><img height=133 alt="" src="http://www.cppblog.com/images/cppblog_com/hktk/CUDA_09-09-20_1_1.jpg" width=200 border=0></p>
<p align=left><span>由于<span> CPU </span>存取显卡内存时只能透过<span> PCI Express </span>接口，因此速度较慢（<span>PCI Express x16 </span>的理论带宽是双向各<span> 4GB/s</span>），因此不能太常进行这类动作，以免降低效率。</span></p>
<p align=left><span>在<span> CUDA </span>架构下，显示芯片执行时的最小单位是 <strong><span>thread</span></strong>。数个<span> thread </span>可以组成一个 <strong><span>block</span></strong>。一个<span> block </span>中的<span> thread </span>能存取同一块共享的内存，而且可以快速进行同步的动作。</span></p>
<p align=left><span>每一个<span> block </span>所能包含的<span> thread </span>数目是有限的。不过，执行相同程序的<span> block</span>，可以组成 <strong><span>grid</span></strong>。不同<span> block </span>中的<span> thread </span>无法存取同一个共享的内存，因此无法直接互通或进行同步。因此，不同<span> block </span>中的<span> thread </span>能合作的程度是比较低的。不过，利用这个模式，可以让程序不用担心显示芯片实际上能同时执行的<span> thread </span>数目限制。例如，一个具有很少量执行单元的显示芯片，可能会把各个<span> block </span>中的<span> thread </span>顺序执行，而非同时执行。不同的<span> grid </span>则可以执行不同的程序（即<span> kernel</span>）。</span></p>
<p align=left><span>Grid</span><span>、<span>block </span>和<span> thread </span>的关系，如下图所示：</span></p>
<p align=center><img height=420 alt="" src="http://www.cppblog.com/images/cppblog_com/hktk/CUDA_09-09-20_1_2.jpg" width=229 border=0></p>
<p align=left><span>每个<span> thread </span>都有自己的一份<span> register </span>和<span> local memory </span>的空间。同一个<span> block </span>中的每个<span> thread </span>则有共享的一份<span> share memory</span>。此外，所有的<span> thread</span>（包括不同<span> block </span>的<span> thread</span>）都共享一份<span> global memory</span>、<span>constant memory</span>、和<span> texture memory</span>。不同的<span> grid </span>则有各自的<span> global memory</span>、<span>constant memory </span>和<span> texture memory</span>。这些不同的内存的差别，会在之后讨论。</span></p>
<table cellSpacing=0 cellPadding=0 width="100%" border=0>
    <tbody>
        <tr>
            <td>
            <p align=left><strong><span>执行模式 </span></strong></p>
            </td>
        </tr>
    </tbody>
</table>
<p align=left><span>由于显示芯片大量并行计算的特性，它处理一些问题的方式，和一般<span> CPU </span>是不同的。主要的特点包括：</span></p>
<ol type=1>
    <li><span>内存存取<span> latency </span>的问题：<span>CPU </span>通常使用<span> cache </span>来减少存取主内存的次数，以避免内存<span> latency </span>影响到执行效率。显示芯片则多半没有<span> cache</span>（或很小），而利用并行化执行的方式来隐藏内存的<span> latency</span>（即，当第一个<span> thread </span>需要等待内存读取结果时，则开始执行第二个<span> thread</span>，依此类推）。<span> </span></span>
    <li><span>分支指令的问题：<span>CPU </span>通常利用分支预测等方式来减少分支指令造成的<span> pipeline bubble</span>。显示芯片则多半使用类似处理内存<span> latency </span>的方式。不过，通常显示芯片处理分支的效率会比较差。</span> </li>
</ol>
<p align=left><span>因此，最适合利用<span> CUDA </span>处理的问题，是可以大量并行化的问题，才能有效隐藏内存的<span> latency</span>，并有效利用显示芯片上的大量执行单元。使用<span> CUDA </span>时，同时有上千个<span> thread </span>在执行是很正常的。因此，如果不能大量并行化的问题，使用<span> CUDA </span>就没办法达到最好的效率了。</span></p>
<p>&nbsp;</p>
<img src ="http://www.cppblog.com/hktk/aggbug/96754.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/hktk/" target="_blank">海 阔 天 空</a> 2009-09-20 10:28 <a href="http://www.cppblog.com/hktk/archive/2009/09/20/96754.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>