﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-Gotta Write A Code-随笔分类-CUDA</title><link>http://www.cppblog.com/bennycen/category/17397.html</link><description /><language>zh-cn</language><lastBuildDate>Thu, 10 May 2012 19:04:15 GMT</lastBuildDate><pubDate>Thu, 10 May 2012 19:04:15 GMT</pubDate><ttl>60</ttl><item><title>水文一篇--基于CUDA的矩阵相乘</title><link>http://www.cppblog.com/bennycen/archive/2011/07/26/151879.html</link><dc:creator>bennycen</dc:creator><author>bennycen</author><pubDate>Tue, 26 Jul 2011 09:01:00 GMT</pubDate><guid>http://www.cppblog.com/bennycen/archive/2011/07/26/151879.html</guid><wfw:comment>http://www.cppblog.com/bennycen/comments/151879.html</wfw:comment><comments>http://www.cppblog.com/bennycen/archive/2011/07/26/151879.html#Feedback</comments><slash:comments>1</slash:comments><wfw:commentRss>http://www.cppblog.com/bennycen/comments/commentRss/151879.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/bennycen/services/trackbacks/151879.html</trackback:ping><description><![CDATA[<div>这几天研究了一下CUDA，发现其并行的思想和普通的CPU多线程思想不太一致，但还是挺不错。主要是将任务划分成一个个block，然后每个block里面再划分成细的线程。然后每个线程做自己做的<br />事情。这种并行思想很适用于像矩阵运算这些元素与元素之间的运算并不耦合得很厉害，但整体数据很大的情况，这只是我对CUDA的初步感觉。<br />矩阵相乘的CPU程序如下：</div><br />
<div style="border-bottom: #cccccc 1px solid; border-left: #cccccc 1px solid; padding-bottom: 4px; background-color: #eeeeee; padding-left: 4px; width: 98%; padding-right: 5px; font-size: 13px; border-top: #cccccc 1px solid; border-right: #cccccc 1px solid; padding-top: 4px"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #008000">//</span><span style="color: #008000">C&nbsp;=&nbsp;A*B</span><span style="color: #008000"><br /></span><span style="color: #0000ff">void</span><span style="color: #000000">&nbsp;MatrixMulCPU(</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_C,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">*</span><span style="color: #000000">_A,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">*</span><span style="color: #000000">_B,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wa,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_ha,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wb)<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;sum&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;i&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;&nbsp;i&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;_ha;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">i)<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;j&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;&nbsp;j&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;_wb;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">j)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;k&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;&nbsp;k&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;_wa;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">k)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum&nbsp;</span><span style="color: #000000">+=</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">float</span><span style="color: #000000">)_A[i</span><span style="color: #000000">*</span><span style="color: #000000">_wa</span><span style="color: #000000">+</span><span style="color: #000000">k]</span><span style="color: #000000">*</span><span style="color: #000000">(</span><span style="color: #0000ff">float</span><span style="color: #000000">)_B[k</span><span style="color: #000000">*</span><span style="color: #000000">_wb</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;j];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_C[i</span><span style="color: #000000">*</span><span style="color: #000000">_wb</span><span style="color: #000000">+</span><span style="color: #000000">j]&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">float</span><span style="color: #000000">)sum;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />}</span></div><br />
<div>从上面可以看出，C(i,j) = sum { A(i,k)*B(k,j) } 0&lt;=k &lt; _wa;耦合程度很小，所以我们可以通过划分区域的方法，让每个线程负责一个区域。<br />怎么划分呢？首先最初的想法是让每一个线程计算一个C(i,j)，那么估算一下，应该需要height_c*width_c，也就是ha*wb个线程。进一步，我们将矩阵按一个大方格Grid划分，如果一个<br />方格Grid大小是16*16，那么矩阵80*48的可以表示为5(*16) * 3(*16)，即16*16个大格子(block)，每一个格子内，自然就是(height_c/16) *(width_c/16)个线程了。<br />好了，划分完后，内核代码如下：</div>计算版本0：<br />
<div style="border-bottom: #cccccc 1px solid; border-left: #cccccc 1px solid; padding-bottom: 4px; background-color: #eeeeee; padding-left: 4px; width: 98%; padding-right: 5px; font-size: 13px; border-top: #cccccc 1px solid; border-right: #cccccc 1px solid; padding-top: 4px"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000">__global__&nbsp;</span><span style="color: #0000ff">void</span><span style="color: #000000">&nbsp;matrix_kernel_0(</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_C,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_A,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">*</span><span style="color: #000000">_B,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wa,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wb)<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;sum&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">找出该线程所在的行列</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;row&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;blockIdx.y</span><span style="color: #000000">*</span><span style="color: #000000">blockDim.y&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;threadIdx.y;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;col&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;blockIdx.x</span><span style="color: #000000">*</span><span style="color: #000000">blockDim.x&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;threadIdx.x;<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">线程Thread(row,col)负责计算C(row,col)</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;i&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;&nbsp;i&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;_wa;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">i)<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sum&nbsp;</span><span style="color: #000000">+=</span><span style="color: #000000">&nbsp;_A[row</span><span style="color: #000000">*</span><span style="color: #000000">_wa&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;i]</span><span style="color: #000000">*</span><span style="color: #000000">_B[i</span><span style="color: #000000">*</span><span style="color: #000000">_wb&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;col];<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;_C[row</span><span style="color: #000000">*</span><span style="color: #000000">_wb&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;col]&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;sum;<br />}</span></div><br />
<div>另外一种思路，我们不让每一个线程完整计算一个C(i,j)，通过C(i,j) = sum { A(i,k)*B(k,j) }发现，我们还可以再细度划分：<br />Csub(i,j) = sum{A(i,ksub+offsetA)*B(ksub+offsetB,j)}&nbsp; 0&lt;=ksub &lt; blockSize<br />C(i,j) = sum{Csub(i,j)}<br />就是把矩阵分成n*n个大的子块，然后每一个block负责计算子块i 和 子块j的子乘积，计算完毕后加起来则可。这里主要使用了共享显存作优化。</div><br />计算版本1：<br />
<div style="border-bottom: #cccccc 1px solid; border-left: #cccccc 1px solid; padding-bottom: 4px; background-color: #eeeeee; padding-left: 4px; width: 98%; padding-right: 5px; font-size: 13px; word-break: break-all; border-top: #cccccc 1px solid; border-right: #cccccc 1px solid; padding-top: 4px"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000">__global__&nbsp;</span><span style="color: #0000ff">void</span><span style="color: #000000">&nbsp;matrix_kernel_1(</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_C,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_A,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">*</span><span style="color: #000000">_B,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wa,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wb)<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;bx&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;blockIdx.x;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;by&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;blockIdx.y;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;tx&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;threadIdx.x;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;ty&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;threadIdx.y;<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">该block要处理的A</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;aBegin&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;_wa</span><span style="color: #000000">*</span><span style="color: #000000">(by</span><span style="color: #000000">*</span><span style="color: #000000">BLOCK_SIZE);</span><span style="color: #008000">//</span><span style="color: #008000">A(0,by)</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;aEnd&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;aBegin&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;_wa&nbsp;</span><span style="color: #000000">-</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">1</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;aStep&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;BLOCK_SIZE;</span><span style="color: #008000">//</span><span style="color: #008000">offsetA</span><span style="color: #008000"><br /></span><span style="color: #000000"><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;bBegin&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;BLOCK_SIZE</span><span style="color: #000000">*</span><span style="color: #000000">bx;</span><span style="color: #008000">//</span><span style="color: #008000">B(bx,0)</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;bStep&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;BLOCK_SIZE</span><span style="color: #000000">*</span><span style="color: #000000">_wb;</span><span style="color: #008000">//</span><span style="color: #008000">offsetB</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;cSub&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;a&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;aBegin,b&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;bBegin;&nbsp;a&nbsp;</span><span style="color: #000000">&lt;=</span><span style="color: #000000">&nbsp;aEnd;&nbsp;a&nbsp;</span><span style="color: #000000">+=</span><span style="color: #000000">&nbsp;aStep,b&nbsp;</span><span style="color: #000000">+=</span><span style="color: #000000">&nbsp;bStep)<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;__shared__&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;As[BLOCK_SIZE][BLOCK_SIZE];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;__shared__&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;Bs[BLOCK_SIZE][BLOCK_SIZE];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">每个线程负责一个元素拷贝</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As[ty][tx]&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;_A[a&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;_wa</span><span style="color: #000000">*</span><span style="color: #000000">ty&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;tx];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Bs[ty][tx]&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;_B[b&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;_wb</span><span style="color: #000000">*</span><span style="color: #000000">ty&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;tx];<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;__syncthreads();<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">每个线程负责计算一个子块i&nbsp;和&nbsp;子块j的子乘积</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;k&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;&nbsp;k&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;BLOCK_SIZE;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">k)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;cSub&nbsp;</span><span style="color: #000000">+=</span><span style="color: #000000">&nbsp;As[ty][k]</span><span style="color: #000000">*</span><span style="color: #000000">Bs[k][tx];<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;__syncthreads();<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">全局地址，向全局寄存器写回去<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #008000">//</span><span style="color: #008000">一个线程负责一个元素，一个block负责一个子块</span><span style="color: #008000"><br /></span><span style="color: #000000">&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;cIndex&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;(by</span><span style="color: #000000">*</span><span style="color: #000000">BLOCK_SIZE&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;ty)</span><span style="color: #000000">*</span><span style="color: #000000">_wb&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;(bx</span><span style="color: #000000">*</span><span style="color: #000000">BLOCK_SIZE&nbsp;</span><span style="color: #000000">+</span><span style="color: #000000">&nbsp;tx);<br />&nbsp;&nbsp;&nbsp;&nbsp;_C[cIndex]&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;cSub;<br />}<br /></span></div><br /><br />
<div>最后写一个面向Host的接口函数：</div><br />
<div style="border-bottom: #cccccc 1px solid; border-left: #cccccc 1px solid; padding-bottom: 4px; background-color: #eeeeee; padding-left: 4px; width: 98%; padding-right: 5px; font-size: 13px; word-break: break-all; border-top: #cccccc 1px solid; border-right: #cccccc 1px solid; padding-top: 4px"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #0000ff">void</span><span style="color: #000000">&nbsp;matrixMulGPU(</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_C,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">*</span><span style="color: #000000">_A,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">*</span><span style="color: #000000">_B,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wa,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_ha,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_wb)<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;d_a&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnGPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(_wa</span><span style="color: #000000">*</span><span style="color: #000000">_ha);<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;d_b&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnGPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(_wb</span><span style="color: #000000">*</span><span style="color: #000000">_wa);<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;d_c&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnGPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(_wb</span><span style="color: #000000">*</span><span style="color: #000000">_ha);<br />&nbsp;&nbsp;&nbsp;&nbsp;copyFromCPUToGPU(_A,d_a,_wa</span><span style="color: #000000">*</span><span style="color: #000000">_ha);<br />&nbsp;&nbsp;&nbsp;&nbsp;copyFromCPUToGPU(_B,d_b,_wb</span><span style="color: #000000">*</span><span style="color: #000000">_wa);<br />&nbsp;&nbsp;&nbsp;&nbsp;dim3&nbsp;threads(BLOCK_SIZE,BLOCK_SIZE);<br />&nbsp;&nbsp;&nbsp;&nbsp;dim3&nbsp;blocks(WC</span><span style="color: #000000">/</span><span style="color: #000000">BLOCK_SIZE,HC</span><span style="color: #000000">/</span><span style="color: #000000">BLOCK_SIZE);<br />&nbsp;&nbsp;&nbsp;&nbsp;matrix_kernel_0</span><span style="color: #000000">&lt;&lt;&lt;</span><span style="color: #000000">blocks,threads</span><span style="color: #000000">&gt;&gt;&gt;</span><span style="color: #000000">(d_c,d_a,d_b,_wa,_wb);<br />&nbsp;&nbsp;&nbsp;&nbsp;cudaThreadSynchronize();<br />&nbsp;&nbsp;&nbsp;&nbsp;copyFromGPUToCPU(d_c,_C,_wb</span><span style="color: #000000">*</span><span style="color: #000000">_ha);<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnGPU(d_a);<br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnGPU(d_b);<br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnGPU(d_c);<br />}</span></div><br /><br />调用的主函数如下：<br />
<div style="border-bottom: #cccccc 1px solid; border-left: #cccccc 1px solid; padding-bottom: 4px; background-color: #eeeeee; padding-left: 4px; width: 98%; padding-right: 5px; font-size: 13px; word-break: break-all; border-top: #cccccc 1px solid; border-right: #cccccc 1px solid; padding-top: 4px"><!--<br /><br />Code highlighting produced by Actipro CodeHighlighter (freeware)<br />http://www.CodeHighlighter.com/<br /><br />--><span style="color: #000000">#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">stdio.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">cuda_runtime.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">cutil.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">cutil_inline.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">stdlib.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">time.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">math.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">string</span><span style="color: #000000">.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">Windows.h</span><span style="color: #000000">&gt;</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">"</span><span style="color: #000000">CUDACommon.h</span><span style="color: #000000">"</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">"</span><span style="color: #000000">MatrixMulCPU.h</span><span style="color: #000000">"</span><span style="color: #000000"><br />#include&nbsp;</span><span style="color: #000000">"</span><span style="color: #000000">MatrixMulGPU.h</span><span style="color: #000000">"</span><span style="color: #000000"><br /><br /></span><span style="color: #0000ff">void</span><span style="color: #000000">&nbsp;randomInit(</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_data,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_size)<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;i&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;&nbsp;i&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;_size;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">i)<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;_data[i]&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;rand()</span><span style="color: #000000">/</span><span style="color: #000000">(</span><span style="color: #0000ff">float</span><span style="color: #000000">)RAND_MAX;<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />}<br /><br /></span><span style="color: #0000ff">bool</span><span style="color: #000000">&nbsp;checkError(</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_A,</span><span style="color: #0000ff">const</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;_B,</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;_size)<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">for</span><span style="color: #000000">&nbsp;(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;i&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">&nbsp;;&nbsp;i&nbsp;</span><span style="color: #000000">&lt;</span><span style="color: #000000">&nbsp;_size;&nbsp;</span><span style="color: #000000">++</span><span style="color: #000000">i)<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">if</span><span style="color: #000000">&nbsp;(fabs(_A[i]&nbsp;</span><span style="color: #000000">-</span><span style="color: #000000">&nbsp;_B[i])&nbsp;</span><span style="color: #000000">&gt;</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">1.0e-3</span><span style="color: #000000">)<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #000000">"</span><span style="color: #000000">%f&nbsp;\t&nbsp;%f\n</span><span style="color: #000000">"</span><span style="color: #000000">,_A[i],_B[i]);<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">return</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">false</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">return</span><span style="color: #000000">&nbsp;</span><span style="color: #0000ff">true</span><span style="color: #000000">;<br />}<br /><br /></span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;main(</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;argc,&nbsp;</span><span style="color: #0000ff">char</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;argv[])<br />{<br />&nbsp;&nbsp;&nbsp;&nbsp;srand(</span><span style="color: #000000">13</span><span style="color: #000000">);<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">if</span><span style="color: #000000">(</span><span style="color: #000000">!</span><span style="color: #000000">InitCUDA())&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">return</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;A&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnCPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(WA</span><span style="color: #000000">*</span><span style="color: #000000">HA);<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;B&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnCPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(WB</span><span style="color: #000000">*</span><span style="color: #000000">HB);<br />&nbsp;&nbsp;&nbsp;&nbsp;randomInit(A,WA</span><span style="color: #000000">*</span><span style="color: #000000">HA);<br />&nbsp;&nbsp;&nbsp;&nbsp;randomInit(B,WB</span><span style="color: #000000">*</span><span style="color: #000000">HB);<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;C&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnCPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(WC</span><span style="color: #000000">*</span><span style="color: #000000">HC);<br />&nbsp;&nbsp;&nbsp;&nbsp;memset(C,</span><span style="color: #000000">0</span><span style="color: #000000">,</span><span style="color: #0000ff">sizeof</span><span style="color: #000000">(</span><span style="color: #0000ff">float</span><span style="color: #000000">)</span><span style="color: #000000">*</span><span style="color: #000000">WC</span><span style="color: #000000">*</span><span style="color: #000000">HC);<br />&nbsp;&nbsp;&nbsp;&nbsp;<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">float</span><span style="color: #000000">*</span><span style="color: #000000">&nbsp;C2&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;myNewOnCPU</span><span style="color: #000000">&lt;</span><span style="color: #0000ff">float</span><span style="color: #000000">&gt;</span><span style="color: #000000">(WC</span><span style="color: #000000">*</span><span style="color: #000000">HC);<br />&nbsp;&nbsp;&nbsp;&nbsp;memset(C2,</span><span style="color: #000000">0</span><span style="color: #000000">,</span><span style="color: #0000ff">sizeof</span><span style="color: #000000">(</span><span style="color: #0000ff">float</span><span style="color: #000000">)</span><span style="color: #000000">*</span><span style="color: #000000">WC</span><span style="color: #000000">*</span><span style="color: #000000">HC);<br />&nbsp;&nbsp;&nbsp;&nbsp;<br />&nbsp;&nbsp;&nbsp;&nbsp;unsigned&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;tick1&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;GetTickCount();<br />&nbsp;&nbsp;&nbsp;&nbsp;MatrixMulCPU(C2,A,B,WA,HA,WB);<br />&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #000000">"</span><span style="color: #000000">CPU&nbsp;use&nbsp;Time&nbsp;:&nbsp;%dms\n</span><span style="color: #000000">"</span><span style="color: #000000">,GetTickCount()&nbsp;</span><span style="color: #000000">-</span><span style="color: #000000">&nbsp;tick1);<br />&nbsp;&nbsp;&nbsp;&nbsp;unsigned&nbsp;</span><span style="color: #0000ff">int</span><span style="color: #000000">&nbsp;timer&nbsp;</span><span style="color: #000000">=</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />&nbsp;&nbsp;&nbsp;&nbsp;cutilCheckError(cutCreateTimer(</span><span style="color: #000000">&amp;</span><span style="color: #000000">timer));<br />&nbsp;&nbsp;&nbsp;&nbsp;cutilCheckError(cutStartTimer(timer));<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;matrixMulGPU(C,A,B,WA,HA,WB);<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;cutilCheckError(cutStopTimer(timer));<br />&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #000000">"</span><span style="color: #000000">GPU&nbsp;use&nbsp;time:&nbsp;%f&nbsp;(ms)&nbsp;\n</span><span style="color: #000000">"</span><span style="color: #000000">,&nbsp;cutGetTimerValue(timer));<br />&nbsp;&nbsp;&nbsp;&nbsp;cutilCheckError(cutDeleteTimer(timer));<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">if</span><span style="color: #000000">&nbsp;(checkError(C,C2,WC</span><span style="color: #000000">*</span><span style="color: #000000">HC))<br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #000000">"</span><span style="color: #000000">Accept\n</span><span style="color: #000000">"</span><span style="color: #000000">);<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">else</span><span style="color: #000000"><br />&nbsp;&nbsp;&nbsp;&nbsp;{<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;printf(</span><span style="color: #000000">"</span><span style="color: #000000">Worng&nbsp;Answer\n</span><span style="color: #000000">"</span><span style="color: #000000">);<br />&nbsp;&nbsp;&nbsp;&nbsp;}<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnCPU(A);<br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnCPU(B);<br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnCPU(C);<br />&nbsp;&nbsp;&nbsp;&nbsp;myDeleteOnCPU(C2);<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;</span><span style="color: #0000ff">return</span><span style="color: #000000">&nbsp;</span><span style="color: #000000">0</span><span style="color: #000000">;<br />}<br /></span></div><br />运算结果如下：<br />版本0：<br /><br /><br /><br />版本1：<br />
<div></div><img border="0" alt="" src="http://www.cppblog.com/images/cppblog_com/bennycen/2.jpg" width="673" height="440" /><br /><br />可以看出，GPU并行性能比CPU好很多，而且版本1优于版本0<br /><br />整个工程下载：<a href="/Files/bennycen/CUDAMatrixMul.rar">/Files/bennycen/CUDAMatrixMul.rar</a><img src ="http://www.cppblog.com/bennycen/aggbug/151879.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/bennycen/" target="_blank">bennycen</a> 2011-07-26 17:01 <a href="http://www.cppblog.com/bennycen/archive/2011/07/26/151879.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>