C++ Coder

HCP高性能计算架构,实现,编译器指令优化,算法优化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首页 新随笔 联系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/158866

Low ALUBusy and low FetchUnitBusy

此问题 未被回答 。

NURBSNewbie
NURBS 2012-3-19 下午1:35

Hi,

      When my kernel performs badly, the APP profiler reports a very low ALUBusy and low FetchUniBusy, (Both less than 10%)

      What can be the bottleneck here? Could it be because of the high number of code paths?

 

 

Thanks

NURBS

有用答案 作者 pesh 
  • 140 浏览次数
  • 有用答案Re: Low ALUBusy and low FetchUnitBusy
    peshNewbie
    pesh 2012-3-26 上午7:07 (回复 NURBS)

    Hi, NURBS!

    Can you provide information about your device? If it's an AMD APU then there were problems with performance counters in previous versions of APP Profiler.

    Also, check ALUPacking counter, if it has low value, then you code is VLIW limited and ALUBusy is poor, in this case try to reduce some data dependencies across sequential operations, it will allow compiler to better pack ALU instructions in VLIW, and utilize ALU resources. Try to reduce control flow statements, they affect counters to. In your situation, maybe you have if-statements, where in one branch you do fetch operation, and in another do some computations? That will cause some part of wavefront do fetch, and only after that remainder of wavefront will do ALU operations. So you will use only part of resources at time.

    • Re: Low ALUBusy and low FetchUnitBusy
      NURBSNewbie
      NURBS 2012-3-26 上午7:57 (回复 pesh)

      I have dual Radeon 6950 with either 12.3 or the new beta driver. It seems control flow was the issue, things are much better now. Is there an equation  I can use to sum up the numbers of counters to 100%, so that I can be more certain I am not getting bogus numbers?

      • Re: Low ALUBusy and low FetchUnitBusy
        peshNewbie
        pesh 2012-3-26 上午8:46 (回复 NURBS)

        I guess no, there is no such equation. First of all because when fetch instruction is applied by wavefront executing on compute unit, this wavefront goes to fetch unit, where it sits until fetch is done. At this time other wavefronts are doing calculations, or wait unit fetch unit become free, to execute next fetch instructions. So when some wavefronts are doing memory read or write other can do computations, and in the best case both counters can have 100% value, and ALUFetchRatio counter will equal to 1. Another important counters is FetchUnitStalled and WriteUnitStalled, try to keep them about 0 value. If it's too big, then many of wavefront are waiting for fetch unit to do memory read/write. To improve performance first of all, try to use sequential memory access pattern, then try to use local memory, if your algorithm reuse data several timers within workgroup.

posted on 2013-01-09 16:26 jackdong 阅读(347) 评论(0)  编辑 收藏 引用 所属分类: OpenCL

只有注册用户登录后才能发表评论。
网站导航: 博客园   IT新闻   BlogJava   知识库   博问   管理