LOG-2011-04

//============================================================

DATE:2011-4-12

TIME:01:18

ICBC.pdf –finish

//============================================================

DATE:2011-4-15

TIME:00:00

Reading the “NoSQL Datebase”

Reason for use NoSQL

1. Avoidance of Unneeded Complexity

2. High Throughput

3. Horizontal Scalability and Running on Commodity Hardware

4. Avoidance of Expensive Object-Relational Mapping

5. Complexity and Cost of Setting up Database Clusters

6. Compromising Reliability for Better Performance

7. The Current “One size fit’s it all” Databases Thinking Was and Is Wrong

8. The Myth of Effortless Distribution and Partitioning of Centralized Data Models

9. Movements in Programming Languages and Development Frameworks

10. Requirements of Cloud Computing

11. The RDBMS plus Caching-Layer Pattern/Workaround vs. Systems Built from Scratch with Scalability in Mind

12. Yesterday’s vs. Today’s Needs

Nosqldbs.pdf ----page19

//============================================================
//============================================================
DATE:2011-4-16

TIME:00:24

Reading the cudaArticle—05

A multiprocessor takes four clock cycles to issue one memory instruction for a "warp"

Accessing local or global memory incurs an additional 400 to 600 clock cycles of memory latency

-----------------------------------

Cuda Memory

Registers:

l The fastest form of memory on the multi-processor.

l Is only accessible by the thread.

l Has the lifetime of the thread.

Shared Memory:

l Can be as fast as a register when there are no bank conflicts or when reading from the same address.

l Accessible by any thread of the block from which it was created.

l Has the lifetime of the block.

Global memory:

l Potentially 150x slower than register or shared memory -- watch out for uncoalesced reads and writes which will be discussed in the next column.

l Accessible from either the host or device.

l Has the lifetime of the application.

Local memory:

l A potential performance gotcha, it resides in global memory and can be 150x slower than register or shared memory.

l Is only accessible by the thread.

l Has the lifetime of the thread.

// includes, system

#include <stdio.h>

#include <assert.h>

// Simple utility function to check for CUDA runtime errors

void checkCUDAError(const char* msg);

// Part 2 of 2: implement the fast kernel using shared memory

__global__ void reverseArrayBlock(int *d_out, int *d_in)

    extern __shared__ int s_data[];

    int inOffset = blockDim.x * blockIdx.x;

    int in = inOffset + threadIdx.x;

    // Load one element per thread from device memory and store it

    // *in reversed order* into temporary shared memory

    s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

// Block until all threads in the block have written

//their data to shared mem

    __syncthreads();

    // write the data from shared memory in forward order,

    // but to the reversed block offset as before

    int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

    int out = outOffset + threadIdx.x;

    d_out[out] = s_data[threadIdx.x];

////////////////////////////////////////////////////////////////////

// Program main

////////////////////////////////////////////////////////////////////

int main( int argc, char** argv)

    // pointer for host memory and size

    int *h_a;

    int dimA = 256 * 1024; // 256K elements (1MB total)

    // pointer for device memory

    int *d_b, *d_a;

    // define grid and block size

    int numThreadsPerBlock = 256;

// Compute number of blocks needed based on array size

//and desired block size

    int numBlocks = dimA / numThreadsPerBlock;

    // Part 1 of 2: Compute the number of bytes of shared memory needed

    // This is used in the kernel invocation below

    int sharedMemSize = numThreadsPerBlock * sizeof(int);

    // allocate host and device memory

    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);

    h_a = (int *) malloc(memSize);

    cudaMalloc( (void **) &d_a, memSize );

    cudaMalloc( (void **) &d_b, memSize );

    // Initialize input array on host

    for (int i = 0; i < dimA; ++i) {

        h_a[i] = i;

    // Copy host array to device array

    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

    // launch kernel

    dim3 dimGrid(numBlocks);

    dim3 dimBlock(numThreadsPerBlock);

reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );

    // block until the device has completed

    cudaThreadSynchronize();

    // check if kernel execution generated an error

    // Check for any CUDA errors

    checkCUDAError("kernel invocation");

    // device to host copy

    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

    // Check for any CUDA errors

    checkCUDAError("memcpy");

    // verify the data returned to the host is correct

    for (int i = 0; i < dimA; i++){

        assert(h_a[i] == dimA - 1 - i );

    // free device memory

    cudaFree(d_a);

    cudaFree(d_b);

    // free host memory

    free(h_a);

// If the program makes it this far,

//then the results are correct and

    // there are no run-time errors. Good work!

    printf("Correct!\n");

    return 0;

void checkCUDAError(const char *msg)

    cudaError_t err = cudaGetLastError();

    if( cudaSuccess != err)

        fprintf(stderr, "Cuda error: %s: %s.\n", msg,

                          cudaGetErrorString( err) );

        exit(EXIT_FAILURE);

//============================================================

TIME:01:16

Finsh reading the cudaArticle 06

//============================================================

DATE:2011-4-23

TIME:09:31

Reading berkeley view on cloud computing

Page 10 classes of utility computing

//============================================================

DATE:2011-4-24

TIME:00:16

Reading Makefile.pdf

--------------------------------------------------------------

List macros specified by defalut(Makefile)

Using : make –p

$@ name of target

$? List of dependents

$^ gives all dependencies,whether more recent than the target

$+ same as above,but keep the duplicate names

$< the first dependencies

--------------------------------------------------------------

Reading berkeley view on cloud computing

Page 19 Number 5 Obstacle: Performance Unpredictability

//============================================================

DATE:2011-4-25

TIME:01:40

Finish reading Berkeley view on cloud computing

//============================================================

DATE:2011-4-28

TIME:21:22

Coding the motion project

The Visual Studio 2005 return an error that stack overflow

“Unhandled exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.”

--------------------------------------------------------------

'motion.exe': Unloaded 'C:\WINDOWS\WinSxS\x86_Microsoft.VC80.CRT_1fc8b3b9a1e18e3b_8.0.50727.4053_x-ww_e6967989\msvcr80.dll'

'motion.exe': Unloaded 'C:\WINDOWS\system32\psapi.dll'

'motion.exe': Unloaded 'C:\WINDOWS\system32\shimeng.dll'

First-chance exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.

Unhandled exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.

The program '[2388] motion.exe: Native' has exited with code 0 (0x0).

--------------------------------------------------------------

Problem: using huge big objet

//============================================================

DATE:2011-4-30

TIME:01:40

Coding CSE332 project 2

Adding other data-counter Implementations

posted on 2011-05-03 21:57 chaogu 阅读(649) 评论(0) 编辑收藏引用

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！



网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

chaogu ---大写的人！

LOG-2011-04

导航

统计

常用链接

留言簿(1)

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜