//============================================================
//============================================================
DATE:2011-4-12
TIME:01:18
ICBC.pdf –finish
//============================================================
//============================================================
DATE:2011-4-15
TIME:00:00
Reading the “NoSQL Datebase”
   Reason for use NoSQL
1. Avoidance of Unneeded Complexity
2. High Throughput
3. Horizontal Scalability and Running on Commodity Hardware
4. Avoidance of Expensive Object-Relational Mapping
5. Complexity and Cost of Setting up Database Clusters
6. Compromising Reliability for Better Performance
7. The Current “One size fit’s it all” Databases Thinking Was and Is Wrong
8. The Myth of Effortless Distribution and Partitioning of Centralized Data Models
9. Movements in Programming Languages and Development Frameworks
10. Requirements of Cloud Computing
11. The RDBMS plus Caching-Layer Pattern/Workaround vs. Systems Built from Scratch with Scalability in Mind
12. Yesterday’s vs. Today’s Needs
Nosqldbs.pdf ----page19
 
//============================================================
//============================================================
DATE:2011-4-16
TIME:00:24
Reading the cudaArticle—05
A multiprocessor takes four clock cycles to issue one memory instruction for a "warp"
Accessing local or global memory incurs an additional 400 to 600 clock cycles of memory latency
-----------------------------------
Cuda Memory
Registers:
l The fastest form of memory on the multi-processor.
l Is only accessible by the thread.
l Has the lifetime of the thread.
Shared Memory:
l Can be as fast as a register when there are no bank conflicts or when reading from the same address.
l Accessible by any thread of the block from which it was created.
l Has the lifetime of the block.
Global memory:
l Potentially 150x slower than register or shared memory -- watch out for uncoalesced reads and writes which will be discussed in the next column.
l Accessible from either the host or device.
l Has the lifetime of the application.
Local memory:
l A potential performance gotcha, it resides in global memory and can be 150x slower than register or shared memory.
l Is only accessible by the thread.
l Has the lifetime of the thread.
 
// includes, system
#include <stdio.h>
#include <assert.h>
 
// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);
 
// Part 2 of 2: implement the fast kernel using shared memory
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
    extern __shared__ int s_data[];
 
    int inOffset = blockDim.x * blockIdx.x;
    int in = inOffset + threadIdx.x;
 
    // Load one element per thread from device memory and store it 
    // *in reversed order* into temporary shared memory
    s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
 
// Block until all threads in the block have written 
//their data to shared mem
    __syncthreads();
 
    // write the data from shared memory in forward order, 
    // but to the reversed block offset as before
 
    int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
 
    int out = outOffset + threadIdx.x;
    d_out[out] = s_data[threadIdx.x];
}
 
////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////
int main( int argc, char** argv) 
{
    // pointer for host memory and size
    int *h_a;
    int dimA = 256 * 1024; // 256K elements (1MB total)
 
    // pointer for device memory
    int *d_b, *d_a;
 
    // define grid and block size
    int numThreadsPerBlock = 256;
 
// Compute number of blocks needed based on array size 
//and desired block size
    int numBlocks = dimA / numThreadsPerBlock; 
 
    // Part 1 of 2: Compute the number of bytes of shared memory needed
    // This is used in the kernel invocation below
    int sharedMemSize = numThreadsPerBlock * sizeof(int);
 
    // allocate host and device memory
    size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
    h_a = (int *) malloc(memSize);
    cudaMalloc( (void **) &d_a, memSize );
    cudaMalloc( (void **) &d_b, memSize );
 
    // Initialize input array on host
    for (int i = 0; i < dimA; ++i) {
        h_a[i] = i;
    }
 
    // Copy host array to device array
    cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
 
    // launch kernel
    dim3 dimGrid(numBlocks);
    dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );
 
    // block until the device has completed
    cudaThreadSynchronize();
 
    // check if kernel execution generated an error
    // Check for any CUDA errors
    checkCUDAError("kernel invocation");
 
    // device to host copy
    cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );
 
    // Check for any CUDA errors
    checkCUDAError("memcpy");
 
    // verify the data returned to the host is correct
    for (int i = 0; i < dimA; i++){
        assert(h_a[i] == dimA - 1 - i );
    }
 
    // free device memory
    cudaFree(d_a);
    cudaFree(d_b);
 
    // free host memory
    free(h_a);
 
// If the program makes it this far, 
//then the results are correct and
    // there are no run-time errors. Good work!
    printf("Correct!\n");
 
    return 0;
}
 
void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err) 
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg, 
                          cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }                         
}
 
 
//============================================================
TIME:01:16
Finsh reading the cudaArticle 06
 
//============================================================
DATE:2011-4-23
TIME:09:31
Reading berkeley view on cloud computing
   Page 10 classes of utility computing
 
//============================================================
DATE:2011-4-24
TIME:00:16
Reading Makefile.pdf
 
--------------------------------------------------------------
List macros specified by defalut(Makefile)
   Using : make –p
$@ name of target
$? List of dependents
$^ gives all dependencies,whether more recent than the target
$+ same as above,but keep the duplicate names
$< the first dependencies
 
--------------------------------------------------------------
Reading berkeley view on cloud computing
   Page 19 Number 5 Obstacle: Performance Unpredictability
 
//============================================================
//============================================================
DATE:2011-4-25
TIME:01:40
Finish reading Berkeley view on cloud computing
 
//============================================================
//============================================================
DATE:2011-4-28
TIME:21:22
Coding the motion project 
The Visual Studio 2005 return an error that stack overflow
“Unhandled exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.”
 
--------------------------------------------------------------
'motion.exe': Unloaded 'C:\WINDOWS\WinSxS\x86_Microsoft.VC80.CRT_1fc8b3b9a1e18e3b_8.0.50727.4053_x-ww_e6967989\msvcr80.dll'
'motion.exe': Unloaded 'C:\WINDOWS\system32\psapi.dll'
'motion.exe': Unloaded 'C:\WINDOWS\system32\shimeng.dll'
First-chance exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.
Unhandled exception at 0x00439a57 in motion.exe: 0xC00000FD: Stack overflow.
The program '[2388] motion.exe: Native' has exited with code 0 (0x0).
--------------------------------------------------------------
Problem: using huge big objet
 
//============================================================
//============================================================
DATE:2011-4-30
TIME:01:40
Coding CSE332 project 2
   Adding other data-counter Implementations