A small blog about my projects # CUDA Programming

## CUDA Programming is pretty cool

The last bit of my CSE 415 class is CUDA programming. CUDA programming is quite a bit of fun when you get to see the performance increase.

As I understand it, CUDA, the way our class was using it, is a very good replacement for a for-loop. Since there are hundreds or thousands of CUDA cores, you can massively parallelize a program.

This comes with downsides as well. In my testing, starting a CUDA kernel took 80-200ms depending on the gpu. That is sometimes enough to rule out having it all together. I hope this was an anomaly or something that the HPCC I was using does, because that's quite slow. After the 'warm up', it was lightning fast, speeding up my program anywhere from 32 - 400x the single-thread time. I even got results in the 4,000x speedup range, although it seems to be a mistake.

The problem we were working on was temperature diffusion. Basically the simulation of the heat moving from one end of a metal rod to the other. I believe this was massively simplified for us, however it was cool anyways.

Here is the equation:

$$u_{t+1}=\frac{u[i+1] + u[i-1]}{2}$$

Basically, the position's temperature is the average of its neighbors.

So, the setup is two arrays of length n, with all elements set to room temperature, with the first element of the first array set to a higher temperature, like 100. The second array is calculated from the average of the two elements of the first array.

The single-thread for-loop version looks like:

for (i=0; i < time; i++) {
for (k=1; k <= size; k++) {
a[k] = (b[k-1] + b[k+1]) / 2.0;
}
a[size+1] = a[size];
tempPtr = a;
a = b;
b = tempPtr;
}

a is calculated by averaging two elements of b (the inner loop), and that is repeated for the amount of time specified (outer loop). Then you swap pointers and go the other way.

Now for CUDA, I replaced the entire inner for loop with a kernel. Then I specified how I wanted to run that kernel:

Kernel:

 __global__
void cuda_diffusion_kernel(float *a, float *b, long int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i <= size && i > 0) {
a[i] = (b[i-1] + b[i+1]) / 2.0;
}
if(i == size){
a[i+1] = a[i];
}
}


The modified for loop:

for (i=0; i<time; i++) {
cuda_diffusion_kernel<<<grid, block>>>(d_a, d_b, size);
tempPtr = d_a;
d_a = d_b;
d_b = tempPtr;
}

The grid is the amount of blocks you need to run, and a block is a group of threads. In this case, I used the max number of threads (512 is for almost all of CUDA except for the new GPUs):

int threads_per_block = 512;
dim3 grid(blocks);
dim3 block(threads_per_block);

That was pretty much all of the code required to make this simulation perform 32 - 400x faster.

Pretty cool stuff