CUDA Programming is pretty cool
The last bit of my CSE 415 class is CUDA programming. CUDA programming is quite a bit of fun when you get to see the performance increase.
As I understand it, CUDA, the way our class was using it, is a very good replacement for a for-loop. Since there are hundreds or thousands of CUDA cores, you can massively parallelize a program.
This comes with downsides as well. In my testing, starting a CUDA kernel took 80-200ms depending on the gpu. That is sometimes enough to rule out having it all together. I hope this was an anomaly or something that the HPCC I was using does, because that's quite slow. After the 'warm up', it was lightning fast, speeding up my program anywhere from 32 - 400x the single-thread time. I even got results in the 4,000x speedup range, although it seems to be a mistake.
The problem we were working on was temperature diffusion. Basically the simulation of the heat moving from one end of a metal rod to the other. I believe this was massively simplified for us, however it was cool anyways.
Here is the equation:
$$u_{t+1}=\frac{u[i+1] + u[i-1]}{2}$$
Basically, the position's temperature is the average of its neighbors.
So, the setup is two arrays of length n
, with all elements set to room temperature, with the first element of the first array set to a higher temperature, like 100. The second array is calculated from the average of the two elements of the first array.
The single-thread for-loop version looks like:
for (i=0; i < time; i++) {
for (k=1; k <= size; k++) {
a[k] = (b[k-1] + b[k+1]) / 2.0;
}
a[size+1] = a[size];
tempPtr = a;
a = b;
b = tempPtr;
}
a
is calculated by averaging two elements of b
(the inner loop), and that is repeated for the amount of time specified (outer loop). Then you swap pointers and go the other way.
Now for CUDA, I replaced the entire inner for loop with a kernel. Then I specified how I wanted to run that kernel:
Kernel:
__global__
void cuda_diffusion_kernel(float *a, float *b, long int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if(i <= size && i > 0) {
a[i] = (b[i-1] + b[i+1]) / 2.0;
}
if(i == size){
a[i+1] = a[i];
}
}
The modified for loop:
for (i=0; i<time; i++) {
cuda_diffusion_kernel<<<grid, block>>>(d_a, d_b, size);
tempPtr = d_a;
d_a = d_b;
d_b = tempPtr;
}
The grid is the amount of blocks you need to run, and a block is a group of threads. In this case, I used the max number of threads (512 is for almost all of CUDA except for the new GPUs):
int threads_per_block = 512;
int blocks = ceil(size/((float)threads_per_block));
dim3 grid(blocks);
dim3 block(threads_per_block);
That was pretty much all of the code required to make this simulation perform 32 - 400x faster.
Pretty cool stuff