GPU Programming with CUDA

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA. It enables software developers to use NVIDIA GPUs for general-purpose processing, going beyond traditional graphics rendering. This opens up possibilities for accelerating computationally intensive tasks in various fields, including scientific simulations, machine learning, and image processing. This document provides an advanced overview of CUDA programming in C++.

What is GPU Programming with CUDA

GPU programming with CUDA leverages the massively parallel architecture of GPUs. Unlike CPUs, which are optimized for serial execution with a few powerful cores, GPUs consist of thousands of smaller cores designed for parallel execution of the same instruction on multiple data points (SIMD - Single Instruction, Multiple Data). CUDA provides a C++ extension that allows developers to write code that executes directly on the GPU.

Key Concepts:

Host: The CPU and its associated memory (RAM). The host code is the standard C++ code that runs on the CPU.
Device: The GPU and its associated memory (GPU memory or VRAM). The device code is the CUDA code that runs on the GPU.
Kernel: A function that is executed in parallel by multiple threads on the GPU. Kernels are written in CUDA C++ and launched from the host.
Thread: The smallest unit of execution on the GPU. Threads are grouped into blocks.
Block: A group of threads that can cooperate by sharing data through shared memory and synchronizing their execution. Blocks are grouped into grids.
Grid: A collection of blocks that execute the same kernel.
Memory Hierarchy: CUDA utilizes a memory hierarchy consisting of global memory (accessible by all threads), shared memory (accessible by threads within a block), registers (private to each thread), and constant memory (read-only, cached).

Edge Cases and Performance Considerations:

Data Transfer Overhead: Transferring data between the host and the device is a significant bottleneck. Minimizing data transfers is crucial for performance. Techniques like asynchronous data transfers (using CUDA streams) can help overlap data transfers with computation.
Thread Divergence: If threads within a warp (a group of 32 threads that execute in lockstep) take different execution paths (e.g., due to if statements), performance can degrade. This is because the entire warp must wait for all threads to complete their execution paths. Writing code that minimizes thread divergence is important.
Memory Coalescing: When threads access global memory, it’s most efficient when they access contiguous memory locations. This is called memory coalescing. If threads access memory in a scattered pattern, performance can suffer.
Occupancy: Occupancy refers to the ratio of active warps to the maximum number of warps that can be resident on a Streaming Multiprocessor (SM) at any given time. Higher occupancy generally leads to better performance, but it’s not always the case. Balancing occupancy with other factors, such as register usage and shared memory usage, is important.
Kernel Launch Overhead: Launching a kernel has a certain overhead. Launching many small kernels can be less efficient than launching a single larger kernel.

Syntax and Usage

CUDA C++ extends standard C++ with keywords and functions that allow you to write code that executes on the GPU.

__global__: This keyword declares a function that is executed on the device (GPU) and called from the host (CPU). It’s the entry point for kernel execution.
__device__: This keyword declares a function that is executed on the device and called from another device function.
__host__: This keyword declares a function that is executed on the host and called from another host function. It’s typically omitted since it’s the default.
cudaMalloc(): Allocates memory on the device.
cudaFree(): Frees memory on the device.
cudaMemcpy(): Copies data between the host and the device, or between different memory locations on the device. It takes a cudaMemcpyKind argument to specify the direction of the copy (e.g., cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice).
<<<gridDim, blockDim>>>: This is the kernel launch configuration syntax. It specifies the number of blocks in the grid (gridDim) and the number of threads in each block (blockDim). gridDim and blockDim can be either integers or dim3 objects (for 1D, 2D, or 3D grids and blocks).
threadIdx.x, threadIdx.y, threadIdx.z: These built-in variables provide the thread index within its block.
blockIdx.x, blockIdx.y, blockIdx.z: These built-in variables provide the block index within the grid.
blockDim.x, blockDim.y, blockDim.z: These built-in variables provide the dimensions of the block.
gridDim.x, gridDim.y, gridDim.z: These built-in variables provide the dimensions of the grid.
cudaDeviceSynchronize(): Waits for all preceding CUDA calls to complete. This is often used to ensure that data transfers are finished before proceeding.

Basic Example

This example demonstrates vector addition on the GPU.


#include <iostream>
#include <cuda_runtime.h>
 
// Kernel function to add two vectors
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < n) {
    c[i] = a[i] + b[i];
  }
}
 
int main() {
  int n = 1024;
  float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
 
  // Allocate memory on the host
  h_a = new float[n];
  h_b = new float[n];
  h_c = new float[n];
 
  // Initialize host vectors
  for (int i = 0; i < n; ++i) {
    h_a[i] = i;
    h_b[i] = i * 2;
  }
 
  // Allocate memory on the device
  cudaMalloc((void **)&d_a, n * sizeof(float));
  cudaMalloc((void **)&d_b, n * sizeof(float));
  cudaMalloc((void **)&d_c, n * sizeof(float));
 
  // Copy data from host to device
  cudaMemcpy(d_a, h_a, n * sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b, n * sizeof(float), cudaMemcpyHostToDevice);
 
  // Define grid and block dimensions
  int blockSize = 256;
  int numBlocks = (n + blockSize - 1) / blockSize;
 
  // Launch the kernel
  vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);
 
  // Copy results from device to host
  cudaMemcpy(h_c, d_c, n * sizeof(float), cudaMemcpyDeviceToHost);
 
  // Verify results
  for (int i = 0; i < n; ++i) {
    if (h_c[i] != h_a[i] + h_b[i]) {
      std::cout << "Error at index " << i << ": " << h_c[i] << " != " << h_a[i] + h_b[i] << std::endl;
      break;
    }
  }
 
  std::cout << "Vector addition completed successfully." << std::endl;
 
  // Free memory
  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_c);
  delete[] h_a;
  delete[] h_b;
  delete[] h_c;
 
  return 0;
}

Explanation:

Memory Allocation: The code allocates memory on both the host (h_a, h_b, h_c) and the device (d_a, d_b, d_c).
Data Transfer: The input vectors h_a and h_b are copied from the host to the device using cudaMemcpy.
Kernel Launch: The vectorAdd kernel is launched with a specified number of blocks and threads per block. The <<<numBlocks, blockSize>>> syntax defines the execution configuration.
Kernel Execution: The vectorAdd kernel calculates the index i for each thread and adds the corresponding elements of the input vectors.
Result Transfer: The result vector d_c is copied from the device back to the host.
Verification: The code verifies that the GPU computation produced the correct results.
Memory Deallocation: The allocated memory on both the host and device is freed.

Advanced Example

This example demonstrates matrix multiplication on the GPU using shared memory for improved performance.


#include <iostream>
#include <cuda_runtime.h>
 
// Kernel function for matrix multiplication using shared memory
__global__ void matrixMulShared(float *a, float *b, float *c, int width) {
  // Block row and column
  int bx = blockIdx.x;
  int by = blockIdx.y;
 
  // Thread row and column within Csub
  int tx = threadIdx.x;
  int ty = threadIdx.y;
 
  // Each thread computes one element of Csub
  // by accumulating results into Cvalue
  float Cvalue = 0;
 
  // Loop over all the sub-matrices of A and B that are
  // required to compute Csub
  // Multiply each pair of sub-matrices together
  // and accumulate the results
  __shared__ float As[32][32]; // Assuming block size is 32
  __shared__ float Bs[32][32];
 
  for (int i = 0; i < width / 32; ++i) { //Width must be multiple of 32 for this example
 
    // Load Asub and Bsub from device memory to shared memory
    // Each thread loads one element of each sub-matrix
    As[ty][tx] = a[by * width * 32 + i * 32 + ty * width + tx];
    Bs[ty][tx] = b[i * width * 32 + bx * 32 + ty * width + tx];
 
    // Synchronize to make sure the sub-matrices are loaded
    // before starting the computation
    __syncthreads();
 
    // Multiply Asub and Bsub together
    for (int k = 0; k < 32; ++k) {
      Cvalue += As[ty][k] * Bs[k][tx];
    }
 
    // Synchronize to make sure that the preceding
    // computation is done before loading new Asub and Bsub
    __syncthreads();
  }
 
  // Write Csub to device memory
  c[by * width * 32 + bx * 32 + ty * width + tx] = Cvalue;
}
 
 
int main() {
  int width = 128; // Matrix dimension (width x width) - must be multiple of 32 for this example
  int size = width * width;
  float *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
 
  // Allocate memory on the host
  h_a = new float[size];
  h_b = new float[size];
  h_c = new float[size];
 
  // Initialize host matrices
  for (int i = 0; i < size; ++i) {
    h_a[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    h_b[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
  }
 
  // Allocate memory on the device
  cudaMalloc((void **)&d_a, size * sizeof(float));
  cudaMalloc((void **)&d_b, size * sizeof(float));
  cudaMalloc((void **)&d_c, size * sizeof(float));
 
  // Copy data from host to device
  cudaMemcpy(d_a, h_a, size * sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b, size * sizeof(float), cudaMemcpyHostToDevice);
 
  // Define grid and block dimensions
  dim3 threadsPerBlock(32, 32);
  dim3 blocksPerGrid(width / 32, width / 32); //Assuming width is a multiple of 32
 
 
  // Launch the kernel
  matrixMulShared<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, width);
 
  // Copy results from device to host
  cudaMemcpy(h_c, d_c, size * sizeof(float), cudaMemcpyDeviceToHost);
 
  // Basic verification (expensive to fully verify, so just check a few elements)
  bool error = false;
  for (int i = 0; i < 10; ++i) {
      int row = rand() % width;
      int col = rand() % width;
      float expected = 0.0f;
      for (int k = 0; k < width; ++k) {
          expected += h_a[row * width + k] * h_b[k * width + col];
      }
      float actual = h_c[row * width + col];
      if (abs(actual - expected) > 0.001) {
          std::cout << "Error at (" << row << ", " << col << "): Expected " << expected << ", got " << actual << std::endl;
          error = true;
          break;
      }
  }
 
  if (!error) {
      std::cout << "Matrix multiplication completed successfully." << std::endl;
  }
 
 
  // Free memory
  cudaFree(d_a);
  cudaFree(d_b);
  cudaFree(d_c);
  delete[] h_a;
  delete[] h_b;
  delete[] h_c;
 
  return 0;
}

Common Use Cases

Scientific Simulations: Molecular dynamics, fluid dynamics, weather forecasting.
Machine Learning: Training deep neural networks, image recognition, natural language processing.
Image and Video Processing: Image filtering, video encoding/decoding, computer vision.
Finance: Risk analysis, option pricing, portfolio optimization.

Best Practices

Minimize Data Transfers: Transfer data between the host and device as infrequently as possible. Use asynchronous data transfers to overlap computation and communication.
Optimize Memory Access: Ensure that threads access global memory in a coalesced manner. Use shared memory to reduce global memory accesses.
Reduce Thread Divergence: Write code that minimizes the differences in execution paths taken by threads within a warp.
Maximize Occupancy: Choose block sizes that maximize occupancy without sacrificing other performance factors.
Profile Your Code: Use the NVIDIA Nsight profiler to identify performance bottlenecks and optimize your code.

Common Pitfalls

Incorrect Memory Management: Failing to allocate or free memory properly can lead to memory leaks or segmentation faults.
Race Conditions: When multiple threads access the same shared memory location without proper synchronization, race conditions can occur, leading to unpredictable results.
Deadlocks: When threads are waiting for each other to release resources, a deadlock can occur, causing the program to hang.
Ignoring Error Codes: CUDA functions return error codes. Ignoring these error codes can make debugging difficult.

Key Takeaways

CUDA enables you to leverage the massively parallel architecture of GPUs for general-purpose computing.
Understanding the CUDA programming model, including kernels, threads, blocks, and grids, is essential for writing efficient CUDA code.
Optimizing memory access, minimizing data transfers, and reducing thread divergence are critical for achieving high performance.
Profiling your code and using the NVIDIA Nsight profiler can help you identify and address performance bottlenecks.