C++#

"C++ is a general-purpose programming language providing a direct and efficient model of hardware combined with facilities for defining lightweight abstractions." - Bjarne Stroustrup, "The C++ Programming Language, Fourth Edition"

Getting Started#

This section illustrates the process to compile and run a basic C++ program on the HPC systems.

Hello World#

Begin by creating a source file named hello.cpp with the following contents:

#include <iostream>

int main(void) {
  std::cout << "Hello, World!\n";
  return 0;
}

Next, we must select the compiler to use for compiling our program. We can choose among GNU, Intel, and Cray compilers, depending on the system that we are using (see Compilers and Toolchains). To see available modules and versions, use module avail. For this example, we will use the g++ compiler, which is part of GNU's gcc package. We will load the default version of the compiler, which in this case is gcc 10.1:

$ module load gcc
$ module list
Currently Loaded Modules:
  1) gcc/10.1.0
$ gcc --version | head -1
gcc (Spack GCC) 10.1.0

With the gcc package, the C++ compiler is provided by the g++ command. To compile the program, run:

$ g++ hello.cpp -o hello

This creates an executable named hello. Now run the program and observe the output:

$ ./hello
Hello, World!

CUDA Vector Operations Example#

This code demonstrates basic vector operations using CUDA:

Scaling a vector using two different kernel strategies
Adding two vectors element-wise on the GPU
Managing memory and data transfer between host (CPU) and device (GPU)
Verifying results on the host

Each CUDA kernel shows a different way to parallelize vector operations. The main function walks through allocation, kernel launches, and cleanup. Create a source file named stream.cu with the following contents:

stream.cu

#include <iostream>
#include <cassert>

// Define the size of the vectors
#define N 10240
typedef float dType;

// CUDA kernel to scale a vector (each thread handles one element)
__global__ void scaleVector(dType scale, dType *input, dType *output)
{
    // Calculate global thread index
    int tid = threadIdx.x + blockIdx.x*blockDim.x;

    // Ensure thread index is within bounds
    if (tid < N)
    {
        output[tid] = input[tid]*scale;
    }
};

// Alternate CUDA kernel to scale a vector (each thread handles multiple elements)
__global__ void scaleVectorAlternate(dType scale, dType *input, dType *output)
{
    // Calculate global thread index
    int tid = threadIdx.x + blockIdx.x*blockDim.x;

    // Loop over elements assigned to this thread
    for (int i=tid; i<N; i=i+blockDim.x*gridDim.x)
    {
        output[i] = input[i]*scale;
    }
}

// CUDA kernel to add two vectors element-wise
__global__ void addVectors(dType *a,dType *b, dType *c)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x; // Calculate global thread index

    // Each thread processes multiple elements spaced by total number of threads
    for (int i=tid; i<N; i=i+blockDim.x*gridDim.x)
    {
        c[i] = a[i] + b[i]; // Add corresponding elements from a and b, store in c
    }

}

int main()
{
    // Allocate host memory for input and output vectors
    dType *a = new dType[N];
    dType *b = new dType[N];

    // Device pointers
    dType *dev_a;
    dType *dev_b;

    // Scaling factor
    const dType scale = 2;

    // Initialize input vector on host
    for (int i=0; i<N; i++) a[i] = (dType)i / 2.0;

    // Allocate device memory
    cudaMalloc( (void**)&dev_a, N*sizeof(dType));
    cudaMalloc( (void**)&dev_b, N*sizeof(dType));

    // Copy input data from host to device
    cudaMemcpy(dev_a, a, N*sizeof(dType), cudaMemcpyHostToDevice);

    // Set CUDA kernel launch parameters
    const int blockSize=128;
    int nBlocks = (N-1)/blockSize + 1;

    // Launch the scaleVector kernel on the device
    scaleVector <<<nBlocks, blockSize>>> (scale,dev_a,dev_b);

    // Copy result from device to host
    cudaMemcpy(b, dev_b, N*sizeof(dType), cudaMemcpyDeviceToHost);

    // Verify results on host
    for (int i=0; i<N; i++) assert(a[i]*scale == b[i]);

    // Reset device memory for the next kernel
    cudaMemset(dev_b, (dType) 0, N*sizeof(dType));

    // Change the number of blocks such that each thread handles multiple elements
    nBlocks = 4;

    // Launch the alternate scaleVector kernel on the device
    scaleVectorAlternate <<<nBlocks, blockSize>>> (scale,dev_a,dev_b);

    // Copy result from device to host
    cudaMemcpy(b, dev_b, N*sizeof(dType), cudaMemcpyDeviceToHost);

    // Verify results on host
    for (int i=0; i<N; i++) assert(a[i]*scale == b[i]);

    dType *c;      // Host pointer for result vector
    dType *dev_c;  // Device pointer for result vector

    c = (dType *) malloc(N*sizeof(dType)); // Allocate host memory for output vector

    cudaMalloc( (void**)&dev_c, N*sizeof(dType)); // Allocate device memory for output vector

    addVectors <<<nBlocks, blockSize>>> (dev_a, dev_b, dev_c); // Launch kernel to add vectors

    // Copy result from device to host
    cudaMemcpy(c, dev_c, N*sizeof(dType), cudaMemcpyDeviceToHost);

    // Verify results on host
    for (int i=0; i<N; i++) assert(c[i] == a[i] + b[i]); // Check if addition is correct

    std::cout << "All operations completed successfully!" << std::endl;

    // Free host and device memory
    delete [] a;      // Free host memory for input vector a
    delete [] b;      // Free host memory for input/output vector b
    cudaFree(dev_a);  // Free device memory for input vector a
    cudaFree(dev_b);  // Free device memory for input/output vector b
    free(c);          // Free host memory for output vector c
    cudaFree(dev_c);  // Free device memory for output vector c

    return 0; // Return success

}

With the nvhpc package, the CUDA C++ compiler is provided by the nvcc command. To compile the program, run:

$ salloc -A <project_name> -t 00:30:00 --nodes=1 --ntasks-per-node=10 --gres=gpu:1 --partition=debug
$ module load nvhpc
$ nvcc -o cuda_stream stream.cu

This creates an executable named cuda_stream. Now run the program and observe the output:

$ ./cuda_stream
All operations completed successfully!

MPI + CUDA 2D Diffusion Equation Example#

In this example, each MPI process computes a subdomain of a 2D array using CUDA. Processes are arranged in a 2D Cartesian grid (MPI_Cart_create). Each process evolves its local subdomain using a simple explicit diffusion scheme. Boundary data is exchanged using host buffers and MPI_Sendrecv. At the end, all subdomains are gathered to rank 0 and saved as a CSV file.

Create a source file named mpi_diffusion.cu with the following contents:

mpi_diffusion.cu

#include <iostream>
#include <fstream>
#include <mpi.h>
#include <cuda_runtime.h>
#include <vector>
#include <cstring>

#define N 8 // Global domain size (NxN)
#define BLOCK_SIZE 4 // Local domain size per process (BLOCK_SIZE x BLOCK_SIZE)
#define NSTEPS 10 // Number of time steps
#define DT 0.1f
#define DX 1.0f
#define DIFF 1.0f // Diffusion coefficient

typedef float dType;

// CUDA kernel: explicit 2D diffusion step
__global__ void diffusionStep(
    dType* u, dType* u_new,
    int n, dType dt, dType dx, dType diff,
    const dType* top, const dType* bottom, const dType* left, const dType* right)
{
    int i = blockIdx.y * blockDim.y + threadIdx.y; // row index in local domain
    int j = blockIdx.x * blockDim.x + threadIdx.x; // column index in local domain

    if (i < n && j < n) {
        // Get neighbor values, using boundary arrays if on edge
        dType up    = (i == 0)      ? top[j]    : u[(i-1)*n + j];           // Top neighbor
        dType down  = (i == n-1)    ? bottom[j] : u[(i+1)*n + j];           // Bottom neighbor
        dType leftv = (j == 0)      ? left[i]   : u[i*n + (j-1)];           // Left neighbor
        dType rightv= (j == n-1)    ? right[i]  : u[i*n + (j+1)];           // Right neighbor
        dType center= u[i*n + j];                                            // Center value

        // Explicit diffusion update formula
        u_new[i*n + j] = center + diff * dt / (dx*dx) *
            (up + down + leftv + rightv - 4.0f * center);
    }
}

int main(int argc, char** argv)
{
    MPI_Init(&argc, &argv); // Initialize MPI

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get process rank
    MPI_Comm_size(MPI_COMM_WORLD, &size); // Get total number of processes

    // Create a 2D Cartesian communicator for domain decomposition
    int dims[2] = {0, 0};
    MPI_Dims_create(size, 2, dims); // Let MPI decide grid shape
    int periods[2] = {0, 0}; // No periodic boundaries
    MPI_Comm cart_comm;
    MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &cart_comm);

    int coords[2];
    MPI_Cart_coords(cart_comm, rank, 2, coords); // Get coordinates of this rank in the grid

    // Allocate local domain arrays on host
    dType *local = new dType[BLOCK_SIZE * BLOCK_SIZE];      // Current solution
    dType *local_new = new dType[BLOCK_SIZE * BLOCK_SIZE];  // Next solution

    // Initialize: set center of global domain to 100, rest to 0
    for (int i = 0; i < BLOCK_SIZE; ++i)
        for (int j = 0; j < BLOCK_SIZE; ++j)
            local[i * BLOCK_SIZE + j] = (coords[0] == dims[0]/2 && coords[1] == dims[1]/2 && i == BLOCK_SIZE/2 && j == BLOCK_SIZE/2) ? 100.0f : 0.0f;

    // Allocate device memory for local domains
    dType *d_local, *d_local_new;
    cudaMalloc(&d_local, BLOCK_SIZE * BLOCK_SIZE * sizeof(dType));
    cudaMalloc(&d_local_new, BLOCK_SIZE * BLOCK_SIZE * sizeof(dType));
    cudaMemcpy(d_local, local, BLOCK_SIZE * BLOCK_SIZE * sizeof(dType), cudaMemcpyHostToDevice);

    // Allocate device buffers for boundaries (top, bottom, left, right)
    dType *d_top, *d_bottom, *d_left, *d_right;
    cudaMalloc(&d_top, BLOCK_SIZE * sizeof(dType));
    cudaMalloc(&d_bottom, BLOCK_SIZE * sizeof(dType));
    cudaMalloc(&d_left, BLOCK_SIZE * sizeof(dType));
    cudaMalloc(&d_right, BLOCK_SIZE * sizeof(dType));

    // Host buffers for MPI boundary exchange
    dType h_top[BLOCK_SIZE], h_bottom[BLOCK_SIZE], h_left[BLOCK_SIZE], h_right[BLOCK_SIZE];

    // Find ranks of neighboring processes in the grid
    int north, south, west, east;
    MPI_Cart_shift(cart_comm, 0, 1, &north, &south); // Find north/south neighbors (row direction)
    MPI_Cart_shift(cart_comm, 1, 1, &west, &east);   // Find west/east neighbors (col direction)

    // CUDA kernel launch configuration
    dim3 block(2,2), grid((BLOCK_SIZE+1)/2,(BLOCK_SIZE+1)/2);

    // Main time-stepping loop
    for (int step = 0; step < NSTEPS; ++step) {
        // Extract boundaries from device to host for MPI exchange
        cudaMemcpy(h_top,    d_local, BLOCK_SIZE * sizeof(dType), cudaMemcpyDeviceToHost); // Copy top row to host
        cudaMemcpy(h_bottom, d_local + (BLOCK_SIZE-1)*BLOCK_SIZE, BLOCK_SIZE * sizeof(dType), cudaMemcpyDeviceToHost); // Copy bottom row to host
        for (int i = 0; i < BLOCK_SIZE; ++i) {
            cudaMemcpy(&h_left[i],  d_local + i*BLOCK_SIZE, sizeof(dType), cudaMemcpyDeviceToHost); // Copy left column to host
            cudaMemcpy(&h_right[i], d_local + i*BLOCK_SIZE + (BLOCK_SIZE-1), sizeof(dType), cudaMemcpyDeviceToHost); // Copy right column to host
        }

        // Exchange boundaries with neighbors using MPI_Sendrecv
        // Each boundary is sent to the neighbor and received from the same neighbor
        // The received data overwrites the host buffer for use in the next kernel launch

        // North neighbor exchange (top row)
        if (north != MPI_PROC_NULL)
            MPI_Sendrecv(h_top, BLOCK_SIZE, MPI_FLOAT, north, 0,   // Send top row to north
                        h_top, BLOCK_SIZE, MPI_FLOAT, north, 0,   // Receive top row from north
                        cart_comm, MPI_STATUS_IGNORE);

        // South neighbor exchange (bottom row)
        if (south != MPI_PROC_NULL)
            MPI_Sendrecv(h_bottom, BLOCK_SIZE, MPI_FLOAT, south, 0, // Send bottom row to south
                        h_bottom, BLOCK_SIZE, MPI_FLOAT, south, 0, // Receive bottom row from south
                        cart_comm, MPI_STATUS_IGNORE);

        // West neighbor exchange (left column)
        if (west != MPI_PROC_NULL)
            MPI_Sendrecv(h_left, BLOCK_SIZE, MPI_FLOAT, west, 0,    // Send left column to west
                        h_left, BLOCK_SIZE, MPI_FLOAT, west, 0,    // Receive left column from west
                        cart_comm, MPI_STATUS_IGNORE);

        // East neighbor exchange (right column)
        if (east != MPI_PROC_NULL)
            MPI_Sendrecv(h_right, BLOCK_SIZE, MPI_FLOAT, east, 0,   // Send right column to east
                        h_right, BLOCK_SIZE, MPI_FLOAT, east, 0,   // Receive right column from east
                        cart_comm, MPI_STATUS_IGNORE);

        // Copy received boundaries from host to device for use in kernel
        cudaMemcpy(d_top,    h_top,    BLOCK_SIZE * sizeof(dType), cudaMemcpyHostToDevice);    // Top boundary
        cudaMemcpy(d_bottom, h_bottom, BLOCK_SIZE * sizeof(dType), cudaMemcpyHostToDevice);     // Bottom boundary
        cudaMemcpy(d_left,   h_left,   BLOCK_SIZE * sizeof(dType), cudaMemcpyHostToDevice);     // Left boundary
        cudaMemcpy(d_right,  h_right,  BLOCK_SIZE * sizeof(dType), cudaMemcpyHostToDevice);     // Right boundary

        // Perform one explicit diffusion step on GPU
        diffusionStep<<<grid, block>>>(
            d_local, d_local_new, BLOCK_SIZE, DT, DX, DIFF,
            d_top, d_bottom, d_left, d_right);

        // Swap pointers for next step (so d_local always points to current solution)
        std::swap(d_local, d_local_new);
    }

    // Copy final local domain from device to host for gathering
    cudaMemcpy(local, d_local, BLOCK_SIZE * BLOCK_SIZE * sizeof(dType), cudaMemcpyDeviceToHost);

    // Gather all subdomains to rank 0 for output
    int total_size = BLOCK_SIZE * BLOCK_SIZE * size;
    std::vector<dType> global;
    if (rank == 0)
        global.resize(total_size);

    // Gather all local domains into the global array on rank 0
    MPI_Gather(local, BLOCK_SIZE * BLOCK_SIZE, MPI_FLOAT,
              rank == 0 ? global.data() : nullptr, BLOCK_SIZE * BLOCK_SIZE, MPI_FLOAT,
              0, MPI_COMM_WORLD);

    // Save the full solution to CSV on rank 0
    if (rank == 0) {
        std::ofstream fout("diffusion_result.csv");
        fout << "x,y,u\n"; // CSV header
        int px = dims[0], py = dims[1]; // Number of processes in x and y
        for (int proc = 0; proc < size; ++proc) {
            int cx, cy;
            MPI_Cart_coords(cart_comm, proc, 2, coords); // Get process coordinates for proc
            cx = coords[0]; cy = coords[1];
            for (int i = 0; i < BLOCK_SIZE; ++i) {
                int gi = cx * BLOCK_SIZE + i; // Global row index
                for (int j = 0; j < BLOCK_SIZE; ++j) {
                    int gj = cy * BLOCK_SIZE + j; // Global col index
                    int idx = proc * BLOCK_SIZE * BLOCK_SIZE + i * BLOCK_SIZE + j; // Index in gathered array
                    fout << gi << "," << gj << "," << global[idx] << "\n"; // Write x,y,u
                }
            }
        }
        fout.close();
        std::cout << "Saved diffusion_result.csv\n";
    }

    // Free device memory
    cudaFree(d_local);
    cudaFree(d_local_new);
    cudaFree(d_top);
    cudaFree(d_bottom);
    cudaFree(d_left);
    cudaFree(d_right);

    // Free host memory
    delete[] local;
    delete[] local_new;

    MPI_Finalize(); // Finalize MPI
    return 0;
}

With the nvhpc package, the MPI+CUDA C++ compiler is provided by the mpicxx command. To compile the program, run:

$ salloc -A <project_name> -t 00:30:00 --nodes=1 --ntasks-per-node=10 --gres=gpu:2 --partition=debug
$ module load nvhpc
$ mpicxx -o cuda_mpi_diffusion mpi_diffusion.cu -lcudart -lmpi

This creates an executable named cuda_mpi_diffusion. Now run the program and observe the output:

$ srun -n 4 ./cuda_mpi_diffusion
Saved diffusion_result.csv

Compilers and Toolchains#

The following is a summary of available compilers and toolchains. User are encouraged to run module avail to check for the most up-to-date information on a particular system.

Toolchain	C++ Compiler	Module	Systems
gcc	`g++`	`gcc`	All
Intel	`icpc`	`intel-oneapi-compilers`	Swift, Vermilion, Kestrel
Cray	`CC`	`PrgEnv-cray`	Kestrel

Note that Kestrel also provides the PrgEnv-intel and PrgEnv-gnu modules, which combine the Intel or gcc compilers together with Cray MPICH. Please refer to Kestrel Programming Environments Overview for details about the programming environments available on Kestrel.

For information specific to compiling MPI applications, refer to MPI.