MPI
Cray-MPICH
Documentation: Cray-MPICH
Cray's MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
Note Cray-MPICH is only available on Kestrel.
In order to use Cray-MPICH, it is recommended to use the HPE Cray complier wrappers cc
, CC
and ftn
.
The wrappers will find the necessary MPI headers and libraries as well as scientific libraries provided by LibSci.
Depending on the compiler of choice, we can load a different instance of Cray-MPICH.
For example, if we decide to use PrgEnv-intel
, we can load the module PrgEnv-intel
which will invoke an Intel instance of cray-mpich
that can be used through cc
, CC
and ftn
.
We can also use the usual MPI compilers mpicc
, mpicxx
and mpif90
/mpifort
but it is recommended to use the wrappers.
Cray-MPICH takes into consideration the processor architecture through craype-x86-spr
and the network type through craype-network-ofi
.
cray-mpich-abi
For codes compiled using intel-mpi
or mpich
, we can load the module cray-mpich-abi
, an HPE provided MPI that allows pre-compiled software to leverage MPICH benefits on Kestrel's network topology.
OpenMPI
Documentation: OpenMPI
The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.
The Open MPI framework is a free and open-source communications library that is commonly developed against by many programmers. As an open-source package with strong academic support, the latest ideas may appear as implementations here prior to commercial MPI libraries.
Note that the Slurm-integrated builds of OpenMPI do not create the mpirun
or mpiexec
wrapper scripts that you may be used to. Ideally you should use srun
(to take advantage of Slurm integration), but you can also use OpenMPI's native job launcher orterun
. Some have also had success simply symlinking mpirun
to orterun
.
OpenMPI implements two Byte Transfer Layers for data transport between ranks in the same physical memory space: sm
and vader
.
Both use a memory-mapped file, which by default is placed in /tmp
.
The node-local /tmp
filesystem is quite small, and it is easy to fill this and crash or hang your job.
Non-default locations of this file may be set through the OMPI_TMPDIR
environment variable.
-
If you are running only a few ranks per node with modest buffer space requirements, consider setting OMPI_TMPDIR
to /dev/shm
in your job script.
-
If you are running many nodes per rank, you should set iOMPI_TMPDIR
to /tmp/scratch
, which holds at least 1 TB depending on Eagle node type.
Supported Versions
Kestrel |
Eagle |
Swift |
Vermilion |
openmpi/4.1.5-gcc |
openmpi/1.10.7/gcc-8.4.0 |
openmpi/4.1.1-6vr2flz |
openmpi/4.1.4-gcc |
openmpi/4.1.5-intel |
openmpi/3.1.6/gcc-8.4.0 |
|
|
|
openmpi/4.0.4/gcc-8.4.0 |
|
|
|
openmpi/4.1.1/gcc+cuda |
|
|
|
openmpi/4.1.2/gcc |
|
|
|
openmpi/4.1.2/intel |
|
|
|
openmpi/4.1.3/gcc-11.3.0-cuda-11.7 |
|
|
|
openmpi/4.1.0/gcc-8.4.0 |
|
|
IntelMPI
Documentation: IntelMPI
Intel® MPI Library is a multifabric message-passing library that implements the open source MPICH specification. Use the library to create, maintain, and test advanced, complex applications that perform better on HPC clusters based on Intel® and compatible processors.
Intel's MPI library enables tight interoperability with its processors and software development framework, and is a solid choice for most HPC applications.
Supported Versions
Kestrel |
Eagle |
Swift |
Vermilion |
intel-oneapi-mpi/2021.10.0-intel |
intel-mpi/2020.1.217 |
intel-oneapi-mpi/2021.3.0-hcp2lkf |
intel-oneapi-mpi/2021.7.1-intel |
MPT
Documentation: MPT
HPE's Message Passing Interface (MPI) is a component of the HPE Message Passing Toolkit (MPT), a software package that supports parallel programming across a network of computer systems through a technique known as message passing.
Hewlett-Packard Enterprise (HPE)—Eagle's creator—offers a very performant MPI library as well, built on top of and colloquially known via its underlying Message Passing Toolkit high-performance communications component as "MPT."
Supported Versions
Note:
MPT is only installed on Eagle.
MPICH
Documentation: MPICH
MPICH is a high performance and widely portable implementation of the Message Passing Interface (MPI) standard.
MPICH and its derivatives form the most widely used implementations of MPI in the world. They are used exclusively on nine of the top 10 supercomputers (June 2016 ranking), including the world’s fastest supercomputer: Taihu Light.
Supported Versions
Kestrel |
Eagle |
Swift |
Vermilion |
mpich/4.1-gcc |
|
mpich/3.4.2-h2s5tru |
mpich/4.0.2-gcc |
mpich/4.1-intel |
|
|
|
Running MPI Jobs on Eagle GPUs
To run MPI (message-passing interface) jobs on the Eagle system's NVidia GPUs, the MPI library must be "CUDA-aware."
A suitable OpenMPI build has been made available via the openmpi/4.0.4/gcc+cuda module.
This module is currently in test.
Interactive Use
srun
does not work with this OpenMPI build when running interactively, so please use orterun
instead.
However, OpenMPI is cognizant of the Slurm environment, so one should request the resources needed via salloc
(for example, the number of available "slots" is determined by the number of tasks requested via salloc
).
Ranks are mapped round-robin to the GPUs on a node.
nvidia-smi
shows, for example,
Processes:
GPU |
PID |
Type |
Process name |
GPU Memory Usage |
0 |
24625 |
C |
./jacobi |
803MiB |
0 |
24627 |
C |
./jacobi |
803MiB |
1 |
24626 |
C |
./jacobi |
803MiB |
when oversubscribing 3 ranks onto the 2 GPUs via the commands
srun --nodes=1 --ntasks-per-node=3 --account=<allocation_id> --time=10:00 --gres=gpu:2 --pty $SHELL
...<getting node>...
orterun -np 3 ./jacobi
If more ranks are desired than were originally requested via srun, the OpenMPI flag --oversubscribe could be added to the orterun command.
Batch Use
An example batch script to run 4 MPI ranks across two nodes is as follows.
batch script
#!/bin/bash --login
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --time=2:00
#SBATCH --gres=gpu:2
#SBATCH --job-name=GPU_MPItest
#SBATCH --account=<allocation_id>
#SBATCH --error=%x-%j.err
#SBATCH --output=%x-%j.out
ml use -a /nopt/nrel/apps/modules/test/modulefiles
ml gcc/8.4.0 cuda/10.2.89 openmpi/4.0.4/gcc+cuda
cd $SLURM_SUBMIT_DIR
srun ./jacobi
Multi-Process Service
To run multiple ranks per GPU, you may find it beneficial to run NVidia's Multi-Process Service. This process management service can increase GPU utilization, reduce on-GPU storage requirements, and reduce context switching. To do so, include the following functionality in your Slurm script or interactive session:
MPS setup
MPS setup
export CUDA_MPS_PIPE_DIRECTORY=/tmp/scratch/nvidia-mps
if [ -d $CUDA_MPS_PIPE_DIRECTORY ]
then
rm -rf $CUDA_MPS_PIPE_DIRECTORY
fi
mkdir $CUDA_MPS_PIPE_DIRECTORY
export CUDA_MPS_LOG_DIRECTORY=/tmp/scratch/nvidia-log
if [ -d $CUDA_MPS_LOG_DIRECTORY ]
then
rm -rf $CUDA_MPS_LOG_DIRECTORY
fi
mkdir $CUDA_MPS_LOG_DIRECTORY
# Start user-space daemon
nvidia-cuda-mps-control -d
# Run OpenMPI job.
orterun ...
# To clean up afterward, shut down daemon, remove directories, and unset variables
echo quit | nvidia-cuda-mps-control
for i in `env | grep CUDA_MPS | sed 's/=.*//'`; do rm -rf ${!i}; unset $i; done
For more information on MPS, see the NVidia guide.