Getting Started
Kestrel has 132 GPU nodes. Each GPU node has 4 NVIDIA H100 GPUs (80 GB), 128 CPU cores, and 384GB RAM.
GPU nodes are shared, meaning that less than a full node may be requested for a job, leaving the remainder of the node for use by other jobs concurrently.
To request use of a GPU, use the flag --gres=gpu:<quantity>
with sbatch, srun, or salloc, or add it as an #SBATCH directive in your sbatch submit script, where <quantity>
is a number from 1 to 4.
Running on GPU Nodes
An example GPU job allocation command:
salloc --time=2:00:00 --reservation=<friendly user reservation> --partition=gpu-h100 --account=<project handle> --nodes=1 -n 1 --mem-per-cpu=8G --gres=gpu:h100:<# of GPUS per node>
You're automatically given access to all of the memory for the GPU or GPUs requested (80GB per GPU). Note that you'll need to use -n
to request the number of CPU cores needed, and --mem
or --mem-per-cpu
to request the amount of CPU memory needed. You can use --exclusive
to request all of the resources on the GPU node.
You can verify whether the GPU device is found by running nvidia-smi
after landing on the node. If you receive any kind of output other than a No devices were found
message, there is a GPU waiting to be used by your software.
Warning
If you submit to the gpu-h100
partition without including --gres=gpu:<# of GPUs per node>
, your job will not allocate any GPU cards to your session, preventing you from using the resource at all.
Additional Resources