Job Arrays
Job arrays are typically used when a user wants to submit many similar jobs with different inputs. Job arrays are capable of submitting hundreds, and even thousands, of similar jobs together. Here, we will describe how to submit job arrays on Slurm. More details on job arrays can be found in the Slurm documentation.
An example of a job array submission script can be found in our NREL HPC Slurm Examples directory. The job array example is titled uselist.sh, and requires doarray.py and invertc.c from the source folder.
SBATCH Directives for Job Arrays
In order to submit a job array to Slurm, the SBATCH directives at the top of your script or sbatch command line submission must contain the flag --array=<ARRAY_VALS>
, where ARRAY_VALS
is a list or range of numbers that will represent the index values of your job array. For example:
# SBATCH --array=0-12 # Submits a job array with index values between 0 and 12
...
# SBATCH --array=2,4,6,10 # Submits a job array with index values 2, 4, 6, and 10
...
# SBATCH --array=1-43:2 # Submits a job array with index values between 1 and 43 with a step size of 2
...
# SBATCH --array=1-25%5 # Submits a job array with index values between 1 and 25 and limits the number of simultaneously running tasks to 5
Submitting Job Arrays on Kestrel
To ensure that your job array is running optimally, it is recommended that job arrays are submitted on the shared partition using --partition=shared
. See more about shared partitions on Kestrel here.
Job Control
Like standard slurm jobs, job arrays have a JOB_ID, which is stored in the environment variable SLURM_ARRAY_JOB_ID
. The environment variable SLURM_ARRAY_TASK_ID
will hold information about the index of the job array.
For example, if there is a job array in the queue, the output may look like this:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST
45678_1 standard array user R 0:13 1 x1007c0s0b0n1
45678_2 standard array user R 0:13 1 x1007c0s0b0n1
45678_3 standard array user R 0:13 1 x1007c0s0b0n1
45678_4 standard array user R 0:13 1 x1007c0s0b0n1
Here, SLURM_ARRAY_JOB_ID
is 45678. The number followed by the underscore in row is the SLURM_ARRAY_TASK_ID
. This job is a job array that was submitted with --array=1-4
.
Scontrol commands can be executed on entire job arrays or specific indices of a job array.
$ scontrol suspend 45678
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST
45678_1 standard array user S 0:13 1 x1007c0s0b0n1
45678_2 standard array user S 0:13 1 x1007c0s0b0n1
45678_3 standard array user S 0:13 1 x1007c0s0b0n1
45678_4 standard array user S 0:13 1 x1007c0s0b0n1
$ scontrol resume 45678
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST
45678_1 standard array user R 0:13 1 x1007c0s0b0n1
45678_2 standard array user R 0:13 1 x1007c0s0b0n1
45678_3 standard array user R 0:13 1 x1007c0s0b0n1
45678_4 standard array user R 0:13 1 x1007c0s0b0n1
$ scontrol suspend 45678_2
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST
45678_1 standard array user R 0:13 1 x1007c0s0b0n1
45678_2 standard array user S 0:13 1 x1007c0s0b0n1
45678_3 standard array user R 0:13 1 x1007c0s0b0n1
45678_4 standard array user R 0:13 1 x1007c0s0b0n1
$ scontrol resume 45678_2
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST
45678_1 standard array user R 0:13 1 x1007c0s0b0n1
45678_2 standard array user R 0:13 1 x1007c0s0b0n1
45678_3 standard array user R 0:13 1 x1007c0s0b0n1
45678_4 standard array user R 0:13 1 x1007c0s0b0n1