Using srun to Launch Applications Under Slurm

Subjects covered#

Basics
Pointers to Examples
Why not just use mpiexec/mpirun?
Simple runs
Threaded (OpenMP) runs
Hybrid MPI/OpenMPI
MPMD - a simple distribution
MPMD multinode

1. Basics#

Eagle uses the Slurm scheduler and applications run on a compute node must be run via the scheduler. For batch runs users write a script and submit the script using the sbatch command. The script tells the scheduler what resources are required including a limit on the time to run. The script also normally contains "charging" or account information.

Here is a very basic script that just runs hostname to list the nodes allocated for a job.

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:01:00
#SBATCH --account=hpcapps 


srun hostname

Note we used the srun command to launch multiple (parallel) instances of our application hostname.

This article primarily discusses options for the srun command to enable good parallel execution. In the script above we have asked for two nodes --nodes=2 and each node will run a single instance of hostname --ntasks-per-node=1. If srun is not given options on the command line it will determine the number of tasks to run from the arguments in the header. Thus our output from the script given above will be two lines, a list of nodes allocated for the job.

2. Pointers to examples#

The page https://www.nrel.gov/hpc/eagle-batch-jobs.html has information about running jobs under Slurm including a link to example batch scripts. The page https://github.com/NREL/HPC/tree/master/slurm has many slurm examples ranging from simple to complex. This article is based on the second page.

3. Why not just use mpiexec/mpirun?#

The srun command is an integral part of the Slurm scheduling system. It "knows" the configuration of the machine and recognizes the environmental variables set by the scheduler, such as cores per nodes. Mpiexec and mpirun come with the MPI compilers. The amount of integration with the scheduler is implementation and install methodology dependent. They may not enable the best performance for your applications. In some cases they flat out just don't work correctly on Eagle. For example, when trying to run MPMD applications (different programs running on different cores) using the mpt version of mpiexec, the same programs gets launched on all cores.

4. Simple runs#

For our srun examples we will use two glorified "Hello World" programs, one in Fortran and the other in C. They are essentially the same program written in the two languages. They can be compiled as MPI, OpenMP, or as hybrid MPI/OpenMP. They are available from the NREL HPC repository https://github.com/NREL/HPC.git in the slurm/source directory or by running the wget commands shown below.

wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/fhostone.f90
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/mympi.f90
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/phostone.c
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/makehello -O makefile

After the files are downloaded you can build the programs

using the mpt MPI compilers#

module purge
module load mpt gcc/10.1.0
make

or using Intel MPI compilers#

module purge
module load intel-mpi gcc/10.1.0
make

You will end up with the executables:

fomp         - Fortran Openmp program
fhybrid      - Fortran hybrid MPI/Openmp program
fmpi         - Fortran MPI program
comp         - C hybrid Openmp program
chybrid      - C hybrid MPI/Openmp program
cmpi         - C MPI program

These programs have many options. Running with the command line option -h will show them. Not all options are applicable for all versions. Run without options the programs just print the hostname on which they were run.

We look at our simple example again. Here we ask for 2 nodes, 4 tasks per node for a total of 8 tasks.

#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks=8
#SBATCH --time=00:10:00

srun ./cmpi

This will produce (sorted) output like:

r105u33
r105u33
r105u33
r105u33
r105u37
r105u37
r105u37
r105u37

In the above script we have nodes,ntasks-per-node and ntasks. You do not need to specify all three parameters but values that are specified must be consistent.

If nodes is not specified it will default to 1.
If ntasks is not specified it will default to 1 tasks per node.
You can put --ntasks-per-node and/or --ntasks on the srun line. For example, to run a total of 9 tasks, 5 on one node and 4 on the second:

#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --nodes=2
#SBATCH --time=00:10:00

srun --ntasks=9 ./cmpi

5. Threaded (OpenMP) runs#

The variable used to tell the operating system how many threads to use for an OpenMP program is OMP_NUM_THREADS. In the ideal world you could just set OMP_NUM_THREADS to a value, say 36, the number of cores on each Eagle node, and each thread would be assigned to a core. Unfortunately without setting additional variables you will get the requested number of threads but threads might not be spread across all cores. This can result in a significant slowdown. For a program that is computationally intensive if two threads get mapped to the same core the runtime will increase 100%. If all threads end up on the same core, the slowdown could actually be greater than the number of cores.

Our example programs, phostone.c and fhostone.f90, have a nice feature. If you add -F to the command line they will produce a report showing on which core each thread runs. We are going to look at the C version of the code and compile it with both the Intel version of C, icc and with the Gnu compiler gcc.

ml comp-intel/2020.1.217 gcc/10.1.0
gcc -fopenmp -DNOMPI phostone.c -o comp.gcc
icc -fopenmp -DNOMPI phostone.c -o comp.icc

Run the script...

#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --cpus-per-task=36
## ask for 10 minutes
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --partition=debug
export OMP_NUM_THREADS=36

srun ./comp.gcc -F > gcc.out
srun ./comp.gcc -F > icc.out

Note we have added the line #SBATCH --cpus-per-task=36. cpus-per-task should match the value of OMP_NUM_THREADS.

We now look at the sorted head of each of the output files

el3:nslurm> cat icc.out | sort -k6,6
task    thread             node name  first task    # on node  core
0000      0030               r5i7n35        0000         0000  0000
0000      0001               r5i7n35        0000         0000  0001
0000      0034               r5i7n35        0000         0000  0001
0000      0002               r5i7n35        0000         0000  0002
0000      0035               r5i7n35        0000         0000  0002
0000      0032               r5i7n35        0000         0000  0003
. . .

el3:nslurm> cat gcc.out | sort -k6,6
task    thread             node name  first task    # on node  core
0000      0031               r5i7n35        0000         0000  0000
0000      0001               r5i7n35        0000         0000  0001
0000      0002               r5i7n35        0000         0000  0002
0000      0034               r5i7n35        0000         0000  0002
0000      0003               r5i7n35        0000         0000  0003
0000      0004               r5i7n35        0000         0000  0004
. . .

The last column shows the core on which a thread is run. We see that there is duplication of cores, potentially leading to poor performance.

There are two sets of environmental variables that can be used to map threads to cores. One variable is specific to the Intel compilers, KMP_AFFINITY. The others are general for OpenMP compilers and should work for any OpenMP compiler, OMP_PLACES and OMP_PROC_BIND. These are documented at:

https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html

https://www.openmp.org/spec-html/5.0/openmpse52.html

https://www.openmp.org/spec-html/5.0/openmpse53.html

We ran each version of our code 100 times with 5 different settings. The settings were:

export KMP_AFFINITY=verbose,scatter
export KMP_AFFINITY=verbose,compact
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
export OMP_PLACES=cores
export OMP_PROC_BIND=close
NONE

The table below shows the results of our runs. In particular, it shows the minimum number of cores used with the particular settings. 36 is the desired value. We see that for gcc the following settings worked well:

export OMP_PLACES=cores
export OMP_PROC_BIND=spread

or
export OMP_PLACES=cores
export OMP_PROC_BIND=clone

Setting KMP_AFFINITY did not work for gcc but for the Intel compiler KMP_AFFINITY also gave good results.

Compiler	Setting	Worked	min cores	mean cores	max cores
gcc	cores, close	yes	36	36	36
gcc	cores, spread	yes	36	36	36
gcc	KMP_AFFINITY=compact	no	25	34.18	36
gcc	KMP_AFFINITY=scatter	no	26	34.56	36
gcc	none	no	28	34.14	36

icc	cores, close	yes	36	36	36
icc	cores, spread	yes	36	36	36
icc	KMP_AFFINITY=compact	yes	36	36	36
icc	KMP_AFFINITY=scatter	yes	36	36	36
icc	none	no	19	23.56	29

So our final working script for OpenMP programs could be:

#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --cpus-per-task=36
## ask for 10 minutes
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --partition=debug
export OMP_NUM_THREADS=36

export OMP_PLACES=cores
export OMP_PROC_BIND=close
#export OMP_PROC_BIND=spread

srun ./comp.gcc -F > gcc.out
srun ./comp.gcc -F > icc.out

When a job is run the SLURM_CPUS_PER_TASK is set to cpus-per-task so you may want to

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

6. Hybrid MPI/OpenMPI#

The next script is just an extension of the last. We now run hybrid, a combination of MPI and OpenMP. Our base example programs, fhostame.f90 and phostname.c can be compiled in hybrid mode as well as in pure MPI and pure OpenMP.

First we look at the (sorted) output from our program run in hybrid mode with 4 tasks on two nodes and 4 threads.

MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task    thread             node name  first task    # on node  core
0000      0000                r5i0n4        0000         0000  0000
0000      0001                r5i0n4        0000         0000  0004
0000      0002                r5i0n4        0000         0000  0009
0000      0003                r5i0n4        0000         0000  0014
0001      0000                r5i0n4        0000         0001  0018
0001      0001                r5i0n4        0000         0001  0022
0001      0002                r5i0n4        0000         0001  0027
0001      0003                r5i0n4        0000         0001  0032
0002      0000               r5i0n28        0002         0000  0000
0002      0001               r5i0n28        0002         0000  0004
0002      0002               r5i0n28        0002         0000  0009
0002      0003               r5i0n28        0002         0000  0014
0003      0000               r5i0n28        0002         0001  0018
0003      0001               r5i0n28        0002         0001  0022
0003      0002               r5i0n28        0002         0001  0027
0003      0003               r5i0n28        0002         0001  0032
total time      3.009

The first column is the MPI task number followed by the thread, then the node. The last column is the core on which that give task/thread was run. We can cat a list of unique combinations of nodes and cores by piping the file into

grep ^0 | awk '{print $3, $6}' | sort -u | wc -l`

We get 16 which is the number of tasks times the number of threads. That is, we have each task/thread assigned to its own core. This will give good performance. The script below runs on a fixed number of tasks (4 = 2 per node * 2 nodes) and using from 1 to cpus-per-task=18 threads.

The variable SLURM_CPUS_PER_TASK is set by slurm to be cpus-per-task. After the srun line we post process the output to report core usage.

#!/bin/bash
#SBATCH --account=hpcapps 
#SBATCH --time=00:10:00 
#SBATCH --nodes=2 
#SBATCH --partition=short 
#SBATCH --cpus-per-task=18
#SBATCH --ntasks=4

module purge
module load intel-mpi/2020.1.217 gcc/10.1.0


export OMP_PLACES=cores
export OMP_PROC_BIND=spread

echo "CPT TASKS THREADS  cores"
for n in `seq 1 $SLURM_CPUS_PER_TASK` ; do
    request=`python -c "print($n*$SLURM_NTASKS)"`
    have=72
    if ((request <= have)); then
      export OMP_NUM_THREADS=$n
      srun  --ntasks-per-core=1 -n $SLURM_NTASKS ./phostone.icc -F -t 3 > out.$SLURM_NTASKS.$OMP_NUM_THREADS
# post process
      cores=`cat out.$SLURM_NTASKS.$OMP_NUM_THREADS | grep ^0 | awk '{print $3, $6}' | sort -u | wc -l`
      echo $SLURM_CPUS_PER_TASK "    " $SLURM_NTASKS "    " $OMP_NUM_THREADS "    " $cores
    fi
done

Our final output from this script is:

el3:stuff> cat slurm-7002718.out
CPT TASKS THREADS cores
18      4      1      4
18      4      2      8
18      4      3      12
18      4      4      16
18      4      5      20
18      4      6      24
18      4      7      28
18      4      8      32
18      4      9      36
18      4      10      40
18      4      11      44
18      4      12      48
18      4      13      52
18      4      14      56
18      4      15      60
18      4      16      64
18      4      17      68
18      4      18      72
el3:stuff>

The important lines are:

#SBATCH --cpus-per-task=18
. . .
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
. . .
srun  --ntasks-per-core=1 -n $SLURM_NTASKS ./phostone.icc

We need to set cpus-per-task to tell slurm we are going to run multithreaded and how many cores we are going to use for our threads. This should be set to the maximum number of threads per task we expect to use.

We use the OMP variables to map threads to cores. IMPORTANT: using KMP_AFFINTY will not give the desired results. It will cause all threads for a task to be mapped to a single core.

We can run this script for hybrid MPI/OpenMP programs as is or set the number of cpus-per-task and tasks on the sbatch command line. For example:

sbatch --cpus-per-task=9 --ntasks=8 simple

gives us:

el3:stuff> cat slurm-7002858.out
CPT TASKS THREADS  cores
9      8      1      8
9      8      2      16
9      8      3      24
9      8      4      32
9      8      5      40
9      8      6      48
9      8      7      56
9      8      8      64
9      8      9      72
el3:stuff>

7. MPMD - a simple distribution#

Here we look at launching Multi Program Multi Data runs. We use a the --multi-prog option with srun. This involves creating a config_file that lists the programs we are going to run along with the task ID. See: https://computing.llnl.gov/tutorials/linux_clusters/multi-prog.html for a quick description of the format for the config_file.

Here we create the file on the fly but it could be done beforehand.

We have two MPI programs to run together, phostone and fhostone. They are actually the same program written in C and Fortran. In the real world MPMD applications would maybe run a GUI or a manager for one task and rest doing compute.

The syntax for running MPMD programs is

srun --multi-prog mapfile

where mapfile is a config_file that lists the programs to run.

It is possible to pass different arguments to each program as discussed in the link above. Here we just add command line arguments for task 0.

Our mapfile has 8 programs listed. The even tasks are running phostone and the odd fhostone. Our script uses two for loops to add lines to the mapfile and then uses sed to append command line arguments to the first line.

#!/bin/bash
#SBATCH --account=hpcapps 
#SBATCH --time=00:10:00 
#SBATCH --nodes=1
#SBATCH --partition=debug 
#SBATCH --cpus-per-task=1

# create our mapfile
app1=./phostone
for n in 0 2 4 6 ; do
  echo $n $app1 >> mapfile
done
app2=./fhostone
for n in 1 3 5 7 ; do
  echo $n $app2 >> mapfile
done

# add a command line option to the first line
# sed does an in-place change to the first line
# of our mapfile adding *-F*
sed -i "1 s/$/ -F /" mapfile

cat mapfile

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -n8 --multi-prog mapfile

Here is the complete output including the mapfile and output from our two programs. Lines with three digits for core number were created by the Fortran version of the program.

el3:stuff> cat *7003104*
0 ./phostone -F 
2 ./phostone
4 ./phostone
6 ./phostone
1 ./fhostone
3 ./fhostone
5 ./fhostone
7 ./fhostone
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS

task    thread             node name  first task    # on node  core
0000      0000               r1i7n35        0000         0000  0022
0001      0000               r1i7n35        0001         0000   021
0002      0000               r1i7n35        0000         0001  0027
0003      0000               r1i7n35        0003         0000   023
0004      0000               r1i7n35        0000         0002  0020
0005      0000               r1i7n35        0005         0000   025
0006      0000               r1i7n35        0000         0003  0026
0007      0000               r1i7n35        0007         0000   019
el3:stuff>

8. MPMD multinode#

Our final example again just extends the previous one. We want to add the capability to launch different numbers of tasks on a set of nodes and at the same time have different programs on each of the nodes. We create a mapfile to list the programs to run as was done above. In this case for illustration purposes we are running one copy of phostone and seven instances of fhostone.

We add to that a hostfile that lists the nodes on which to run. The hostfile has one host per MPI task.

#!/bin/bash
#SBATCH --account=hpcapps 
#SBATCH --time=00:10:00 
#SBATCH --nodes=2
#SBATCH --partition=debug 

export OMP_NUM_THREADS=1

# Create our mapfile
rm -rf mapfile
app1=./phostone
for n in 0  ; do
  echo $n $app1 >> mapfile
done
app2=./fhostone
for n in 1 2 3 4 5 6 7 ; do
  echo $n $app2 >> mapfile
done

# Add a command line option to the first line
# sed does an in-place change to the first line
# of our mapfile adding *-F*
sed -i "1 s/$/ -F /" mapfile

# Count of each app to run on a node
counts="1 7"

# Get a list of nodes on a single line
nodes=`scontrol show hostnames | tr '\n' ' '`

# Create our hostfile and tell slrum its name
export SLURM_HOSTFILE=hostlist

# It is possible to do this in bash but
# I think this is easier to understand
# in python.  It uses the values for
# counts and nodes set above.
python -  > $SLURM_HOSTFILE << EOF
c="$counts".split()
nodes="$nodes".split()
k=0
for i in c:
  i=int(i)
  node=nodes[k]
  for j in range(0,i):
      print(node)
  k=k+1
EOF


srun -n 8 --multi-prog mapfile

Here is the output from our run including the mapfile and hostlist. Notice that the first instance of the set of running programs is the C version. It is the only thing running on the first nodes. The rest of the MPI tasks are the Fortran version of the program running on the second node.

el3:stuff> cat slurm-7003587.out | sort -k3,3 -k1,1
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task    thread             node name  first task    # on node  core
0000      0000               r102u34        0000         0000  0004
0001      0000               r102u35        0001         0000   003
0002      0000               r102u35        0002         0000   000
0003      0000               r102u35        0001         0001   006
0004      0000               r102u35        0004         0000   007
0005      0000               r102u35        0001         0002   004
0006      0000               r102u35        0002         0001   005
0007      0000               r102u35        0001         0003   002
el3:stuff> cat mapfile
0 ./phostone -F 
1 ./fhostone
2 ./fhostone
3 ./fhostone
4 ./fhostone
5 ./fhostone
6 ./fhostone
7 ./fhostone
el3:stuff> c
el3:stuff> cat hostlist
r102u34
r102u35
r102u35
r102u35
r102u35
r102u35
r102u35
r102u35
el3:stuff>