Using srun to Launch Applications Under Slurm
Subjects covered
- Basics
- Pointers to Examples
- Why not just use mpiexec/mpirun?
- Simple runs
- Threaded (OpenMP) runs
- Hybrid MPI/OpenMPI
- MPMD - a simple distribution
- MPMD multinode
1. Basics
Eagle uses the Slurm scheduler and applications run on a compute node must be run via the scheduler. For batch runs users write a script and submit the script using the sbatch command. The script tells the scheduler what resources are required including a limit on the time to run. The script also normally contains "charging" or account information.
Here is a very basic script that just runs hostname to list the nodes allocated for a job.
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:01:00
#SBATCH --account=hpcapps
srun hostname
Note we used the srun command to launch multiple (parallel) instances of our application hostname.
This article primarily discusses options for the srun command to enable good parallel execution. In the script above we have asked for two nodes --nodes=2 and each node will run a single instance of hostname --ntasks-per-node=1. If srun is not given options on the command line it will determine the number of tasks to run from the arguments in the header. Thus our output from the script given above will be two lines, a list of nodes allocated for the job.
2. Pointers to examples
The page
https://www.nrel.gov/hpc/eagle-batch-jobs.html has information about running jobs under Slurm including a link to example batch scripts. The page https://github.com/NREL/HPC/tree/master/slurm has many slurm examples ranging from simple to complex. This article is based on the second page.
3. Why not just use mpiexec/mpirun?
The srun command is an integral part of the Slurm scheduling system. It "knows" the configuration of the machine and recognizes the environmental variables set by the scheduler, such as cores per nodes. Mpiexec and mpirun come with the MPI compilers. The amount of integration with the scheduler is implementation and install methodology dependent. They may not enable the best performance for your applications. In some cases they flat out just don't work correctly on Eagle. For example, when trying to run MPMD applications (different programs running on different cores) using the mpt version of mpiexec, the same programs gets launched on all cores.
4. Simple runs
For our srun examples we will use two glorified "Hello World" programs, one in Fortran and the other in C. They are essentially the same program written in the two languages. They can be compiled as MPI, OpenMP, or as hybrid MPI/OpenMP. They are available from the NREL HPC
repository https://github.com/NREL/HPC.git in the slurm/source directory or by running the wget commands shown below.
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/fhostone.f90
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/mympi.f90
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/phostone.c
wget https://raw.githubusercontent.com/NREL/HPC/master/slurm/source/makehello -O makefile
After the files are downloaded you can build the programs
using the mpt MPI compilers
module purge
module load mpt gcc/10.1.0
make
or using Intel MPI compilers
module purge
module load intel-mpi gcc/10.1.0
make
You will end up with the executables:
fomp - Fortran Openmp program
fhybrid - Fortran hybrid MPI/Openmp program
fmpi - Fortran MPI program
comp - C hybrid Openmp program
chybrid - C hybrid MPI/Openmp program
cmpi - C MPI program
These programs have many options. Running with the command line option -h will show them. Not all options are applicable for all versions. Run without options the programs just print the hostname on which they were run.
We look at our simple example again. Here we ask for 2 nodes, 4 tasks per node for a total of 8 tasks.
#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks=8
#SBATCH --time=00:10:00
srun ./cmpi
This will produce (sorted) output like:
r105u33
r105u33
r105u33
r105u33
r105u37
r105u37
r105u37
r105u37
In the above script we have nodes,ntasks-per-node and ntasks. You do not need to specify all three parameters but values that are specified must be consistent.
- If nodes is not specified it will default to 1.
- If ntasks is not specified it will default to 1 tasks per node.
- You can put --ntasks-per-node and/or --ntasks on the srun line. For example, to run a total of 9 tasks, 5 on one node and 4 on the second:
#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --nodes=2
#SBATCH --time=00:10:00
srun --ntasks=9 ./cmpi
5. Threaded (OpenMP) runs
The variable used to tell the operating system how many threads to use for an OpenMP program is OMP_NUM_THREADS. In the ideal world you could just set OMP_NUM_THREADS to a value, say 36, the number of cores on each Eagle node, and each thread would be assigned to a core. Unfortunately without setting additional variables you will get the requested number of threads but threads might not be spread across all cores. This can result in a significant slowdown. For a program that is computationally intensive if two threads get mapped to the same core the runtime will increase 100%. If all threads end up on the same core, the slowdown could actually be greater than the number of cores.
Our example programs, phostone.c and fhostone.f90, have a nice feature. If you add -F to the command line they will produce a report showing on which core each thread runs. We are going to look at the C version of the code and compile it with both the Intel version of C, icc and with the Gnu compiler gcc.
ml comp-intel/2020.1.217 gcc/10.1.0
gcc -fopenmp -DNOMPI phostone.c -o comp.gcc
icc -fopenmp -DNOMPI phostone.c -o comp.icc
Run the script...
#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --cpus-per-task=36
## ask for 10 minutes
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --partition=debug
export OMP_NUM_THREADS=36
srun ./comp.gcc -F > gcc.out
srun ./comp.gcc -F > icc.out
Note we have added the line #SBATCH --cpus-per-task=36. cpus-per-task should match the value of OMP_NUM_THREADS.
We now look at the sorted head of each of the output files
el3:nslurm> cat icc.out | sort -k6,6
task thread node name first task # on node core
0000 0030 r5i7n35 0000 0000 0000
0000 0001 r5i7n35 0000 0000 0001
0000 0034 r5i7n35 0000 0000 0001
0000 0002 r5i7n35 0000 0000 0002
0000 0035 r5i7n35 0000 0000 0002
0000 0032 r5i7n35 0000 0000 0003
. . .
el3:nslurm> cat gcc.out | sort -k6,6
task thread node name first task # on node core
0000 0031 r5i7n35 0000 0000 0000
0000 0001 r5i7n35 0000 0000 0001
0000 0002 r5i7n35 0000 0000 0002
0000 0034 r5i7n35 0000 0000 0002
0000 0003 r5i7n35 0000 0000 0003
0000 0004 r5i7n35 0000 0000 0004
. . .
The last column shows the core on which a thread is run. We see that there is duplication of cores, potentially leading to poor performance.
There are two sets of environmental variables that can be used to map threads to cores. One variable is specific to the Intel compilers, KMP_AFFINITY. The others are general for OpenMP compilers and should work for any OpenMP compiler, OMP_PLACES and OMP_PROC_BIND. These are documented at:
https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/openmp-support/openmp-library-support/thread-affinity-interface-linux-and-windows.html
https://www.openmp.org/spec-html/5.0/openmpse52.html
https://www.openmp.org/spec-html/5.0/openmpse53.html
We ran each version of our code 100 times with 5 different settings. The settings were:
- export KMP_AFFINITY=verbose,scatter
- export KMP_AFFINITY=verbose,compact
- export OMP_PLACES=cores
export OMP_PROC_BIND=spread
- export OMP_PLACES=cores
export OMP_PROC_BIND=close
- NONE
The table below shows the results of our runs. In particular, it shows the minimum number of cores used with the particular settings. 36 is the desired value. We see that for gcc the following settings worked well:
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
or
export OMP_PLACES=cores
export OMP_PROC_BIND=clone
Setting KMP_AFFINITY did not work for gcc but for the Intel compiler KMP_AFFINITY also gave good results.
Compiler | Setting | Worked | min cores | mean cores | max cores |
gcc | cores, close | yes | 36 | 36 | 36 |
gcc | cores, spread | yes | 36 | 36 | 36 |
gcc | KMP_AFFINITY=compact | no | 25 | 34.18 | 36 |
gcc | KMP_AFFINITY=scatter | no | 26 | 34.56 | 36 |
gcc | none | no | 28 | 34.14 | 36 |
| | | | | |
icc | cores, close | yes | 36 | 36 | 36 |
icc | cores, spread | yes | 36 | 36 | 36 |
icc | KMP_AFFINITY=compact | yes | 36 | 36 | 36 |
icc | KMP_AFFINITY=scatter | yes | 36 | 36 | 36 |
icc | none | no | 19 | 23.56 | 29 |
So our final working script for OpenMP programs could be:
#!/bin/bash
#SBATCH --job-name="hostname"
#SBATCH --cpus-per-task=36
## ask for 10 minutes
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --partition=debug
export OMP_NUM_THREADS=36
export OMP_PLACES=cores
export OMP_PROC_BIND=close
#export OMP_PROC_BIND=spread
srun ./comp.gcc -F > gcc.out
srun ./comp.gcc -F > icc.out
When a job is run the SLURM_CPUS_PER_TASK is set to cpus-per-task so you may want to
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
More on this in the next section.
6. Hybrid MPI/OpenMPI
The next script is just an extension of the last. We now run hybrid, a combination of MPI and OpenMP. Our base example programs, fhostame.f90 and phostname.c can be compiled in hybrid mode as well as in pure MPI and pure OpenMP.
First we look at the (sorted) output from our program run in hybrid mode with 4 tasks on two nodes and 4 threads.
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task thread node name first task # on node core
0000 0000 r5i0n4 0000 0000 0000
0000 0001 r5i0n4 0000 0000 0004
0000 0002 r5i0n4 0000 0000 0009
0000 0003 r5i0n4 0000 0000 0014
0001 0000 r5i0n4 0000 0001 0018
0001 0001 r5i0n4 0000 0001 0022
0001 0002 r5i0n4 0000 0001 0027
0001 0003 r5i0n4 0000 0001 0032
0002 0000 r5i0n28 0002 0000 0000
0002 0001 r5i0n28 0002 0000 0004
0002 0002 r5i0n28 0002 0000 0009
0002 0003 r5i0n28 0002 0000 0014
0003 0000 r5i0n28 0002 0001 0018
0003 0001 r5i0n28 0002 0001 0022
0003 0002 r5i0n28 0002 0001 0027
0003 0003 r5i0n28 0002 0001 0032
total time 3.009
The first column is the MPI task number followed by the thread, then the node. The last column is the core on which that give task/thread was run. We can cat a list of unique combinations of nodes and cores by piping the file into
grep ^0 | awk '{print $3, $6}' | sort -u | wc -l`
We get 16 which is the number of tasks times the number of threads. That is, we have each task/thread assigned to its own core. This will give good performance. The script below runs
on a fixed number of tasks (4 = 2 per node * 2 nodes) and using from 1 to cpus-per-task=18 threads.
The variable SLURM_CPUS_PER_TASK is set by slurm to be cpus-per-task. After the srun line we post process the output to report core usage.
#!/bin/bash
#SBATCH --account=hpcapps
#SBATCH --time=00:10:00
#SBATCH --nodes=2
#SBATCH --partition=short
#SBATCH --cpus-per-task=18
#SBATCH --ntasks=4
module purge
module load intel-mpi/2020.1.217 gcc/10.1.0
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
echo "CPT TASKS THREADS cores"
for n in `seq 1 $SLURM_CPUS_PER_TASK` ; do
request=`python -c "print($n*$SLURM_NTASKS)"`
have=72
if ((request <= have)); then
export OMP_NUM_THREADS=$n
srun --ntasks-per-core=1 -n $SLURM_NTASKS ./phostone.icc -F -t 3 > out.$SLURM_NTASKS.$OMP_NUM_THREADS
# post process
cores=`cat out.$SLURM_NTASKS.$OMP_NUM_THREADS | grep ^0 | awk '{print $3, $6}' | sort -u | wc -l`
echo $SLURM_CPUS_PER_TASK " " $SLURM_NTASKS " " $OMP_NUM_THREADS " " $cores
fi
done
Our final output from this script is:
el3:stuff> cat slurm-7002718.out
CPT TASKS THREADS cores
18 4 1 4
18 4 2 8
18 4 3 12
18 4 4 16
18 4 5 20
18 4 6 24
18 4 7 28
18 4 8 32
18 4 9 36
18 4 10 40
18 4 11 44
18 4 12 48
18 4 13 52
18 4 14 56
18 4 15 60
18 4 16 64
18 4 17 68
18 4 18 72
el3:stuff>
The important lines are:
#SBATCH --cpus-per-task=18
. . .
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
. . .
srun --ntasks-per-core=1 -n $SLURM_NTASKS ./phostone.icc
We need to set cpus-per-task to tell slurm we are going to run multithreaded and how many cores we are going to use for our threads. This should be set to the maximum number of threads per task we expect to use.
We use the OMP variables to map threads to cores. IMPORTANT: using KMP_AFFINTY will not give the desired results. It will cause all threads for a task to be mapped to a single core.
We can run this script for hybrid MPI/OpenMP programs as is or set the number of cpus-per-task and tasks on the sbatch command line. For example:
sbatch --cpus-per-task=9 --ntasks=8 simple
gives us:
el3:stuff> cat slurm-7002858.out
CPT TASKS THREADS cores
9 8 1 8
9 8 2 16
9 8 3 24
9 8 4 32
9 8 5 40
9 8 6 48
9 8 7 56
9 8 8 64
9 8 9 72
el3:stuff>
7. MPMD - a simple distribution
Here we look at launching Multi Program Multi Data runs. We use a the --multi-prog option with srun. This involves creating a config_file that lists the programs we are going to run along with the task ID. See: https://computing.llnl.gov/tutorials/linux_clusters/multi-prog.html for a quick description of the format for the config_file.
Here we create the file on the fly but it could be done beforehand.
We have two MPI programs to run together, phostone and fhostone. They are actually the same program written in C and Fortran. In the real world MPMD applications would maybe run a GUI or a manager for one task and rest doing compute.
The syntax for running MPMD programs is
srun --multi-prog mapfile
where mapfile is a config_file that lists the programs to run.
It is possible to pass different arguments to each program as discussed in the link above. Here we just add command line arguments for task 0.
Our mapfile has 8 programs listed. The even tasks are running phostone and the odd fhostone. Our script uses two for loops to add lines to the mapfile and then uses sed to append command line arguments to the first line.
#!/bin/bash
#SBATCH --account=hpcapps
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --partition=debug
#SBATCH --cpus-per-task=1
# create our mapfile
app1=./phostone
for n in 0 2 4 6 ; do
echo $n $app1 >> mapfile
done
app2=./fhostone
for n in 1 3 5 7 ; do
echo $n $app2 >> mapfile
done
# add a command line option to the first line
# sed does an in-place change to the first line
# of our mapfile adding *-F*
sed -i "1 s/$/ -F /" mapfile
cat mapfile
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun -n8 --multi-prog mapfile
Here is the complete output including the mapfile and output from our two programs. Lines with three digits for core number were created by the Fortran version of the program.
el3:stuff> cat *7003104*
0 ./phostone -F
2 ./phostone
4 ./phostone
6 ./phostone
1 ./fhostone
3 ./fhostone
5 ./fhostone
7 ./fhostone
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task thread node name first task # on node core
0000 0000 r1i7n35 0000 0000 0022
0001 0000 r1i7n35 0001 0000 021
0002 0000 r1i7n35 0000 0001 0027
0003 0000 r1i7n35 0003 0000 023
0004 0000 r1i7n35 0000 0002 0020
0005 0000 r1i7n35 0005 0000 025
0006 0000 r1i7n35 0000 0003 0026
0007 0000 r1i7n35 0007 0000 019
el3:stuff>
8. MPMD multinode
Our final example again just extends the previous one. We want to add the capability to launch different numbers of tasks on a set of nodes and at the same time have different programs on each of the nodes. We create a mapfile to list the programs to run as was done above. In this case for illustration purposes we are running one copy of phostone and seven instances of fhostone.
We add to that a hostfile that lists the nodes on which to run. The hostfile has one host per MPI task.
#!/bin/bash
#SBATCH --account=hpcapps
#SBATCH --time=00:10:00
#SBATCH --nodes=2
#SBATCH --partition=debug
export OMP_NUM_THREADS=1
# Create our mapfile
rm -rf mapfile
app1=./phostone
for n in 0 ; do
echo $n $app1 >> mapfile
done
app2=./fhostone
for n in 1 2 3 4 5 6 7 ; do
echo $n $app2 >> mapfile
done
# Add a command line option to the first line
# sed does an in-place change to the first line
# of our mapfile adding *-F*
sed -i "1 s/$/ -F /" mapfile
# Count of each app to run on a node
counts="1 7"
# Get a list of nodes on a single line
nodes=`scontrol show hostnames | tr '\n' ' '`
# Create our hostfile and tell slrum its name
export SLURM_HOSTFILE=hostlist
# It is possible to do this in bash but
# I think this is easier to understand
# in python. It uses the values for
# counts and nodes set above.
python - > $SLURM_HOSTFILE << EOF
c="$counts".split()
nodes="$nodes".split()
k=0
for i in c:
i=int(i)
node=nodes[k]
for j in range(0,i):
print(node)
k=k+1
EOF
srun -n 8 --multi-prog mapfile
Here is the output from our run including the mapfile and hostlist. Notice that the first instance of the set of running programs is the C version. It is the only thing running on the first nodes. The rest of the MPI tasks are the Fortran version of the program running on the second node.
el3:stuff> cat slurm-7003587.out | sort -k3,3 -k1,1
MPI VERSION Intel(R) MPI Library 2019 Update 7 for Linux* OS
task thread node name first task # on node core
0000 0000 r102u34 0000 0000 0004
0001 0000 r102u35 0001 0000 003
0002 0000 r102u35 0002 0000 000
0003 0000 r102u35 0001 0001 006
0004 0000 r102u35 0004 0000 007
0005 0000 r102u35 0001 0002 004
0006 0000 r102u35 0002 0001 005
0007 0000 r102u35 0001 0003 002
el3:stuff> cat mapfile
0 ./phostone -F
1 ./fhostone
2 ./fhostone
3 ./fhostone
4 ./fhostone
5 ./fhostone
6 ./fhostone
7 ./fhostone
el3:stuff> c
el3:stuff> cat hostlist
r102u34
r102u35
r102u35
r102u35
r102u35
r102u35
r102u35
r102u35
el3:stuff>