Slurm Hardware Control

Slurm provides advanced hardware control features that are not available in the default Torc workflow. This page describes how you can use those features and still benefit from Torc’s management features.

The summary is that you need to schedule torc worker processes with your own sbatch scripts that define your desired Slurm directives. Jobs started by those worker processes will inherit the Slurm-defined settings.

Slurm CPU management

You may want to use Slurm’s CPU management features.

Let’s suppose that a compute node has two sockets, 18 cores each, and that each job will consume 18 cores.

This example starts two torc worker processes through Slurm. srun binds each worker process to cores on different sockets. Each job process run by a worker inherits that worker’s binding.

Note

You must swap <url> and <workflow-key> with your actual values below. If you added them to your ~/.torc_settings.toml file, you can delete the -u and -k options.

#!/bin/bash
#SBATCH --account=my_account
#SBATCH --job-name=my_job
#SBATCH --time=04:00:00
#SBATCH --output=output/job_output_%j.o
#SBATCH --error=output/job_output_%j.e
#SBATCH --nodes=1

srun -c18 -n2 --cpu-bind=mask_cpu:0x3ffff,0xffffc0000 \
    torc -u <url> -k <workflow-key> hpc slurm run-jobs -o output --is-subtask --max-parallel-jobs=1

Key points:

  • Tell Slurm how many CPUs (-c) to give to each torc worker (and user job) and how many torc workers to start (-n).

  • Tell torc that the worker is a subtask.

  • Tell torc that each worker should only run one job at a time.

For more srun --cpu-bind options, refer to its man page (man srun or pinfo srun).

Resource monitoring

Torc will not monitor overall node resource utilization if --is-subtask is true. You can still enable per-job process monitoring. However, be aware that torc will start one monitoring subprocess for each worker process.