Skip to content

Commands to Monitor and Control Jobs#

Slurm includes a suite of command-line tools used to submit, monitor, and control jobs and the job queue.

Command Description
squeue Show the Slurm queue. Users can specify JOBID or USER.
scontrol Controls various aspects of jobs such as job suspension, re-queuing or resuming jobs and can display diagnostic info about each job.
scancel Cancel specified job(s).
sinfo View information about all Slurm nodes and partitions.
sacct Detailed information on accounting for all jobs and job steps.
sprio View priority and the factors that determine scheduling priority.

Please see man pages on the cluster for more information on each command. Also see --help or --usage flags for each.

Our Presentation on Advanced Slurm Features is also available as a resource, which has supplementary information on how to manage jobs.

Another great resource for Slurm at NREL is this repository on Github.

Usage Examples#

squeue#

The squeue command is used to view the current state of jobs in the queue.

To show your jobs:

$ squeue -u hpcuser
           JOBID    PARTITION       NAME      USER   ST       TIME      NODES   NODELIST(REASON)
          506955          gpu   wait_tes   hpcuser   PD       0:00          1      (Resources)

To show all jobs in the queue with extended information:

$ squeue -l
Thu Dec 13 12:17:31 2018
 JOBID  PARTITION NAME     USER     STATE   TIME    TIME_LIMIT   NODES  NODELIST(REASON)
 516890 standard Job007    user1    PENDING 0:00    12:00:00    1050   (Dependency)
 516891 standard Job008    user1    PENDING 0:00    12:00:00    1050   (Dependency)
 516897      gpu Job009    user2    PENDING 0:00    04:00:00       1   (Resources)
 516898 standard Job010    user3    PENDING 0:00    15:00:00      71   (Priority)
 516899 standard Job011    user3    PENDING 0:00    15:00:00      71   (Priority)
-----------------------------------------------------------------------------
 516704 standard Job001    user4    RUNNING 4:09:48 15:00:00      71    r1i0n[0-35],r1i1n[0-34]
 516702 standard Job002    user4    RUNNING 4:16:50 15:00:00      71    r1i6n35,r1i7n[0-35],r2i0n[0-33]
 516703 standard Job003    user4    RUNNING 4:16:57 15:00:00      71    r1i5n[0-35],r1i6n[0-34]
 516893 standard Job004    user4    RUNNING 7:19     3:00:00      71    r1i1n35,r1i2n[0-35],r1i3n[0-33]
 516894 standard Job005    user4    RUNNING 7:19     3:00:00      71    r4i2n[20-25],r6i6n[7-35],r6i7n[0-35]
 516895 standard Job006    user4    RUNNING 7:19     3:00:00      71    r4i2n[29-35],r4i3n[0-35],r4i4n[0-20]

To estimate when your jobs will start to run, use the squeue --start command with the JOBID.

Note that the Slurm start times are only an estimate, and are updated frequently based on the current state of the queue and the specified --time of all jobs in the queue.

$ squeue --start -j 509851,509852
 JOBID    PARTITION    NAME      USER      ST          START_TIME    NODES   SCHEDNODES   NODELIST(REASON)
 509851   short      test1.sh   hpcuser    PD                 N/A      100       (null)       (Dependency)
 509852   short      test2.sh   hpcuser    PD 2018-12-19T16:54:00        1      r1i6n35         (Priority)

Output Customization of the squeue Command#

The displayed fields in squeue can be highly customized to display the information that's most relevant for the user by using the -o or -O flags. The full list of customizable fields can be found under the entries for these flags in the man squeue command on the system.

By setting the environment variable export $SQUEUE_FORMAT, you can override the system's default squeue fields with your own. For example, if you run the following line (or place it in your ~/.bashrc or ~/.bash_aliases file to make it persistent across logins):

export SQUEUE_FORMAT="%.18i %.15P %.8q %.12a %.8p %.8j %.8u %.2t %.10M %.6D %R"

Using squeue will now provide the formatted output:

JOBID    PARTITION   QOS    ACCOUNT   PRIORITY     NAME     USER    ST     TIME    NODES NODELIST(REASON)
13141110 standard   normal  csc000    0.051768    my_job   hpcuser  R   2-04:01:17   1    r1i3n29

Or you may wish to add the %V to show the timestamp that a job was submitted, and sort by timestamp, ascending:

squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %20V %6q %12l %R" -S "V"

Example output:

             JOBID PARTITION     NAME     USER ST       TIME  NODES SUBMIT_TIME          QOS    TIME_LIMIT   NODELIST(REASON)
          13166762    bigmem    first  hpcuser PD       0:00      1 2023-08-30T14:08:11  high   2-00:00:00   (Priority)
          13166761    bigmem       P5  hpcuser PD       0:00      1 2023-08-30T14:08:11  high   2-00:00:00   (Priority)
          13166760    bigmem       P4  hpcuser PD       0:00      1 2023-08-30T14:08:11  high   2-00:00:00   (Priority)
          13166759    bigmem      Qm3  hpcuser PD       0:00      1 2023-08-30T14:08:11  high   2-00:00:00   (Priority)
          13166758    bigmem       P2  hpcuser PD       0:00      1 2023-08-30T14:08:11  high   2-00:00:00   (Priority)
          13166757    bigmem       G1  hpcuser PD       0:00      1 2023-08-30T14:08:11  high   2-00:00:00   (Priority)
          13167383    bigmem       r8  hpcuser PD       0:00      1 2023-08-30T16:25:52  high   2-00:00:00   (Priority)
          13167390  standard      P12  hpcuser PD       0:00      1 2023-08-30T16:25:55  high   2-00:00:00   (Priority)
          13167391    bigmem      P34  hpcuser PD       0:00      1 2023-08-30T16:25:55  high   2-00:00:00   (Priority)
          13167392    bigmem    qchem  hpcuser PD       0:00      1 2023-08-30T16:25:55  high   2-00:00:00   (Priority)
          13167393     debug  testrun  hpcuser PD       0:00      1 2023-08-30T16:25:55  high   2-00:00:00   (Priority)
          13167394    bigmem   latest  hpcuser PD       0:00      1 2023-08-30T16:25:55  high   2-00:00:00   (Priority)
          13182480     debug  runtest  jwright2 R      31:01      1 2023-09-01T14:49:54  normal 59:00        r3i7n35

Many other options are available in the man page.

scontrol#

To get detailed information about your job before and while it runs, you may use scontrol show job with the JOBID. For example:

$ scontrol show job 522616
JobId=522616 JobName=myscript.sh
 UserId=hpcuser(123456) GroupId=hpcuser(123456) MCS_label=N/A
 Priority=43295364 Nice=0 Account=csc000 QOS=normal
 JobState=PENDING Reason=Dependency Dependency=afterany:522615
The scontrol command can also be used to modify pending and running jobs:
$ scontrol update jobid=526501 qos=high
$ sacct -j 526501 --format=jobid,partition,state,qos
       JobID  Partition      State        QOS
------------ ---------- ---------- ----------
526501            short    RUNNING       high
526501.exte+               RUNNING
526501.0                 COMPLETED
To pause a job: scontrol hold <JOBID>

To resume a job: scontrol resume <JOBID>

To cancel and rerun: scontrol requeue <JOBID>

scancel#

Use scancel -i <jobID> for an interactive mode to confirm each job_id.step_id before performing the cancel operation. Use scancel --state=PENDING,RUNNING,SUSPENDED -u <userid> to cancel your jobs by STATE or scancel -u <userid> to cancel ALL of your jobs.

sinfo#

Use sinfo to view cluster information:

$ sinfo -o %A
NODES(A/I)
1580/514
Above, sinfo shows nodes Allocated (A) and nodes idle (I) in the entire cluster.

To see specific node information use sinfo -n <node id> to show information about a single or list of nodes. You will see the partition to which the node can allocate as well as the node STATE.

$ sinfo -n r105u33,r2i4n27
PARTITION  AVAIL   TIMELIMIT NODES  STATE  NODELIST
short      up        4:00:00     1  drain   r2i4n27
short      up        4:00:00     1   down   r105u33
standard   up     2-00:00:00     1  drain   r2i4n27
standard   up     2-00:00:00     1   down   r105u33
long       up     10-00:00:0     1  drain   r2i4n27
long       up     10-00:00:0     1   down   r105u33
bigmem     up     2-00:00:00     1   down   r105u33
gpu        up     2-00:00:00     1   down   r105u33
bigscratch up     2-00:00:00     0    n/a
ddn        up     2-00:00:00     0    n/a

sacct#

Use sacct to view accounting information about jobs AND job steps:

$ sacct -j 525198 --format=User,JobID,Jobname,partition,state,time,start,elapsed,nnodes,ncpus
     User        JobID    JobName  Partition      State  Timelimit               Start    Elapsed  NNodes    NCPUS
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ---------- ------- --------
  hpcuser 525198        acct_test      short  COMPLETED   00:01:00 2018-12-19T16:09:34   00:00:54       4      144
          525198.batch      batch             COMPLETED            2018-12-19T16:09:34   00:00:54       1       36
          525198.exte+     extern             COMPLETED            2018-12-19T16:09:34   00:00:54       4      144
          525198.0           bash             COMPLETED            2018-12-19T16:09:38   00:00:00       4        4
Use sacct -e to print a list of fields that can be specified with the --format option.

sprio#

By default, sprio returns information for all pending jobs. Options exist to display specific jobs by JOBID and USER.

$ sprio -u hpcuser
  JOBID  PARTITION     USER  PRIORITY   AGE  JOBSIZE PARTITION       QOS
 526752      short  hpcuser  43383470  3733   179737         0  43200000

Use the `-n` flag to provide a normalized priority weighting with a value between 0-1:

$ sprio -u hpcuser -n
  JOBID  PARTITION     USER    PRIORITY        AGE    JOBSIZE  PARTITION        QOS
 526752      short  hpcuser  0.01010100  0.0008642  0.0009747  0.0000000  0.1000000

The sprio command also has some options that can be used to view the entire queue by priority order. The following command will show the "long" (-l) format sprio with extended information, sorted by priority in descending order (-S -Y), and piped through the less command with line numbers shown on the far left (less -N):

sprio -S -Y -l | less -N

1           JOBID PARTITION     USER   PRIORITY       SITE        AGE      ASSOC  FAIRSHARE    JOBSIZE  PARTITION        QOS        NICE                 TRES
2        13150512 standard-  hpcuser  373290120          0    8909585          0  360472143      84743    3823650          0           0
3        13150514 standard-  hpcuser  373290070          0    8909534          0  360472143      84743    3823650          0           0

When sprio is piped through the less command for paginating, press the / key and type in a jobid or a username and press the return key to search for and jump to that jobid or username. Press / and hit return again to search for the next occurrence of your search term, or use the ? instead of / to search upwards in the list. Press q to exit.

Note that when piped through less -N, line numbers may be equated to position in the priority queue plus 1, because the top column label line of sprio is counted by less. To remove the column labels from sprio output, add the -h or --noheader flag to sprio.

The -l(--long) flag precludes using the -n for normalized priority values.

Like squeue and other Slurm commands, sprio supports the -o format flag to customize the columns that are displayed. For example:

sprio S -Y -o "%i %r %u %y"

Will show only the jobid, partition, username, and normalized priority. More details about output formatting are available in man sprio.