Gracefully shutdown jobs

A common error condition in HPC environments is underestimating the walltime for a job. The HPC scheduler will kill the job. If you don’t take precautions, you will lose the work and have to start from the beginning.

Similar to Slurm, Torc offers one procedure to help with this problem: the supports_termination flag in the job defintion.

In all cases Torc will send the SIGTERM to all running job processes 30 seconds before the allocation expiration time (configurable via the compute_node_expiration_buffer_seconds field in the config section of the workflow specification).

By default Torc will set the job status and return code to terminated. If supports_termination is true then torc will wait for the processes to complete and then set the return code to whatever the process returns.

You can leverage this feature to resume interrupted work by doing the following:

  • Register a signal handler for SIGTERM in your application.

  • In that hander, cause your code to save the current state and gracefully shut down. Return an appropriate exit code and record files such that a new instance of your application can resume from where it left off.

  • Set supports_termination=true on each job.

  • Set compute_node_expiration_buffer_seconds to the amount of time your application will need to gracefully shut down.

If this is set to true then torc will send the signal SIGTERM to each job process. If your job registers a signal handler for that signal, you can gracefully shutdown such that a subsequent process can resume where it left off.

Refer to this script for a Python example of handling SIGTERM: https://github.com/NREL/torc/blob/main/torc_client/tests/scripts/sleep.py

Note

The torc worker application on compute nodes handles SIGTERM. If you configure Slurm to terminate jobs at an earlier time than the torc setting, torc will respect it.