Jobs¶

Job Statuses¶

The torc worker application and database service manage job status according to the rules shown here.

uninitialized: Initial state. Not yet known if it is blocked or ready.
ready: The job can be submitted.
blocked: The job cannot start because of dependencies.
scheduled: The job is ready and a compute node was scheduled to run it (but any node with sufficient resources could run it).
submitted_pending: The job was given to a compute node but is not yet running. Transient state.
submitted: The job is running on a compute node.
terminated: Compute node timeout occurred and the job was notified to checkpoint and shut down.
done: The job finished. It may or may not have completed successfully.
canceled: A blocking job failed and so the job never ran.
disabled: The job cannot run or change state.

$digraph job_statuses { "uninitialized" -> "ready"; "uninitialized" -> "blocked"; "uninitialized" -> "disabled"; "disabled" -> "uninitialized"; "ready" -> "submitted_pending" [style = "dotted"]; "ready" -> "scheduled"; "submitted_pending" -> "submitted"; "submitted" -> "canceled"; "submitted_pending" -> "canceled"; "scheduled" -> "submitted_pending" [style = "dotted"]; "submitted" -> "done"; "submitted" -> "terminated"; "blocked" -> "canceled"; "blocked" -> "ready"; }$

Note

All statuses can be reset back to uninitialized.

Scheduled jobs¶

If you enable compute node scheduling for a job that is initially blocked, as discussed in Automated scheduling, here is what torc will do:

When all blocking conditions are satisfied the torc database service will change the job status to ready. This is normal; however, in this case there should not be any compute node that has sufficient resources to run the job.
When a torc worker application finishes its work it sends the API command prepare_jobs_for_scheduling. The database service searches for all jobs that have a ready status and schedule_compute_nodes set with a value. It returns a list of all those parameters. It also changes the status of each job to scheduled.
The torc worker application then runs the scheduler command with those IDs (e.g., torc hpc slurm schedule-nodes).
If there happens to be another compute node with available resources, that node could run the scheduled jobs instead. In that case the newly-scheduled node will detect that there is no work to do and exit.