Jobs

Job Statuses

The torc worker application and database service manage job status according to the rules shown here.

  • uninitialized: Initial state. Not yet known if it is blocked or ready.

  • ready: The job can be submitted.

  • blocked: The job cannot start because of dependencies.

  • scheduled: The job is ready and a compute node was scheduled to run it (but any node with sufficient resources could run it).

  • submitted_pending: The job was given to a compute node but is not yet running. Transient state.

  • submitted: The job is running on a compute node.

  • terminated: Compute node timeout occurred and the job was notified to checkpoint and shut down.

  • done: The job finished. It may or may not have completed successfully.

  • canceled: A blocking job failed and so the job never ran.

  • disabled: The job cannot run or change state.

digraph job_statuses {
   "uninitialized" -> "ready";
   "uninitialized" -> "blocked";
   "uninitialized" -> "disabled";
   "disabled" -> "uninitialized";
   "ready" -> "submitted_pending" [style = "dotted"];
   "ready" -> "scheduled";
   "submitted_pending" -> "submitted";
   "submitted" -> "canceled";
   "submitted_pending" -> "canceled";
   "scheduled" -> "submitted_pending" [style = "dotted"];
   "submitted" -> "done";
   "submitted" -> "terminated";
   "blocked" -> "canceled";
   "blocked" -> "ready";
}

Note

All statuses can be reset back to uninitialized.

Scheduled jobs

If you enable compute node scheduling for a job that is initially blocked, as discussed in Automated scheduling, here is what torc will do:

  • When all blocking conditions are satisfied the torc database service will change the job status to ready. This is normal; however, in this case there should not be any compute node that has sufficient resources to run the job.

  • When a torc worker application finishes its work it sends the API command prepare_jobs_for_scheduling. The database service searches for all jobs that have a ready status and schedule_compute_nodes set with a value. It returns a list of all those parameters. It also changes the status of each job to scheduled.

  • The torc worker application then runs the scheduler command with those IDs (e.g., torc hpc slurm schedule-nodes).

  • If there happens to be another compute node with available resources, that node could run the scheduled jobs instead. In that case the newly-scheduled node will detect that there is no work to do and exit.