Database Errors

Torc attempts to provide resiliency against some database errors. The workflow config defines the parameter compute_node_wait_for_healthy_database_minutes with a default value of 20.

Database access on compute node acquisition

If the first API command issued by the torc worker application running on a compute node fails, it will wait that number of minutes, polling once a minute. If the database becomes responsive, it will continue as normal. If the database is still unavailable, it will exit (and release the allocation if torc scheduled it).

Database access while running jobs

Similarly, if an API command issued by the torc worker application fails while it is in its run loop, it will wait that number of minutes. If the database is still unavailable it will terminate all jobs and exit.

All other API commands issued by the torc worker application are not protected and will cause it to exit.

Customization

You can change the value of compute_node_wait_for_healthy_database_minutes in the config section of a workflow specification file (JSON), through the Julia WorkflowBuilder scripts, or through the torc CLI command torc workflows set-compute-node-parameters.