Workflows¶
Create a workflow¶
As discussed elsewhere, there are many ways to create a workflow. The standard method is to use a workflow specification.
Command¶
$ torc workflows create-from-json-file
The workflow specification is a user-facing JSON document that fully describes a workflow but is not actually stored in the database.
What happens¶
The torc database service does the following:
Creates a set of collections for one workflow. Each workflow has its own collection for jobs, files, user data, etc. The format of each collection names is
<collection_name>/<workflow_key>
, such asjobs__96282097
.Converts the JSON objects to torc data models, stores them in the database, and then creates edges in the workflow graph between related vertexes based on names in the specification. Edges are based on document keys, not names.
This table describes the database vertex and edge collections that torc creates.
From Collection |
To Collection |
Edge |
Description |
---|---|---|---|
jobs |
files |
produces |
Defines the files produced by a job. |
jobs |
files |
needs |
Defines the files needed by a job. |
jobs |
user_data |
stores |
Defines the user data created and stored by a job. |
jobs |
user_data |
consumes |
Defines the user data consumed by a job. |
jobs |
jobs |
blocks |
Defines the order of execution for jobs. Can be defined by the user or derived from files and user data. |
jobs |
resource_requirements |
requires |
Connects a job with its resource requirements. |
jobs |
results |
returned |
Connects a job with its execution results. |
jobs |
job_process_stats |
process_used |
Connects a job with its process utilization stats. |
compute_nodes |
jobs |
executed |
Connects a compute node with jobs that it executed. |
compute_nodes |
compute_node_stats |
nodeUsed |
Connects a compute node with its resource utilization stats. |
jobs |
schedulers |
scheduledBys |
Connects a job with its compute node scheduler. |
Refer to torc collections join --help
to see how to display these relationships.
Start a workflow¶
Command¶
$ torc workflows start
What happens¶
The torc client application makes a series of API calls to check that all required input files and user data objects exist.
The torc client application records the last-modified timestamp of existing input files and records those in the database.
The torc client application calls
initialize_jobs
. The database service does the following:Clears any user_data objects defined as ephemeral.
Add
blocks
edges between jobs based on theproduces
/needs
job-file edges.Add
blocks
edges between jobs based on thestores
/consumes
job-user_data edges.Sets job status as appropriate. Jobs will either be
ready
orblocked
.
Schedule compute nodes¶
The standard way to start a workflow in an HPC environment is through the
torc hpc slurm schedule-nodes
command(s).
What happens¶
The torc client application submits requests to the compute node scheduler (like Slurm) to allocate nodes. It tells the scheduler to run the torc worker application upon node acquisition.
When that application starts it makes database calls to ask for jobs appropriate for its hardware (CPUs, memory, etc.) and then runs those jobs.
Parallelism on each node is based on the resource requirements defined for each job. If each job only needs one CPU and 1 GB of memory, torc will start as many jobs as the node has CPUs.
Complete a job¶
The torc worker application manages job completions. When a user job finishes the application
creates a results object with the return code and execution time, and passes that to the database
service through the call complete_job
. The service
does the following:
Stores the result in the database and connects the job to that result with a
returned
edge.Changes the job status to
done
.Creates a hash of all critical input data for the job and stores it in the database. This includes all inputs that affect the result, like the command, input files, and user data, but not items that do not affect the result, like the job name. If the workflow is ever restarted, torc uses this information to tell whether a successfully-completed job needs to be rerun.
If this job was the last job blocking another job from running, the service changes that job status to
ready
.
The torc worker application also records process utilization stats for the complete job in the database. When all jobs are finished it also records overall compute node utilization stats in the database.
Restart a workflow¶
Command¶
$ torc workflows restart
What happens¶
The main goal of a workflow restart is to only rerun jobs that did not complete in a previous run. Jobs that finished successfully and have no changes to dependencies do not need to be rerun.
The torc client application repeats the behavior of Start a workflow. With help from
the database service it also looks for changes to critical job input parameters, input files and
user data. It changes the job status to uninitialized
for any job that meets one of these
criteria:
The job did not complete successfully. This includes failures, timeouts, and cancelations.
The job input files were updated and have a new timestamp.
User data documents consumed by the job were updated and have a new revision.
The workflow status contains a run_id
field. The torc worker application increments the value
each time it starts or restarts a workflow. This allows you to inspect results from each run.
Cancel a workflow¶
Command¶
$ torc workflows cancel
What happens¶
The torc client application calls put_workflows_key_cancel
. The database service does the
following:
Sets the workflow status to
canceled
.Sets all jobs that have the status
submitted
orsubmitted_pending
tocanceled
.
The torc worker application on each compute node detects those status changes and terminate all running jobs.
Events¶
The torc applications and database service post events to the database for conditions like starting and completing workflows. User applications can post their own events.
Note
The torc database service will add a timestamp
value to every event that does not
already have one. It is recommended that you not add your own.
You can view the events with this command:
$ torc events list