Workflows¶

Create a workflow¶

As discussed elsewhere, there are many ways to create a workflow. The standard method is to use a workflow specification.

Command¶

$ torc workflows create-from-json-file

The workflow specification is a user-facing JSON document that fully describes a workflow but is not actually stored in the database.

What happens¶

The torc database service does the following:

Creates a set of collections for one workflow. Each workflow has its own collection for jobs, files, user data, etc. The format of each collection names is <collection_name>/<workflow_key>, such as jobs__96282097.
Converts the JSON objects to torc data models, stores them in the database, and then creates edges in the workflow graph between related vertexes based on names in the specification. Edges are based on document keys, not names.

This table describes the database vertex and edge collections that torc creates.

Database Vertexes and Edges¶
From Collection	To Collection	Edge	Description
jobs	files	produces	Defines the files produced by a job.
jobs	files	needs	Defines the files needed by a job.
jobs	user_data	stores	Defines the user data created and stored by a job.
jobs	user_data	consumes	Defines the user data consumed by a job.
jobs	jobs	blocks	Defines the order of execution for jobs. Can be defined by the user or derived from files and user data.
jobs	resource_requirements	requires	Connects a job with its resource requirements.
jobs	results	returned	Connects a job with its execution results.
jobs	job_process_stats	process_used	Connects a job with its process utilization stats.
compute_nodes	jobs	executed	Connects a compute node with jobs that it executed.
compute_nodes	compute_node_stats	nodeUsed	Connects a compute node with its resource utilization stats.
jobs	schedulers	scheduledBys	Connects a job with its compute node scheduler.

Refer to torc collections join --help to see how to display these relationships.

Start a workflow¶

Command¶

$ torc workflows start

What happens¶

The torc client application makes a series of API calls to check that all required input files and user data objects exist.
The torc client application records the last-modified timestamp of existing input files and records those in the database.
The torc client application calls initialize_jobs. The database service does the following:
- Clears any user_data objects defined as ephemeral.
- Add blocks edges between jobs based on the produces/needs job-file edges.
- Add blocks edges between jobs based on the stores/consumes job-user_data edges.
- Sets job status as appropriate. Jobs will either be ready or blocked.

Schedule compute nodes¶

The standard way to start a workflow in an HPC environment is through the torc hpc slurm schedule-nodes command(s).

What happens¶

The torc client application submits requests to the compute node scheduler (like Slurm) to allocate nodes. It tells the scheduler to run the torc worker application upon node acquisition.

When that application starts it makes database calls to ask for jobs appropriate for its hardware (CPUs, memory, etc.) and then runs those jobs.

Parallelism on each node is based on the resource requirements defined for each job. If each job only needs one CPU and 1 GB of memory, torc will start as many jobs as the node has CPUs.

Complete a job¶

The torc worker application manages job completions. When a user job finishes the application creates a results object with the return code and execution time, and passes that to the database service through the call complete_job. The service does the following:

Stores the result in the database and connects the job to that result with a returned edge.
Changes the job status to done.
Creates a hash of all critical input data for the job and stores it in the database. This includes all inputs that affect the result, like the command, input files, and user data, but not items that do not affect the result, like the job name. If the workflow is ever restarted, torc uses this information to tell whether a successfully-completed job needs to be rerun.
If this job was the last job blocking another job from running, the service changes that job status to ready.

The torc worker application also records process utilization stats for the complete job in the database. When all jobs are finished it also records overall compute node utilization stats in the database.

Restart a workflow¶

Command¶

$ torc workflows restart

What happens¶

The main goal of a workflow restart is to only rerun jobs that did not complete in a previous run. Jobs that finished successfully and have no changes to dependencies do not need to be rerun.

The torc client application repeats the behavior of Start a workflow. With help from the database service it also looks for changes to critical job input parameters, input files and user data. It changes the job status to uninitialized for any job that meets one of these criteria:

The job did not complete successfully. This includes failures, timeouts, and cancelations.
The job input files were updated and have a new timestamp.
User data documents consumed by the job were updated and have a new revision.

The workflow status contains a run_id field. The torc worker application increments the value each time it starts or restarts a workflow. This allows you to inspect results from each run.

Cancel a workflow¶

Command¶

$ torc workflows cancel

What happens¶

The torc client application calls put_workflows_key_cancel. The database service does the following:

Sets the workflow status to canceled.
Sets all jobs that have the status submitted or submitted_pending to canceled.

The torc worker application on each compute node detects those status changes and terminate all running jobs.

Events¶

The torc applications and database service post events to the database for conditions like starting and completing workflows. User applications can post their own events.

Note

The torc database service will add a timestamp value to every event that does not already have one. It is recommended that you not add your own.

You can view the events with this command:

$ torc events list