Workflows

Create a workflow

As discussed elsewhere, there are many ways to create a workflow. The standard method is to use a workflow specification.

Command

$ torc workflows create-from-json-file

The workflow specification is a user-facing JSON document that fully describes a workflow but is not actually stored in the database.

What happens

The torc database service does the following:

  • Creates a set of collections for one workflow. Each workflow has its own collection for jobs, files, user data, etc. The format of each collection names is <collection_name>/<workflow_key>, such as jobs__96282097.

  • Converts the JSON objects to torc data models, stores them in the database, and then creates edges in the workflow graph between related vertexes based on names in the specification. Edges are based on document keys, not names.

This table describes the database vertex and edge collections that torc creates.

Database Vertexes and Edges

From Collection

To Collection

Edge

Description

jobs

files

produces

Defines the files produced by a job.

jobs

files

needs

Defines the files needed by a job.

jobs

user_data

stores

Defines the user data created and stored by a job.

jobs

user_data

consumes

Defines the user data consumed by a job.

jobs

jobs

blocks

Defines the order of execution for jobs. Can be defined by the user or derived from files and user data.

jobs

resource_requirements

requires

Connects a job with its resource requirements.

jobs

results

returned

Connects a job with its execution results.

jobs

job_process_stats

process_used

Connects a job with its process utilization stats.

compute_nodes

jobs

executed

Connects a compute node with jobs that it executed.

compute_nodes

compute_node_stats

nodeUsed

Connects a compute node with its resource utilization stats.

jobs

schedulers

scheduledBys

Connects a job with its compute node scheduler.

Refer to torc collections join --help to see how to display these relationships.

Start a workflow

Command

$ torc workflows start

What happens

  • The torc client application makes a series of API calls to check that all required input files and user data objects exist.

  • The torc client application records the last-modified timestamp of existing input files and records those in the database.

  • The torc client application calls initialize_jobs. The database service does the following:

    • Clears any user_data objects defined as ephemeral.

    • Add blocks edges between jobs based on the produces/needs job-file edges.

    • Add blocks edges between jobs based on the stores/consumes job-user_data edges.

    • Sets job status as appropriate. Jobs will either be ready or blocked.

Schedule compute nodes

The standard way to start a workflow in an HPC environment is through the torc hpc slurm schedule-nodes command(s).

What happens

The torc client application submits requests to the compute node scheduler (like Slurm) to allocate nodes. It tells the scheduler to run the torc worker application upon node acquisition.

When that application starts it makes database calls to ask for jobs appropriate for its hardware (CPUs, memory, etc.) and then runs those jobs.

Parallelism on each node is based on the resource requirements defined for each job. If each job only needs one CPU and 1 GB of memory, torc will start as many jobs as the node has CPUs.

Complete a job

The torc worker application manages job completions. When a user job finishes the application creates a results object with the return code and execution time, and passes that to the database service through the call complete_job. The service does the following:

  • Stores the result in the database and connects the job to that result with a returned edge.

  • Changes the job status to done.

  • Creates a hash of all critical input data for the job and stores it in the database. This includes all inputs that affect the result, like the command, input files, and user data, but not items that do not affect the result, like the job name. If the workflow is ever restarted, torc uses this information to tell whether a successfully-completed job needs to be rerun.

  • If this job was the last job blocking another job from running, the service changes that job status to ready.

The torc worker application also records process utilization stats for the complete job in the database. When all jobs are finished it also records overall compute node utilization stats in the database.

Restart a workflow

Command

$ torc workflows restart

What happens

The main goal of a workflow restart is to only rerun jobs that did not complete in a previous run. Jobs that finished successfully and have no changes to dependencies do not need to be rerun.

The torc client application repeats the behavior of Start a workflow. With help from the database service it also looks for changes to critical job input parameters, input files and user data. It changes the job status to uninitialized for any job that meets one of these criteria:

  • The job did not complete successfully. This includes failures, timeouts, and cancelations.

  • The job input files were updated and have a new timestamp.

  • User data documents consumed by the job were updated and have a new revision.

The workflow status contains a run_id field. The torc worker application increments the value each time it starts or restarts a workflow. This allows you to inspect results from each run.

Cancel a workflow

Command

$ torc workflows cancel

What happens

The torc client application calls put_workflows_key_cancel. The database service does the following:

  • Sets the workflow status to canceled.

  • Sets all jobs that have the status submitted or submitted_pending to canceled.

The torc worker application on each compute node detects those status changes and terminate all running jobs.

Events

The torc applications and database service post events to the database for conditions like starting and completing workflows. User applications can post their own events.

Note

The torc database service will add a timestamp value to every event that does not already have one. It is recommended that you not add your own.

You can view the events with this command:

$ torc events list