Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Torc is a distributed workflow orchestration system for managing computational pipelines ranging from simple workflows needing to parallelize independent jobs to complex workflows with job dependencies, mixed resource requirements, and multiple stages, compute node scheduling.

Torc provides:

  • Declarative workflow definitions in YAML, JSON, JSON5, or KDL
  • Automatic dependency resolution based on file and data relationships
  • Distributed execution across local machines and HPC clusters
  • Resource management for CPU, memory, and GPU tracking
  • Fault tolerance with workflow resumption after failures
  • Change detection to automatically rerun affected jobs

Who Should Use Torc?

Torc is designed for:

  • HPC Users who need to parallelize jobs across cluster resources
  • Computational Scientists running parameter sweeps and simulations
  • Data Engineers building complex data processing pipelines
  • ML/AI Researchers managing training workflows and hyperparameter searches
  • Anyone who needs reliable, resumable workflow orchestration

Key Features

Job Parameterization

Create parameter sweeps with simple syntax:

jobs:
  - name: job_{index}
    command: bash work.sh {index}
    parameters:
      index: "1:100"

This expands to 100 jobs.

Implicit Dependencies

Dependencies between jobs are automatically inferred based on file dependencies.

name: my_workflow
description: Example workflow with implicit dependencies
jobs:
  - name: preprocess
    command: "bash tests/scripts/preprocess.sh -i ${files.input.f1} -o ${files.output.f2} -o ${files.output.f3}"

  - name: work1
    command: "bash tests/scripts/work.sh -i ${files.input.f2} -o ${files.output.f4}"

  - name: work2
    command: "bash tests/scripts/work.sh -i ${files.input.f3} -o ${files.output.f5}"

  - name: postprocess
    command: "bash tests/scripts/postprocess.sh -i ${files.input.f4} -i ${files.input.f5} -o ${files.output.f6}"

# File definitions - representing the data files in the workflow
files:
  - name: f1
    path: f1.json

  - name: f2
    path: f2.json

  - name: f3
    path: f3.json

  - name: f4
    path: f4.json

  - name: f5
    path: f5.json

  - name: f6
    path: f6.json

Slurm Integration

Native support for HPC clusters:

slurm_schedulers:
  - name: big memory nodes
    partition: bigmem
    account: myproject
    walltime: 04:00:00
    num_nodes: 5

Documentation Structure

This documentation is divided into these sections:

Next Steps

Getting Started

Torc is a distributed workflow orchestration system for managing complex computational pipelines with job dependencies, resource requirements, and distributed execution.

Torc uses a client-server architecture where a central server manages workflow state and coordination, while clients create workflows and job runners execute tasks on compute resources.

Key Components

  • Server: REST API service that manages workflow state via a SQLite database
  • Client: CLI tool and library for creating and managing workflows
  • Job Runner: Worker process that pulls ready jobs, executes them, and reports results
  • Database: Central SQLite database that stores all workflow state and coordinates distributed execution

Features

  • Declarative Workflow Specifications - Define workflows in YAML, JSON5, JSON, or KDL
  • Automatic Dependency Resolution - Dependencies inferred from file and data relationships
  • Job Parameterization - Create parameter sweeps and grid searches with simple syntax
  • Distributed Execution - Run jobs across multiple compute nodes with resource tracking
  • Slurm Integration - Native support for HPC cluster job submission
  • Workflow Resumption - Restart workflows after failures without losing progress
  • Change Detection - Automatically detect input changes and re-run affected jobs
  • Resource Management - Track CPU, memory, and GPU usage across all jobs
  • RESTful API - Complete OpenAPI-specified REST API for integration

Installation

Download precompiled binaries from the releases page.

macOS users: The precompiled binaries are not signed with an Apple Developer certificate. macOS Gatekeeper will block them by default. To allow the binaries to run, remove the quarantine attribute after downloading:

xattr -cr /path/to/torc*

Alternatively, you can right-click each binary and select “Open” to add a security exception.

Building from Source

Prerequisites

  • Rust 1.70 or later
  • SQLite 3.35 or later (usually included with Rust via sqlx)

Clone the Repository

git clone https://github.com/NREL/torc.git
cd torc

Building All Components

Note that the file .env designates the database URL as ./db/sqlite/dev.db Change as desired or set the environment variable DATABASE_URL.

Initialize the database

# Install sqlx-cli if needed
cargo install sqlx-cli --no-default-features --features sqlite
sqlx database setup

Build everything (server, client, dashboard, job runners):

# Development build
cargo build --workspace

# Release build (optimized, recommended)
cargo build --workspace --release

Build individual components:

# Server
cargo build --release -p torc-server

# Client CLI
cargo build --release -p torc

# Web Dashboard
cargo build --release -p torc-dash

# Slurm job runner
cargo build --release -p torc-slurm-job-runner

Binaries will be in target/release/. We recommend adding this directory to your system path so that you can run all binaries without specifying the full path.

Python Client

The Python client provides programmatic workflow management for Python users.

Prerequisites

  • Python 3.11 or later

Installation

pip install "torc @ git+https://github.com/NREL/torc.git#subdirectory=python_client"

The pytorc command will be available after installation.

Julia Client

The Julia client provides programmatic workflow management for Julia users.

Prerequisites

  • Julia 1.10 or later

Installation

Since the package is not yet registered in the Julia General registry, install it directly from GitHub:

using Pkg
Pkg.add(url="https://github.com/NREL/torc.git", subdir="julia_client/Torc")

Then use it in your code:

using Torc

For Developers

Running Tests

Run all tests

cargo test -- --test-threads=1

# Run specific test
cargo test --test test_workflow_manager test_initialize_files_with_updated_files

# Run with debug logging
RUST_LOG=debug cargo test -- --nocapture

Setting Up the Server

Start the server:

# Development mode
cargo run -p torc-server -- run

# Production mode (release build)
./target/release/torc-server run

# Custom port
./target/release/torc-server run --port 8080

Server will start on http://localhost:8080.

When running small workflows for testing and demonstration purposes, we recommend setting this option so that the server detects job completions faster than the default value of 60 seconds.

./target/release/torc-server run --completion-check-interval-secs 5

Quick Start

Choose the guide that matches your environment:

Quick Start (HPC)

For HPC clusters with Slurm — Run workflows on compute nodes via Slurm.

  • Start server on login node
  • Define jobs with resource requirements (CPU, memory, runtime)
  • Submit with torc submit-slurm --account <account> workflow.yaml
  • Jobs run on compute nodes

Quick Start (Local)

For local execution — Run workflows on the current machine.

  • Ideal for testing, development, or non-HPC environments
  • Start server locally
  • Run with torc run workflow.yaml
  • Jobs run on the current machine

Quick Start (Local)

This guide walks you through creating and running your first Torc workflow with local execution. Jobs run directly on the current machine, making this ideal for testing, development, or non-HPC environments.

For running workflows on HPC clusters with Slurm, see Quick Start (HPC).

Start the Server

Start a Torc server with a local database. Setting --completion-check-interval-secs ensures job completions are processed quickly (use this for personal servers, not shared deployments).

torc-server run --database torc.db --completion-check-interval-secs 5

Test the Connection

In a new terminal, verify the client can connect:

torc workflows list

Create a Workflow

Save this as workflow.yaml:

name: hello_world
description: Simple hello world workflow

jobs:
  - name: job 1
    command: echo "Hello from torc!"
  - name: job 2
    command: echo "Hello again from torc!"

Note: Torc also accepts .json5 and .kdl workflow specifications. See Workflow Specification Formats for details.

Run the Workflow

Run jobs locally with a short poll interval for demo purposes:

torc run workflow.yaml --poll-interval 1

This creates the workflow, initializes it, and runs all jobs on the current machine.

View Results

torc results list

Or use the TUI for an interactive view:

torc tui

Example: Diamond Workflow

A workflow with fan-out and fan-in dependencies:

name: diamond_workflow
description: Example workflow with implicit dependencies

jobs:
  - name: preprocess
    command: "bash tests/scripts/preprocess.sh -i ${files.input.f1} -o ${files.output.f2} -o ${files.output.f3}"

  - name: work1
    command: "bash tests/scripts/work.sh -i ${files.input.f2} -o ${files.output.f4}"

  - name: work2
    command: "bash tests/scripts/work.sh -i ${files.input.f3} -o ${files.output.f5}"

  - name: postprocess
    command: "bash tests/scripts/postprocess.sh -i ${files.input.f4} -i ${files.input.f5} -o ${files.output.f6}"

files:
  - name: f1
    path: f1.json
  - name: f2
    path: f2.json
  - name: f3
    path: f3.json
  - name: f4
    path: f4.json
  - name: f5
    path: f5.json
  - name: f6
    path: f6.json

Dependencies are automatically inferred from file inputs/outputs:

  • work1 and work2 wait for preprocess (depend on its output files)
  • postprocess waits for both work1 and work2 to complete

More Examples

The examples directory contains many more workflow examples in YAML, JSON5, and KDL formats.

Next Steps

Quick Start (HPC)

This guide walks you through running your first Torc workflow on an HPC cluster with Slurm. Jobs are submitted to Slurm and run on compute nodes.

For local execution (testing, development, or non-HPC environments), see Quick Start (Local).

Prerequisites

  • Access to an HPC cluster with Slurm
  • A Slurm account/allocation for submitting jobs
  • Torc installed (see Installation)

Start the Server

On the login node, start a Torc server with a local database:

torc-server run --database torc.db --completion-check-interval-secs 5

Note: For larger deployments, your team may provide a shared Torc server. In that case, skip this step and set TORC_API_URL to the shared server address.

Check Your HPC Profile

Torc includes built-in profiles for common HPC systems. Check if your system is detected:

torc hpc detect

If detected, you’ll see your HPC system name. To see available partitions:

torc hpc partitions <profile-name>

Note: If your HPC system isn’t detected, see Custom HPC Profile or request built-in support.

Create a Workflow with Resource Requirements

Save this as workflow.yaml:

name: hpc_hello_world
description: Simple HPC workflow

resource_requirements:
  - name: small
    num_cpus: 4
    memory: 8g
    runtime: PT30M

jobs:
  - name: job1
    command: echo "Hello from compute node!" && hostname
    resource_requirements: small

  - name: job2
    command: echo "Hello again!" && hostname
    resource_requirements: small
    depends_on: [job1]

Key differences from local workflows:

  • resource_requirements: Define CPU, memory, and runtime needs
  • Jobs reference these requirements by name
  • Torc matches requirements to appropriate Slurm partitions

Submit the Workflow

Submit with your Slurm account:

torc submit-slurm --account <your-account> workflow.yaml

Torc will:

  1. Detect your HPC system
  2. Match job requirements to appropriate partitions
  3. Generate Slurm scheduler configurations
  4. Create and submit the workflow

Monitor Progress

Check workflow status:

torc workflows list
torc jobs list <workflow-id>

Or use the interactive TUI:

torc tui

Check Slurm queue:

squeue -u $USER

View Results

Once jobs complete:

torc results list <workflow-id>

Job output is stored in the output/ directory by default.

Example: Multi-Stage Pipeline

A more realistic workflow with different resource requirements per stage:

name: analysis_pipeline
description: Data processing pipeline

resource_requirements:
  - name: light
    num_cpus: 4
    memory: 8g
    runtime: PT30M

  - name: compute
    num_cpus: 32
    memory: 64g
    runtime: PT2H

  - name: gpu
    num_cpus: 8
    num_gpus: 1
    memory: 32g
    runtime: PT1H

jobs:
  - name: preprocess
    command: python preprocess.py
    resource_requirements: light

  - name: train
    command: python train.py
    resource_requirements: gpu
    depends_on: [preprocess]

  - name: evaluate
    command: python evaluate.py
    resource_requirements: compute
    depends_on: [train]

Torc stages resource allocation based on dependencies:

  • preprocess resources are allocated at workflow start
  • train resources are allocated when preprocess completes
  • evaluate resources are allocated when train completes

This prevents wasting allocation time on resources that aren’t needed yet.

Preview Before Submitting

For production workflows, preview the generated Slurm configuration first:

torc slurm generate --account <your-account> workflow.yaml

This shows what schedulers and actions Torc will create without submitting anything.

Next Steps

Explanation

This section provides understanding-oriented discussions of Torc’s key concepts and architecture. Here you’ll learn how Torc works internally, how components interact, and the design decisions behind the system.

Topics covered:

  • System architecture and component interactions
  • Database design and coordination model
  • Server API implementation
  • Client architecture and workflow management
  • Job runner execution model
  • Job state transitions and lifecycle
  • Workflow reinitialization and change detection
  • Workflow archiving for long-term workflow management
  • Dependency resolution mechanisms
  • Ready queue optimization for large workflows
  • Parallelization strategies and job allocation approaches
  • Workflow actions for automation and dynamic resource allocation

Architecture

Overview

Torc uses a client-server architecture where a central server manages workflow state and coordination, while clients create workflows and job runners execute tasks on compute resources.

Key Components:

  • Server: REST API service that manages workflow state via a SQLite database
  • Client: CLI tool and library for creating and managing workflows
  • Job Runner: Worker process that pulls ready jobs, executes them, and reports results
  • Database: Central SQLite database that stores all workflow state and coordinates distributed execution

System Diagram

graph TB
    subgraph Server["Torc Server"]
        API["REST API (Tokio 1-thread)<br/>/workflows /jobs /files<br/>/user_data /results"]
        DB["SQLite Database (WAL)<br/>• Workflow state<br/>• Job dependencies<br/>• Resource tracking<br/>• Execution results"]
        API --> DB
    end

    Client["Torc Client<br/>• Create workflows<br/>• Submit specs<br/>• Monitor"]
    Runner1["Job Runner 1<br/>(compute-01)<br/>• Poll for jobs<br/>• Execute tasks<br/>• Report results"]
    RunnerN["Job Runner N<br/>(compute-nn)<br/>• Poll for jobs<br/>• Execute tasks<br/>• Report results"]

    Client -.HTTP/REST.-> API
    Runner1 -.HTTP/REST.-> API
    RunnerN -.HTTP/REST.-> API

Client

Torc provides client libraries in multiple languages for workflow management.

Rust Client (Primary)

The Rust client provides both CLI and library interfaces:

Workflow Creation

  • Parse workflow specification files (JSON, JSON5, YAML, KDL)
  • Expand parameterized job/file specifications
  • Create all workflow components atomically via API calls
  • Handle name-to-ID resolution for dependencies

Workflow Manager

  • Start/restart/reinitialize workflow execution
  • Track file changes and update database
  • Detect changed user_data inputs
  • Validate workflow state before initialization

API Integration

  • Auto-generated client from OpenAPI spec
  • Pagination support for large result sets
  • Retry logic and error handling

Client Modes

The Rust client operates in multiple modes:

  1. CLI Mode - Command-line interface for interactive use
  2. Library Mode - Programmatic API for integration with other tools
  3. Specification Parser - Reads and expands workflow specifications
  4. API Client - HTTP client for communicating with the server

Python Client

The Python client (torc package) provides programmatic workflow management for Python users:

  • OpenAPI-generated client for full API access
  • make_api() helper for easy server connection
  • map_function_to_jobs() for mapping Python functions across parameters
  • Integration with Python data science and ML pipelines

See Creating Workflows for usage examples.

Julia Client

The Julia client (Torc.jl package) provides programmatic workflow management for Julia users:

  • OpenAPI-generated client for full API access
  • make_api() helper for easy server connection
  • send_api_command() wrapper with error handling
  • add_jobs() for batch job creation
  • map_function_to_jobs() for mapping Julia functions across parameters

See Creating Workflows for usage examples.

Job Runners

Job runners are worker processes that execute jobs on compute resources.

Job Runner Modes

  1. Local Runner (torc run) - Runs jobs on local machine with resource tracking
  2. Slurm Runner (torc-slurm-job-runner) - Submits jobs to Slurm clusters

Job Allocation Strategies

The job runner supports two different strategies for retrieving and executing jobs:

Resource-Based Allocation (Default)

Used when: --max-parallel-jobs is NOT specified

Behavior:

  • Retrieves jobs from the server via the command claim_jobs_based_on_resources
  • Server filters jobs based on available compute node resources (CPU, memory, GPU)
  • Only returns jobs that fit within the current resource capacity
  • Prevents resource over-subscription and ensures jobs have required resources
  • Defaults to requiring one CPU and 1 MB of memory for each job.

Use cases:

  • When you want parallelization based on one CPU per job.
  • When you have heterogeneous jobs with different resource requirements and want intelligent resource management.

Example 1: Run jobs at queue depth of num_cpus:

parameters:
  i: "1:100"
jobs:
  - name: "work_{i}"
    command: bash my_script.sh {i}
    use_parameters:
    - i

Example 2: Resource-based parallelization:

resource_requirements:
  - name: "work_resources"
    num_cpus: 32
    memory: "200g"
    runtime: "PT4H"
    num_nodes: 1
    
parameters:
  i: "1:100"
jobs:
  - name: "work_{i}"
    command: bash my_script.sh {i}
    resource_requirements: work_resources  
    use_parameters:
    - i

Simple Queue-Based Allocation

Used when: --max-parallel-jobs is specified

Behavior:

  • Retrieves jobs from the server via the command claim_next_jobs
  • Server returns the next N ready jobs from the queue (up to the specified limit)
  • Ignores job resource requirements completely
  • Simply limits the number of concurrent jobs

Use cases: When all jobs have similar resource needs or when the resource bottleneck is not tracked by Torc, such as network or storage I/O. This is the only way to run jobs at a queue depth higher than the number of CPUs in the worker.

Example:

torc run $WORKFLOW_ID \
  --max-parallel-jobs 10 \
  --output-dir ./results

Job Runner Workflow

The job runner executes a continuous loop with these steps:

  1. Check workflow status - Poll server to check if workflow is complete or canceled
  2. Monitor running jobs - Check status of currently executing jobs
  3. Execute workflow actions - Check for and execute any pending workflow actions, such as scheduling new Slurm allocations.
  4. Claim new jobs - Request ready jobs from server based on allocation strategy:
    • Resource-based: claim_jobs_based_on_resources
    • Queue-based: claim_next_jobs
  5. Start jobs - For each claimed job:
    • Call start_job to mark job as started in database
    • Execute job command in a non-blocking subprocess
    • Record stdout/stderr output to files
  6. Complete jobs - When running jobs finish:
    • Call complete_job with exit code and result
    • Server updates job status and automatically marks dependent jobs as ready
  7. Sleep and repeat - Wait for job completion poll interval, then repeat loop

The runner continues until the workflow is complete or canceled.

Resource Management (Resource-Based Allocation Only)

When using resource-based allocation (default), the local job runner tracks:

  • Number of CPUs in use
  • Memory allocated to running jobs
  • GPUs in use
  • Job runtime limits

When a ready job is retrieved, the runner checks if sufficient resources are available before executing it.

Job State Transitions

Jobs progress through a defined lifecycle:

stateDiagram-v2
    [*] --> uninitialized
    uninitialized --> ready: initialize_jobs called
    uninitialized --> blocked: initialize_jobs called<br/>(dependencies not met)

    ready --> pending: job runner claims job
    blocked --> ready: dependency completed
    pending --> running: job runner starts job

    running --> completed: exit code 0
    running --> failed: exit code != 0
    running --> canceled: explicit cancellation
    running --> terminated: explicit termination

    completed --> [*]
    failed --> [*]
    canceled --> [*]
    terminated --> [*]

State Descriptions

  • uninitialized (0) - Job created but dependencies not evaluated
  • blocked (1) - Waiting for dependencies to complete
  • ready (2) - All dependencies satisfied, ready for execution
  • pending (3) - Job claimed by runner
  • running (4) - Currently executing
  • completed (5) - Finished successfully (exit code 0)
  • failed (6) - Finished with error (exit code != 0)
  • canceled (7) - Explicitly canceled by user or system
  • terminated (8) - Explicitly terminated by system, such as for checkpointing before wall-time timeout
  • disabled (9) - Explicitly disabled by user

Environment Variables

When Torc executes jobs, it automatically sets several environment variables that provide context about the job and enable communication with the Torc server. These variables are available to all job commands during execution.

Variables Set During Job Execution

TORC_WORKFLOW_ID

The unique identifier of the workflow that contains this job.

  • Type: Integer (provided as string)
  • Example: "42"
  • Use case: Jobs can use this to query workflow information or to organize output files by workflow
# Example: Create a workflow-specific output directory
mkdir -p "/data/results/workflow_${TORC_WORKFLOW_ID}"
echo "Processing data..." > "/data/results/workflow_${TORC_WORKFLOW_ID}/output.txt"

TORC_JOB_ID

The unique identifier of the currently executing job.

  • Type: Integer (provided as string)
  • Example: "123"
  • Use case: Jobs can use this for logging, creating job-specific output files, or querying job metadata
# Example: Log job-specific information
echo "Job ${TORC_JOB_ID} started at $(date)" >> "/var/log/torc/job_${TORC_JOB_ID}.log"

TORC_API_URL

The URL of the Torc API server that the job runner is communicating with.

  • Type: String (URL)
  • Example: "http://localhost:8080/torc-service/v1"
  • Use case: Jobs can make API calls to the Torc server to query data, create files, update user data, or perform other operations
# Example: Query workflow information from within a job
curl -s "${TORC_API_URL}/workflows/${TORC_WORKFLOW_ID}" | jq '.name'

# Example: Create a file entry in Torc
curl -X POST "${TORC_API_URL}/files" \
  -H "Content-Type: application/json" \
  -d "{
    \"workflow_id\": ${TORC_WORKFLOW_ID},
    \"name\": \"result_${TORC_JOB_ID}\",
    \"path\": \"/data/results/output.txt\"
  }"

Complete Example

Here’s a complete example of a job that uses all three environment variables:

name: "Environment Variables Demo"
user: "demo"

jobs:
  - name: "example_job"
    command: |
      #!/bin/bash
      set -e

      echo "=== Job Environment ==="
      echo "Workflow ID: ${TORC_WORKFLOW_ID}"
      echo "Job ID: ${TORC_JOB_ID}"
      echo "API URL: ${TORC_API_URL}"

      # Create job-specific output directory
      OUTPUT_DIR="/tmp/workflow_${TORC_WORKFLOW_ID}/job_${TORC_JOB_ID}"
      mkdir -p "${OUTPUT_DIR}"

      # Do some work
      echo "Processing data..." > "${OUTPUT_DIR}/status.txt"
      date >> "${OUTPUT_DIR}/status.txt"
      echo "Job completed successfully!"

Notes

  • All environment variables are set as strings, even numeric values like workflow and job IDs
  • The TORC_API_URL includes the full base path to the API (e.g., /torc-service/v1)
  • Jobs inherit all other environment variables from the job runner process
  • These variables are available in both local and Slurm-scheduled job executions

Workflow Reinitialization

Reinitialization allows workflows to be rerun when inputs change.

Reinitialize Process

  1. Bump run_id - Increments workflow run counter.
  2. Reset workflow status - Clears previous run state.
  3. Process changed files - Detects file modifications via st_mtime.
  4. Process changed user_data - Computes input hashes and detects changes.
  5. Mark jobs for rerun - Sets affected jobs to uninitialized.
  6. Re-initialize jobs - Re-evaluates dependencies and marks jobs ready/blocked.

Input Change Detection

The process_changed_job_inputs endpoint implements hash-based change detection:

  1. For each job, compute SHA256 hash of all input parameters. Note: files are tracked by modification times, not hashes. User data records are hashed.
  2. Compare to stored hash in the database.
  3. If hash differs, mark job as uninitialized.
  4. All updates happen in a single database transaction (all-or-none).

After jobs are marked uninitialized, calling initialize_jobs re-evaluates the dependency graph:

  • Jobs with satisfied dependencies → ready
  • Jobs waiting on dependencies → blocked

Use Cases

  • Development iteration - Modify input files and re-run affected jobs
  • Parameter updates - Change configuration and re-execute
  • Failed job recovery - Fix issues and re-run without starting from scratch
  • Incremental computation - Only re-run jobs affected by changes

Workflow Archiving

Workflow archiving provides a way to hide completed or inactive workflows from default list views while preserving all workflow data and execution history. Archived workflows remain fully accessible but don’t clutter everyday workflow management operations.

Purpose and Motivation

As projects mature and accumulate workflows over time, the list of active workflows can become difficult to navigate. Archiving addresses this by:

  • Reducing visual clutter - Completed workflows no longer appear in default list views
  • Preserving historical data - All workflow data, jobs, results, and logs remain accessible
  • Improving usability - Users can focus on active workflows without losing access to past work
  • Maintaining audit trails - Archived workflows can be retrieved for analysis, debugging, or compliance

Archiving is particularly useful for:

  • Completed experiments that may need future reference
  • Successful production runs that serve as historical records
  • Development workflows that are no longer active but contain valuable examples
  • Workflows from completed projects that need to be retained for documentation

How It Works

When you archive a workflow, it’s marked with an “archived” flag. This flag controls whether the workflow appears in default list views:

  • Active workflows (not archived): Appear in standard workflows list commands
  • Archived workflows: Hidden from default lists but accessible with the --archived-only flag

The archive status is just metadata - it doesn’t affect the workflow’s data, results, or any other functionality.

Archiving Workflows

Use the workflows archive command to archive or unarchive workflows:

# Archive a specific workflow
torc workflows archive true <workflow_id>

# Archive multiple workflows
torc workflows archive true 123 456 789

# Interactive selection (prompts user to choose)
torc workflows archive true

# With JSON output
torc --format json workflows archive true <workflow_id>

The command will output confirmation messages:

Successfully archived workflow 123
Successfully archived workflow 456
Successfully archived workflow 789

Unarchiving Workflows

To restore an archived workflow to active status, use the same command with false:

# Unarchive a specific workflow
torc workflows archive false <workflow_id>

# Unarchive multiple workflows
torc workflows archive false 123 456 789

# Interactive selection
torc workflows archive false

Output:

Successfully unarchived workflow 123

Viewing Workflows

Default Behavior

By default, the workflows list command shows only non-archived workflows:

# Shows active (non-archived) workflows only
torc workflows list

# Shows active workflows for a specific user
torc workflows list --user alice

Viewing Archived Workflows

Use the --archived-only flag to see archived workflows:

# List only archived workflows for current user
torc workflows list --archived-only

# List all archived workflows for all users
torc workflows list --all-users --archived-only

Viewing All Workflows

Use the --include-archived flag to see all workflows:

torc workflows list --include-archived

Accessing Specific Workflows

You can always access a workflow directly by its ID, regardless of archive status:

# Get details of any workflow (archived or not)
torc workflows get <workflow_id>

# Check workflow status
torc workflows status <workflow_id>

Impact on Workflow Operations

Operations Restricted on Archived Workflows

Certain workflow operations are not allowed on archived workflows to prevent accidental modifications:

  • Status reset: Cannot use workflows reset-status on archived workflows
    • Error message: “Cannot reset archived workflow status. Unarchive the workflow first.”
    • To reset status, unarchive the workflow first, then reset

Interactive Selection Behavior

When commands prompt for interactive workflow selection (when workflow ID is not specified), archived workflows are excluded by default:

# These will NOT show archived workflows in the interactive menu
torc-client workflows delete
torc-client workflows status
torc-client workflows initialize

This prevents accidentally operating on archived workflows while still allowing explicit access by ID.

Archive vs. Delete

Understanding when to archive versus delete workflows:

OperationData PreservedReversibleUse Case
Archive✅ Yes✅ YesCompleted workflows you may reference later
Delete❌ No❌ NoFailed experiments, test workflows, unwanted data

Archive when:

  • Workflow completed successfully and may need future reference
  • Results should be preserved for reproducibility or compliance
  • Workflow represents a milestone or important historical run
  • You want to declutter lists but maintain data integrity

Delete when:

  • Workflow failed and results are not useful
  • Workflow was created for testing purposes only
  • Data is no longer needed and storage space is a concern
  • Workflow contains errors that would confuse future users

Common Use Cases

Completed Experiments

After completing an experiment and validating results:

# Archive the completed experiment
torc-client workflows archive true 123

# Later, if you need to reference it
torc-client workflows get 123
torc-client results list 123

Development Cleanup

Clean up development workflows while preserving examples:

# Delete test workflows
torc-client workflows delete 301 302 303

# Archive useful development examples
torc-client workflows archive true 304 305

Periodic Maintenance

Regularly archive old workflows to keep lists manageable:

# List workflows, identify completed ones
torc-client workflows list

# Archive workflows from completed projects
torc workflows archive true 401 402 403 404 405

Best Practices

When to Archive

  1. After successful completion - Archive workflows once they’ve completed successfully and been validated
  2. Project milestones - Archive workflows representing project phases or releases
  3. Regular cleanup - Establish periodic archiving of workflows older than a certain timeframe
  4. Before major changes - Archive working versions before making significant modifications

Summary

Workflow archiving provides a simple, reversible way to hide completed or inactive workflows from default views while preserving all data and functionality. It’s designed for long-term workflow management in active projects where historical data is valuable but visual clutter is undesirable.

Key points:

  • Archive workflows with: torc workflows archive true <id>
  • Unarchive workflows with: torc workflows archive false <id>
  • Archived workflows are hidden from default lists but remain fully functional
  • View archived workflows with: torc workflows list --archived-only
  • Archiving is reversible and does not affect data storage
  • Use archiving for completed workflows; use deletion for unwanted data

Dependency Resolution

Torc supports two types of dependencies. For a hands-on tutorial, see Diamond Workflow with File Dependencies.

1. Explicit Dependencies

Declared via depends_on:

jobs:
  - name: preprocess
    command: preprocess.sh
  - name: analyze
    command: analyze.sh
    depends_on:
      - preprocess

2. Implicit Dependencies

Inferred from file and user_data relationships.

File Dependencies

jobs:
  - name: preprocess
    command: process.sh
    output_files:
      - intermediate_data

  - name: analyze
    command: analyze.sh
    input_files:
      - intermediate_data  # Implicitly depends on preprocess

User Data Dependencies

User scripts ingest JSON data into Torc’s database. This is analagous to JSON files except that they are stored in the database AND user code must understand Torc’s API.

jobs:
  - name: generate_config
    command: make_config.py
    output_user_data:
      - config

  - name: run_simulation
    command: simulate.py
    input_user_data:
      - config  # Implicitly depends on generate_config
      
user_data:
  - name: config

Resolution Process

During workflow creation, the server:

  1. Resolves all names to IDs
  2. Stores explicit dependencies in job_depends_on
  3. Stores file/user_data relationships in junction tables
  4. During initialize_jobs, queries junction tables to add implicit dependencies

Dependency Graph Evaluation

When initialize is called:

  1. All jobs start in uninitialized state
  2. Server builds complete dependency graph from explicit and implicit dependencies
  3. Jobs with no unsatisfied dependencies are marked ready
  4. Jobs waiting on dependencies are marked blocked
  5. As jobs complete, blocked jobs are re-evaluated and may become ready

Variable Substitution Syntax

In workflow specification files (YAML, JSON5, KDL), use these patterns to reference files and user data in job commands:

PatternDescription
${files.input.NAME}File path this job reads (creates implicit dependency)
${files.output.NAME}File path this job writes (satisfies dependencies)
${user_data.input.NAME}User data this job reads
${user_data.output.NAME}User data this job writes

Example:

jobs:
  - name: process
    command: "python process.py -i ${files.input.raw} -o ${files.output.result}"

See Workflow Specification Formats for complete syntax details.

Parallelization Strategies

Torc provides flexible parallelization strategies to accommodate different workflow patterns and resource allocation scenarios. Understanding these strategies helps you optimize job execution for your specific use case.

Overview

Torc supports two primary approaches to parallel job execution:

  1. Resource-aware allocation - Define per-job resource requirements and let runners intelligently select jobs that fit available resources
  2. Queue-depth parallelism - Control the number of concurrent jobs without resource tracking

The choice between these approaches depends on your workflow characteristics and execution environment.

Use Case 1: Resource-Aware Job Allocation

This strategy is ideal for heterogeneous workflows where jobs have varying resource requirements (CPU, memory, GPU, runtime). The server intelligently allocates jobs based on available compute node resources.

How It Works

When you define resource requirements for each job:

resource_requirements:
  - name: small
    num_cpus: 2
    num_gpus: 0
    memory: 4g
    runtime: PT30M

  - name: large
    num_cpus: 16
    num_gpus: 2
    memory: 128g
    runtime: PT8H

jobs:
  - name: preprocessing
    command: ./preprocess.sh
    resource_requirements: small

  - name: model_training
    command: python train.py
    resource_requirements: large

The job runner pulls jobs from the server by detecting its available resources automatically.

torc run $WORKFLOW_ID

The server’s GET /workflows/{id}/claim_jobs_based_on_resources endpoint:

  1. Receives the runner’s resource capacity
  2. Queries the ready queue for jobs that fit within those resources
  3. Returns a set of jobs that can run concurrently without over-subscription
  4. Updates job status from ready to pending atomically

Job Allocation Ambiguity: Two Approaches

When you have multiple compute nodes or schedulers with different capabilities, there are two ways to handle job allocation:

Approach 1: Sort Method (Flexible but Potentially Ambiguous)

How it works:

  • Jobs do NOT specify a particular scheduler/compute node
  • The server uses a job_sort_method parameter to prioritize jobs when allocating
  • Any runner with sufficient resources can claim any ready job

Available sort methods: Define the field job_sort_method in the workflow specification file (YAML/JSON/KDL)

  • gpus_runtime_memory - Prioritize jobs by GPU count (desc), then runtime (desc), then memory (desc)
  • gpus_memory_runtime - Prioritize jobs by GPU count (desc), then memory (desc), then runtime (desc)
  • none - No sorting, jobs selected in queue order

Tradeoffs:

Advantages:

  • Maximum flexibility - any runner can execute any compatible job
  • Better resource utilization - if GPU runner is idle, it can pick up CPU-only jobs
  • Simpler workflow specifications - no need to explicitly map jobs to schedulers
  • Fault tolerance - if one runner fails, others can pick up its jobs

Disadvantages:

  • Ambiguity - no guarantee GPU jobs go to GPU runners
  • Potential inefficiency - high-memory jobs might land on low-memory nodes if timing is unlucky
  • Requires careful sort method selection
  • Less predictable job placement

When to use:

  • Homogeneous or mostly-homogeneous compute resources
  • Workflows where job placement flexibility is valuable
  • When you want runners to opportunistically pick up work
  • Development and testing environments

Approach 2: Scheduler ID (Deterministic but Less Flexible)

How it works:

  • Define scheduler configurations in your workflow spec
  • Assign each job a specific scheduler_id
  • Runners provide their scheduler_config_id when requesting jobs
  • Server only returns jobs matching that scheduler ID

Example workflow specification:

slurm_schedulers:
  - name: gpu_cluster
    partition: gpu
    account: myproject

  - name: highmem_cluster
    partition: highmem
    account: myproject

jobs:
  - name: model_training
    command: python train.py
    resource_requirements: large
    slurm_scheduler: gpu_cluster     # Binds to specific scheduler

  - name: large_analysis
    command: ./analyze.sh
    resource_requirements: highmem
    slurm_scheduler: highmem_cluster

Example runner invocation:

# GPU runner - only pulls jobs assigned to gpu_cluster
torc-slurm-job-runner $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-cpus 32 \
  --num-gpus 8

# High-memory runner - only pulls jobs assigned to highmem_cluster
torc-slurm-job-runner $WORKFLOW_ID \
  --scheduler-config-id 2 \
  --num-cpus 64 \
  --memory-gb 512

Tradeoffs:

Advantages:

  • Zero ambiguity - jobs always run on intended schedulers
  • Predictable job placement
  • Prevents GPU jobs from landing on CPU-only nodes
  • Clear workflow specification - explicit job→scheduler mapping
  • Better for heterogeneous clusters (GPU vs CPU vs high-memory)

Disadvantages:

  • Less flexibility - idle runners can’t help other queues
  • Potential resource underutilization - GPU runner sits idle while CPU queue is full
  • More complex workflow specifications
  • If a scheduler fails, its jobs remain stuck until that scheduler returns

When to use:

  • Highly heterogeneous compute resources (GPU clusters, high-memory nodes, specialized hardware)
  • Production workflows requiring predictable job placement
  • Multi-cluster environments
  • When job-resource matching is critical (e.g., GPU-only codes, specific hardware requirements)
  • Slurm or HPC scheduler integrations

Choosing Between Sort Method and Scheduler ID

ScenarioRecommended ApproachRationale
All jobs can run anywhereSort methodMaximum flexibility, simplest spec
Some jobs need GPUs, some don’tScheduler IDPrevent GPU waste on CPU jobs
Multi-cluster Slurm environmentScheduler IDJobs must target correct clusters
Development/testingSort methodEasier to experiment
Production with SLAsScheduler IDPredictable resource usage
Homogeneous compute nodesSort methodNo benefit to restricting
Specialized hardware (GPUs, high-memory, FPGAs)Scheduler IDMatch jobs to capabilities

You can also mix approaches: Use scheduler_id for jobs with strict requirements, leave it NULL for flexible jobs.

Use Case 2: Queue-Depth Parallelism

This strategy is ideal for workflows with homogeneous resource requirements where you simply want to control the level of parallelism.

How It Works

Instead of tracking resources, you specify a maximum number of concurrent jobs:

torc run $WORKFLOW_ID \
  --max-parallel-jobs 10 \
  --output-dir ./results

or with Slurm:

torc slurm schedule-nodes $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-hpc-jobs 4 \
  --max-parallel-jobs 8

Server behavior:

The GET /workflows/{id}/claim_next_jobs endpoint:

  1. Accepts limit parameter specifying maximum jobs to return
  2. Ignores all resource requirements
  3. Returns the next N ready jobs from the queue
  4. Updates their status from ready to pending

Runner behavior:

  • Maintains a count of running jobs
  • When count falls below max_parallel_jobs, requests more work
  • Does NOT track CPU, memory, GPU, or other resources
  • Simply enforces the concurrency limit

Ignoring Resource Consumption

This is a critical distinction: when using --max-parallel-jobs, the runner completely ignores current resource consumption.

Normal resource-aware mode:

Runner has: 32 CPUs, 128 GB memory
Job A needs: 16 CPUs, 64 GB
Job B needs: 16 CPUs, 64 GB
Job C needs: 16 CPUs, 64 GB

Runner starts Job A and Job B (resources fully allocated)
Job C waits until resources free up

Queue-depth mode with –max-parallel-jobs 3:

Runner has: 32 CPUs, 128 GB memory (IGNORED)
Job A needs: 16 CPUs, 64 GB (IGNORED)
Job B needs: 16 CPUs, 64 GB (IGNORED)
Job C needs: 16 CPUs, 64 GB (IGNORED)

Runner starts Job A, Job B, and Job C simultaneously
Total requested: 48 CPUs, 192 GB (exceeds node capacity!)
System may: swap, OOM, or throttle performance

When to Use Queue-Depth Parallelism

✅ Use queue-depth parallelism when:

  1. All jobs have similar resource requirements

    # All jobs use ~4 CPUs, ~8GB memory
    jobs:
      - name: process_file_1
        command: ./process.sh file1.txt
      - name: process_file_2
        command: ./process.sh file2.txt
      # ... 100 similar jobs
    
  2. Resource requirements are negligible compared to node capacity

    • Running 100 lightweight Python scripts on a 64-core machine
    • I/O-bound jobs that don’t consume much CPU/memory
  3. Jobs are I/O-bound or sleep frequently

    • Data download jobs
    • Jobs waiting on external services
    • Polling or monitoring tasks
  4. You want simplicity over precision

    • Quick prototypes
    • Testing workflows
    • Simple task queues
  5. Jobs self-limit their resource usage

    • Application has built-in thread pools
    • Container resource limits
    • OS-level cgroups or resource controls

❌ Avoid queue-depth parallelism when:

  1. Jobs have heterogeneous resource requirements

    • Mix of 2-CPU and 32-CPU jobs
    • Some jobs need 4GB, others need 128GB
  2. Resource contention causes failures

    • Out-of-memory errors
    • CPU thrashing
    • GPU memory exhaustion
  3. You need efficient bin-packing

    • Maximizing node utilization
    • Complex resource constraints
  4. Jobs are compute-intensive

    • CPU-bound numerical simulations
    • Large matrix operations
    • Video encoding

Queue-Depth Parallelism in Practice

Example 1: Slurm with Queue Depth

# Schedule 4 Slurm nodes, each running up to 8 concurrent jobs
torc slurm schedule-nodes $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-hpc-jobs 4 \
  --max-parallel-jobs 8

This creates 4 Slurm job allocations. Each allocation runs a worker that:

  • Pulls up to 8 jobs at a time
  • Runs them concurrently
  • Requests more when any job completes

Total concurrency: up to 32 jobs (4 nodes × 8 jobs/node)

Example 2: Local Runner with Queue Depth

# Run up to 20 jobs concurrently on local machine
torc-job-runner $WORKFLOW_ID \
  --max-parallel-jobs 20 \
  --output-dir ./output

Example 3: Mixed Approach

You can even run multiple runners with different strategies:

# Terminal 1: Resource-aware runner for large jobs
torc run $WORKFLOW_ID \
  --num-cpus 32 \
  --memory-gb 256

# Terminal 2: Queue-depth runner for small jobs
torc run $WORKFLOW_ID \
  --max-parallel-jobs 50

The ready queue serves both runners. The resource-aware runner gets large jobs that fit its capacity, while the queue-depth runner gets small jobs for fast parallel execution.

Performance Characteristics

Resource-aware allocation:

  • Query complexity: O(jobs in ready queue)
  • Requires computing resource sums
  • Slightly slower due to filtering and sorting
  • Better resource utilization

Queue-depth allocation:

  • Query complexity: O(1) with limit
  • Simple LIMIT clause, no resource computation
  • Faster queries
  • Simpler logic

For workflows with thousands of ready jobs, queue-depth allocation has lower overhead.

Best Practices

  1. Start with resource-aware allocation for new workflows

    • Better default behavior
    • Prevents resource over-subscription
    • Easier to debug resource issues
  2. Use scheduler_id for production multi-cluster workflows

    • Explicit job placement
    • Predictable resource usage
    • Better for heterogeneous resources
  3. Use sort_method for flexible single-cluster workflows

    • Simpler specifications
    • Better resource utilization
    • Good for homogeneous resources
  4. Use queue-depth parallelism for homogeneous task queues

    • Many similar jobs
    • I/O-bound workloads
    • When simplicity matters more than precision
  5. Monitor resource usage when switching strategies

    • Check for over-subscription
    • Verify expected parallelism
    • Look for resource contention
  6. Test with small workflows first

    • Validate job allocation behavior
    • Check resource accounting
    • Ensure jobs run on intended schedulers

Summary

StrategyUse WhenAllocation MethodResource Tracking
Resource-aware + sort_methodHeterogeneous jobs, flexible allocationServer filters by resourcesYes
Resource-aware + scheduler_idHeterogeneous jobs, strict allocationServer filters by resources AND schedulerYes
Queue-depthHomogeneous jobs, simple parallelismServer returns next N jobsNo

Choose the strategy that best matches your workflow characteristics and execution environment. You can even mix strategies across different runners for maximum flexibility.

Workflow Actions

Workflow actions enable automatic execution of commands and resource allocation in response to workflow lifecycle events. Actions provide hooks for setup, teardown, monitoring, and dynamic resource management throughout workflow execution.

Overview

Actions consist of three components:

  1. Trigger - The condition that activates the action
  2. Action Type - The operation to perform
  3. Configuration - Parameters specific to the action
actions:
  - trigger_type: "on_workflow_start"
    action_type: "run_commands"
    commands:
      - "mkdir -p output logs"
      - "echo 'Workflow started' > logs/status.txt"

Trigger Types

Workflow Lifecycle Triggers

on_workflow_start

Executes once when the workflow is initialized.

When it fires: During initialize_jobs after jobs are transitioned from uninitialized to ready/blocked states.

Typical use cases:

  • Scheduling Slurm allocations
  • Creating directory structures
  • Copying initial data
- trigger_type: "on_workflow_start"
  action_type: "run_commands"
  commands:
    - "mkdir -p output checkpoints temp"
    - "echo 'Workflow started at $(date)' > workflow.log"

on_workflow_complete

Executes once when all jobs reach terminal states (completed, failed, or canceled).

When it fires: After the last job completes, as detected by the job runner.

Typical use cases:

  • Archiving final results
  • Uploading to remote storage
  • Cleanup of temporary files
  • Generating summary reports
- trigger_type: "on_workflow_complete"
  action_type: "run_commands"
  commands:
    - "tar -czf results.tar.gz output/"
    - "aws s3 cp results.tar.gz s3://bucket/results/"
    - "rm -rf temp/"

Job-Based Triggers

on_jobs_ready

Executes when all specified jobs transition to the “ready” state.

When it fires: When the last specified job becomes ready to execute (all dependencies satisfied).

Typical use cases:

  • Scheduling Slurm allocations
  • Starting phase-specific monitoring
  • Pre-computation setup
  • Notifications before expensive operations
- trigger_type: "on_jobs_ready"
  action_type: "schedule_nodes"
  jobs: ["train_model_001", "train_model_002", "train_model_003"]
  scheduler: "gpu_cluster"
  scheduler_type: "slurm"
  num_allocations: 2

Important: The action triggers only when all matching jobs are ready, not individually as each job becomes ready.

on_jobs_complete

Executes when all specified jobs reach terminal states (completed, failed, or canceled).

When it fires: When the last specified job finishes execution.

Typical use cases:

  • Scheduling Slurm allocations
  • Cleaning up intermediate files
  • Archiving phase results
  • Freeing storage space
  • Phase-specific reporting
- trigger_type: "on_jobs_complete"
  action_type: "run_commands"
  jobs: ["preprocess_1", "preprocess_2", "preprocess_3"]
  commands:
    - "echo 'Preprocessing phase complete' >> workflow.log"
    - "rm -rf raw_data/"

Worker Lifecycle Triggers

Worker lifecycle triggers are persistent by default, meaning they execute once per worker (job runner), not once per workflow.

on_worker_start

Executes when each worker (job runner) starts.

When it fires: After a job runner starts and checks for workflow actions, before claiming any jobs.

Typical use cases:

  • Worker-specific initialization
  • Setting up worker-local logging
  • Copying data to compute node local storage
  • Initializing worker-specific resources
  • Recording worker startup metrics
- trigger_type: "on_worker_start"
  action_type: "run_commands"
  persistent: true  # Each worker executes this
  commands:
    - "echo 'Worker started on $(hostname) at $(date)' >> worker.log"
    - "mkdir -p worker_temp"

on_worker_complete

Executes when each worker completes (exits the main loop).

When it fires: After a worker finishes processing jobs and before it shuts down.

Typical use cases:

  • Worker-specific cleanup
  • Uploading worker-specific logs
  • Recording worker completion metrics
  • Cleaning up worker-local resources
- trigger_type: "on_worker_complete"
  action_type: "run_commands"
  persistent: true  # Each worker executes this
  commands:
    - "echo 'Worker completed on $(hostname) at $(date)' >> worker.log"
    - "rm -rf worker_temp"

Job Selection

For on_jobs_ready and on_jobs_complete triggers, specify which jobs to monitor.

Exact Job Names

- trigger_type: "on_jobs_complete"
  action_type: "run_commands"
  jobs: ["job1", "job2", "job3"]
  commands:
    - "echo 'Specific jobs complete'"

Regular Expressions

- trigger_type: "on_jobs_ready"
  action_type: "schedule_nodes"
  job_name_regexes: ["train_model_[0-9]+", "eval_.*"]
  scheduler: "gpu_cluster"
  scheduler_type: "slurm"
  num_allocations: 2

Common regex patterns:

  • "train_.*" - All jobs starting with “train_”
  • "model_[0-9]+" - Jobs like “model_1”, “model_2”
  • ".*_stage1" - All jobs ending with “_stage1”
  • "job_(a|b|c)" - Jobs “job_a”, “job_b”, or “job_c”

Combining Selection Methods

You can use both together - the action triggers when all matching jobs meet the condition:

jobs: ["critical_job"]
job_name_regexes: ["batch_.*"]
# Triggers when "critical_job" AND all "batch_*" jobs are ready/complete

Action Types

run_commands

Execute shell commands sequentially on a compute node.

Configuration:

- trigger_type: "on_workflow_complete"
  action_type: "run_commands"
  commands:
    - "tar -czf results.tar.gz output/"
    - "aws s3 cp results.tar.gz s3://bucket/"

Execution details:

  • Commands run in the workflow’s output directory
  • Commands execute sequentially (one after another)
  • If a command fails, the action fails (but workflow continues)
  • Commands run on compute nodes, not the submission node
  • Uses the shell environment of the job runner process

schedule_nodes

Dynamically allocate compute resources from a Slurm scheduler.

Configuration:

- trigger_type: "on_jobs_ready"
  action_type: "schedule_nodes"
  jobs: ["train_model_1", "train_model_2"]
  scheduler: "gpu_cluster"
  scheduler_type: "slurm"
  num_allocations: 2
  start_one_worker_per_node: true
  max_parallel_jobs: 8

Parameters:

  • scheduler (required) - Name of Slurm scheduler configuration (must exist in slurm_schedulers)
  • scheduler_type (required) - Must be “slurm”
  • num_allocations (required) - Number of Slurm allocation requests to submit
  • start_one_worker_per_node (optional) - Start one job runner per node (default: false)
  • max_parallel_jobs (optional) - Maximum concurrent jobs per runner

Use cases:

  • Just-in-time resource allocation
  • Cost optimization (allocate only when needed)
  • Separating workflow phases with different resource requirements

Complete Examples

Refer to this example

Execution Model

Action Claiming and Execution

  1. Atomic Claiming: Actions are claimed atomically by workers to prevent duplicate execution
  2. Non-Persistent Actions: Execute once per workflow (first worker to claim executes)
  3. Persistent Actions: Can be claimed and executed by multiple workers
  4. Trigger Counting: Job-based triggers increment a counter as jobs transition; action becomes pending when count reaches threshold
  5. Immediate Availability: Worker lifecycle actions are immediately available after workflow initialization

Action Lifecycle

[Workflow Created]
    ↓
[initialize_jobs called]
    ↓
├─→ on_workflow_start actions become pending
├─→ on_worker_start actions become pending (persistent)
├─→ on_worker_complete actions become pending (persistent)
└─→ on_jobs_ready actions wait for job transitions
    ↓
[Worker Claims and Executes Actions]
    ↓
[Jobs Execute]
    ↓
[Jobs Complete]
    ↓
├─→ on_jobs_complete actions become pending when all specified jobs complete
└─→ on_workflow_complete actions become pending when all jobs complete
    ↓
[Workers Exit]
    ↓
[on_worker_complete actions execute per worker]

Important Characteristics

  1. No Rollback: Failed actions don’t affect workflow execution
  2. Compute Node Execution: Actions run on compute nodes via job runners
  3. One-Time Triggers: Non-persistent actions trigger once when conditions are first met
  4. No Inter-Action Dependencies: Actions don’t depend on other actions
  5. Concurrent Workers: Multiple workers can execute different actions simultaneously

Workflow Reinitialization

When a workflow is reinitialized (e.g., after resetting failed jobs), actions are reset to allow them to trigger again:

  1. Executed flags are cleared: All actions can be claimed and executed again
  2. Trigger counts are recalculated: For on_jobs_ready and on_jobs_complete actions, the trigger count is set based on current job states

Example scenario:

  • job1 and job2 are independent jobs
  • postprocess_job depends on both job1 and job2
  • An on_jobs_ready action triggers when postprocess_job becomes ready

After first run completes:

  1. job1 fails, job2 succeeds
  2. User resets failed jobs and reinitializes
  3. job2 is already Completed, so it counts toward the trigger count
  4. When job1 completes in the second run, postprocess_job becomes ready
  5. The action triggers again because the trigger count reaches the required threshold

This ensures actions properly re-trigger after workflow reinitialization, even when some jobs remain in their completed state.

Limitations

  1. No Action Dependencies: Actions cannot depend on other actions completing
  2. No Conditional Execution: Actions cannot have conditional logic (use multiple actions with different job selections instead)
  3. No Action Retries: Failed actions are not automatically retried
  4. Single Action Type: Each action has one action_type (cannot combine run_commands and schedule_nodes)
  5. No Dynamic Job Selection: Job names/patterns are fixed at action creation time

For complex workflows requiring these features, consider:

  • Using job dependencies to order operations
  • Creating separate jobs for conditional logic
  • Implementing retry logic within command scripts
  • Creating multiple actions for different scenarios

Slurm Workflows

This document explains how Torc simplifies running workflows on Slurm-based HPC systems. The key insight is that you don’t need to understand Slurm schedulers or workflow actions to run workflows on HPC systems—Torc handles this automatically.

The Simple Approach

Running a workflow on Slurm requires just two things:

  1. Define your jobs with resource requirements
  2. Submit with submit-slurm

That’s it. Torc will analyze your workflow, generate appropriate Slurm configurations, and submit everything for execution.

⚠️ Important: The submit-slurm command uses heuristics to auto-generate Slurm schedulers and workflow actions. For complex workflows with unusual dependency patterns, the generated configuration may not be optimal and could result in suboptimal allocation timing. Always preview the configuration first using torc slurm generate (see Previewing Generated Configuration) before submitting production workflows.

Example Workflow

Here’s a complete workflow specification that runs on Slurm:

name: data_analysis_pipeline
description: Analyze experimental data with preprocessing, training, and evaluation

resource_requirements:
  - name: light
    num_cpus: 4
    memory: 8g
    runtime: PT30M

  - name: compute
    num_cpus: 32
    memory: 64g
    runtime: PT2H

  - name: gpu
    num_cpus: 16
    num_gpus: 2
    memory: 128g
    runtime: PT4H

jobs:
  - name: preprocess
    command: python preprocess.py --input data/ --output processed/
    resource_requirements: light

  - name: train_model
    command: python train.py --data processed/ --output model/
    resource_requirements: gpu
    depends_on: [preprocess]

  - name: evaluate
    command: python evaluate.py --model model/ --output results/
    resource_requirements: compute
    depends_on: [train_model]

  - name: generate_report
    command: python report.py --results results/
    resource_requirements: light
    depends_on: [evaluate]

Submitting the Workflow

torc submit-slurm --account myproject workflow.yaml

Torc will:

  1. Detect which HPC system you’re on (e.g., NREL Kestrel)
  2. Match each job’s requirements to appropriate partitions
  3. Generate Slurm scheduler configurations
  4. Create workflow actions that stage resource allocation based on dependencies
  5. Submit the workflow for execution

How It Works

When you use submit-slurm, Torc performs intelligent analysis of your workflow:

1. Per-Job Scheduler Generation

Each job gets its own Slurm scheduler configuration based on its resource requirements. This means:

  • Jobs are matched to the most appropriate partition
  • Memory, CPU, and GPU requirements are correctly specified
  • Walltime is set to the partition’s maximum (explained below)

2. Staged Resource Allocation

Torc analyzes job dependencies and creates staged workflow actions:

  • Jobs without dependencies trigger on_workflow_start — resources are allocated immediately
  • Jobs with dependencies trigger on_jobs_ready — resources are allocated only when the job becomes ready to run

This prevents wasting allocation time on resources that aren’t needed yet. For example, in the workflow above:

  • preprocess resources are allocated at workflow start
  • train_model resources are allocated when preprocess completes
  • evaluate resources are allocated when train_model completes
  • generate_report resources are allocated when evaluate completes

3. Conservative Walltime

Torc sets the walltime to the partition’s maximum rather than your job’s estimated runtime. This provides:

  • Headroom for jobs that run slightly longer than expected
  • No additional cost since Torc workers exit when work completes
  • Protection against job termination due to tight time limits

For example, if your job requests 3 hours and matches the “short” partition (4 hours max), the allocation will request 4 hours.

4. HPC Profile Knowledge

Torc includes built-in knowledge of HPC systems like NREL Kestrel, including:

  • Available partitions and their resource limits
  • GPU configurations
  • Memory and CPU specifications
  • Special requirements (e.g., minimum node counts for high-bandwidth partitions)

Using an unsupported HPC? Please request built-in support so everyone benefits. You can also create a custom profile for immediate use.

Resource Requirements Specification

Resource requirements are the key to the simplified workflow. Define them once and reference them from jobs:

resource_requirements:
  - name: small
    num_cpus: 4
    num_gpus: 0
    num_nodes: 1
    memory: 8g
    runtime: PT1H

  - name: gpu_training
    num_cpus: 32
    num_gpus: 4
    num_nodes: 1
    memory: 256g
    runtime: PT8H

Fields

FieldDescriptionExample
nameReference name for jobs"compute"
num_cpusCPU cores required32
num_gpusGPUs required (0 if none)2
num_nodesNodes required1
memoryMemory with unit suffix"64g", "512m"
runtimeISO8601 duration"PT2H", "PT30M"

Runtime Format

Use ISO8601 duration format:

  • PT30M — 30 minutes
  • PT2H — 2 hours
  • PT1H30M — 1 hour 30 minutes
  • P1D — 1 day
  • P2DT4H — 2 days 4 hours

Job Dependencies

Define dependencies explicitly or implicitly through file/data relationships:

Explicit Dependencies

jobs:
  - name: step1
    command: ./step1.sh
    resource_requirements: small

  - name: step2
    command: ./step2.sh
    resource_requirements: small
    depends_on: [step1]

  - name: step3
    command: ./step3.sh
    resource_requirements: small
    depends_on: [step1, step2]  # Waits for both

Implicit Dependencies (via Files)

files:
  - name: raw_data
    path: /data/raw.csv
  - name: processed_data
    path: /data/processed.csv

jobs:
  - name: process
    command: python process.py
    input_files: [raw_data]
    output_files: [processed_data]
    resource_requirements: compute

  - name: analyze
    command: python analyze.py
    input_files: [processed_data]  # Creates implicit dependency on 'process'
    resource_requirements: compute

Previewing Generated Configuration

Recommended Practice: Always preview the generated configuration before submitting to Slurm, especially for complex workflows. This allows you to verify that schedulers and actions are appropriate for your workflow structure.

Viewing the Execution Plan

Before generating schedulers, visualize how your workflow will execute in stages:

torc workflows execution-plan workflow.yaml

This shows the execution stages, which jobs run at each stage, and (if schedulers are defined) when Slurm allocations are requested. See Visualizing Workflow Structure for detailed examples.

Generating Slurm Configuration

Preview what Torc will generate:

torc slurm generate --account myproject --profile kestrel workflow.yaml

This outputs the complete workflow with generated schedulers and actions:

name: data_analysis_pipeline
# ... original content ...

jobs:
  - name: preprocess
    command: python preprocess.py --input data/ --output processed/
    resource_requirements: light
    scheduler: preprocess_scheduler

  # ... more jobs ...

slurm_schedulers:
  - name: preprocess_scheduler
    account: myproject
    mem: 8g
    nodes: 1
    walltime: "04:00:00"

  - name: train_model_scheduler
    account: myproject
    mem: 128g
    nodes: 1
    gres: "gpu:2"
    walltime: "04:00:00"

  # ... more schedulers ...

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: preprocess_scheduler
    scheduler_type: slurm
    num_allocations: 1

  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [train_model]
    scheduler: train_model_scheduler
    scheduler_type: slurm
    num_allocations: 1

  # ... more actions ...

Save the output to inspect or modify before submission:

torc slurm generate --account myproject workflow.yaml -o workflow_with_schedulers.yaml

Torc Server Considerations

The Torc server must be accessible to compute nodes. Options include:

  1. Shared server (Recommended): A team member allocates a dedicated server in the HPC environment
  2. Login node: Suitable for small workflows with few, long-running jobs

For large workflows with many short jobs, a dedicated server prevents overloading login nodes.

Best Practices

1. Focus on Resource Requirements

Spend time accurately defining resource requirements. Torc handles the rest:

resource_requirements:
  # Be specific about what each job type needs
  - name: io_heavy
    num_cpus: 4
    memory: 32g      # High memory for data loading
    runtime: PT1H

  - name: compute_heavy
    num_cpus: 64
    memory: 16g      # Less memory, more CPU
    runtime: PT4H

2. Use Meaningful Names

Name resource requirements by their purpose, not by partition:

# Good - describes the workload
resource_requirements:
  - name: data_preprocessing
  - name: model_training
  - name: inference

# Avoid - ties you to specific infrastructure
resource_requirements:
  - name: short_partition
  - name: gpu_h100

3. Group Similar Jobs

Jobs with similar requirements can share resource requirement definitions:

resource_requirements:
  - name: quick_task
    num_cpus: 2
    memory: 4g
    runtime: PT15M

jobs:
  - name: validate_input
    command: ./validate.sh
    resource_requirements: quick_task

  - name: check_output
    command: ./check.sh
    resource_requirements: quick_task
    depends_on: [main_process]

4. Test Locally First

Validate your workflow logic locally before submitting to HPC:

# Run locally (without Slurm)
torc run workflow.yaml

# Then submit to HPC
torc submit-slurm --account myproject workflow.yaml

Limitations and Caveats

The auto-generation in submit-slurm uses heuristics that work well for common workflow patterns but may not be optimal for all cases:

When Auto-Generation Works Well

  • Linear pipelines: A → B → C → D
  • Fan-out patterns: One job unblocks many (e.g., preprocess → 100 work jobs)
  • Fan-in patterns: Many jobs unblock one (e.g., 100 work jobs → postprocess)
  • Simple DAGs: Clear dependency structures with distinct resource tiers

When to Use Manual Configuration

Consider using torc slurm generate to preview and manually adjust, or define schedulers manually, when:

  • Complex dependency graphs: Multiple interleaved dependency patterns
  • Shared schedulers: You want multiple jobs to share the same Slurm allocation
  • Custom timing: Specific requirements for when allocations should be requested
  • Resource optimization: Fine-tuning to minimize allocation waste
  • Multi-node jobs: Jobs requiring coordination across multiple nodes

What Could Go Wrong

Without previewing, auto-generation might:

  1. Request allocations too early: Wasting queue time waiting for dependencies
  2. Request allocations too late: Adding latency to job startup
  3. Create suboptimal scheduler groupings: Not sharing allocations when beneficial
  4. Miss optimization opportunities: Not recognizing patterns that could share resources

Best Practice: For production workflows, always run torc slurm generate first, review the output, and submit the reviewed configuration with torc submit.

Advanced: Manual Scheduler Configuration

For advanced users who need fine-grained control, you can define schedulers and actions manually. See Working with Slurm for details.

Common reasons for manual configuration:

  • Non-standard partition requirements
  • Custom Slurm directives (e.g., --constraint)
  • Multi-node jobs with specific topology requirements
  • Reusing allocations across multiple jobs for efficiency

Troubleshooting

“No partition found for job”

Your resource requirements exceed what’s available. Check:

  • Memory doesn’t exceed partition limits
  • Runtime doesn’t exceed partition walltime
  • GPU count is available on GPU partitions

Use torc hpc partitions <profile> to see available resources.

Jobs Not Starting

Ensure the Torc server is accessible from compute nodes:

# From a compute node
curl $TORC_API_URL/health

Wrong Partition Selected

Use torc hpc match to see which partitions match your requirements:

torc hpc match kestrel --cpus 32 --memory 64g --walltime 2h --gpus 2

See Also

Design

This section covers Torc’s internal design and implementation details. These topics are intended for developers who want to understand how Torc works internally, contribute to the project, or debug complex issues.

Contents

  • Server API Handler - Multi-threaded async web service architecture and key endpoints
  • Central Database - SQLite schema, concurrency model, and coordination mechanisms
  • Web Dashboard - Browser-based UI gateway architecture and CLI integration

For user-facing concepts and guides, see the other Explanation chapters.

Server API Handler

The server is a Rust async web service built with Tokio and uses:

  • Multi-threaded Tokio runtime for concurrent request handling
  • Modular API structure with separate modules per resource type (workflows.rs, jobs.rs, files.rs, etc.)
  • OpenAPI-generated types for consistent API contracts
  • Database-level locking (BEGIN IMMEDIATE TRANSACTION) for critical sections

Key Endpoints

The server implements these key endpoints:

  • POST /workflows - Create new workflows
  • POST /workflows/{id}/initialize_jobs - Build dependency graph and mark jobs ready
  • POST /workflows/{id}/claim_next_jobs - Thread-safe job allocation to workers
  • POST /jobs/{id}/manage_status_change - Update job status with cascade effects
  • POST /workflows/{id}/process_changed_job_inputs - Detect changed inputs and reset jobs

Thread Safety

The claim_next_jobs endpoint uses database-level write locks to prevent multiple workers from double-allocating jobs to different clients. This is critical for maintaining consistency in distributed execution.

API Organization

Each resource type (workflows, jobs, files, events, etc.) has its own module in server/src/bin/server/api/, keeping the codebase organized and maintainable. The main routing logic delegates to these specialized modules.

Central Database

The SQLite database is the heart of Torc’s coordination model. All workflow state lives in the database, enabling:

  • Stateless clients and workers - All state persists in the database
  • Multiple concurrent workers - Workers coordinate through database locks
  • Fault tolerance - Workers can crash and restart; state is preserved
  • Workflow resumption - Workflows can be stopped and restarted without losing progress

Core Database Tables

  • workflow - Top-level workflow records with name, user, description
  • workflow_status - Workflow execution state (run_id, status)
  • job - Individual computational tasks with commands and status
  • job_internal - Internal job data (input hashes for change detection)
  • job_depends_on - Explicit and implicit job dependencies
  • file - File artifacts with paths and modification times
  • user_data - JSON data artifacts for passing information between jobs
  • job_input_file, job_output_file - Job-file relationships
  • job_input_user_data, job_output_user_data - Job-user_data relationships
  • resource_requirements - CPU, memory, GPU, runtime specifications
  • compute_node - Available compute resources
  • scheduled_compute_node - Compute nodes allocated to workflows
  • local_scheduler, slurm_scheduler - Execution environment configurations
  • result - Job execution results (exit code, stdout, stderr)
  • event - Audit log of workflow events

Foreign Key Cascades

The schema uses foreign key constraints with cascading deletes. Deleting a workflow automatically removes all associated jobs, files, events, and other related records, ensuring referential integrity.

Concurrency Model

SQLite uses database-level locking with BEGIN IMMEDIATE TRANSACTION to prevent race conditions in critical sections, particularly during job allocation when multiple workers request jobs simultaneously.

Web Dashboard (torc-dash)

The torc-dash application is a web gateway that provides a browser-based UI for managing Torc workflows. It bridges a web frontend with the torc ecosystem by proxying API requests and executing CLI commands.

Architecture

flowchart LR
    Browser["Browser<br/>(Web UI)"] <--> Dashboard["torc-dash<br/>(Gateway)"]
    Dashboard <--> Server["torc-server<br/>(API)"]
    Dashboard --> CLI["torc CLI<br/>(subprocess)"]

The dashboard acts as a gateway layer that:

  1. Serves embedded static assets - HTML, CSS, and JavaScript bundled into the binary
  2. Proxies API requests - Forwards /torc-service/* requests to a remote torc-server
  3. Executes CLI commands - Runs torc CLI as subprocesses for complex operations
  4. Manages server lifecycle - Optionally spawns and manages a torc-server instance

Core Components

Embedded Static Assets

Uses the rust_embed crate to bundle all files from the static/ directory directly into the binary at compile time:

#![allow(unused)]
fn main() {
#[derive(Embed)]
#[folder = "static/"]
struct Assets;
}

This enables single-binary deployment with no external file dependencies.

Application State

Shared state across all request handlers:

#![allow(unused)]
fn main() {
struct AppState {
    api_url: String,           // Remote torc-server URL
    client: reqwest::Client,   // HTTP client for proxying
    torc_bin: String,          // Path to torc CLI binary
    torc_server_bin: String,   // Path to torc-server binary
    managed_server: Mutex<ManagedServer>,  // Optional embedded server state
}
}

Standalone Mode

When launched with --standalone, torc-dash automatically spawns a torc-server subprocess:

  1. Starts torc-server with configurable port (0 for auto-detection)
  2. Reads TORC_SERVER_PORT=<port> from stdout to discover actual port
  3. Configures API URL to point to the managed server
  4. Tracks process ID for lifecycle management

This enables single-command deployment for local development or simple production setups.

Request Routing

Static File Routes

RouteHandlerPurpose
/index_handlerServes index.html
/static/*static_handlerServes embedded assets with MIME types

API Proxy

All /torc-service/* requests are transparently proxied to the remote torc-server:

Browser: GET /torc-service/v1/workflows
    ↓
torc-dash: forwards to http://localhost:8080/torc-service/v1/workflows
    ↓
torc-server: responds with workflow list
    ↓
torc-dash: returns response to browser

The proxy preserves HTTP methods (GET, POST, PUT, PATCH, DELETE), headers, and request bodies.

CLI Command Endpoints

These endpoints execute the torc CLI as subprocesses, enabling operations that require local file access or complex orchestration:

EndpointCLI CommandPurpose
POST /api/cli/createtorc workflows createCreate workflow from spec file
POST /api/cli/runtorc workflows runRun workflow locally
POST /api/cli/submittorc workflows submitSubmit to scheduler
POST /api/cli/initializetorc workflows initializeInitialize job dependencies
POST /api/cli/deletetorc workflows deleteDelete workflow
POST /api/cli/reinitializetorc workflows reinitializeReinitialize workflow
POST /api/cli/reset-statustorc workflows reset-statusReset job statuses
GET /api/cli/run-streamtorc workflows runSSE streaming execution

Server Management Endpoints

EndpointPurpose
POST /api/server/startStart a managed torc-server
POST /api/server/stopStop the managed server
GET /api/server/statusCheck server running status

Utility Endpoints

EndpointPurpose
POST /api/cli/read-fileRead local file contents
POST /api/cli/plot-resourcesGenerate resource plots from DB
POST /api/cli/list-resource-dbsFind resource database files

Key Features

Streaming Workflow Execution

The /api/cli/run-stream endpoint uses Server-Sent Events (SSE) to provide real-time feedback:

Event: start
Data: Running workflow abc123

Event: stdout
Data: Job job_1 started

Event: status
Data: Jobs: 3 running, 7 completed (total: 10)

Event: stdout
Data: Job job_1 completed

Event: end
Data: success

Event: exit_code
Data: 0

The stream includes:

  • stdout/stderr from the torc CLI process
  • Periodic status updates fetched from the API every 3 seconds
  • Exit code when the process completes

CLI Execution Pattern

All CLI commands follow a consistent execution pattern:

#![allow(unused)]
fn main() {
async fn run_torc_command(torc_bin: &str, args: &[&str], api_url: &str) -> CliResponse {
    Command::new(torc_bin)
        .args(args)
        .env("TORC_API_URL", api_url)  // Pass server URL to CLI
        .output()
        .await
}
}

Returns structured JSON:

{
  "success": true,
  "stdout": "Workflow created: abc123",
  "stderr": "",
  "exit_code": 0
}

Configuration Merging

Configuration is merged from multiple sources (highest to lowest priority):

  1. CLI arguments - Command-line flags
  2. Environment variables - TORC_API_URL, TORC_BIN, etc.
  3. Configuration file - TorcConfig from ~/.torc.toml or similar

Design Rationale

Why Proxy Instead of Direct API Access?

  1. CORS avoidance - Browser same-origin policy doesn’t apply to server-side requests
  2. Authentication layer - Can add authentication/authorization without modifying torc-server
  3. Request transformation - Can modify requests/responses as needed
  4. Logging and monitoring - Centralized request logging

Why CLI Delegation?

Complex operations like workflow creation are delegated to the existing torc CLI rather than reimplementing:

  1. Code reuse - Leverages tested CLI implementation
  2. Local file access - CLI can read workflow specs from the filesystem
  3. Consistent behavior - Same behavior as command-line usage
  4. Maintenance - Single implementation to maintain

Why Standalone Mode?

  1. Single-binary deployment - One command starts everything needed
  2. Development convenience - Quick local testing without separate server
  3. Port auto-detection - Avoids port conflicts with port 0 support

How-To Guides

This section provides task-oriented guides for accomplishing specific goals with Torc. Each guide shows you how to complete a concrete task.

Topics covered:

Using Configuration Files

This guide shows how to set up and use configuration files for Torc components.

Quick Start

Create a user configuration file:

torc config init --user

Edit the file at ~/.config/torc/config.toml to set your defaults.

Configuration File Locations

LocationPurposePriority
/etc/torc/config.tomlSystem-wide defaults1 (lowest)
~/.config/torc/config.tomlUser preferences2
./torc.tomlProject-specific3
Environment variablesRuntime overrides4
CLI argumentsExplicit overrides5 (highest)

Available Commands

# Show configuration file locations
torc config paths

# Show effective (merged) configuration
torc config show

# Show as JSON
torc config show --format json

# Create configuration file
torc config init --user      # User config
torc config init --local     # Project config
torc config init --system    # System config (requires root)

# Validate configuration
torc config validate

Client Configuration

Common client settings:

[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"  # or "json"
log_level = "info"
username = "myuser"

[client.run]
poll_interval = 5.0
output_dir = "output"
max_parallel_jobs = 4
num_cpus = 8
memory_gb = 32.0
num_gpus = 1

Server Configuration

For torc-server:

[server]
url = "0.0.0.0"
port = 8080
threads = 4
database = "/path/to/torc.db"
auth_file = "/path/to/htpasswd"
require_auth = false
completion_check_interval_secs = 60.0
log_level = "info"
https = false

[server.logging]
log_dir = "/var/log/torc"
json_logs = false

Dashboard Configuration

For torc-dash:

[dash]
host = "127.0.0.1"
port = 8090
api_url = "http://localhost:8080/torc-service/v1"
torc_bin = "torc"
torc_server_bin = "torc-server"
standalone = false
server_port = 0
completion_check_interval_secs = 5

Environment Variables

Use environment variables for runtime configuration. Use double underscore (__) to separate nested keys:

# Client settings
export TORC_CLIENT__API_URL="http://server:8080/torc-service/v1"
export TORC_CLIENT__FORMAT="json"

# Server settings
export TORC_SERVER__PORT="9999"
export TORC_SERVER__THREADS="8"

# Dashboard settings
export TORC_DASH__PORT="8090"

Overriding with CLI Arguments

CLI arguments always take precedence:

# Uses config file for api_url, but CLI for format
torc --format json workflows list

# CLI url overrides config file
torc --url http://other:8080/torc-service/v1 workflows list

Common Patterns

Development Environment

# ~/.config/torc/config.toml
[client]
api_url = "http://localhost:8080/torc-service/v1"
log_level = "debug"

[client.run]
poll_interval = 2.0

Team Shared Server

# ~/.config/torc/config.toml
[client]
api_url = "http://torc.internal.company.com:8080/torc-service/v1"
username = "developer"

CI/CD Pipeline

#!/bin/bash
export TORC_CLIENT__API_URL="${CI_TORC_SERVER}"
export TORC_CLIENT__FORMAT="json"

torc run workflow.yaml
result=$(torc workflows status $WORKFLOW_ID | jq -r '.status')

HPC Cluster

# Project-local torc.toml
[client]
api_url = "http://login-node:8080/torc-service/v1"

[client.run]
num_cpus = 64
memory_gb = 256.0
num_gpus = 8
output_dir = "/scratch/user/workflow_output"

Troubleshooting

Configuration not applied?

  1. Check which files are loaded: torc config validate
  2. View effective config: torc config show
  3. Verify file permissions and syntax

Environment variable not working?

Use double underscore for nesting: TORC_CLIENT__API_URL (not TORC_CLIENT_API_URL)

Invalid configuration?

Run validation: torc config validate

How to Create Workflows

This guide shows different methods for creating Torc workflows, from the most common (specification files) to more advanced approaches (CLI, API).

The easiest way to create workflows is with specification files. Torc supports YAML, JSON5, and KDL formats.

Create from a YAML File

torc workflows create workflow.yaml

Create from JSON5 or KDL

torc workflows create workflow.json5
torc workflows create workflow.kdl

Torc detects the format from the file extension.

Create and Run in One Step

For quick iteration, combine creation and execution:

# Create and run locally
torc run workflow.yaml

# Create and submit to Slurm
torc submit workflow.yaml

For format syntax and examples, see the Workflow Specification Formats reference.

Using the CLI (Step by Step)

For programmatic workflow construction or when you need fine-grained control, create workflows piece by piece using the CLI.

Step 1: Create an Empty Workflow

torc workflows new \
  --name "my_workflow" \
  --description "My test workflow"

Output:

Successfully created workflow:
  ID: 1
  Name: my_workflow
  User: dthom
  Description: My test workflow

Note the workflow ID (1) for subsequent commands.

Step 2: Add Resource Requirements

torc resource-requirements create \
  --name "small" \
  --num-cpus 1 \
  --memory "1g" \
  --runtime "PT10M" \
  1  # workflow ID

Output:

Successfully created resource requirements:
  ID: 2
  Workflow ID: 1
  Name: small

Step 3: Add Files (Optional)

torc files create \
  --name "input_file" \
  --path "/data/input.txt" \
  1  # workflow ID

Step 4: Add Jobs

torc jobs create \
  --name "process_data" \
  --command "python process.py" \
  --resource-requirements-id 2 \
  --input-file-ids 1 \
  1  # workflow ID

Step 5: Initialize and Run

# Initialize the workflow (resolves dependencies)
torc workflows initialize-jobs 1

# Run the workflow
torc run 1

Using the Python API

For complex programmatic workflow construction, use the Python client:

from torc import make_api
from torc.openapi_client import (
    WorkflowModel,
    JobModel,
    ResourceRequirementsModel,
)

# Connect to the server
api = make_api("http://localhost:8080/torc-service/v1")

# Create workflow
workflow = api.create_workflow(WorkflowModel(
    name="my_workflow",
    user="myuser",
    description="Programmatically created workflow",
))

# Add resource requirements
rr = api.create_resource_requirements(ResourceRequirementsModel(
    workflow_id=workflow.id,
    name="small",
    num_cpus=1,
    memory="1g",
    runtime="PT10M",
))

# Add jobs
api.create_job(JobModel(
    workflow_id=workflow.id,
    name="job1",
    command="echo 'Hello World'",
    resource_requirements_id=rr.id,
))

print(f"Created workflow {workflow.id}")

For more details, see the Map Python Functions tutorial.

Using the Julia API

The Julia client provides similar functionality for programmatic workflow construction:

using Torc
import Torc: APIClient

# Connect to the server
api = make_api("http://localhost:8080/torc-service/v1")

# Create workflow
workflow = send_api_command(
    api,
    APIClient.create_workflow,
    APIClient.WorkflowModel(;
        name = "my_workflow",
        user = get_user(),
        description = "Programmatically created workflow",
    ),
)

# Add resource requirements
rr = send_api_command(
    api,
    APIClient.create_resource_requirements,
    APIClient.ResourceRequirementsModel(;
        workflow_id = workflow.id,
        name = "small",
        num_cpus = 1,
        memory = "1g",
        runtime = "PT10M",
    ),
)

# Add jobs
send_api_command(
    api,
    APIClient.create_job,
    APIClient.JobModel(;
        workflow_id = workflow.id,
        name = "job1",
        command = "echo 'Hello World'",
        resource_requirements_id = rr.id,
    ),
)

println("Created workflow $(workflow.id)")

The Julia client also supports map_function_to_jobs for mapping a function across parameters, similar to the Python client.

Choosing a Method

MethodBest For
Specification filesMost workflows; declarative, version-controllable
CLI step-by-stepScripted workflows, testing individual components
Python APIComplex dynamic workflows, integration with Python pipelines
Julia APIComplex dynamic workflows, integration with Julia pipelines

Common Tasks

Validate a Workflow File Without Creating

Use --dry-run to validate a workflow specification without creating it on the server:

torc workflows create --dry-run workflow.yaml

Example output:

Workflow Validation Results
===========================

Workflow: my_workflow
Description: A sample workflow

Components to be created:
  Jobs: 100 (expanded from 1 parameterized job specs)
  Files: 5
  User data records: 2
  Resource requirements: 2
  Slurm schedulers: 2
  Workflow actions: 3

Submission: Ready for scheduler submission (has on_workflow_start schedule_nodes action)

Validation: PASSED

For programmatic use (e.g., in scripts or the dashboard), get JSON output:

torc -f json workflows create --dry-run workflow.yaml

What Validation Checks

The dry-run performs comprehensive validation:

Structural Checks:

  • Valid file format (YAML, JSON5, KDL, or JSON)
  • Required fields present
  • Parameter expansion (shows expanded job count vs. original spec count)

Reference Validation:

  • depends_on references existing jobs
  • depends_on_regexes patterns are valid and match at least one job
  • resource_requirements references exist
  • scheduler references exist
  • input_files and output_files reference defined files
  • input_user_data and output_user_data reference defined user data
  • All regex patterns (*_regexes fields) are valid

Duplicate Detection:

  • Duplicate job names
  • Duplicate file names
  • Duplicate user data names
  • Duplicate resource requirement names
  • Duplicate scheduler names

Dependency Analysis:

  • Circular dependency detection (reports all jobs in the cycle)

Action Validation:

  • Actions reference existing jobs and schedulers
  • schedule_nodes actions have required scheduler and scheduler_type

Scheduler Configuration:

  • Slurm scheduler node requirements are valid
  • Warns about heterogeneous schedulers without jobs_sort_method (see below)

Heterogeneous Scheduler Warning

When you have multiple Slurm schedulers with different resource profiles (memory, GPUs, walltime, partition) and jobs without explicit scheduler assignments, the validation warns about potential suboptimal job-to-node matching:

Warnings (1):
  - Workflow has 3 schedulers with different memory (mem), walltime but 10 job(s)
    have no explicit scheduler assignment and jobs_sort_method is not set. The
    default sort method 'gpus_runtime_memory' will be used (jobs sorted by GPUs,
    then runtime, then memory). If this doesn't match your workload, consider
    setting jobs_sort_method explicitly to 'gpus_memory_runtime' (prioritize
    memory over runtime) or 'none' (no sorting).

This warning helps you avoid situations where:

  • Long-walltime nodes pull short-runtime jobs
  • High-memory nodes pull low-memory jobs
  • GPU nodes pull non-GPU jobs

Solutions:

  1. Set jobs_sort_method explicitly in your workflow spec
  2. Assign jobs to specific schedulers using the scheduler field on each job
  3. Accept the default gpus_runtime_memory sorting if it matches your workload

Bypassing Validation

To create a workflow despite validation warnings:

torc workflows create --skip-checks workflow.yaml

Note: This bypasses scheduler node validation checks (which are treated as errors), but does not bypass all errors. Errors such as missing references or circular dependencies will always prevent creation.

List Available Workflows

torc workflows list

Delete a Workflow

torc workflows delete <workflow_id>

View Workflow Details

torc workflows get <workflow_id>

Defining File Dependencies

Jobs often need to read input files and produce output files. Torc can automatically infer job dependencies from these file relationships using variable substitution:

files:
  - name: raw_data
    path: /data/raw.csv
  - name: processed_data
    path: /data/processed.csv

jobs:
  - name: preprocess
    command: "python preprocess.py -o ${files.output.raw_data}"

  - name: analyze
    command: "python analyze.py -i ${files.input.raw_data} -o ${files.output.processed_data}"

Key concepts:

  • ${files.input.NAME} - References a file this job reads (creates a dependency on the job that outputs it)
  • ${files.output.NAME} - References a file this job writes (satisfies dependencies for downstream jobs)

In the example above, analyze automatically depends on preprocess because it needs raw_data as input, which preprocess produces as output.

For a complete walkthrough, see Tutorial: Diamond Workflow.

Next Steps

Working with Slurm

This guide covers advanced Slurm configuration for users who need fine-grained control over their HPC workflows.

For most users: See Slurm Workflows for the recommended approach using torc submit-slurm. You don’t need to manually configure schedulers or actions—Torc handles this automatically.

When to Use Manual Configuration

Manual Slurm configuration is useful when you need:

  • Custom Slurm directives (e.g., --constraint, --exclusive)
  • Multi-node jobs with specific topology requirements
  • Shared allocations across multiple jobs for efficiency
  • Non-standard partition configurations
  • Fine-tuned control over allocation timing

Torc Server Requirements

The Torc server must be accessible from compute nodes:

  • External server (Recommended): A team member allocates a shared server in the HPC environment. This is recommended if your operations team provides this capability.
  • Login node: Suitable for small workflows. The server runs single-threaded by default. If you have many thousands of short jobs, check with your operations team about resource limits.

Manual Scheduler Configuration

Defining Slurm Schedulers

Define schedulers in your workflow specification:

slurm_schedulers:
  - name: standard
    account: my_project
    nodes: 1
    walltime: "12:00:00"
    partition: compute
    mem: 64G

  - name: gpu_nodes
    account: my_project
    nodes: 1
    walltime: "08:00:00"
    partition: gpu
    gres: "gpu:4"
    mem: 256G

Scheduler Fields

FieldDescriptionRequired
nameScheduler identifierYes
accountSlurm account/allocationYes
nodesNumber of nodesYes
walltimeTime limit (HH:MM:SS or D-HH:MM:SS)Yes
partitionSlurm partitionNo
memMemory per nodeNo
gresGeneric resources (e.g., GPUs)No
qosQuality of ServiceNo
ntasks_per_nodeTasks per nodeNo
tmpTemporary disk spaceNo
extraAdditional sbatch argumentsNo

Defining Workflow Actions

Actions trigger scheduler allocations:

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: standard
    scheduler_type: slurm
    num_allocations: 1

  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [train_model]
    scheduler: gpu_nodes
    scheduler_type: slurm
    num_allocations: 2

Action Trigger Types

TriggerDescription
on_workflow_startFires when workflow is submitted
on_jobs_readyFires when specified jobs become ready
on_jobs_completeFires when specified jobs complete
on_workflow_completeFires when all jobs complete

Assigning Jobs to Schedulers

Reference schedulers in job definitions:

jobs:
  - name: preprocess
    command: ./preprocess.sh
    scheduler: standard

  - name: train
    command: python train.py
    scheduler: gpu_nodes
    depends_on: [preprocess]

Scheduling Strategies

Strategy 1: Many Single-Node Allocations

Submit multiple Slurm jobs, each with its own Torc worker:

slurm_schedulers:
  - name: work_scheduler
    account: my_account
    nodes: 1
    walltime: "04:00:00"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: work_scheduler
    scheduler_type: slurm
    num_allocations: 10

When to use:

  • Jobs have diverse resource requirements
  • Want independent time limits per job
  • Cluster has low queue wait times

Benefits:

  • Maximum scheduling flexibility
  • Independent time limits per allocation
  • Fault isolation

Drawbacks:

  • More Slurm queue overhead
  • Multiple jobs to schedule

Strategy 2: Multi-Node Allocation, One Worker Per Node

Launch multiple workers within a single allocation:

slurm_schedulers:
  - name: work_scheduler
    account: my_account
    nodes: 10
    walltime: "04:00:00"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: work_scheduler
    scheduler_type: slurm
    num_allocations: 1
    start_one_worker_per_node: true

When to use:

  • Many jobs with similar requirements
  • Want faster queue scheduling (larger jobs often prioritized)

Benefits:

  • Single queue wait
  • Often prioritized by Slurm scheduler

Drawbacks:

  • Shared time limit for all workers
  • Less flexibility

Strategy 3: Single Worker Per Allocation

One Torc worker handles all nodes:

slurm_schedulers:
  - name: work_scheduler
    account: my_account
    nodes: 10
    walltime: "04:00:00"

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: work_scheduler
    scheduler_type: slurm
    num_allocations: 1

When to use:

  • Your application manages node coordination
  • Need full control over compute resources

Staged Allocations

For pipelines with distinct phases, stage allocations to avoid wasted resources:

slurm_schedulers:
  - name: preprocess_sched
    account: my_project
    nodes: 2
    walltime: "01:00:00"

  - name: compute_sched
    account: my_project
    nodes: 20
    walltime: "08:00:00"

  - name: postprocess_sched
    account: my_project
    nodes: 1
    walltime: "00:30:00"

actions:
  # Preprocessing starts immediately
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: preprocess_sched
    scheduler_type: slurm
    num_allocations: 1

  # Compute nodes allocated when compute jobs are ready
  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [compute_step]
    scheduler: compute_sched
    scheduler_type: slurm
    num_allocations: 1
    start_one_worker_per_node: true

  # Postprocessing allocated when those jobs are ready
  - trigger_type: on_jobs_ready
    action_type: schedule_nodes
    jobs: [postprocess]
    scheduler: postprocess_sched
    scheduler_type: slurm
    num_allocations: 1

Note: The torc submit-slurm command handles this automatically by analyzing job dependencies.

Custom Slurm Directives

Use the extra field for additional sbatch arguments:

slurm_schedulers:
  - name: exclusive_nodes
    account: my_project
    nodes: 4
    walltime: "04:00:00"
    extra: "--exclusive --constraint=skylake"

Submitting Workflows

With Manual Configuration

# Submit workflow with pre-defined schedulers and actions
torc submit workflow.yaml

Scheduling Additional Nodes

Add more allocations to a running workflow:

torc slurm schedule-nodes -n 5 $WORKFLOW_ID

Debugging

Check Slurm Job Status

squeue -u $USER

View Torc Worker Logs

Workers log to the Slurm output file. Check:

cat slurm-<jobid>.out

Verify Server Connectivity

From a compute node:

curl $TORC_API_URL/health

See Also

Working with HPC Profiles

HPC (High-Performance Computing) profiles provide pre-configured knowledge about specific HPC systems, including their partitions, resource limits, and optimal settings. Torc uses this information to automatically match job requirements to appropriate partitions.

Overview

HPC profiles contain:

  • Partition definitions: Available queues with their resource limits (CPUs, memory, walltime, GPUs)
  • Detection rules: How to identify when you’re on a specific HPC system
  • Default settings: Account names and other system-specific defaults

Built-in profiles are available for systems like NREL’s Kestrel. You can also define custom profiles for private clusters.

Listing Available Profiles

View all known HPC profiles:

torc hpc list

Example output:

Known HPC profiles:

╭─────────┬──────────────┬────────────┬──────────╮
│ Name    │ Display Name │ Partitions │ Detected │
├─────────┼──────────────┼────────────┼──────────┤
│ kestrel │ NREL Kestrel │ 15         │ ✓        │
╰─────────┴──────────────┴────────────┴──────────╯

The “Detected” column shows if Torc recognizes you’re currently on that system.

Detecting the Current System

Torc can automatically detect which HPC system you’re on:

torc hpc detect

Detection works through environment variables. For example, NREL Kestrel is detected when NREL_CLUSTER=kestrel is set.

Viewing Profile Details

See detailed information about a specific profile:

torc hpc show kestrel

This displays:

  • Profile name and description
  • Detection method
  • Default account (if configured)
  • Number of partitions

Viewing Available Partitions

List all partitions for a profile:

torc hpc partitions kestrel

Example output:

Partitions for kestrel:

╭──────────┬─────────────┬───────────┬─────────────────┬─────────────────╮
│ Name     │ CPUs/Node   │ Mem/Node  │ Max Walltime    │ GPUs            │
├──────────┼─────────────┼───────────┼─────────────────┼─────────────────┤
│ debug    │ 104         │ 240 GB    │ 1h              │ -               │
│ short    │ 104         │ 240 GB    │ 4h              │ -               │
│ standard │ 104         │ 240 GB    │ 48h             │ -               │
│ gpu-h100 │ 2           │ 240 GB    │ 48h             │ 4 (H100)        │
│ ...      │ ...         │ ...       │ ...             │ ...             │
╰──────────┴─────────────┴───────────┴─────────────────┴─────────────────╯

Finding Matching Partitions

Find partitions that can satisfy specific resource requirements:

torc hpc match kestrel --cpus 32 --memory 64g --walltime 2h

Options:

  • --cpus <N>: Required CPU cores
  • --memory <SIZE>: Required memory (e.g., 64g, 512m)
  • --walltime <DURATION>: Required walltime (e.g., 2h, 4:00:00)
  • --gpus <N>: Required GPUs (optional)

This is useful for understanding which partitions your jobs will be assigned to.

Custom HPC Profiles

If your HPC system doesn’t have a built-in profile, you have two options:

Request Built-in Support (Recommended)

If your HPC is widely used, please open an issue requesting built-in support. Include:

  • Your HPC system name and organization
  • Partition names with resource limits (CPUs, memory, walltime, GPUs)
  • Detection method (environment variable or hostname pattern)

Built-in profiles benefit everyone using that system and are maintained by the Torc team.

If you need to use your HPC immediately or have a private cluster, you can define a custom profile in your configuration file. See the Custom HPC Profile Tutorial for a complete walkthrough.

Quick Example

Define custom profiles in your configuration file:

# ~/.config/torc/config.toml

[client.hpc.custom_profiles.mycluster]
display_name = "My Research Cluster"
description = "Internal research HPC system"
detect_env_var = "MY_CLUSTER=research"
default_account = "default_project"

[[client.hpc.custom_profiles.mycluster.partitions]]
name = "compute"
cpus_per_node = 64
memory_mb = 256000
max_walltime_secs = 172800
shared = false

[[client.hpc.custom_profiles.mycluster.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 128000
max_walltime_secs = 86400
gpus_per_node = 4
gpu_type = "A100"
shared = false

See Configuration Reference for full configuration options.

Using Profiles with Slurm Workflows

HPC profiles are used by Slurm-related commands to automatically generate scheduler configurations. See Working with Slurm for details on:

  • torc submit-slurm - Submit workflows with auto-generated schedulers
  • torc workflows create-slurm - Create workflows with auto-generated schedulers

See Also

How to Checkpoint a Job During Wall-Time Timeout

When running jobs on HPC systems like Slurm, your job may be terminated when the allocated wall-time expires. Torc supports graceful termination, allowing jobs to save checkpoints before exiting. This guide explains how to configure Slurm and your jobs to handle wall-time timeouts gracefully.

Overview

When Slurm is about to reach wall-time, it can be configured to send a SIGTERM signal to the Torc worker process. Torc then:

  1. Sends SIGTERM to jobs with supports_termination: true
  2. Sends SIGKILL to jobs with supports_termination: false (or unset)
  3. Waits for all processes to exit
  4. Reports job status as terminated to the server

Jobs that support termination can catch SIGTERM and perform cleanup operations like saving checkpoints, flushing buffers, or releasing resources.

Enabling Graceful Termination

Configuring Slurm to Send a Signal Before Timeout

By default, Slurm does not send any signal before the job’s end time. When the wall-time limit is reached, Slurm immediately terminates all processes. To receive a warning signal before timeout, you must explicitly configure it using the --signal option in the extra field of your Slurm scheduler specification:

slurm_schedulers:
  - name: gpu_scheduler
    account: my_project
    partition: gpu
    nodes: 1
    walltime: "04:00:00"
    extra: "--signal=B:TERM@300"  # Send SIGTERM to batch script 300 seconds before timeout

The --signal option format is [B:]<sig_num>[@sig_time]:

  • B: prefix sends the signal only to the batch shell (by default, all job steps are signaled but not the batch shell itself)
  • sig_num is the signal name or number (e.g., TERM, USR1, 10)
  • sig_time is seconds before the time limit to send the signal (default: 60 if not specified)

Note: Due to Slurm’s event handling resolution, the signal may be sent up to 60 seconds earlier than specified.

To enable graceful termination for a job, set supports_termination: true in your job specification:

Configuring a Torc job to be terminated gracefully

jobs:
  - name: training_job
    command: python train.py --checkpoint-dir /scratch/checkpoints
    supports_termination: true
    resource_requirements:
      num_cpus: 4
      memory: 16g
      runtime: PT2H

Writing a Job That Handles SIGTERM

Your job script must catch SIGTERM and save its state. Here’s a Python example:

import signal
import sys
import pickle

# Global state
checkpoint_path = "/scratch/checkpoints/model.pkl"
model_state = None

def save_checkpoint():
    """Save current model state to disk."""
    print("Saving checkpoint...")
    with open(checkpoint_path, 'wb') as f:
        pickle.dump(model_state, f)
    print(f"Checkpoint saved to {checkpoint_path}")

def handle_sigterm(signum, frame):
    """Handle SIGTERM by saving checkpoint and exiting."""
    print("Received SIGTERM - saving checkpoint before exit")
    save_checkpoint()
    sys.exit(0)  # Exit cleanly after saving

# Register the signal handler
signal.signal(signal.SIGTERM, handle_sigterm)

# Main training loop
def train():
    global model_state
    for epoch in range(1000):
        # Training logic here...
        model_state = {"epoch": epoch, "weights": [...]}

        # Optionally save periodic checkpoints
        if epoch % 100 == 0:
            save_checkpoint()

if __name__ == "__main__":
    train()

Bash Script Example

For shell scripts, use trap to catch SIGTERM:

#!/bin/bash

CHECKPOINT_FILE="/scratch/checkpoints/progress.txt"

# Function to save checkpoint
save_checkpoint() {
    echo "Saving checkpoint at iteration $ITERATION"
    echo "$ITERATION" > "$CHECKPOINT_FILE"
}

# Trap SIGTERM and save checkpoint
trap 'save_checkpoint; exit 0' SIGTERM

# Load checkpoint if exists
if [ -f "$CHECKPOINT_FILE" ]; then
    ITERATION=$(cat "$CHECKPOINT_FILE")
    echo "Resuming from iteration $ITERATION"
else
    ITERATION=0
fi

# Main loop
while [ $ITERATION -lt 1000 ]; do
    # Do work...
    ITERATION=$((ITERATION + 1))
    sleep 1
done

Complete Workflow Example

name: ml_training_workflow
user: researcher

jobs:
  - name: preprocess
    command: python preprocess.py
    supports_termination: false  # Quick job, no checkpointing needed

  - name: train_model
    command: python train.py --checkpoint-dir /scratch/checkpoints
    supports_termination: true   # Long job, needs checkpointing
    depends_on:
      - preprocess
    resource_requirements:
      num_cpus: 8
      memory: 32g
      num_gpus: 1
      runtime: PT4H

  - name: evaluate
    command: python evaluate.py
    supports_termination: true
    depends_on:
      - train_model

slurm_schedulers:
  - name: gpu_scheduler
    account: my_project
    partition: gpu
    nodes: 1
    walltime: "04:00:00"
    extra: "--signal=B:TERM@300"  # Send SIGTERM to batch script 300 seconds before timeout

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: gpu_scheduler
    scheduler_type: slurm
    num_allocations: 1

Restarting After Termination

When a job is terminated due to wall-time, it will have status terminated. To continue the workflow:

  1. Re-submit the workflow to allocate new compute time:

    torc workflows submit $WORKFLOW_ID
    
  2. Reinitialize terminated jobs to make them ready again:

    torc workflows reinitialize $WORKFLOW_ID
    

Your job script should detect existing checkpoints and resume from where it left off.

Best Practices

1. Verify Checkpoint Integrity

Add validation to ensure checkpoints are complete:

def save_checkpoint():
    temp_path = checkpoint_path + ".tmp"
    with open(temp_path, 'wb') as f:
        pickle.dump(model_state, f)
    # Atomic rename ensures complete checkpoint
    os.rename(temp_path, checkpoint_path)

2. Handle Multiple Termination Signals

Some systems send multiple signals. Ensure your handler is idempotent:

checkpoint_saved = False

def handle_sigterm(signum, frame):
    global checkpoint_saved
    if not checkpoint_saved:
        save_checkpoint()
        checkpoint_saved = True
    sys.exit(0)

3. Test Locally

Test your SIGTERM handling locally before running on the cluster:

# Start your job
python train.py &
PID=$!

# Wait a bit, then send SIGTERM
sleep 10
kill -TERM $PID

# Verify checkpoint was saved
ls -la /scratch/checkpoints/

Troubleshooting

Job Killed Without Checkpointing

Symptoms: Job status is terminated but no checkpoint was saved.

Causes:

  • supports_termination not set to true
  • Signal handler not registered before training started
  • Checkpoint save took longer than the buffer time

Solutions:

  • Verify supports_termination: true in job spec
  • Register signal handlers early in your script

Checkpoint File Corrupted

Symptoms: Job fails to load checkpoint on restart.

Causes:

  • Job was killed during checkpoint write
  • Disk space exhausted

Solutions:

  • Use atomic file operations (write to temp, then rename)
  • Check available disk space before checkpointing
  • Implement checkpoint validation on load

Job Doesn’t Receive SIGTERM

Symptoms: Job runs until hard kill with no graceful shutdown.

Causes:

  • Job running in a subprocess that doesn’t forward signals
  • Container or wrapper script intercepting signals

Solutions:

  • Use exec in wrapper scripts to replace the shell
  • Configure signal forwarding in containers
  • Run the job directly without wrapper scripts

See Also

How to Monitor Resource Usage

This guide shows how to track CPU and memory usage of your workflow jobs and identify resource requirement mismatches.

Enable Resource Monitoring

Resource monitoring is enabled by default for all workflows. To explicitly configure it, add a resource_monitor section to your workflow specification:

name: "My Workflow"

resource_monitor:
  enabled: true
  granularity: "summary"       # or "time_series"
  sample_interval_seconds: 5

jobs:
  # ... your jobs

To disable monitoring when creating a workflow:

torc workflows create my_workflow.yaml --no-resource-monitoring

View Summary Metrics

For workflows using summary mode (default), view resource metrics with:

torc results list <workflow_id>

The output includes columns for peak and average CPU and memory usage.

Check for Resource Violations

Use check-resource-utilization to identify jobs that exceeded their specified requirements:

# Check latest run
torc reports check-resource-utilization <workflow_id>

# Check a specific run
torc reports check-resource-utilization <workflow_id> --run-id <run_id>

# Show all jobs, not just violations
torc reports check-resource-utilization <workflow_id> --all

Example output:

⚠ Found 3 resource over-utilization violations:

Job ID | Job Name         | Resource | Specified | Peak Used | Over-Utilization
-------|------------------|----------|-----------|-----------|------------------
15     | train_model      | Memory   | 8.00 GB   | 10.50 GB  | +31.3%
15     | train_model      | Runtime  | 2h 0m 0s  | 2h 45m 0s | +37.5%
16     | large_preprocess | CPU      | 800%      | 950.5%    | +18.8%

Adjust Resource Requirements

After identifying violations, update your workflow specification:

# Before: job used 10.5 GB but was allocated 8 GB
resource_requirements:
  - name: training
    memory: 8g
    runtime: PT2H

# After: increased with buffer
resource_requirements:
  - name: training
    memory: 12g       # 10.5 GB peak + 15% buffer
    runtime: PT3H     # 2h 45m actual + buffer

Guidelines for buffers:

  • Memory: Add 10-20% above peak usage
  • Runtime: Add 15-30% above actual duration
  • CPU: Round up to next core count

Enable Time Series Monitoring

For detailed resource analysis over time, switch to time series mode:

resource_monitor:
  granularity: "time_series"
  sample_interval_seconds: 2

This creates a SQLite database with samples at regular intervals.

Generate Resource Plots

Create interactive visualizations from time series data:

# Generate all plots
torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/

# Generate plots for specific jobs
torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/ \
  --job-ids 15,16

The tool generates:

  • Individual job plots showing CPU, memory, and process count over time
  • Overview plots comparing all jobs
  • Summary dashboard with bar charts

Query Time Series Data Directly

Access the SQLite database for custom analysis:

sqlite3 -table output/resource_utilization/resource_metrics_1_1.db
-- View samples for a specific job
SELECT job_id, timestamp, cpu_percent, memory_bytes, num_processes
FROM job_resource_samples
WHERE job_id = 1
ORDER BY timestamp;

-- View job metadata
SELECT * FROM job_metadata;

Troubleshooting

No metrics recorded

  • Check that monitoring wasn’t disabled with --no-resource-monitoring
  • Ensure jobs run long enough for at least one sample (default: 5 seconds)

Time series database not created

  • Verify the output directory is writable
  • Confirm granularity: "time_series" is set in the workflow spec

Missing child process metrics

  • Decrease sample_interval_seconds to catch short-lived processes

Next Steps

Terminal User Interface (TUI)

The Torc TUI provides a full-featured terminal interface for managing workflows, designed for HPC users working in terminal-over-SSH environments.

Quick Start

# Option 1: Connect to an existing server
torc-server run &   # Start server in background
torc tui            # Launch the TUI

# Option 2: Standalone mode (auto-starts server)
torc tui --standalone

# Option 3: Start TUI without server (manual connection)
torc tui            # Shows warning, use 'S' to start server

Standalone Mode

Use --standalone for single-machine development or testing:

# Basic standalone mode
torc tui --standalone

# Custom port
torc tui --standalone --port 8090

# Custom database location
torc tui --standalone --database /path/to/workflows.db

In standalone mode, the TUI automatically starts a torc-server process with the specified configuration.

Features

  • Workflow Management: Create, initialize, run, submit, cancel, reset, and delete workflows
  • Job Management: View details, logs, cancel, terminate, or retry jobs
  • Real-time Monitoring: Auto-refresh, manual refresh, color-coded status
  • Server Management: Start/stop torc-server directly from the TUI
  • File Viewing: Preview workflow files with search and navigation
  • DAG Visualization: Text-based dependency graph

Interface Overview

When the TUI starts, you’ll see:

┌─ Torc Management Console ────────────────────────────────────────┐
│ ?: help | n: new | i: init | I: reinit | R: reset | x: run ...  │
└──────────────────────────────────────────────────────────────────┘
┌─ Server ─────────────────────────────────────────────────────────┐
│ http://localhost:8080/torc-service/v1  S: start | K: stop | O: output │
└──────────────────────────────────────────────────────────────────┘
┌─ User Filter ────────────────────────────────────────────────────┐
│ Current: yourname  (press 'w' to change, 'a' for all users)     │
└──────────────────────────────────────────────────────────────────┘
┌─ Workflows [FOCUSED] ────────────────────────────────────────────┐
│ >> 1  | my-workflow    | yourname | Example workflow            │
│    2  | data-pipeline  | yourname | Data processing pipeline    │
└──────────────────────────────────────────────────────────────────┘

Basic Navigation

KeyAction
/ Move up/down in the current table
/ Switch focus between Workflows and Details panes
TabSwitch between detail tabs (Jobs → Files → Events → Results → DAG)
EnterLoad details for selected workflow
qQuit (or close popup/dialog)
?Show help popup with all keybindings

Workflow Actions

Select a workflow and use these keys:

KeyActionDescription
nNewCreate workflow from spec file
iInitializeSet up job dependencies, mark ready jobs
IRe-initializeReset and re-initialize (prompts if output files exist)
RResetReset all job statuses
xRunRun workflow locally (shows real-time output)
sSubmitSubmit to HPC scheduler (Slurm)
CCancelCancel running workflow
dDeleteDelete workflow (destructive!)

All destructive actions show a confirmation dialog.

Handling Existing Output Files

When initializing or re-initializing a workflow, if existing output files are detected, the TUI will show a confirmation dialog listing the files that will be deleted. Press y to proceed with --force or n to cancel.

Job Management

Navigate to the Jobs tab ( then Tab if needed) to manage individual jobs:

KeyAction
EnterView job details
lView job logs (stdout/stderr)
cCancel job
tTerminate job
yRetry failed job
fFilter jobs by column

Job Status Colors

  • Green: Completed
  • Yellow: Running
  • Red: Failed
  • Magenta: Canceled/Terminated
  • Blue: Pending/Scheduled
  • Cyan: Ready
  • Gray: Blocked

Log Viewer

Press l on a job to view its logs:

KeyAction
TabSwitch between stdout and stderr
/ Scroll one line
PgUp / PgDnScroll 20 lines
g / GJump to top / bottom
/Start search
n / NNext / previous search match
qClose log viewer

File Viewer

Navigate to the Files tab and press Enter on a file to view its contents. The file viewer supports:

  • Files up to 1MB
  • Binary files show a hex dump preview
  • Same navigation keys as the log viewer

Server Management

The TUI can start and manage a torc-server instance:

KeyAction
SStart torc-server
KStop/Kill server
OShow server output

The server status indicator in the connection bar shows:

  • (green): Server is running (managed by TUI)
  • (yellow): Server was started but has stopped
  • No indicator: External server (not managed by TUI)

Connection Settings

KeyAction
uChange server URL
wChange user filter
aToggle show all users

Auto-Refresh

Press A to toggle auto-refresh (30-second interval). When enabled, the workflow list and details refresh automatically.

Configuration

The TUI respects Torc’s layered configuration system:

  1. Interactive changes in TUI (press u to change server URL)
  2. Environment variables (TORC_CLIENT__API_URL)
  3. Local config file (./torc.toml)
  4. User config file (~/.config/torc/config.toml)
  5. System config file (/etc/torc/config.toml)
  6. Default values

Troubleshooting

“Could not connect to server”

  1. Ensure the Torc server is running: torc-server run
  2. Check the server URL: press u to update if needed
  3. Verify network connectivity

“No log content available”

Logs may not be available if:

  • The job hasn’t run yet
  • You’re on a different machine than where jobs ran
  • The output directory is in a different location

Screen rendering issues

  • Ensure your terminal supports UTF-8 and 256 colors
  • Try resizing your terminal window
  • Press r to force a refresh

TUI vs Web Dashboard

FeatureTUI (torc tui)Web (torc-dash)
EnvironmentTerminal/SSHWeb browser
StartupInstant~2 seconds
DependenciesNone (single binary)Python + packages
Workflow actionsYesYes
Job actionsYesYes
Log viewingYesYes
DAG visualizationText-basedInteractive graph
Resource plotsPlannedYes

Choose the TUI for: SSH sessions, HPC environments, quick operations, low-bandwidth connections.

Choose torc-dash for: Rich visualizations, resource plots, team dashboards.

Web Dashboard (torc-dash)

The Torc Dashboard (torc-dash) provides a modern web-based interface for monitoring and managing workflows, offering an intuitive alternative to the command-line interface.

Overview

torc-dash is a Rust-based web application that allows you to:

  • Monitor workflows and jobs with real-time status updates
  • Create and run workflows by uploading specification files (YAML, JSON, JSON5, KDL)
  • Visualize workflow DAGs with interactive dependency graphs
  • Debug failed jobs with integrated log file viewer
  • Generate resource plots from time series monitoring data
  • Manage torc-server start/stop in standalone mode
  • View events with automatic polling for updates

Installation

Building from Source

torc-dash is built as part of the Torc workspace:

# Build torc-dash
cargo build --release -p torc-dash

# Binary location
./target/release/torc-dash

Prerequisites

  • A running torc-server (or use --standalone mode to auto-start one)
  • The torc CLI binary in your PATH (for workflow execution features)

Running the Dashboard

Quick Start (Standalone Mode)

The easiest way to get started is standalone mode, which automatically starts torc-server:

torc-dash --standalone

This will:

  1. Start torc-server on an automatically-detected free port
  2. Start the dashboard on http://127.0.0.1:8090
  3. Configure the dashboard to connect to the managed server

Connecting to an Existing Server

If you already have torc-server running:

# Use default API URL (http://localhost:8080/torc-service/v1)
torc-dash

# Specify custom API URL
torc-dash --api-url http://myserver:9000/torc-service/v1

# Or use environment variable
export TORC_API_URL="http://myserver:9000/torc-service/v1"
torc-dash

Command-Line Options

Options:
  -p, --port <PORT>           Dashboard port [default: 8090]
      --host <HOST>           Dashboard host [default: 127.0.0.1]
  -a, --api-url <API_URL>     Torc server API URL [default: http://localhost:8080/torc-service/v1]
      --torc-bin <PATH>       Path to torc CLI binary [default: torc]
      --torc-server-bin       Path to torc-server binary [default: torc-server]
      --standalone            Auto-start torc-server alongside dashboard
      --server-port <PORT>    Server port in standalone mode (0 = auto-detect) [default: 0]
      --database <PATH>       Database path for standalone server
      --completion-check-interval-secs <SECS>  Server polling interval [default: 5]

Features

Workflows Tab

The main workflows view provides:

  • Workflow list with ID, name, timestamp, user, and description
  • Create Workflow button to upload new workflow specifications
  • Quick actions for each workflow:
    • View details and DAG visualization
    • Initialize/reinitialize workflow
    • Run locally or submit to scheduler
    • Delete workflow

Creating Workflows

Click “Create Workflow” to open the creation dialog:

  1. Upload a file: Drag and drop or click to select a workflow specification file
    • Supports YAML, JSON, JSON5, and KDL formats
  2. Or enter a file path: Specify a path on the server filesystem
  3. Click “Create” to register the workflow

Details Tab

Explore workflow components with interactive tables:

  • Jobs: View all jobs with status, name, command, and dependencies
  • Files: Input/output files with paths and timestamps
  • User Data: Key-value data passed between jobs
  • Results: Execution results with return codes and resource metrics
  • Compute Nodes: Available compute resources
  • Resource Requirements: CPU, memory, GPU specifications
  • Schedulers: Slurm scheduler configurations

Features:

  • Workflow selector: Filter by workflow
  • Column sorting: Click headers to sort
  • Row filtering: Type in filter boxes (supports column:value syntax)
  • Auto-refresh: Toggle automatic updates

DAG Visualization

Click “View” on any workflow to see an interactive dependency graph:

  • Nodes represent jobs, colored by status
  • Edges show dependencies (file-based and explicit)
  • Zoom, pan, and click nodes for details
  • Legend shows status colors

Debugging Tab

Investigate failed jobs with the integrated debugger:

  1. Select a workflow
  2. Configure output directory (where logs are stored)
  3. Toggle “Show only failed jobs” to focus on problems
  4. Click “Generate Report” to fetch results
  5. Click any job row to view its log files:
    • stdout: Standard output from the job
    • stderr: Error output and stack traces
    • Copy file paths with one click

Events Tab

Monitor workflow activity:

  • Real-time event stream with automatic polling
  • Filter by workflow
  • View event types, timestamps, and details
  • Useful for tracking job state transitions

Resource Plots Tab

Visualize CPU and memory usage over time:

  1. Enter a base directory containing resource database files
  2. Click “Scan for Databases” to find .db files
  3. Select databases to plot
  4. Click “Generate Plots” for interactive Plotly charts

Requires workflows run with granularity: "time_series" in resource_monitor config.

Configuration Tab

Server Management

Start and stop torc-server directly from the dashboard:

  • Server Port: Port to listen on (0 = auto-detect free port)
  • Database Path: SQLite database file location
  • Completion Check Interval: How often to check for job completions
  • Log Level: Server logging verbosity

Click “Start Server” to launch, “Stop Server” to terminate.

API Configuration

  • API URL: Torc server endpoint
  • Test Connection: Verify connectivity

Settings are saved to browser local storage.

Common Usage Patterns

Running a Workflow

  1. Navigate to Workflows tab
  2. Click Create Workflow
  3. Upload your specification file
  4. Click Create
  5. Click Initialize on the new workflow
  6. Click Run Locally (or Submit for Slurm)
  7. Monitor progress in the Details tab or Events tab

Debugging a Failed Workflow

  1. Go to the Debugging tab
  2. Select the workflow
  3. Check “Show only failed jobs”
  4. Click Generate Report
  5. Click on a failed job row
  6. Review the stderr tab for error messages
  7. Check stdout for context

Monitoring Active Jobs

  1. Open Details tab
  2. Select “Jobs” and your workflow
  3. Enable Auto-refresh
  4. Watch job statuses update in real-time

Security Considerations

  1. Network Access: By default, binds to 127.0.0.1 (localhost only)
  2. Remote Access: Use --host 0.0.0.0 with caution; consider a reverse proxy with HTTPS
  3. Authentication: Torc server supports htpasswd-based authentication (see Authentication)

Troubleshooting

Cannot Connect to Server

  • Verify torc-server is running: curl http://localhost:8080/torc-service/v1/workflows
  • Check the API URL in Configuration tab
  • In standalone mode, check server output for startup errors

Workflow Creation Fails

  • Ensure workflow specification is valid YAML/JSON/KDL
  • Check file paths are accessible from the server
  • Review browser console for error details

Resource Plots Not Showing

  • Verify workflow used granularity: "time_series" mode
  • Confirm .db files exist in the specified directory
  • Check that database files contain data

Standalone Mode Server Won’t Start

  • Verify torc-server binary is in PATH or specify --torc-server-bin
  • Check if the port is already in use
  • Review console output for error messages

Architecture

torc-dash is a self-contained Rust binary with:

  • Axum web framework for HTTP server
  • Embedded static assets (HTML, CSS, JavaScript)
  • API proxy to forward requests to torc-server
  • CLI integration for workflow operations

The frontend uses vanilla JavaScript with:

  • Cytoscape.js for DAG visualization
  • Plotly.js for resource charts
  • Custom components for tables and forms

Next Steps

Visualizing Workflow Structure

Understanding how your workflow will execute—which jobs run in parallel, how dependencies create stages, and when Slurm allocations are requested—is essential for debugging and optimization. Torc provides several tools for visualizing workflow structure.

Execution Plan Command

The torc workflows execution-plan command analyzes a workflow and displays its execution stages, showing how jobs are grouped and when schedulers allocate resources.

Basic Usage

# From a specification file
torc workflows execution-plan workflow.yaml

# From an existing workflow
torc workflows execution-plan <workflow_id>

Example Output

For a workflow with two independent processing pipelines that merge at the end:

Workflow: two_subgraph_pipeline
Total Jobs: 15

▶ Stage 1: Workflow Start
  Scheduler Allocations:
    • prep_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • prep_a
    • prep_b

→ Stage 2: When jobs 'prep_a', 'prep_b' complete
  Scheduler Allocations:
    • work_a_sched (slurm) - 1 allocation(s)
    • work_b_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • work_a_{1..5}
    • work_b_{1..5}

→ Stage 3: When 10 jobs complete
  Scheduler Allocations:
    • post_a_sched (slurm) - 1 allocation(s)
    • post_b_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • post_a
    • post_b

→ Stage 4: When jobs 'post_a', 'post_b' complete
  Scheduler Allocations:
    • final_sched (slurm) - 1 allocation(s)
  Jobs Becoming Ready:
    • final

Total Stages: 4

What the Execution Plan Shows

  1. Stages: Groups of jobs that become ready at the same time based on dependency resolution
  2. Scheduler Allocations: Which Slurm schedulers request resources at each stage (for workflows with Slurm configuration)
  3. Jobs Becoming Ready: Which jobs transition to “ready” status at each stage
  4. Subgraphs: Independent branches of the workflow that can execute in parallel

Workflows Without Slurm Schedulers

For workflows without pre-defined Slurm schedulers, the execution plan shows the job stages without scheduler information:

torc workflows execution-plan workflow_no_slurm.yaml
Workflow: my_pipeline
Total Jobs: 10

▶ Stage 1: Workflow Start
  Jobs Becoming Ready:
    • preprocess

→ Stage 2: When job 'preprocess' completes
  Jobs Becoming Ready:
    • work_{1..5}

→ Stage 3: When 5 jobs complete
  Jobs Becoming Ready:
    • postprocess

Total Stages: 3

This helps you understand the workflow topology before adding Slurm configuration with torc slurm generate.

Use Cases

  • Validate workflow structure: Ensure dependencies create the expected execution order
  • Identify parallelism: See which jobs can run concurrently
  • Debug slow workflows: Find stages that serialize unnecessarily
  • Plan Slurm allocations: Understand when resources will be requested
  • Verify auto-generated schedulers: Check that torc slurm generate created appropriate staging

DAG Visualization in the Dashboard

The web dashboard provides interactive DAG (Directed Acyclic Graph) visualization.

Viewing the DAG

  1. Navigate to the Details tab
  2. Select a workflow
  3. Click View DAG in the Visualization section

DAG Types

The dashboard supports three DAG visualization types:

TypeDescription
Job DependenciesShows explicit and implicit dependencies between jobs
Job-File RelationsShows how jobs connect through input/output files
Job-UserData RelationsShows how jobs connect through user data

DAG Features

  • Color-coded nodes: Jobs are colored by status (ready, running, completed, failed, etc.)
  • Interactive: Zoom, pan, and click nodes for details
  • Layout: Automatic hierarchical layout using Dagre algorithm
  • Legend: Status color reference

TUI DAG View

The terminal UI (torc tui) also includes DAG visualization:

  1. Select a workflow
  2. Press d to toggle the DAG view
  3. Use arrow keys to navigate

Comparing Visualization Tools

ToolBest For
execution-planUnderstanding execution stages, Slurm allocation timing
Dashboard DAGInteractive exploration, status monitoring
TUI DAGQuick terminal-based visualization

Example: Analyzing a Complex Workflow

Consider a workflow with preprocessing, parallel work, and aggregation:

# First, view the execution plan
torc workflows execution-plan examples/subgraphs/subgraphs_workflow.yaml

# If no schedulers, generate them
torc slurm generate --account myproject examples/subgraphs/subgraphs_workflow_no_slurm.yaml

# View the plan again to see scheduler allocations
torc workflows execution-plan examples/subgraphs/subgraphs_workflow.yaml

The execution plan helps you verify that:

  • Independent subgraphs are correctly identified
  • Stages align with your expected execution order
  • Slurm allocations are timed appropriately

See Also

Debugging Workflows

When workflows fail or produce unexpected results, Torc provides comprehensive debugging tools to help you identify and resolve issues. The primary debugging tools are:

  • torc results list: Prints a table of return codes for each job execution (non-zero means failure)
  • torc reports results: Generates a detailed JSON report containing job results and all associated log file paths
  • torc-dash Debug tab: Interactive web interface for visual debugging with log file viewer

Overview

Torc automatically captures return codes and multiple log files for each job execution:

  • Job stdout/stderr: Output from your job commands
  • Job runner logs: Internal logs from the Torc job runner
  • Slurm logs: Additional logs when using Slurm scheduler (see Debugging Slurm Workflows)

The reports results command consolidates all this information into a single JSON report, making it easy to locate and examine relevant log files for debugging.

Quick Start

View the job return codes in a table:

torc results list <workflow_id>
Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Peak CPU % │ Completion Time          │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4  │ 6      │ 2     │ 1      │ 1           │ 1.01      │ 73.8MB   │ 21.9%      │ 2025-11-13T13:35:43.289Z │ Done   │
│ 5  │ 4      │ 2     │ 1      │ 0           │ 1.01      │ 118.1MB  │ 301.3%     │ 2025-11-13T13:35:43.393Z │ Done   │
│ 6  │ 5      │ 2     │ 1      │ 0           │ 1.01      │ 413.6MB  │ 19.9%      │ 2025-11-13T13:35:43.499Z │ Done   │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯

Total: 3 results

View only failed jobs:

torc results list <workflow_id> --failed
Results for workflow ID 2:
╭────┬────────┬───────┬────────┬─────────────┬───────────┬──────────┬────────────┬──────────────────────────┬────────╮
│ ID │ Job ID │ WF ID │ Run ID │ Return Code │ Exec Time │ Peak Mem │ Peak CPU % │ Completion Time          │ Status │
├────┼────────┼───────┼────────┼─────────────┼───────────┼──────────┼────────────┼──────────────────────────┼────────┤
│ 4  │ 6      │ 2     │ 1      │ 1           │ 1.01      │ 73.8MB   │ 21.9%      │ 2025-11-13T13:35:43.289Z │ Done   │
╰────┴────────┴───────┴────────┴─────────────┴───────────┴──────────┴────────────┴──────────────────────────┴────────╯

Generate a debugging report for a workflow:

# Generate report for a specific workflow
torc reports results <workflow_id>

# Specify custom output directory (default: "output")
torc reports results <workflow_id> --output-dir /path/to/output

# Include all workflow runs (default: only latest run)
torc reports results <workflow_id> --all-runs

# Interactive workflow selection (if workflow_id omitted)
torc reports results

The command outputs a comprehensive JSON report to stdout. Redirect it to a file for easier analysis:

torc reports results <workflow_id> > debug_report.json

Report Structure

Top-Level Fields

The JSON report includes workflow-level information:

{
  "workflow_id": 123,
  "workflow_name": "my_pipeline",
  "workflow_user": "researcher",
  "all_runs": false,
  "total_results": 5,
  "results": [...]
}

Fields:

  • workflow_id: Unique identifier for the workflow
  • workflow_name: Human-readable workflow name
  • workflow_user: Owner of the workflow
  • all_runs: Whether report includes all historical runs or just the latest
  • total_results: Number of job results in the report
  • results: Array of individual job result records

Job Result Records

Each entry in the results array contains detailed information about a single job execution:

{
  "job_id": 456,
  "job_name": "preprocess_data",
  "status": "Done",
  "run_id": 1,
  "return_code": 0,
  "completion_time": "2024-01-15T14:30:00.000Z",
  "exec_time_minutes": 5.2,
  "compute_node_id": 789,
  "compute_node_type": "local",
  "job_stdout": "output/job_stdio/job_456.o",
  "job_stderr": "output/job_stdio/job_456.e",
  "job_runner_log": "output/job_runner_hostname_123_1.log"
}

Core Fields:

  • job_id: Unique identifier for the job
  • job_name: Human-readable job name from workflow spec
  • status: Job status (Done, Terminated, Failed, etc.)
  • run_id: Workflow run number (increments on reinitialization)
  • return_code: Exit code from job command (0 = success)
  • completion_time: ISO 8601 timestamp when job completed
  • exec_time_minutes: Duration of job execution in minutes

Compute Node Fields:

  • compute_node_id: ID of the compute node that executed the job
  • compute_node_type: Type of compute node (“local” or “slurm”)

Log File Paths

The report includes paths to all log files associated with each job. The specific files depend on the compute node type.

Local Runner Log Files

For jobs executed by the local job runner (compute_node_type: "local"):

{
  "job_stdout": "output/job_stdio/job_456.o",
  "job_stderr": "output/job_stdio/job_456.e",
  "job_runner_log": "output/job_runner_hostname_123_1.log"
}

Log File Descriptions:

  1. job_stdout (output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.o):

    • Standard output from your job command
    • Contains print statements, normal program output
    • Use for: Checking expected output, debugging logic errors
  2. job_stderr (output/job_stdio/job_<workflow_id>_<job_id>_<run_id>.e):

    • Standard error from your job command
    • Contains error messages, warnings, stack traces
    • Use for: Investigating crashes, exceptions, error messages
  3. job_runner_log (output/job_runner_<hostname>_<workflow_id>_<run_id>.log):

    • Internal Torc job runner logging
    • Shows job lifecycle events, resource monitoring, process management
    • Use for: Understanding Torc’s job execution behavior, timing issues

Log path format conventions:

  • Job stdio logs use job ID in filename
  • Runner logs use hostname, workflow ID, and run ID
  • All paths are relative to the specified --output-dir

Slurm Runner Log Files

For jobs executed via Slurm scheduler (compute_node_type: "slurm"), additional log files are available including Slurm stdout/stderr, environment logs, and dmesg logs.

See Debugging Slurm Workflows for detailed information about Slurm-specific log files and debugging tools.

Using the torc-dash Debugging Tab

The torc-dash web interface provides an interactive Debugging tab for visual debugging of workflow jobs. This is often the quickest way to investigate failed jobs without using command-line tools.

Accessing the Debugging Tab

  1. Start torc-dash (standalone mode recommended for quick setup):

    torc-dash --standalone
    
  2. Open your browser to http://localhost:8090

  3. Select a workflow from the dropdown in the sidebar

  4. Click the Debugging tab in the navigation

Features

Job Results Report

The Debug tab provides a report generator with the following options:

  • Output Directory: Specify where job logs are stored (default: output). This must match the directory used during workflow execution.

  • Include all runs: Check this to see results from all workflow runs, not just the latest. Useful for comparing job behavior across reinitializations.

  • Show only failed jobs: Filter to display only jobs with non-zero return codes. This is checked by default to help you focus on problematic jobs.

Click Generate Report to fetch job results from the server.

Job Results Table

After generating a report, the Debug tab displays an interactive table showing:

  • Job ID: Unique identifier for the job
  • Job Name: Human-readable name from the workflow spec
  • Status: Job completion status (Done, Terminated, etc.)
  • Return Code: Exit code (0 = success, non-zero = failure)
  • Execution Time: Duration in minutes
  • Run ID: Which workflow run the result is from

Click any row to select a job and view its log files.

Log File Viewer

When you select a job from the table, the Log File Viewer displays:

  • stdout tab: Standard output from the job command

    • Shows print statements and normal program output
    • Useful for checking expected behavior and debugging logic
  • stderr tab: Standard error from the job command

    • Shows error messages, warnings, and stack traces
    • Primary location for investigating crashes and exceptions

Each tab includes:

  • Copy Path button: Copy the full file path to clipboard
  • File path display: Shows where the log file is located
  • Scrollable content viewer: Dark-themed viewer for easy reading

Quick Debugging Workflow with torc-dash

  1. Open torc-dash and select your workflow from the sidebar
  2. Go to the Debugging tab
  3. Ensure “Show only failed jobs” is checked
  4. Click Generate Report
  5. Click on a failed job in the results table
  6. Review the stderr tab for error messages
  7. Check the stdout tab for context about what the job was doing

When to Use torc-dash vs CLI

Use torc-dash Debugging tab when:

  • You want a visual, interactive debugging experience
  • You need to quickly scan multiple failed jobs
  • You’re investigating jobs and want to easily switch between stdout/stderr
  • You prefer not to construct jq queries manually

Use CLI tools (torc reports results) when:

  • You need to automate failure detection in CI/CD
  • You want to save reports for archival or version control
  • You’re working on a remote server without browser access
  • You need to process results programmatically

Common Debugging Workflows

Investigating Failed Jobs

When a job fails, follow these steps:

  1. Generate the debug report:

    torc reports results <workflow_id> > debug_report.json
    
  2. Find the failed job using jq or similar tool:

    # Find jobs with non-zero return codes
    jq '.results[] | select(.return_code != 0)' debug_report.json
    
    # Find jobs with specific status
    jq '.results[] | select(.status == "Done")' debug_report.json
    
  3. Check the job’s stderr for error messages:

    # Extract stderr path for a specific job
    STDERR_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stderr' debug_report.json)
    
    # View the error output
    cat "$STDERR_PATH"
    
  4. Review job stdout for context:

    STDOUT_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_stdout' debug_report.json)
    cat "$STDOUT_PATH"
    
  5. Check runner logs for execution issues:

    LOG_PATH=$(jq -r '.results[] | select(.job_name == "my_failing_job") | .job_runner_log' debug_report.json)
    cat "$LOG_PATH"
    

Example: Complete Debugging Session

# 1. Generate report
torc reports results 123 > report.json

# 2. Check overall success/failure counts
echo "Total jobs: $(jq '.total_results' report.json)"
echo "Failed jobs: $(jq '[.results[] | select(.return_code != 0)] | length' report.json)"

# 3. List all failed jobs with their names
jq -r '.results[] | select(.return_code != 0) | "\(.job_id): \(.job_name) (exit code: \(.return_code))"' report.json

# Output:
# 456: process_batch_2 (exit code: 1)
# 789: validate_results (exit code: 2)

# 4. Examine stderr for first failure
jq -r '.results[] | select(.job_id == 456) | .job_stderr' report.json | xargs cat

# Output might show:
# FileNotFoundError: [Errno 2] No such file or directory: 'input/batch_2.csv'

# 5. Check if job dependencies completed successfully
# (The missing file might be an output from a previous job)
jq -r '.results[] | select(.job_name == "generate_batch_2") | "\(.status) (exit code: \(.return_code))"' report.json

Debugging Across Multiple Runs

When a workflow has been reinitialized multiple times, compare runs to identify regressions:

# Generate report with all historical runs
torc reports results <workflow_id> --all-runs > full_history.json

# Compare return codes across runs for a specific job
jq -r '.results[] | select(.job_name == "flaky_job") | "Run \(.run_id): exit code \(.return_code)"' full_history.json

# Output:
# Run 1: exit code 0
# Run 2: exit code 1
# Run 3: exit code 0
# Run 4: exit code 1

# Extract stderr paths for failed runs
jq -r '.results[] | select(.job_name == "flaky_job" and .return_code != 0) | "Run \(.run_id): \(.job_stderr)"' full_history.json

Log File Missing Warnings

The reports results command automatically checks for log file existence and prints warnings to stderr if files are missing:

Warning: job stdout log file does not exist for job 456: output/job_stdio/job_456.o
Warning: job runner log file does not exist for job 456: output/job_runner_host1_123_1.log

Common causes of missing log files:

  1. Wrong output directory: Ensure --output-dir matches the directory used during workflow execution
  2. Logs not yet written: Job may still be running or failed to start
  3. Logs cleaned up: Files may have been manually deleted
  4. Path mismatch: Output directory moved or renamed after execution

Solution: Verify the output directory and ensure it matches what was passed to torc run or torc slurm schedule-nodes.

Output Directory Management

The --output-dir parameter must match the directory used during workflow execution:

Local Runner

# Execute workflow with specific output directory
torc run <workflow_id> /path/to/my_output

# Generate report using the same directory
torc reports results <workflow_id> --output-dir /path/to/my_output

Slurm Scheduler

# Submit jobs to Slurm with output directory
torc slurm schedule-nodes <workflow_id> --output-dir /path/to/my_output

# Generate report using the same directory
torc reports results <workflow_id> --output-dir /path/to/my_output

Default behavior: If --output-dir is not specified, both the runner and reports command default to ./output.

Best Practices

  1. Generate reports after each run: Create a debug report immediately after workflow execution for easier troubleshooting

  2. Archive reports with logs: Store the JSON report alongside log files for future reference

    torc reports results "$WF_ID" > "output/report_${WF_ID}_$(date +%Y%m%d_%H%M%S).json"
    
  3. Use version control: Commit debug reports for important workflow runs to track changes over time

  4. Automate failure detection: Use the report in CI/CD pipelines to automatically detect and report failures

  5. Check warnings: Pay attention to warnings about missing log files - they often indicate configuration issues

  6. Combine with resource monitoring: Use reports results for log files and reports check-resource-utilization for performance issues

    # Check if job failed due to resource constraints
    torc reports check-resource-utilization "$WF_ID"
    torc reports results "$WF_ID" > report.json
    
  7. Filter large reports: For workflows with many jobs, filter the report to focus on relevant jobs

    # Only include failed jobs in filtered report
    jq '{workflow_id, workflow_name, results: [.results[] | select(.return_code != 0)]}' report.json
    

Troubleshooting Common Issues

“Output directory does not exist” Error

Cause: The specified --output-dir path doesn’t exist.

Solution: Verify the directory exists and the path is correct:

ls -ld output/  # Check if directory exists
torc reports results <workflow_id> --output-dir "$(pwd)/output"

Empty Results Array

Cause: No job results exist for the workflow (jobs not yet executed or initialized).

Solution: Check workflow status and ensure jobs have been completed:

torc workflows status <workflow_id>
torc results list <workflow_id>  # Verify results exist

All Log Paths Show Warnings

Cause: Output directory mismatch between execution and report generation.

Solution: Verify the output directory used during execution:

# Check where logs actually are
find . -name "job_*.o" -o -name "job_runner_*.log"

# Use correct output directory in report
torc reports results <workflow_id> --output-dir <correct_path>
  • torc results list: View summary of job results in table format
  • torc workflows status: Check overall workflow status
  • torc reports results: Generate debug report with all log file paths
  • torc reports check-resource-utilization: Analyze resource usage and find over-utilized jobs
  • torc jobs list: View all jobs and their current status
  • torc-dash: Launch web interface with interactive Debugging tab
  • torc tui: Launch terminal UI for workflow monitoring

For Slurm-specific debugging tools (torc slurm parse-logs, torc slurm sacct, etc.), see Debugging Slurm Workflows.

Debugging Slurm Workflows

When running workflows on Slurm clusters, Torc provides additional debugging tools specifically designed for Slurm environments. This guide covers Slurm-specific debugging techniques and tools.

For general debugging concepts and tools that apply to all workflows, see Debugging Workflows.

Overview

Slurm workflows generate additional log files beyond the standard job logs:

  • Slurm stdout/stderr: Output from Slurm’s perspective (job allocation, environment setup)
  • Slurm environment logs: All SLURM environment variables captured at job runner startup
  • dmesg logs: Kernel message buffer captured when the Slurm job runner exits

These logs help diagnose issues specific to the cluster environment, such as resource allocation failures, node problems, and system-level errors.

Slurm Log File Structure

For jobs executed via Slurm scheduler (compute_node_type: "slurm"), the debug report includes these additional log paths:

{
  "job_stdout": "output/job_stdio/job_456.o",
  "job_stderr": "output/job_stdio/job_456.e",
  "job_runner_log": "output/job_runner_slurm_12345_node01_67890.log",
  "slurm_stdout": "output/slurm_output_12345.o",
  "slurm_stderr": "output/slurm_output_12345.e",
  "slurm_env_log": "output/slurm_env_12345_node01_67890.log",
  "dmesg_log": "output/dmesg_slurm_12345_node01_67890.log"
}

Log File Descriptions

  1. slurm_stdout (output/slurm_output_<slurm_job_id>.o):

    • Standard output from Slurm’s perspective
    • Includes Slurm environment setup, job allocation info
    • Use for: Debugging Slurm job submission issues
  2. slurm_stderr (output/slurm_output_<slurm_job_id>.e):

    • Standard error from Slurm’s perspective
    • Contains Slurm-specific errors (allocation failures, node issues)
    • Use for: Investigating Slurm scheduler problems
  3. slurm_env_log (output/slurm_env_<slurm_job_id>_<node_id>_<task_pid>.log):

    • All SLURM environment variables captured at job runner startup
    • Contains job allocation details, resource limits, node assignments
    • Use for: Verifying Slurm job configuration, debugging resource allocation issues
  4. dmesg log (output/dmesg_slurm_<slurm_job_id>_<node_id>_<task_pid>.log):

    • Kernel message buffer captured when the Slurm job runner exits
    • Contains system-level events: OOM killer activity, hardware errors, kernel panics
    • Use for: Investigating job failures caused by system-level issues (e.g., out-of-memory kills, hardware failures)

Note: Slurm job runner logs include the Slurm job ID, node ID, and task PID in the filename for correlation with Slurm’s own logs.

Parsing Slurm Log Files for Errors

The torc slurm parse-logs command scans Slurm stdout/stderr log files for known error patterns and correlates them with affected Torc jobs:

# Parse logs for a specific workflow
torc slurm parse-logs <workflow_id>

# Specify custom output directory
torc slurm parse-logs <workflow_id> --output-dir /path/to/output

# Output as JSON for programmatic processing
torc slurm parse-logs <workflow_id> --format json

Detected Error Patterns

The command detects common Slurm failure patterns including:

Memory Errors:

  • out of memory, oom-kill, cannot allocate memory
  • memory cgroup out of memory, Exceeded job memory limit
  • task/cgroup: .*: Killed
  • std::bad_alloc (C++), MemoryError (Python)

Slurm-Specific Errors:

  • slurmstepd: error:, srun: error:
  • DUE TO TIME LIMIT, DUE TO PREEMPTION
  • NODE_FAIL, FAILED, CANCELLED
  • Exceeded.*step.*limit

GPU/CUDA Errors:

  • CUDA out of memory, CUDA error, GPU memory.*exceeded

Signal/Crash Errors:

  • Segmentation fault, SIGSEGV
  • Bus error, SIGBUS
  • killed by signal, core dumped

Python Errors:

  • Traceback (most recent call last)
  • ModuleNotFoundError, ImportError

File System Errors:

  • No space left on device, Disk quota exceeded
  • Read-only file system, Permission denied

Network Errors:

  • Connection refused, Connection timed out, Network is unreachable

Example Output

Table format:

Slurm Log Analysis Results
==========================

Found 2 error(s) in log files:

╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
│ File                        │ Slurm Job ID │ Line │ Pattern                     │ Severity │ Affected Torc Jobs           │
├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
│ slurm_output_12345.e        │ 12345        │ 42   │ Out of Memory (OOM) Kill    │ critical │ process_data (ID: 456)       │
│ slurm_output_12346.e        │ 12346        │ 15   │ CUDA out of memory          │ error    │ train_model (ID: 789)        │
╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯

Viewing Slurm Accounting Data

The torc slurm sacct command displays a summary of Slurm job accounting data for all scheduled compute nodes in a workflow:

# Display sacct summary table for a workflow
torc slurm sacct <workflow_id>

# Also save full JSON files for detailed analysis
torc slurm sacct <workflow_id> --save-json --output-dir /path/to/output

# Output as JSON for programmatic processing
torc slurm sacct <workflow_id> --format json

Summary Table Fields

The command displays a summary table with key metrics:

  • Slurm Job: The Slurm job ID
  • Job Step: Name of the job step (e.g., “worker_1”, “batch”)
  • State: Job state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, etc.)
  • Exit Code: Exit code of the job step
  • Elapsed: Wall clock time for the job step
  • Max RSS: Maximum resident set size (memory usage)
  • CPU Time: Total CPU time consumed
  • Nodes: Compute nodes used

Example Output

Slurm Accounting Summary for Workflow 123

╭────────────┬───────────┬───────────┬───────────┬─────────┬─────────┬──────────┬─────────╮
│ Slurm Job  │ Job Step  │ State     │ Exit Code │ Elapsed │ Max RSS │ CPU Time │ Nodes   │
├────────────┼───────────┼───────────┼───────────┼─────────┼─────────┼──────────┼─────────┤
│ 12345      │ worker_1  │ COMPLETED │ 0         │ 2h 15m  │ 4.5GB   │ 4h 30m   │ node01  │
│ 12345      │ batch     │ COMPLETED │ 0         │ 2h 16m  │ 128.0MB │ 1m 30s   │ node01  │
│ 12346      │ worker_1  │ FAILED    │ 1         │ 45m 30s │ 8.2GB   │ 1h 30m   │ node02  │
╰────────────┴───────────┴───────────┴───────────┴─────────┴─────────┴──────────┴─────────╯

Total: 3 job steps

Saving Full JSON Output

Use --save-json to save full sacct JSON output to files for detailed analysis:

torc slurm sacct 123 --save-json --output-dir output
# Creates: output/sacct_12345.json, output/sacct_12346.json, etc.

Viewing Slurm Logs in torc-dash

The torc-dash web interface provides two ways to view Slurm logs:

Debugging Tab - Slurm Log Analysis

The Debugging tab includes a “Slurm Log Analysis” section:

  1. Navigate to the Debugging tab
  2. Find the Slurm Log Analysis section
  3. Enter the output directory path (default: output)
  4. Click Analyze Slurm Logs

The results show all detected errors with their Slurm job IDs, line numbers, error patterns, severity levels, and affected Torc jobs.

Debugging Tab - Slurm Accounting Data

The Debugging tab also includes a “Slurm Accounting Data” section:

  1. Navigate to the Debugging tab
  2. Find the Slurm Accounting Data section
  3. Click Collect sacct Data

This displays a summary table showing job state, exit codes, elapsed time, memory usage (Max RSS), CPU time, and nodes for all Slurm job steps. The table helps quickly identify failed jobs and resource usage patterns.

Scheduled Nodes Tab - View Slurm Logs

You can view individual Slurm job logs directly from the Details view:

  1. Select a workflow
  2. Go to the Details tab
  3. Switch to the Scheduled Nodes sub-tab
  4. Find a Slurm scheduled node in the table
  5. Click the View Logs button in the Logs column

This opens a modal with tabs for viewing the Slurm job’s stdout and stderr files.

Viewing Slurm Logs in the TUI

The torc tui terminal interface also supports Slurm log viewing:

  1. Launch the TUI: torc tui
  2. Select a workflow and press Enter to load details
  3. Press Tab to switch to the Scheduled Nodes tab
  4. Navigate to a Slurm scheduled node using arrow keys
  5. Press l to view the Slurm job’s logs

The log viewer shows:

  • stdout tab: Slurm job standard output (slurm_output_<id>.o)
  • stderr tab: Slurm job standard error (slurm_output_<id>.e)

Use Tab to switch between stdout/stderr, arrow keys to scroll, / to search, and q to close.

Debugging Slurm Job Failures

When a Slurm job fails, follow this debugging workflow:

  1. Parse logs for known errors:

    torc slurm parse-logs <workflow_id>
    
  2. If OOM or resource issues are detected, collect sacct data:

    torc slurm sacct <workflow_id>
    cat output/sacct_<slurm_job_id>.json | jq '.jobs[].steps[].tres.requested'
    
  3. View the specific Slurm log files:

    • Use torc-dash: Details → Scheduled Nodes → View Logs
    • Or use TUI: Scheduled Nodes tab → press l
    • Or directly: cat output/slurm_output_<slurm_job_id>.e
  4. Check the job’s own stderr for application errors:

    torc reports results <workflow_id> > report.json
    jq -r '.results[] | select(.return_code != 0) | .job_stderr' report.json | xargs cat
    
  5. Review dmesg logs for system-level issues:

    cat output/dmesg_slurm_<slurm_job_id>_*.log
    

Common Slurm Issues and Solutions

Out of Memory (OOM) Kills

Symptoms:

  • torc slurm parse-logs shows “Out of Memory (OOM) Kill”
  • Job exits with signal 9 (SIGKILL)
  • dmesg log shows “oom-kill” entries

Solutions:

  • Increase memory request in job specification
  • Check torc slurm sacct output for actual memory usage (Max RSS)
  • Consider splitting job into smaller chunks

Time Limit Exceeded

Symptoms:

  • torc slurm parse-logs shows “DUE TO TIME LIMIT”
  • Job state in sacct shows “TIMEOUT”

Solutions:

  • Increase runtime in job specification
  • Check if job is stuck (review stdout for progress)
  • Consider optimizing the job or splitting into phases

Node Failures

Symptoms:

  • torc slurm parse-logs shows “NODE_FAIL”
  • Job may have completed partially

Solutions:

  • Reinitialize workflow to retry failed jobs
  • Check cluster status with sinfo
  • Review dmesg logs for hardware issues

GPU/CUDA Errors

Symptoms:

  • torc slurm parse-logs shows “CUDA out of memory” or “CUDA error”

Solutions:

  • Reduce batch size or model size
  • Check GPU memory with nvidia-smi in job script
  • Ensure correct CUDA version is loaded
  • torc slurm parse-logs: Parse Slurm logs for known error patterns
  • torc slurm sacct: Collect Slurm accounting data for workflow jobs
  • torc reports results: Generate debug report with all log file paths
  • torc results list: View summary of job results in table format
  • torc-dash: Launch web interface with Slurm log viewing
  • torc tui: Launch terminal UI with Slurm log viewing

For general debugging tools and workflows, see Debugging Workflows.

Authentication

Torc supports HTTP Basic authentication to secure access to your workflow orchestration server. This guide explains how to set up and use authentication.

Overview

Torc’s authentication system provides:

  • Multi-user support via htpasswd files
  • Bcrypt password hashing for secure credential storage
  • Backward compatibility - authentication is optional by default
  • Flexible deployment - can require authentication or allow mixed access
  • CLI and environment variable support for credentials

Server-Side Setup

1. Create User Accounts

Use the torc-htpasswd utility to manage user accounts:

# Add a user (will prompt for password)
torc-htpasswd add --file /path/to/htpasswd username

# Add a user with password on command line
torc-htpasswd add --file /path/to/htpasswd --password mypassword username

# Add a user with custom bcrypt cost (higher = more secure but slower)
torc-htpasswd add --file /path/to/htpasswd --cost 14 username

# List all users
torc-htpasswd list --file /path/to/htpasswd

# Verify a password
torc-htpasswd verify --file /path/to/htpasswd username

# Remove a user
torc-htpasswd remove --file /path/to/htpasswd username

The htpasswd file format is simple:

# Torc htpasswd file
# Format: username:bcrypt_hash
alice:$2b$12$abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOP
bob:$2b$12$zyxwvutsrqponmlkjihgfedcba0987654321ZYXWVUTSRQPONMLK

2. Start Server with Authentication

# Optional authentication (backward compatible mode)
torc-server run --auth-file /path/to/htpasswd

# Required authentication (all requests must authenticate)
torc-server run --auth-file /path/to/htpasswd --require-auth

# Can also use environment variable
export TORC_AUTH_FILE=/path/to/htpasswd
torc-server run

Authentication Modes:

  • No --auth-file: Authentication disabled, all requests allowed (default)
  • --auth-file only: Authentication optional - authenticated requests are logged, unauthenticated requests allowed
  • --auth-file --require-auth: Authentication required - unauthenticated requests are rejected

3. Server Logs

The server logs authentication events:

INFO  torc_server: Loading htpasswd file from: /path/to/htpasswd
INFO  torc_server: Loaded 3 users from htpasswd file
INFO  torc_server: Authentication is REQUIRED for all requests
...
DEBUG torc::server::auth: User 'alice' authenticated successfully
WARN  torc::server::auth: Authentication failed for user 'bob'
WARN  torc::server::auth: Authentication required but no credentials provided

Client-Side Usage

Using Command-Line Flags

# Provide credentials via flags
torc --username alice --password mypassword workflows list

# Username via flag, password will be prompted
torc --username alice workflows list
Password: ****

# All commands support authentication
torc --username alice --password mypassword workflows create workflow.yaml

Using Environment Variables

# Set credentials in environment
export TORC_USERNAME=alice
export TORC_PASSWORD=mypassword

# Run commands without flags
torc workflows list
torc jobs list my-workflow-id

Mixed Approach

# Username from env, password prompted
export TORC_USERNAME=alice
torc workflows list
Password: ****

# Override env with flag
export TORC_USERNAME=alice
torc --username bob --password bobpass workflows list

Security Best Practices

1. Use HTTPS in Production

Basic authentication sends base64-encoded credentials (easily decoded). Always use HTTPS when authentication is enabled:

# Start server with HTTPS
torc-server run --https --auth-file /path/to/htpasswd --require-auth

# Client connects via HTTPS
torc --url https://torc.example.com/torc-service/v1 --username alice workflows list

2. Secure Credential Storage

Do:

  • Store htpasswd files with restrictive permissions: chmod 600 /path/to/htpasswd
  • Use environment variables for passwords in scripts
  • Use password prompting for interactive sessions
  • Rotate passwords periodically

Don’t:

  • Commit htpasswd files to version control
  • Share htpasswd files between environments
  • Pass passwords as command-line arguments in production (visible in process list)
  • Use weak passwords or low bcrypt costs

3. Bcrypt Cost Factor

The cost factor determines password hashing strength:

  • Cost 4-8: Fast but weaker (testing only)
  • Cost 10-12: Balanced (default: 12)
  • Cost 13-15: Strong (production systems)
  • Cost 16+: Very strong (high-security environments)
# Use higher cost for production
torc-htpasswd add --file prod_htpasswd --cost 14 alice

4. Audit Logging

Monitor authentication events in server logs:

# Run server with debug logging for auth events
torc-server run --log-level debug --auth-file /path/to/htpasswd

# Or use RUST_LOG for granular control
RUST_LOG=torc::server::auth=debug torc-server run --auth-file /path/to/htpasswd

Common Workflows

Development Environment

# 1. Create test user
torc-htpasswd add --file dev_htpasswd --password devpass developer

# 2. Start server (auth optional)
torc-server run --auth-file dev_htpasswd --database dev.db

# 3. Use client without auth (still works)
torc workflows list

# 4. Or with auth
torc --username developer --password devpass workflows list

Production Deployment

# 1. Create production users with strong passwords and high cost
torc-htpasswd add --file /etc/torc/htpasswd --cost 14 alice
torc-htpasswd add --file /etc/torc/htpasswd --cost 14 bob

# 2. Secure the file
chmod 600 /etc/torc/htpasswd
chown torc-server:torc-server /etc/torc/htpasswd

# 3. Start server with required auth and HTTPS
torc-server run \
  --https \
  --auth-file /etc/torc/htpasswd \
  --require-auth \
  --database /var/lib/torc/production.db

# 4. Clients must authenticate
export TORC_USERNAME=alice
torc --url https://torc.example.com/torc-service/v1 workflows list
Password: ****

CI/CD Pipeline

# Store credentials as CI secrets
# TORC_USERNAME=ci-bot
# TORC_PASSWORD=<secure-password>

# Use in pipeline
export TORC_USERNAME="${TORC_USERNAME}"
export TORC_PASSWORD="${TORC_PASSWORD}"
export TORC_API_URL=https://torc.example.com/torc-service/v1

# Run workflow
torc workflows create pipeline.yaml
torc workflows start "${WORKFLOW_ID}"

Migrating from No Auth to Required Auth

# 1. Start: No authentication
torc-server run --database prod.db

# 2. Add authentication file (optional mode)
torc-server run --auth-file /etc/torc/htpasswd --database prod.db

# 3. Monitor logs, ensure clients are authenticating
# Look for "User 'X' authenticated successfully" messages

# 4. Once all clients authenticate, enable required auth
torc-server run --auth-file /etc/torc/htpasswd --require-auth --database prod.db

Troubleshooting

“Authentication required but no credentials provided”

Cause: Server has --require-auth but client didn’t send credentials.

Solution:

# Add username and password
torc --username alice --password mypass workflows list

“Authentication failed for user ‘alice’”

Cause: Wrong password or user doesn’t exist in htpasswd file.

Solutions:

# 1. Verify user exists
torc-htpasswd list --file /path/to/htpasswd

# 2. Verify password
torc-htpasswd verify --file /path/to/htpasswd alice

# 3. Reset password
torc-htpasswd add --file /path/to/htpasswd alice

“No credentials provided, allowing anonymous access”

Cause: Server has --auth-file but not --require-auth, and client didn’t authenticate.

Solution: This is normal in optional auth mode. To require auth:

torc-server run --auth-file /path/to/htpasswd --require-auth

Password Prompting in Non-Interactive Sessions

Problem: Scripts or CI/CD fail waiting for password prompt.

Solutions:

# Use environment variable
export TORC_PASSWORD=mypassword
torc --username alice workflows list

# Or pass as flag (less secure - visible in process list)
torc --username alice --password mypassword workflows list

Advanced Topics

Multiple Environments

Maintain separate htpasswd files per environment:

# Development
torc-htpasswd add --file ~/.torc/dev_htpasswd --password devpass developer

# Staging
torc-htpasswd add --file /etc/torc/staging_htpasswd --cost 12 alice

# Production
torc-htpasswd add --file /etc/torc/prod_htpasswd --cost 14 alice

Programmatic Access

When using Torc’s Rust, Python, or Julia clients programmatically:

Rust:

#![allow(unused)]
fn main() {
use torc::client::apis::configuration::Configuration;

let mut config = Configuration::new();
config.base_path = "http://localhost:8080/torc-service/v1".to_string();
config.basic_auth = Some(("alice".to_string(), Some("password".to_string())));
}

Python:

from torc import Configuration, ApiClient

config = Configuration(
    host="http://localhost:8080/torc-service/v1",
    username="alice",
    password="password"
)

Julia:

using Torc
using Base64
import OpenAPI

client = OpenAPI.Clients.Client(
    "http://localhost:8080/torc-service/v1";
    headers = Dict("Authorization" => "Basic " * base64encode("alice:password"))
)
api = Torc.APIClient.DefaultApi(client)

Load Balancer Considerations

When running multiple Torc servers behind a load balancer:

  • Share the same htpasswd file across all servers (via NFS, S3, etc.)
  • Or use a configuration management tool to sync htpasswd files
  • Monitor for htpasswd file changes and reload if needed

Shell Completions

Torc provides shell completion scripts to make working with the CLI faster and more convenient. Completions help you discover commands, avoid typos, and speed up your workflow.

Overview

Shell completions provide:

  • Command completion - Tab-complete torc subcommands and options
  • Flag completion - Tab-complete command-line flags and their values
  • Multi-shell support - Bash, Zsh, Fish, Elvish, and PowerShell
  • Automatic updates - Completions are generated from the CLI structure

Generating Completions

Use the torc completions command to generate completion scripts for your shell:

# See available shells
torc completions --help

# Generate for a specific shell
torc completions bash
torc completions zsh
torc completions fish
torc completions elvish
torc completions powershell

Installation

Bash

User installation

# Create completions directory if it doesn't exist
mkdir -p ~/.local/share/bash-completion/completions

# Generate and install completions
torc completions bash > ~/.local/share/bash-completion/completions/torc

# Source the completion file in your current shell
source ~/.local/share/bash-completion/completions/torc

Verify installation:

# Restart your shell or source the completion file
source ~/.local/share/bash-completion/completions/torc

# Test completions
torc wor<TAB>      # Should complete to "workflows"
torc workflows <TAB>  # Should show workflow subcommands

Zsh

Option 1: User installation (recommended)

# Create completions directory in your home directory
mkdir -p ~/.zfunc

# Add to fpath in your ~/.zshrc if not already present
echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc

# Generate and install completions
torc completions zsh > ~/.zfunc/_torc

# Restart shell or source ~/.zshrc
source ~/.zshrc

Option 2: Using custom location

# Generate to a custom location
mkdir -p ~/my-completions
torc completions zsh > ~/my-completions/_torc

# Add to ~/.zshrc
echo 'fpath=(~/my-completions $fpath)' >> ~/.zshrc
echo 'autoload -Uz compinit && compinit' >> ~/.zshrc

# Restart shell
exec zsh

Troubleshooting Zsh completions:

If completions aren’t working, try rebuilding the completion cache:

# Remove completion cache
rm -f ~/.zcompdump

# Restart shell
exec zsh

Fish

# Fish automatically loads completions from ~/.config/fish/completions/
mkdir -p ~/.config/fish/completions

# Generate and install completions
torc completions fish > ~/.config/fish/completions/torc.fish

# Fish will automatically load the completions
# Test immediately (no shell restart needed)
torc wor<TAB>

Elvish

# Create completions directory
mkdir -p ~/.elvish/lib

# Generate completions
torc completions elvish > ~/.elvish/lib/torc.elv

# Add to your ~/.elvish/rc.elv
echo 'use torc' >> ~/.elvish/rc.elv

# Restart shell

PowerShell

Windows PowerShell / PowerShell Core:

# Create profile directory if it doesn't exist
New-Item -ItemType Directory -Force -Path (Split-Path -Parent $PROFILE)

# Generate completions to a file
torc completions powershell > $HOME\.config\torc_completions.ps1

# Add to your PowerShell profile
Add-Content -Path $PROFILE -Value '. $HOME\.config\torc_completions.ps1'

# Reload profile
. $PROFILE

Alternative: Source inline

# Generate and add directly to profile
torc completions powershell | Out-File -Append -FilePath $PROFILE

# Reload profile
. $PROFILE

Using Completions

Once installed, use Tab to trigger completions:

Command Completion

# Complete subcommands
torc <TAB>
# Shows: workflows, jobs, files, events, run, submit, tui, ...

torc work<TAB>
# Completes to: torc workflows

torc workflows <TAB>
# Shows: create, list, get, delete, submit, run, ...

Flag Completion

# Complete flags
torc --<TAB>
# Shows: --url, --username, --password, --format, --log-level, --help

torc workflows list --<TAB>
# Shows available flags for the list command

# Complete flag values (where applicable)
torc workflows list --format <TAB>
# Shows: table, json

Workflow ID Completion

# Some shells support dynamic completion
torc workflows get <TAB>
# May show available workflow IDs

Examples

Here are some common completion patterns:

# Discover available commands
torc <TAB><TAB>

# Complete command names
torc w<TAB>          # workflows
torc wo<TAB>         # workflows
torc j<TAB>          # jobs

# Navigate subcommands
torc workflows <TAB>  # create, list, get, delete, ...
torc jobs <TAB>       # list, get, update, ...

# Complete flags
torc --u<TAB>         # --url, --username
torc --url <type-url>
torc --format <TAB>   # table, json

# Complex commands
torc workflows create --<TAB>
# Shows all available flags for the create command

Updating Completions

When you update Torc to a new version, regenerate the completion scripts to get the latest commands and flags:

# Bash
torc completions bash > ~/.local/share/bash-completion/completions/torc
source ~/.local/share/bash-completion/completions/torc

# Zsh
torc completions zsh > ~/.zfunc/_torc
rm -f ~/.zcompdump && exec zsh

# Fish
torc completions fish > ~/.config/fish/completions/torc.fish
# Fish reloads automatically

# PowerShell
torc completions powershell > $HOME\.config\torc_completions.ps1
. $PROFILE

Automation

You can automate completion installation in your dotfiles or setup scripts:

Bash Setup Script

#!/bin/bash
# install-torc-completions.sh

COMPLETION_DIR="$HOME/.local/share/bash-completion/completions"
mkdir -p "$COMPLETION_DIR"

if command -v torc &> /dev/null; then
    torc completions bash > "$COMPLETION_DIR/torc"
    echo "Torc completions installed for Bash"
    echo "Run: source $COMPLETION_DIR/torc"
else
    echo "Error: torc command not found"
    exit 1
fi

Zsh Setup Script

#!/bin/zsh
# install-torc-completions.zsh

COMPLETION_DIR="$HOME/.zfunc"
mkdir -p "$COMPLETION_DIR"

if command -v torc &> /dev/null; then
    torc completions zsh > "$COMPLETION_DIR/_torc"

    # Add fpath to .zshrc if not already present
    if ! grep -q "fpath=(.*\.zfunc" ~/.zshrc; then
        echo 'fpath=(~/.zfunc $fpath)' >> ~/.zshrc
        echo 'autoload -Uz compinit && compinit' >> ~/.zshrc
    fi

    echo "Torc completions installed for Zsh"
    echo "Run: exec zsh"
else
    echo "Error: torc command not found"
    exit 1
fi

Post-Installation Check

#!/bin/bash
# verify-completions.sh

# Test if completions are working
if complete -p torc &> /dev/null; then
    echo "✓ Torc completions are installed"
else
    echo "✗ Torc completions are not installed"
    echo "Run: torc completions bash > ~/.local/share/bash-completion/completions/torc"
fi

Troubleshooting

Completions Not Working

Problem: Tab completion doesn’t show torc commands.

Solutions:

  1. Verify torc is in your PATH:

    which torc
    # Should show path to torc binary
    
  2. Check if completion file exists:

    # Bash
    ls -l ~/.local/share/bash-completion/completions/torc
    
    # Zsh
    ls -l ~/.zfunc/_torc
    
    # Fish
    ls -l ~/.config/fish/completions/torc.fish
    
  3. Verify completion is loaded:

    # Bash
    complete -p torc
    
    # Zsh
    which _torc
    
  4. Reload shell or source completion file:

    # Bash
    source ~/.local/share/bash-completion/completions/torc
    
    # Zsh
    exec zsh
    
    # Fish (automatic)
    

Outdated Completions

Problem: New commands or flags don’t show in completions.

Solution: Regenerate the completion file after updating Torc:

# Bash
torc completions bash > ~/.local/share/bash-completion/completions/torc
source ~/.local/share/bash-completion/completions/torc

# Zsh
torc completions zsh > ~/.zfunc/_torc
rm ~/.zcompdump && exec zsh

# Fish
torc completions fish > ~/.config/fish/completions/torc.fish

Permission Denied

Problem: Cannot write to system completion directory.

Solution: Use user-level completion directory or sudo:

# Use user directory (recommended)
mkdir -p ~/.local/share/bash-completion/completions
torc completions bash > ~/.local/share/bash-completion/completions/torc

# Or use sudo for system-wide
sudo torc completions bash > /etc/bash_completion.d/torc

Zsh “command not found: compdef”

Problem: Zsh completion system not initialized.

Solution: Add to your ~/.zshrc:

autoload -Uz compinit && compinit

PowerShell Execution Policy

Problem: Cannot run completion script due to execution policy.

Solution: Adjust execution policy:

# Check current policy
Get-ExecutionPolicy

# Set policy to allow local scripts
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Shell-Specific Features

Bash

  • Case-insensitive completion (if configured in .inputrc)
  • Partial matching support
  • Menu completion available

Zsh

  • Advanced completion with descriptions
  • Correction suggestions
  • Menu selection
  • Color support for completions

Fish

  • Rich descriptions for each option
  • Real-time syntax highlighting
  • Automatic paging for long completion lists
  • Fuzzy matching support

PowerShell

  • IntelliSense-style completions
  • Parameter descriptions
  • Type-aware completions

Best Practices

  1. Keep completions updated: Regenerate after each Torc update
  2. Use version control: Include completion installation in dotfiles
  3. Automate installation: Add to setup scripts for new machines
  4. Test after updates: Verify completions work after shell or Torc updates
  5. Document in team wikis: Help teammates set up completions

Additional Resources

Server Deployment

This guide covers deploying and operating the Torc server in production environments, including logging configuration, daemonization, and service management.

Server Subcommands

The torc-server binary has two main subcommands:

torc-server run

Use torc-server run for:

  • HPC login nodes - Run the server in a tmux session while your jobs are running.
  • Development and testing - Run the server interactively in a terminal
  • Manual startup - When you want to control when the server starts and stops
  • Custom deployment - Integration with external process managers (e.g., supervisord, custom scripts)
  • Debugging - Running with verbose logging to troubleshoot issues
# Basic usage
torc-server run

# With options
torc-server run --port 8080 --database ./torc.db --log-level debug
torc-server run --completion-check-interval-secs 5

torc-server service

Use torc-server service for:

  • Production deployment - Install as a system service that starts on boot
  • Reliability - Automatic restart on failure
  • Managed lifecycle - Standard start/stop/status commands
  • Platform integration - Uses systemd (Linux), launchd (macOS), or Windows Services
# Install and start as a user service
torc-server service install --user
torc-server service start --user

# Or as a system service (requires root)
sudo torc-server service install
sudo torc-server service start

Which to choose?

  • For HPC login nodes/development/testing: Use torc-server run
  • For production servers/standalone computers: Use torc-server service install

Quick Start

User Service (Development)

For development, install as a user service (no root required):

# Install with automatic defaults (logs to ~/.torc/logs, db at ~/.torc/torc.db)
torc-server service install --user

# Start the service
torc-server service start --user

System Service (Production)

For production deployment, install as a system service:

# Install with automatic defaults (logs to /var/log/torc, db at /var/lib/torc/torc.db)
sudo torc-server service install --user

# Start the service
sudo torc-server service start --user

The service will automatically start on boot and restart on failure. Logs are automatically configured to rotate when they reach 10 MiB (keeping 5 files max). See the Service Management section for customization options.

Logging System

Torc-server uses the tracing ecosystem for structured, high-performance logging with automatic size-based file rotation.

Console Logging (Default)

By default, logs are written to stdout/stderr only:

torc-server run --log-level info

File Logging with Size-Based Rotation

Enable file logging by specifying a log directory:

torc-server run --log-dir /var/log/torc

This will:

  • Write logs to both console and file
  • Automatically rotate when log file reaches 10 MiB
  • Keep up to 5 rotated log files (torc-server.log, torc-server.log.1, …, torc-server.log.5)
  • Oldest files are automatically deleted when limit is exceeded

JSON Format Logs

For structured log aggregation (e.g., ELK stack, Splunk):

torc-server run --log-dir /var/log/torc --json-logs

This writes JSON-formatted logs to the file while keeping human-readable logs on console.

Log Levels

Control verbosity with the --log-level flag or RUST_LOG environment variable:

# Available levels: error, warn, info, debug, trace
torc-server run --log-level debug --log-dir /var/log/torc

# Or using environment variable
RUST_LOG=debug torc-server run --log-dir /var/log/torc

Environment Variables

  • TORC_LOG_DIR: Default log directory
  • RUST_LOG: Default log level

Example:

export TORC_LOG_DIR=/var/log/torc
export RUST_LOG=info
torc-server run

Daemonization (Unix/Linux Only)

Run torc-server as a background daemon:

torc-server run --daemon --log-dir /var/log/torc

Important:

  • Daemonization is only available on Unix/Linux systems
  • When running as daemon, you must use --log-dir since console output is lost
  • The daemon creates a PID file (default: /var/run/torc-server.pid)

Custom PID File Location

torc-server run --daemon --pid-file /var/run/torc/server.pid --log-dir /var/log/torc

Stopping a Daemon

# Find the PID
cat /var/run/torc-server.pid

# Kill the process
kill $(cat /var/run/torc-server.pid)

# Or forcefully
kill -9 $(cat /var/run/torc-server.pid)

Complete Example: Production Deployment

#!/bin/bash
# Production deployment script

# Create required directories
sudo mkdir -p /var/log/torc
sudo mkdir -p /var/run/torc
sudo mkdir -p /var/lib/torc

# Set permissions (adjust as needed)
sudo chown -R torc:torc /var/log/torc
sudo chown -R torc:torc /var/run/torc
sudo chown -R torc:torc /var/lib/torc

# Start server as daemon
torc-server run \
    --daemon \
    --log-dir /var/log/torc \
    --log-level info \
    --json-logs \
    --pid-file /var/run/torc/server.pid \
    --database /var/lib/torc/torc.db \
    --url 0.0.0.0 \
    --port 8080 \
    --threads 8 \
    --auth-file /etc/torc/htpasswd \
    --require-auth

Automatic Installation

The easiest way to install torc-server as a service is using the built-in service management commands.

User Service (No Root Required)

Install as a user service that runs under your user account (recommended for development):

# Install with defaults (logs to ~/.torc/logs, database at ~/.torc/torc.db)
torc-server service install --user

# Or customize the configuration
torc-server service install --user \
    --log-dir ~/custom/logs \
    --database ~/custom/torc.db \
    --url 0.0.0.0 \
    --port 8080 \
    --threads 4

# Start the user service
torc-server service start --user

# Check status
torc-server service status --user

# Stop the service
torc-server service stop --user

# Uninstall the service
torc-server service uninstall --user

User Service Defaults:

  • Log directory: ~/.torc/logs
  • Database: ~/.torc/torc.db
  • Listen address: 0.0.0.0:8080
  • Worker threads: 4

System Service (Requires Root)

Install as a system-wide service (recommended for production):

# Install with defaults
sudo torc-server service install

# Or customize the configuration
sudo torc-server service install \
    --log-dir /var/log/torc \
    --database /var/lib/torc/torc.db \
    --url 0.0.0.0 \
    --port 8080 \
    --threads 8 \
    --auth-file /etc/torc/htpasswd \
    --require-auth \
    --json-logs

# Start the system service
sudo torc-server service start

# Check status
torc-server service status

# Stop the service
sudo torc-server service stop

# Uninstall the service
sudo torc-server service uninstall

System Service Defaults:

  • Log directory: /var/log/torc
  • Database: /var/lib/torc/torc.db
  • Listen address: 0.0.0.0:8080
  • Worker threads: 4

This automatically creates the appropriate service configuration for your platform:

  • Linux: systemd service (user: ~/.config/systemd/user/, system: /etc/systemd/system/)
  • macOS: launchd service (user: ~/Library/LaunchAgents/, system: /Library/LaunchDaemons/)
  • Windows: Windows Service

Manual Systemd Service (Linux)

Alternatively, you can manually create a systemd service:

# /etc/systemd/system/torc-server.service
[Unit]
Description=Torc Workflow Orchestration Server
After=network.target

[Service]
Type=simple
User=torc
Group=torc
WorkingDirectory=/var/lib/torc
Environment="RUST_LOG=info"
Environment="TORC_LOG_DIR=/var/log/torc"
ExecStart=/usr/local/bin/torc-server run \
    --log-dir /var/log/torc \
    --json-logs \
    --database /var/lib/torc/torc.db \
    --url 0.0.0.0 \
    --port 8080 \
    --threads 8 \
    --auth-file /etc/torc/htpasswd \
    --require-auth
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Then:

sudo systemctl daemon-reload
sudo systemctl enable torc-server
sudo systemctl start torc-server
sudo systemctl status torc-server

# View logs
journalctl -u torc-server -f

Log Rotation Strategy

The server uses automatic size-based rotation with the following defaults:

  • Max file size: 10 MiB per file
  • Max files: 5 rotated files (plus the current log file)
  • Total disk usage: Maximum of ~50 MiB for all log files

When the current log file reaches 10 MiB, it is automatically rotated:

  1. torc-server.logtorc-server.log.1
  2. torc-server.log.1torc-server.log.2
  3. And so on…
  4. Oldest file (torc-server.log.5) is deleted

This ensures predictable disk usage without external tools like logrotate.

Timing Instrumentation

For advanced performance monitoring, enable timing instrumentation:

TORC_TIMING_ENABLED=true torc-server run --log-dir /var/log/torc

This adds detailed timing information for all instrumented functions. Note that timing instrumentation works with both console and file logging.

Troubleshooting

Daemon won’t start

  1. Check permissions on log directory:

    ls -la /var/log/torc
    
  2. Check if PID file directory exists:

    ls -la /var/run/
    
  3. Try running in foreground first:

    torc-server run --log-dir /var/log/torc
    

No log files created

  1. Verify --log-dir is specified
  2. Check directory permissions
  3. Check disk space: df -h

Logs not rotating

Log rotation happens automatically when a log file reaches 10 MiB. If you need to verify rotation is working:

  1. Check the log directory for numbered files (e.g., torc-server.log.1)
  2. Monitor disk usage - it should never exceed ~50 MiB for all log files
  3. For testing, you can generate large amounts of logs with --log-level trace

Reference

This section provides information-oriented technical descriptions of Torc’s APIs, configuration, and data formats. Use this section to look up specific details about Torc’s interfaces.

Topics covered:

CLI Reference

This documentation is automatically generated from the CLI help text.

To regenerate, run:

cargo run --bin generate-cli-docs --features "client,tui,plot_resources"

Command-Line Help for torc

This document contains the help content for the torc command-line program.

Command Overview:

torc

Torc workflow orchestration system

Usage: torc [OPTIONS] <COMMAND>

Subcommands:
  • run — Run a workflow locally (create from spec file or run existing workflow by ID)
  • submit — Submit a workflow to scheduler (create from spec file or submit existing workflow by ID)
  • workflows — Workflow management commands
  • compute-nodes — Compute node management commands
  • files — File management commands
  • jobs — Job management commands
  • job-dependencies — Job dependency and relationship queries
  • resource-requirements — Resource requirements management commands
  • events — Event management commands
  • results — Result management commands
  • user-data — User data management commands
  • slurm — Slurm scheduler commands
  • reports — Generate reports and analytics
  • tui — Interactive terminal UI for managing workflows
  • plot-resources — Generate interactive HTML plots from resource monitoring data
  • completions — Generate shell completions
Options:
  • --log-level <LOG_LEVEL> — Log level (error, warn, info, debug, trace)

  • -f, --format <FORMAT> — Output format (table or json)

    Default value: table

  • --url <URL> — URL of torc server

  • --username <USERNAME> — Username for basic authentication

  • --password <PASSWORD> — Password for basic authentication (will prompt if username provided but password not)

torc run

Run a workflow locally (create from spec file or run existing workflow by ID)

Usage: torc run [OPTIONS] <WORKFLOW_SPEC_OR_ID>

Arguments:
  • <WORKFLOW_SPEC_OR_ID> — Path to workflow spec file (JSON/JSON5/YAML) or workflow ID
Options:
  • --max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs to run concurrently
  • --num-cpus <NUM_CPUS> — Number of CPUs available
  • --memory-gb <MEMORY_GB> — Memory in GB
  • --num-gpus <NUM_GPUS> — Number of GPUs available
  • -p, --poll-interval <POLL_INTERVAL> — Job completion poll interval in seconds
  • -o, --output-dir <OUTPUT_DIR> — Output directory for jobs

torc submit

Submit a workflow to scheduler (create from spec file or submit existing workflow by ID)

Requires workflow to have an on_workflow_start action with schedule_nodes

Usage: torc submit [OPTIONS] <WORKFLOW_SPEC_OR_ID>

Arguments:
  • <WORKFLOW_SPEC_OR_ID> — Path to workflow spec file (JSON/JSON5/YAML) or workflow ID
Options:
  • -i, --ignore-missing-data — Ignore missing data (defaults to false)

    Default value: false

torc workflows

Workflow management commands

Usage: torc workflows <COMMAND>

Subcommands:
  • create — Create a workflow from a specification file (supports JSON, JSON5, and YAML formats)
  • new — Create a new empty workflow
  • list — List workflows
  • get — Get a specific workflow by ID
  • update — Update an existing workflow
  • cancel — Cancel a workflow and all associated Slurm jobs
  • delete — Delete one or more workflows
  • archive — Archive or unarchive one or more workflows
  • submit — Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes
  • run — Run a workflow locally on the current node
  • initialize — Initialize a workflow, including all job statuses
  • reinitialize — Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed
  • status — Get workflow status
  • reset-status — Reset workflow and job status
  • execution-plan — Show the execution plan for a workflow specification or existing workflow

torc workflows create

Create a workflow from a specification file (supports JSON, JSON5, and YAML formats)

Usage: torc workflows create [OPTIONS] --user <USER> <FILE>

Arguments:
  • <FILE> — Path to specification file containing WorkflowSpec

    Supported formats: - JSON (.json): Standard JSON format - JSON5 (.json5): JSON with comments and trailing commas - YAML (.yaml, .yml): Human-readable YAML format

    Format is auto-detected from file extension, with fallback parsing attempted

Options:
  • -u, --user <USER> — User that owns the workflow (defaults to USER environment variable)

  • --no-resource-monitoring — Disable resource monitoring (default: enabled with summary granularity and 5s sample rate)

    Default value: false

torc workflows new

Create a new empty workflow

Usage: torc workflows new [OPTIONS] --name <NAME> --user <USER>

Options:
  • -n, --name <NAME> — Name of the workflow
  • -d, --description <DESCRIPTION> — Description of the workflow
  • -u, --user <USER> — User that owns the workflow (defaults to USER environment variable)

torc workflows list

List workflows

Usage: torc workflows list [OPTIONS]

Options:
  • -u, --user <USER> — User to filter by (defaults to USER environment variable)

  • --all-users — List workflows for all users (overrides –user)

  • -l, --limit <LIMIT> — Maximum number of workflows to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • --sort-by <SORT_BY> — Field to sort by

  • --reverse-sort — Reverse sort order

  • --archived-only — Show only archived workflows

    Default value: false

  • --include-archived — Include both archived and non-archived workflows

    Default value: false

torc workflows get

Get a specific workflow by ID

Usage: torc workflows get [OPTIONS] [ID]

Arguments:
  • <ID> — ID of the workflow to get (optional - will prompt if not provided)
Options:
  • -u, --user <USER> — User to filter by (defaults to USER environment variable)

torc workflows update

Update an existing workflow

Usage: torc workflows update [OPTIONS] [ID]

Arguments:
  • <ID> — ID of the workflow to update (optional - will prompt if not provided)
Options:
  • -n, --name <NAME> — Name of the workflow
  • -d, --description <DESCRIPTION> — Description of the workflow
  • --owner-user <OWNER_USER> — User that owns the workflow

torc workflows cancel

Cancel a workflow and all associated Slurm jobs

Usage: torc workflows cancel [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to cancel (optional - will prompt if not provided)

torc workflows delete

Delete one or more workflows

Usage: torc workflows delete [OPTIONS] [IDS]...

Arguments:
  • <IDS> — IDs of workflows to remove (optional - will prompt if not provided)
Options:
  • --no-prompts — Skip confirmation prompt
  • --force — Force deletion even if workflow belongs to a different user

torc workflows archive

Archive or unarchive one or more workflows

Usage: torc workflows archive <IS_ARCHIVED> [WORKFLOW_IDS]...

Arguments:
  • <IS_ARCHIVED> — Set to true to archive, false to unarchive
  • <WORKFLOW_IDS> — IDs of workflows to archive/unarchive (if empty, will prompt for selection)

torc workflows submit

Submit a workflow: initialize if needed and schedule nodes for on_workflow_start actions This command requires the workflow to have an on_workflow_start action with schedule_nodes

Usage: torc workflows submit [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to submit (optional - will prompt if not provided)
Options:
  • --force — If false, fail the operation if missing data is present (defaults to false)

    Default value: false

torc workflows run

Run a workflow locally on the current node

Usage: torc workflows run [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to run (optional - will prompt if not provided)
Options:
  • -p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds for checking job completion

    Default value: 5.0

  • --max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs to run (defaults to available CPUs)

  • --output-dir <OUTPUT_DIR> — Output directory for job logs and results

    Default value: output

torc workflows initialize

Initialize a workflow, including all job statuses

Usage: torc workflows initialize [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to start (optional - will prompt if not provided)
Options:
  • --force — If false, fail the operation if missing data is present (defaults to false)

    Default value: false

  • --no-prompts — Skip confirmation prompt

  • --dry-run — Perform a dry run without making changes

torc workflows reinitialize

Reinitialize a workflow. This will reinitialize all jobs with a status of canceled, submitting, pending, or terminated. Jobs with a status of done will also be reinitialized if an input_file or user_data record has changed

Usage: torc workflows reinitialize [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to reinitialize (optional - will prompt if not provided)
Options:
  • --force — If false, fail the operation if missing data is present (defaults to false)

    Default value: false

  • --dry-run — Perform a dry run without making changes

torc workflows status

Get workflow status

Usage: torc workflows status [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to get status for (optional - will prompt if not provided)
Options:
  • -u, --user <USER> — User to filter by (defaults to USER environment variable)

torc workflows reset-status

Reset workflow and job status

Usage: torc workflows reset-status [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow to reset status for (optional - will prompt if not provided)
Options:
  • --failed-only — Only reset failed jobs

    Default value: false

  • -r, --restart — Restart the workflow after resetting status

    Default value: false

  • --force — Force reset even if there are active jobs (ignores running/pending jobs check)

    Default value: false

  • --no-prompts — Skip confirmation prompt

torc workflows execution-plan

Show the execution plan for a workflow specification or existing workflow

Usage: torc workflows execution-plan <SPEC_OR_ID>

Arguments:
  • <SPEC_OR_ID> — Path to specification file OR workflow ID

torc compute-nodes

Compute node management commands

Usage: torc compute-nodes <COMMAND>

Subcommands:
  • get — Get a specific compute node by ID
  • list — List compute nodes for a workflow

torc compute-nodes get

Get a specific compute node by ID

Usage: torc compute-nodes get <ID>

Arguments:
  • <ID> — ID of the compute node

torc compute-nodes list

List compute nodes for a workflow

Usage: torc compute-nodes list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List compute nodes for this workflow (optional - will prompt if not provided)
Options:
  • -l, --limit <LIMIT> — Maximum number of compute nodes to return

    Default value: 10000

  • -o, --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • -s, --sort-by <SORT_BY> — Field to sort by

  • -r, --reverse-sort — Reverse sort order

    Default value: false

torc files

File management commands

Usage: torc files <COMMAND>

Subcommands:
  • create — Create a new file
  • list — List files
  • get — Get a specific file by ID
  • update — Update an existing file
  • delete — Delete a file
  • list-required-existing — List required existing files for a workflow

torc files create

Create a new file

Usage: torc files create --name <NAME> --path <PATH> [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Create the file in this workflow
Options:
  • -n, --name <NAME> — Name of the job
  • -p, --path <PATH> — Path of the file

torc files list

List files

Usage: torc files list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List files for this workflow (optional - will prompt if not provided)
Options:
  • --produced-by-job-id <PRODUCED_BY_JOB_ID> — Filter by job ID that produced the files

  • -l, --limit <LIMIT> — Maximum number of files to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • --sort-by <SORT_BY> — Field to sort by

  • --reverse-sort — Reverse sort order

torc files get

Get a specific file by ID

Usage: torc files get <ID>

Arguments:
  • <ID> — ID of the file to get

torc files update

Update an existing file

Usage: torc files update [OPTIONS] <ID>

Arguments:
  • <ID> — ID of the file to update
Options:
  • -n, --name <NAME> — Name of the file
  • -p, --path <PATH> — Path of the file

torc files delete

Delete a file

Usage: torc files delete <ID>

Arguments:
  • <ID> — ID of the file to remove

torc files list-required-existing

List required existing files for a workflow

Usage: torc files list-required-existing [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List required existing files for this workflow (optional - will prompt if not provided)

torc jobs

Job management commands

Usage: torc jobs <COMMAND>

Subcommands:
  • create — Create a new job
  • create-from-file — Create multiple jobs from a text file containing one command per line
  • list — List jobs
  • get — Get a specific job by ID
  • update — Update an existing job
  • delete — Delete one or more jobs
  • delete-all — Delete all jobs for a workflow
  • list-resource-requirements — List jobs with their resource requirements

torc jobs create

Create a new job

Usage: torc jobs create [OPTIONS] --name <NAME> --command <COMMAND> [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Create the job in this workflow
Options:
  • -n, --name <NAME> — Name of the job
  • -c, --command <COMMAND> — Command to execute
  • -r, --resource-requirements-id <RESOURCE_REQUIREMENTS_ID> — Resource requirements ID for this job
  • -b, --blocking-job-ids <BLOCKING_JOB_IDS> — Job IDs that block this job
  • -i, --input-file-ids <INPUT_FILE_IDS> — Input files needed by this job
  • -o, --output-file-ids <OUTPUT_FILE_IDS> — Output files produced by this job

torc jobs create-from-file

Create multiple jobs from a text file containing one command per line

This command reads a text file where each line contains a job command. Lines starting with ‘#’ are treated as comments and ignored. Empty lines are also ignored.

Jobs will be named sequentially as job1, job2, job3, etc., starting from the current job count + 1 to avoid naming conflicts.

All jobs created will share the same resource requirements, which are automatically created and assigned.

Example: torc jobs create-from-file 123 batch_jobs.txt –cpus-per-job 4 –memory-per-job 8g

Usage: torc jobs create-from-file [OPTIONS] <WORKFLOW_ID> <FILE>

Arguments:
  • <WORKFLOW_ID> — Workflow ID to create jobs for

  • <FILE> — Path to text file containing job commands (one per line)

    File format: - One command per line - Lines starting with # are comments (ignored) - Empty lines are ignored

    Example file content: # Data processing jobs python process.py –batch 1 python process.py –batch 2 python process.py –batch 3

Options:
  • --cpus-per-job <CPUS_PER_JOB> — Number of CPUs per job

    Default value: 1

  • --memory-per-job <MEMORY_PER_JOB> — Memory per job (e.g., “1m”, “2g”, “16g”)

    Default value: 1m

  • --runtime-per-job <RUNTIME_PER_JOB> — Runtime per job (ISO 8601 duration format)

    Examples: P0DT1M = 1 minute P0DT30M = 30 minutes P0DT2H = 2 hours P1DT0H = 1 day

    Default value: P0DT1M

torc jobs list

List jobs

Usage: torc jobs list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List jobs for this workflow (optional - will prompt if not provided)
Options:
  • -s, --status <STATUS> — User to filter by (defaults to USER environment variable)

  • --upstream-job-id <UPSTREAM_JOB_ID> — Filter by upstream job ID (jobs that depend on this job)

  • -l, --limit <LIMIT> — Maximum number of jobs to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • --sort-by <SORT_BY> — Field to sort by

  • --reverse-sort — Reverse sort order

  • --include-relationships — Include job relationships (depends_on_job_ids, input/output file/user_data IDs) - slower but more complete

torc jobs get

Get a specific job by ID

Usage: torc jobs get <ID>

Arguments:
  • <ID> — ID of the job to get

torc jobs update

Update an existing job

Usage: torc jobs update [OPTIONS] <ID>

Arguments:
  • <ID> — ID of the job to update
Options:
  • -n, --name <NAME> — Name of the job
  • -c, --command <COMMAND> — Command to execute

torc jobs delete

Delete one or more jobs

Usage: torc jobs delete [IDS]...

Arguments:
  • <IDS> — IDs of the jobs to remove

torc jobs delete-all

Delete all jobs for a workflow

Usage: torc jobs delete-all [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID to delete all jobs from (optional - will prompt if not provided)

torc jobs list-resource-requirements

List jobs with their resource requirements

Usage: torc jobs list-resource-requirements [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID to list jobs from (optional - will prompt if not provided)
Options:
  • -j, --job-id <JOB_ID> — Filter by specific job ID

torc job-dependencies

Job dependency and relationship queries

Usage: torc job-dependencies <COMMAND>

Subcommands:
  • job-job — List job-to-job dependencies for a workflow
  • job-file — List job-file relationships for a workflow
  • job-user-data — List job-user_data relationships for a workflow

torc job-dependencies job-job

List job-to-job dependencies for a workflow

Usage: torc job-dependencies job-job [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)
Options:
  • -l, --limit <LIMIT> — Maximum number of dependencies to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

torc job-dependencies job-file

List job-file relationships for a workflow

Usage: torc job-dependencies job-file [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)
Options:
  • -l, --limit <LIMIT> — Maximum number of relationships to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

torc job-dependencies job-user-data

List job-user_data relationships for a workflow

Usage: torc job-dependencies job-user-data [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — ID of the workflow (optional - will prompt if not provided)
Options:
  • -l, --limit <LIMIT> — Maximum number of relationships to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

torc resource-requirements

Resource requirements management commands

Usage: torc resource-requirements <COMMAND>

Subcommands:
  • create — Create new resource requirements
  • list — List resource requirements
  • get — Get a specific resource requirement by ID
  • update — Update existing resource requirements
  • delete — Delete resource requirements

torc resource-requirements create

Create new resource requirements

Usage: torc resource-requirements create [OPTIONS] --name <NAME> [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Create resource requirements in this workflow
Options:
  • -n, --name <NAME> — Name of the resource requirements

  • --num-cpus <NUM_CPUS> — Number of CPUs required

    Default value: 1

  • --num-gpus <NUM_GPUS> — Number of GPUs required

    Default value: 0

  • --num-nodes <NUM_NODES> — Number of nodes required

    Default value: 1

  • -m, --memory <MEMORY> — Amount of memory required (e.g., “20g”)

    Default value: 1m

  • -r, --runtime <RUNTIME> — Maximum runtime in ISO 8601 duration format (e.g., “P0DT1H”)

    Default value: P0DT1M

torc resource-requirements list

List resource requirements

Usage: torc resource-requirements list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List resource requirements for this workflow (optional - will prompt if not provided)
Options:
  • -l, --limit <LIMIT> — Maximum number of resource requirements to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • --sort-by <SORT_BY> — Field to sort by

  • --reverse-sort — Reverse sort order

torc resource-requirements get

Get a specific resource requirement by ID

Usage: torc resource-requirements get <ID>

Arguments:
  • <ID> — ID of the resource requirement to get

torc resource-requirements update

Update existing resource requirements

Usage: torc resource-requirements update [OPTIONS] <ID>

Arguments:
  • <ID> — ID of the resource requirement to update
Options:
  • -n, --name <NAME> — Name of the resource requirements
  • --num-cpus <NUM_CPUS> — Number of CPUs required
  • --num-gpus <NUM_GPUS> — Number of GPUs required
  • --num-nodes <NUM_NODES> — Number of nodes required
  • --memory <MEMORY> — Amount of memory required (e.g., “20g”)
  • --runtime <RUNTIME> — Maximum runtime (e.g., “1h”, “30m”)

torc resource-requirements delete

Delete resource requirements

Usage: torc resource-requirements delete <ID>

Arguments:
  • <ID> — ID of the resource requirement to remove

torc events

Event management commands

Usage: torc events <COMMAND>

Subcommands:
  • create — Create a new event
  • list — List events for a workflow
  • monitor — Monitor events for a workflow in real-time
  • get-latest-event — Get the latest event for a workflow
  • delete — Delete an event

torc events create

Create a new event

Usage: torc events create --data <DATA> [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Create the event in this workflow
Options:
  • -d, --data <DATA> — JSON data for the event

torc events list

List events for a workflow

Usage: torc events list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List events for this workflow (optional - will prompt if not provided)
Options:
  • -c, --category <CATEGORY> — Filter events by category

  • -l, --limit <LIMIT> — Maximum number of events to return

    Default value: 10000

  • -o, --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • -s, --sort-by <SORT_BY> — Field to sort by

  • -r, --reverse-sort — Reverse sort order

    Default value: false

torc events monitor

Monitor events for a workflow in real-time

Usage: torc events monitor [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Monitor events for this workflow (optional - will prompt if not provided)
Options:
  • -d, --duration <DURATION> — Duration to monitor in minutes (default: infinite)

  • -p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds (default: 60)

    Default value: 60

  • -c, --category <CATEGORY> — Filter events by category

torc events get-latest-event

Get the latest event for a workflow

Usage: torc events get-latest-event [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Get the latest event for this workflow (optional - will prompt if not provided)

torc events delete

Delete an event

Usage: torc events delete <ID>

Arguments:
  • <ID> — ID of the event to remove

torc results

Result management commands

Usage: torc results <COMMAND>

Subcommands:
  • list — List results
  • get — Get a specific result by ID
  • delete — Delete a result

torc results list

List results

Usage: torc results list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — List results for this workflow (optional - will prompt if not provided). By default, only lists results for the latest run of the workflow
Options:
  • -j, --job-id <JOB_ID> — List results for this job

  • -r, --run-id <RUN_ID> — List results for this run_id

  • --return-code <RETURN_CODE> — Filter by return code

  • --failed — Show only failed jobs (non-zero return code)

  • -s, --status <STATUS> — Filter by job status (uninitialized, blocked, canceled, terminated, done, ready, scheduled, running, pending, disabled)

  • -l, --limit <LIMIT> — Maximum number of results to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

  • --sort-by <SORT_BY> — Field to sort by

  • --reverse-sort — Reverse sort order

  • --all-runs — Show all historical results (default: false, only shows current results)

torc results get

Get a specific result by ID

Usage: torc results get <ID>

Arguments:
  • <ID> — ID of the result to get

torc results delete

Delete a result

Usage: torc results delete <ID>

Arguments:
  • <ID> — ID of the result to remove

torc user-data

User data management commands

Usage: torc user-data <COMMAND>

Subcommands:
  • create — Create a new user data record
  • list — List user data records
  • get — Get a specific user data record
  • update — Update a user data record
  • delete — Delete a user data record
  • delete-all — Delete all user data records for a workflow
  • list-missing — List missing user data for a workflow

torc user-data create

Create a new user data record

Usage: torc user-data create [OPTIONS] --name <NAME> [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID
Options:
  • -n, --name <NAME> — Name of the data object
  • -d, --data <DATA> — JSON data content
  • --ephemeral — Whether the data is ephemeral (cleared between runs)
  • --consumer-job-id <CONSUMER_JOB_ID> — Consumer job ID (optional)
  • --producer-job-id <PRODUCER_JOB_ID> — Producer job ID (optional)

torc user-data list

List user data records

Usage: torc user-data list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID (if not provided, will be selected interactively)
Options:
  • -l, --limit <LIMIT> — Maximum number of records to return

    Default value: 50

  • -o, --offset <OFFSET> — Number of records to skip

    Default value: 0

  • --sort-by <SORT_BY> — Field to sort by

  • --reverse-sort — Reverse sort order

  • --name <NAME> — Filter by name

  • --is-ephemeral <IS_EPHEMERAL> — Filter by ephemeral status

    Possible values: true, false

  • --consumer-job-id <CONSUMER_JOB_ID> — Filter by consumer job ID

  • --producer-job-id <PRODUCER_JOB_ID> — Filter by producer job ID

torc user-data get

Get a specific user data record

Usage: torc user-data get <ID>

Arguments:
  • <ID> — User data record ID

torc user-data update

Update a user data record

Usage: torc user-data update [OPTIONS] <ID>

Arguments:
  • <ID> — User data record ID
Options:
  • -n, --name <NAME> — New name for the data object

  • -d, --data <DATA> — New JSON data content

  • --ephemeral <EPHEMERAL> — Update ephemeral status

    Possible values: true, false

torc user-data delete

Delete a user data record

Usage: torc user-data delete <ID>

Arguments:
  • <ID> — User data record ID

torc user-data delete-all

Delete all user data records for a workflow

Usage: torc user-data delete-all <WORKFLOW_ID>

Arguments:
  • <WORKFLOW_ID> — Workflow ID

torc user-data list-missing

List missing user data for a workflow

Usage: torc user-data list-missing <WORKFLOW_ID>

Arguments:
  • <WORKFLOW_ID> — Workflow ID

torc slurm

Slurm scheduler commands

Usage: torc slurm <COMMAND>

Subcommands:
  • create — Add a Slurm config to the database
  • update — Modify a Slurm config in the database
  • list — Show the current Slurm configs in the database
  • get — Get a specific Slurm config by ID
  • delete — Delete a Slurm config by ID
  • schedule-nodes — Schedule compute nodes using Slurm

torc slurm create

Add a Slurm config to the database

Usage: torc slurm create [OPTIONS] --name <NAME> --account <ACCOUNT> [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID
Options:
  • -n, --name <NAME> — Name of config

  • -a, --account <ACCOUNT> — HPC account

  • -g, --gres <GRES> — Request nodes that have at least this number of GPUs. Ex: ‘gpu:2’

  • -m, --mem <MEM> — Request nodes that have at least this amount of memory. Ex: ‘180G’

  • -N, --nodes <NODES> — Number of nodes to use for each job

    Default value: 1

  • -p, --partition <PARTITION> — HPC partition. Default is determined by the scheduler

  • -q, --qos <QOS> — Controls priority of the jobs

    Default value: normal

  • -t, --tmp <TMP> — Request nodes that have at least this amount of storage scratch space

  • -W, --walltime <WALLTIME> — Slurm job walltime

    Default value: 04:00:00

  • -e, --extra <EXTRA> — Add extra Slurm parameters, for example –extra=‘–reservation=my-reservation’

torc slurm update

Modify a Slurm config in the database

Usage: torc slurm update [OPTIONS] <SCHEDULER_ID>

Arguments:
  • <SCHEDULER_ID>
Options:
  • -N, --name <NAME> — Name of config
  • -a, --account <ACCOUNT> — HPC account
  • -g, --gres <GRES> — Request nodes that have at least this number of GPUs. Ex: ‘gpu:2’
  • -m, --mem <MEM> — Request nodes that have at least this amount of memory. Ex: ‘180G’
  • -n, --nodes <NODES> — Number of nodes to use for each job
  • -p, --partition <PARTITION> — HPC partition
  • -q, --qos <QOS> — Controls priority of the jobs
  • -t, --tmp <TMP> — Request nodes that have at least this amount of storage scratch space
  • --walltime <WALLTIME> — Slurm job walltime
  • -e, --extra <EXTRA> — Add extra Slurm parameters

torc slurm list

Show the current Slurm configs in the database

Usage: torc slurm list [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID
Options:
  • -l, --limit <LIMIT> — Maximum number of configs to return

    Default value: 10000

  • --offset <OFFSET> — Offset for pagination (0-based)

    Default value: 0

torc slurm get

Get a specific Slurm config by ID

Usage: torc slurm get <ID>

Arguments:
  • <ID> — ID of the Slurm config to get

torc slurm delete

Delete a Slurm config by ID

Usage: torc slurm delete <ID>

Arguments:
  • <ID> — ID of the Slurm config to delete

torc slurm schedule-nodes

Schedule compute nodes using Slurm

Usage: torc slurm schedule-nodes [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID
Options:
  • -j, --job-prefix <JOB_PREFIX> — Job prefix for the Slurm job names

    Default value: worker

  • --keep-submission-scripts — Keep submission scripts after job submission

    Default value: false

  • -m, --max-parallel-jobs <MAX_PARALLEL_JOBS> — Maximum number of parallel jobs

  • -n, --num-hpc-jobs <NUM_HPC_JOBS> — Number of HPC jobs to submit

    Default value: 1

  • -o, --output <OUTPUT> — Output directory for job output files

    Default value: output

  • -p, --poll-interval <POLL_INTERVAL> — Poll interval in seconds

    Default value: 60

  • --scheduler-config-id <SCHEDULER_CONFIG_ID> — Scheduler config ID

  • --start-one-worker-per-node — Start one worker per node

    Default value: false

torc reports

Generate reports and analytics

Usage: torc reports <COMMAND>

Subcommands:
  • check-resource-utilization — Check resource utilization and report jobs that exceeded their specified requirements
  • results — Generate a comprehensive JSON report of job results including all log file paths

torc reports check-resource-utilization

Check resource utilization and report jobs that exceeded their specified requirements

Usage: torc reports check-resource-utilization [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID to analyze (optional - will prompt if not provided)
Options:
  • -r, --run-id <RUN_ID> — Run ID to analyze (optional - analyzes latest run if not provided)
  • -a, --all — Show all jobs (default: only show jobs that exceeded requirements)

torc reports results

Generate a comprehensive JSON report of job results including all log file paths

Usage: torc reports results [OPTIONS] [WORKFLOW_ID]

Arguments:
  • <WORKFLOW_ID> — Workflow ID to analyze (optional - will prompt if not provided)
Options:
  • -o, --output-dir <OUTPUT_DIR> — Output directory (where job logs are stored, passed in torc run and torc submit)

    Default value: output

  • --all-runs — Include all runs for each job (default: only latest run)

torc tui

Interactive terminal UI for managing workflows

Usage: torc tui

torc plot-resources

Generate interactive HTML plots from resource monitoring data

Usage: torc plot-resources [OPTIONS] <DB_PATHS>...

Arguments:
  • <DB_PATHS> — Path to the resource metrics database file(s)
Options:
  • -o, --output-dir <OUTPUT_DIR> — Output directory for generated plots (default: current directory)

    Default value: .

  • -j, --job-ids <JOB_IDS> — Only plot specific job IDs (comma-separated)

  • -p, --prefix <PREFIX> — Prefix for output filenames

    Default value: resource_plot

  • -f, --format <FORMAT> — Output format: html or json

    Default value: html

torc completions

Generate shell completions

Usage: torc completions <SHELL>

Arguments:
  • <SHELL> — The shell to generate completions for

    Possible values: bash, elvish, fish, powershell, zsh


This document was generated automatically by clap-markdown.

Workflow Specification Formats

Torc supports three workflow specification formats: YAML, JSON5, and KDL. All formats provide the same functionality with different syntaxes to suit different preferences and use cases.

Format Overview

FeatureYAMLJSON5KDL
Parameter Expansion
Comments
Trailing CommasN/A
Human-Readable✓✓✓✓✓✓✓✓
Programmatic Generation✓✓✓✓✓
Industry Standard✓✓✓✓✓
Jobs, Files, Resources
User Data
Workflow Actions
Resource Monitoring
Slurm Schedulers

YAML Format

Best for: Most workflows, especially those using parameter expansion.

File Extension: .yaml or .yml

Example:

name: data_processing_workflow
user: datauser
description: Multi-stage data processing pipeline

# File definitions
files:
  - name: raw_data
    path: /data/input/raw_data.csv
  - name: processed_data
    path: /data/output/processed_data.csv

# Resource requirements
resource_requirements:
  - name: small_job
    num_cpus: 2
    num_gpus: 0
    num_nodes: 1
    memory: 4g
    runtime: PT30M

# Jobs
jobs:
  - name: download_data
    command: wget https://example.com/data.csv -O ${files.output.raw_data}
    resource_requirements: small_job

  - name: process_data
    command: python process.py ${files.input.raw_data} -o ${files.output.processed_data}
    resource_requirements: small_job
    depends_on:
      - download_data

# Workflow actions
actions:
  - trigger_type: on_workflow_start
    action_type: run_commands
    commands:
      - mkdir -p /data/input /data/output
      - echo "Workflow started"

Advantages:

  • Most widely used configuration format
  • Excellent for complex workflows with many jobs
  • Full parameter expansion support
  • Clean, readable syntax without brackets

Disadvantages:

  • Indentation-sensitive
  • No trailing commas allowed
  • Can be verbose for deeply nested structures

JSON5 Format

Best for: Programmatic workflow generation and JSON compatibility.

File Extension: .json5

Example:

{
  name: "data_processing_workflow",
  user: "datauser",
  description: "Multi-stage data processing pipeline",

  // File definitions
  files: [
    {name: "raw_data", path: "/data/input/raw_data.csv"},
    {name: "processed_data", path: "/data/output/processed_data.csv"},
  ],

  // Resource requirements
  resource_requirements: [
    {
      name: "small_job",
      num_cpus: 2,
      num_gpus: 0,
      num_nodes: 1,
      memory: "4g",
      runtime: "PT30M",
    },
  ],

  // Jobs
  jobs: [
    {
      name: "download_data",
      command: "wget https://example.com/data.csv -O ${files.output.raw_data}",
      resource_requirements: "small_job",
    },
    {
      name: "process_data",
      command: "python process.py ${files.input.raw_data} -o ${files.output.processed_data}",
      resource_requirements: "small_job",
      depends_on: ["download_data"],
    },
  ],

  // Workflow actions
  actions: [
    {
      trigger_type: "on_workflow_start",
      action_type: "run_commands",
      commands: [
        "mkdir -p /data/input /data/output",
        "echo 'Workflow started'",
      ],
    },
  ],
}

Advantages:

  • JSON-compatible (easy programmatic manipulation)
  • Supports comments and trailing commas
  • Full parameter expansion support
  • Familiar to JavaScript/JSON users

Disadvantages:

  • More verbose than YAML
  • Requires quotes around all string values
  • More brackets and commas than YAML

KDL Format

Best for: Simple to moderate workflows with clean, modern syntax.

File Extension: .kdl

Example:

name "data_processing_workflow"
user "datauser"
description "Multi-stage data processing pipeline"

// File definitions
file "raw_data" path="/data/input/raw_data.csv"
file "processed_data" path="/data/output/processed_data.csv"

// Resource requirements
resource_requirements "small_job" {
    num_cpus 2
    num_gpus 0
    num_nodes 1
    memory "4g"
    runtime "PT30M"
}

// Jobs
job "download_data" {
    command "wget https://example.com/data.csv -O ${files.output.raw_data}"
    resource_requirements "small_job"
}

job "process_data" {
    command "python process.py ${files.input.raw_data} -o ${files.output.processed_data}"
    resource_requirements "small_job"
    depends_on_job "download_data"
}

// Workflow actions
action {
    trigger_type "on_workflow_start"
    action_type "run_commands"
    command "mkdir -p /data/input /data/output"
    command "echo 'Workflow started'"
}

Advantages:

  • Clean, minimal syntax
  • No indentation requirements
  • Modern configuration language
  • Supports all core Torc features

Disadvantages:

  • No parameter expansion support
  • Less familiar to most users
  • Boolean values use special syntax (#true, #false)

KDL-Specific Syntax Notes

  1. Boolean values: Use #true and #false (not true or false)

    resource_monitor {
        enabled #true
        generate_plots #false
    }
    
  2. Repeated child nodes: Use multiple statements

    action {
        command "echo 'First command'"
        command "echo 'Second command'"
    }
    
  3. User data: Requires child nodes for properties

    user_data "metadata" {
        is_ephemeral #true
        data "{\"key\": \"value\"}"
    }
    

Common Features Across All Formats

Variable Substitution

All formats support the same variable substitution syntax:

  • ${files.input.NAME} - Input file path
  • ${files.output.NAME} - Output file path
  • ${user_data.input.NAME} - Input user data
  • ${user_data.output.NAME} - Output user data

Supported Fields

All formats support:

  • Workflow metadata: name, user, description
  • Jobs: name, command, dependencies, resource requirements
  • Files: name, path, modification time
  • User data: name, data (JSON), ephemeral flag
  • Resource requirements: CPUs, GPUs, memory, runtime
  • Slurm schedulers: account, partition, walltime, etc.
  • Workflow actions: triggers, action types, commands
  • Resource monitoring: enabled, granularity, sampling interval

Parameter Expansion (YAML/JSON5 Only)

YAML and JSON5 support parameter expansion to generate many jobs from concise specifications:

jobs:
  - name: "process_{dataset_id}"
    command: "python process.py --id {dataset_id}"
    parameters:
      dataset_id: "1:100"  # Creates 100 jobs

KDL does not support parameter expansion. For parameterized workflows, use YAML or JSON5.

Examples Directory

The Torc repository includes comprehensive examples in all three formats:

examples/
├── yaml/     # All workflows (15 examples)
├── json/     # All workflows (15 examples)
└── kdl/      # Non-parameterized workflows (9 examples)

Compare the same workflow in different formats to choose your preference:

See the examples directory for the complete collection.

Creating Workflows

All formats use the same command:

torc workflows create examples/yaml/sample_workflow.yaml
torc workflows create examples/json/sample_workflow.json5
torc workflows create examples/kdl/sample_workflow.kdl

Or use the quick execution commands:

# Create and run locally
torc run examples/yaml/sample_workflow.yaml

# Create and submit to scheduler
torc submit examples/yaml/workflow_actions_data_pipeline.yaml

Recommendations

Start with YAML if you’re unsure - it’s the most widely supported and includes full parameter expansion.

Switch to JSON5 if you need to programmatically generate workflows or prefer JSON syntax.

Try KDL if you prefer minimal syntax and don’t need parameter expansion.

All three formats are fully supported and maintained. Choose based on your workflow complexity and personal preference.

Job Parameterization

Parameterization allows creating multiple jobs/files from a single specification by expanding parameter ranges.

Parameter Formats

Integer Ranges

parameters:
  i: "1:10"        # Expands to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  i: "0:100:10"    # Expands to [0, 10, 20, 30, ..., 90, 100] (with step)

Float Ranges

parameters:
  lr: "0.0001:0.01:10"  # 10 values from 0.0001 to 0.01 (log scale)
  alpha: "0.0:1.0:0.1"  # [0.0, 0.1, 0.2, ..., 0.9, 1.0]

Lists (Integer)

parameters:
  batch_size: "[16,32,64,128]"

Lists (Float)

parameters:
  threshold: "[0.1,0.5,0.9]"

Lists (String)

parameters:
  optimizer: "['adam','sgd','rmsprop']"
  dataset: "['train','test','validation']"

Template Substitution

Use parameter values in job/file specifications with {param_name} syntax:

Basic Substitution

jobs:
  - name: job_{i}
    command: python train.py --run={i}
    parameters:
      i: "1:5"

Expands to:

jobs:
  - name: job_1
    command: python train.py --run=1
  - name: job_2
    command: python train.py --run=2
  # ... etc

Format Specifiers

Zero-padded integers:

jobs:
  - name: job_{i:03d}
    command: echo {i}
    parameters:
      i: "1:100"

Expands to: job_001, job_002, …, job_100

Float precision:

jobs:
  - name: train_lr{lr:.4f}
    command: python train.py --lr={lr}
    parameters:
      lr: "[0.0001,0.001,0.01]"

Expands to: train_lr0.0001, train_lr0.0010, train_lr0.0100

Multiple decimals:

files:
  - name: result_{threshold:.2f}
    path: /results/threshold_{threshold:.2f}.csv
    parameters:
      threshold: "0.1:1.0:0.1"

Expands to: result_0.10, result_0.20, …, result_1.00

Multi-Dimensional Parameterization

Use multiple parameters to create Cartesian products:

Example: Hyperparameter Sweep

jobs:
  - name: train_lr{lr:.4f}_bs{batch_size}
    command: |
      python train.py \
        --learning-rate={lr} \
        --batch-size={batch_size}
    parameters:
      lr: "[0.0001,0.001,0.01]"
      batch_size: "[16,32,64]"

This expands to 3 × 3 = 9 jobs:

  • train_lr0.0001_bs16
  • train_lr0.0001_bs32
  • train_lr0.0001_bs64
  • train_lr0.0010_bs16
  • … (9 total)

Example: Multi-Dataset Processing

jobs:
  - name: process_{dataset}_rep{rep:02d}
    command: python process.py --data={dataset} --replicate={rep}
    parameters:
      dataset: "['train','validation','test']"
      rep: "1:5"

This expands to 3 × 5 = 15 jobs

Parameterized Dependencies

Parameters work in dependency specifications:

jobs:
  # Generate data for each configuration
  - name: generate_{config}
    command: python generate.py --config={config}
    output_files:
      - data_{config}
    parameters:
      config: "['A','B','C']"

  # Process each generated dataset
  - name: process_{config}
    command: python process.py --input=data_{config}.pkl
    input_files:
      - data_{config}
    depends_on:
      - generate_{config}
    parameters:
      config: "['A','B','C']"

This creates 6 jobs with proper dependencies:

  • generate_Aprocess_A
  • generate_Bprocess_B
  • generate_Cprocess_C

Parameterized Files and User Data

Files:

files:
  - name: model_{run_id:03d}
    path: /models/run_{run_id:03d}.pt
    parameters:
      run_id: "1:100"

User Data:

user_data:
  - name: config_{experiment}
    data:
      experiment: "{experiment}"
      learning_rate: 0.001
    parameters:
      experiment: "['baseline','ablation','full']"

Shared (Workflow-Level) Parameters

Define parameters once at the workflow level and reuse them across multiple jobs and files using use_parameters:

Basic Usage

name: hyperparameter_sweep
parameters:
  lr: "[0.0001,0.001,0.01]"
  batch_size: "[16,32,64]"
  optimizer: "['adam','sgd']"

jobs:
  # Training jobs - inherit parameters via use_parameters
  - name: train_lr{lr:.4f}_bs{batch_size}_opt{optimizer}
    command: python train.py --lr={lr} --batch-size={batch_size} --optimizer={optimizer}
    use_parameters:
      - lr
      - batch_size
      - optimizer

  # Aggregate results - also uses shared parameters
  - name: aggregate_results
    command: python aggregate.py
    depends_on:
      - train_lr{lr:.4f}_bs{batch_size}_opt{optimizer}
    use_parameters:
      - lr
      - batch_size
      - optimizer

files:
  - name: model_lr{lr:.4f}_bs{batch_size}_opt{optimizer}
    path: /models/model_lr{lr:.4f}_bs{batch_size}_opt{optimizer}.pt
    use_parameters:
      - lr
      - batch_size
      - optimizer

Benefits

  • DRY (Don’t Repeat Yourself) - Define parameter ranges once, use everywhere
  • Consistency - Ensures all jobs use the same parameter values
  • Maintainability - Change parameters in one place, affects all uses
  • Selective inheritance - Jobs can choose which parameters to use

Selective Parameter Inheritance

Jobs don’t have to use all workflow parameters:

parameters:
  lr: "[0.0001,0.001,0.01]"
  batch_size: "[16,32,64]"
  dataset: "['train','validation']"

jobs:
  # Only uses lr and batch_size (9 jobs)
  - name: train_lr{lr:.4f}_bs{batch_size}
    command: python train.py --lr={lr} --batch-size={batch_size}
    use_parameters:
      - lr
      - batch_size

  # Only uses dataset (2 jobs)
  - name: prepare_{dataset}
    command: python prepare.py --dataset={dataset}
    use_parameters:
      - dataset

Local Parameters Override Shared

Jobs can define local parameters that take precedence over workflow-level parameters:

parameters:
  lr: "[0.0001,0.001,0.01]"

jobs:
  # Uses workflow parameter (3 jobs)
  - name: train_lr{lr:.4f}
    command: python train.py --lr={lr}
    use_parameters:
      - lr

  # Uses local override (2 jobs instead of 3)
  - name: special_lr{lr:.4f}
    command: python special.py --lr={lr}
    parameters:
      lr: "[0.01,0.1]"  # Local override - ignores workflow's lr

KDL Syntax

parameters {
    lr "[0.0001,0.001,0.01]"
    batch_size "[16,32,64]"
}

job "train_lr{lr:.4f}_bs{batch_size}" {
    command "python train.py --lr={lr} --batch-size={batch_size}"
    use_parameters "lr" "batch_size"
}

JSON5 Syntax

{
  parameters: {
    lr: "[0.0001,0.001,0.01]",
    batch_size: "[16,32,64]"
  },
  jobs: [
    {
      name: "train_lr{lr:.4f}_bs{batch_size}",
      command: "python train.py --lr={lr} --batch-size={batch_size}",
      use_parameters: ["lr", "batch_size"]
    }
  ]
}

Parameter Modes

By default, when multiple parameters are specified, Torc generates the Cartesian product of all parameter values. You can change this behavior using parameter_mode.

Product Mode (Default)

The default mode generates all possible combinations:

jobs:
  - name: job_{a}_{b}
    command: echo {a} {b}
    parameters:
      a: "[1, 2, 3]"
      b: "['x', 'y', 'z']"
    # parameter_mode: product  # This is the default

This creates 3 × 3 = 9 jobs: job_1_x, job_1_y, job_1_z, job_2_x, etc.

Zip Mode

Use parameter_mode: zip to pair parameters element-wise (like Python’s zip() function). All parameter lists must have the same length.

jobs:
  - name: train_{dataset}_{model}
    command: python train.py --dataset={dataset} --model={model}
    parameters:
      dataset: "['cifar10', 'mnist', 'imagenet']"
      model: "['resnet', 'cnn', 'transformer']"
    parameter_mode: zip

This creates 3 jobs (not 9):

  • train_cifar10_resnet
  • train_mnist_cnn
  • train_imagenet_transformer

When to use zip mode:

  • Pre-determined parameter pairings (dataset A always uses model X)
  • Corresponding input/output file pairs
  • Parallel arrays where position matters

Error handling: If parameter lists have different lengths in zip mode, Torc will return an error:

All parameters must have the same number of values when using 'zip' mode.
Parameter 'dataset' has 3 values, but 'model' has 2 values.

KDL Syntax

job "train_{dataset}_{model}" {
    command "python train.py --dataset={dataset} --model={model}"
    parameters {
        dataset "['cifar10', 'mnist', 'imagenet']"
        model "['resnet', 'cnn', 'transformer']"
    }
    parameter_mode "zip"
}

JSON5 Syntax

{
  name: "train_{dataset}_{model}",
  command: "python train.py --dataset={dataset} --model={model}",
  parameters: {
    dataset: "['cifar10', 'mnist', 'imagenet']",
    model: "['resnet', 'cnn', 'transformer']"
  },
  parameter_mode: "zip"
}

Best Practices

  1. Use descriptive parameter names - lr not x, batch_size not b
  2. Format numbers consistently - Use :03d for run IDs, :.4f for learning rates
  3. Keep parameter counts reasonable - 3×3×3 = 27 jobs is manageable, 10×10×10 = 1000 may overwhelm the system
  4. Match parameter ranges across related jobs - Use same parameter values for generator and consumer jobs
  5. Consider parameter dependencies - Some parameter combinations may be invalid
  6. Prefer shared parameters for multi-job workflows - Use use_parameters to avoid repeating definitions
  7. Use selective inheritance - Only inherit the parameters each job actually needs
  8. Use zip mode for paired parameters - When parameters have a 1:1 correspondence, use parameter_mode: zip

Resource Requirements Reference

Technical reference for job resource specifications and allocation strategies.

Resource Requirements Fields

FieldTypeRequiredDescription
namestringYesIdentifier to reference from jobs
num_cpusintegerNoNumber of CPU cores
num_gpusintegerNoNumber of GPUs
num_nodesintegerNoNumber of compute nodes
memorystringNoMemory allocation (see format below)
runtimestringNoMaximum runtime (ISO 8601 duration)

Example

resource_requirements:
  - name: small
    num_cpus: 2
    num_gpus: 0
    num_nodes: 1
    memory: 4g
    runtime: PT30M

  - name: large
    num_cpus: 16
    num_gpus: 2
    num_nodes: 1
    memory: 128g
    runtime: PT8H

Memory Format

String format with unit suffix:

SuffixUnitExample
kKilobytes512k
mMegabytes512m
gGigabytes16g

Examples:

memory: 512m    # 512 MB
memory: 1g      # 1 GB
memory: 16g     # 16 GB

Runtime Format

ISO 8601 duration format:

FormatDescriptionExample
PTnMMinutesPT30M (30 minutes)
PTnHHoursPT2H (2 hours)
PnDDaysP1D (1 day)
PnDTnHDays and hoursP1DT12H (1.5 days)

Examples:

runtime: PT10M      # 10 minutes
runtime: PT4H       # 4 hours
runtime: P1D        # 1 day
runtime: P1DT12H    # 1 day, 12 hours

Job Allocation Strategies

Resource-Based Allocation (Default)

The server considers each job’s resource requirements and only returns jobs that fit within available compute node resources.

Behavior:

  • Considers CPU, memory, and GPU requirements
  • Prevents resource over-subscription
  • Enables efficient packing of heterogeneous workloads

Configuration: Run without --max-parallel-jobs:

torc run $WORKFLOW_ID

Queue-Based Allocation

The server returns the next N ready jobs regardless of resource requirements.

Behavior:

  • Ignores job resource requirements
  • Only limits concurrent job count
  • Simpler and faster (no resource calculation)

Configuration: Run with --max-parallel-jobs:

torc run $WORKFLOW_ID --max-parallel-jobs 10

Use cases:

  • Homogeneous workloads where all jobs need similar resources
  • Simple task queues
  • When resource tracking overhead is not wanted

Resource Tracking

When using resource-based allocation, the job runner tracks:

ResourceDescription
CPUsNumber of CPU cores in use
MemoryTotal memory allocated to running jobs
GPUsNumber of GPUs in use
NodesNumber of jobs running per node

Jobs are only started when sufficient resources are available.

HPC Profiles Reference

Complete reference for HPC profile system and CLI commands.

Overview

HPC profiles contain pre-configured knowledge about High-Performance Computing systems, enabling automatic Slurm scheduler generation based on job resource requirements.

CLI Commands

torc hpc list

List all available HPC profiles.

torc hpc list [OPTIONS]

Options:

OptionDescription
-f, --format <FORMAT>Output format: table or json

Output columns:

  • Name: Profile identifier used in commands
  • Display Name: Human-readable name
  • Partitions: Number of configured partitions
  • Detected: Whether current system matches this profile

torc hpc detect

Detect the current HPC system.

torc hpc detect [OPTIONS]

Options:

OptionDescription
-f, --format <FORMAT>Output format: table or json

Returns the detected profile name, or indicates no match.


torc hpc show

Display detailed information about an HPC profile.

torc hpc show <PROFILE> [OPTIONS]

Arguments:

ArgumentDescription
<PROFILE>Profile name (e.g., kestrel)

Options:

OptionDescription
-f, --format <FORMAT>Output format: table or json

torc hpc partitions

List partitions for an HPC profile.

torc hpc partitions <PROFILE> [OPTIONS]

Arguments:

ArgumentDescription
<PROFILE>Profile name (e.g., kestrel)

Options:

OptionDescription
-f, --format <FORMAT>Output format: table or json

Output columns:

  • Name: Partition name
  • CPUs/Node: CPU cores per node
  • Mem/Node: Memory per node
  • Max Walltime: Maximum job duration
  • GPUs: GPU count and type (if applicable)
  • Shared: Whether partition supports shared jobs
  • Notes: Special requirements or features

torc hpc match

Find partitions matching resource requirements.

torc hpc match <PROFILE> [OPTIONS]

Arguments:

ArgumentDescription
<PROFILE>Profile name (e.g., kestrel)

Options:

OptionDescription
--cpus <N>Required CPU cores
--memory <SIZE>Required memory (e.g., 64g, 512m)
--walltime <DURATION>Required walltime (e.g., 2h, 4:00:00)
--gpus <N>Required GPUs
-f, --format <FORMAT>Output format: table or json

Memory format: <number><unit> where unit is k, m, g, or t (case-insensitive).

Walltime formats:

  • HH:MM:SS (e.g., 04:00:00)
  • <N>h (e.g., 4h)
  • <N>m (e.g., 30m)
  • <N>s (e.g., 3600s)

torc slurm generate

Generate Slurm schedulers for a workflow based on job resource requirements.

torc slurm generate [OPTIONS] --account <ACCOUNT> <WORKFLOW_FILE>

Arguments:

ArgumentDescription
<WORKFLOW_FILE>Path to workflow specification file (YAML, JSON, or JSON5)

Options:

OptionDescription
--account <ACCOUNT>Slurm account to use (required)
--profile <PROFILE>HPC profile to use (auto-detected if not specified)
-o, --output <FILE>Output file path (prints to stdout if not specified)
--no-actionsDon’t add workflow actions for scheduling nodes
--forceOverwrite existing schedulers in the workflow

Generated artifacts:

  1. Slurm schedulers: One for each unique resource requirement
  2. Job scheduler assignments: Each job linked to appropriate scheduler
  3. Workflow actions: on_workflow_start/schedule_nodes actions (unless --no-actions)

Scheduler naming: <resource_requirement_name>_scheduler


Built-in Profiles

NREL Kestrel

Profile name: kestrel

Detection: Environment variable NREL_CLUSTER=kestrel

Partitions:

PartitionCPUsMemoryMax WalltimeGPUsNotes
debug104240 GB1h-Quick testing
short104240 GB4h-Short jobs
standard104240 GB48h-General workloads
long104240 GB240h-Extended jobs
medmem104480 GB48h-Medium memory
bigmem1042048 GB48h-High memory
shared104240 GB48h-Shared node access
hbw104240 GB48h-High-bandwidth memory, min 10 nodes
nvme104240 GB48h-NVMe local storage
gpu-h1002240 GB48h4x H100GPU compute

Node specifications:

  • Standard nodes: 104 cores (2x Intel Xeon Sapphire Rapids), 240 GB RAM
  • GPU nodes: 4x NVIDIA H100 80GB HBM3, 128 cores, 2 TB RAM

Configuration

Custom Profiles

Don’t see your HPC? Please request built-in support so everyone benefits. See the Custom HPC Profile Tutorial for creating a profile while you wait.

Define custom profiles in your Torc configuration file:

# ~/.config/torc/config.toml

[client.hpc.custom_profiles.mycluster]
display_name = "My Cluster"
description = "Description of the cluster"
detect_env_var = "CLUSTER_NAME=mycluster"
detect_hostname = ".*\\.mycluster\\.org"
default_account = "myproject"

[[client.hpc.custom_profiles.mycluster.partitions]]
name = "compute"
cpus_per_node = 64
memory_mb = 256000
max_walltime_secs = 172800
shared = false

[[client.hpc.custom_profiles.mycluster.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 128000
max_walltime_secs = 86400
gpus_per_node = 4
gpu_type = "A100"
shared = false

Profile Override

Override settings for built-in profiles:

[client.hpc.profile_overrides.kestrel]
default_account = "my_default_account"

Configuration Options

[client.hpc] Section:

OptionTypeDescription
profile_overridestableOverride settings for built-in profiles
custom_profilestableDefine custom HPC profiles

Profile override options:

OptionTypeDescription
default_accountstringDefault Slurm account for this profile

Custom profile options:

OptionTypeRequiredDescription
display_namestringNoHuman-readable name
descriptionstringNoProfile description
detect_env_varstringNoEnvironment variable for detection (NAME=value)
detect_hostnamestringNoRegex pattern for hostname detection
default_accountstringNoDefault Slurm account
partitionsarrayYesList of partition configurations

Partition options:

OptionTypeRequiredDescription
namestringYesPartition name
cpus_per_nodeintYesCPU cores per node
memory_mbintYesMemory per node in MB
max_walltime_secsintYesMaximum walltime in seconds
gpus_per_nodeintNoGPUs per node
gpu_typestringNoGPU model (e.g., “H100”)
sharedboolNoWhether partition supports shared jobs
min_nodesintNoMinimum required nodes
requires_explicit_requestboolNoMust be explicitly requested

Resource Matching Algorithm

When generating schedulers, Torc uses this algorithm to match resource requirements to partitions:

  1. Filter by resources: Partitions must satisfy:

    • CPUs >= required CPUs
    • Memory >= required memory
    • GPUs >= required GPUs (if specified)
    • Max walltime >= required runtime
  2. Exclude debug partitions: Unless no other partition matches

  3. Prefer best fit:

    • Partitions that exactly match resource needs
    • Non-shared partitions over shared
    • Shorter max walltime over longer
  4. Handle special requirements:

    • GPU jobs only match GPU partitions
    • Respect requires_explicit_request flag
    • Honor min_nodes constraints

Generated Scheduler Format

Example generated Slurm scheduler:

slurm_schedulers:
  - name: medium_scheduler
    account: myproject
    nodes: 1
    mem: 64g
    walltime: 04:00:00
    gres: null
    partition: null  # Let Slurm choose based on resources

Corresponding workflow action:

actions:
  - trigger_type: on_workflow_start
    action_type: schedule_nodes
    scheduler: medium_scheduler
    scheduler_type: slurm
    num_allocations: 1

Runtime Format Parsing

Resource requirements use ISO 8601 duration format for runtime:

FormatExampleMeaning
PTnHPT4H4 hours
PTnMPT30M30 minutes
PTnSPT3600S3600 seconds
PTnHnMPT2H30M2 hours 30 minutes
PnDTnHP1DT12H1 day 12 hours

Generated walltime uses HH:MM:SS format (e.g., 04:00:00).


See Also

Resource Monitoring Reference

Technical reference for Torc’s resource monitoring system.

Configuration Options

The resource_monitor section in workflow specifications accepts the following fields:

FieldTypeDefaultDescription
enabledbooleantrueEnable or disable monitoring
granularitystring"summary""summary" or "time_series"
sample_interval_secondsinteger5Seconds between resource samples
generate_plotsbooleanfalseReserved for future use

Granularity Modes

Summary mode ("summary"):

  • Stores only peak and average values per job
  • Metrics stored in the main database results table
  • Minimal storage overhead

Time series mode ("time_series"):

  • Stores samples at regular intervals
  • Creates separate SQLite database per workflow run
  • Database location: <output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db

Sample Interval Guidelines

Job DurationRecommended Interval
< 1 hour1-2 seconds
1-4 hours5 seconds (default)
> 4 hours10-30 seconds

Time Series Database Schema

job_resource_samples Table

ColumnTypeDescription
idINTEGERPrimary key
job_idINTEGERTorc job ID
timestampREALUnix timestamp
cpu_percentREALCPU utilization percentage
memory_bytesINTEGERMemory usage in bytes
num_processesINTEGERProcess count including children

job_metadata Table

ColumnTypeDescription
job_idINTEGERPrimary key, Torc job ID
job_nameTEXTHuman-readable job name

Summary Metrics in Results

When using summary mode, the following fields are added to job results:

FieldTypeDescription
peak_cpu_percentfloatMaximum CPU percentage observed
avg_cpu_percentfloatAverage CPU percentage
peak_memory_gbfloatMaximum memory in GB
avg_memory_gbfloatAverage memory in GB

check-resource-utilization JSON Output

When using --format json:

{
  "workflow_id": 123,
  "run_id": null,
  "total_results": 10,
  "over_utilization_count": 3,
  "violations": [
    {
      "job_id": 15,
      "job_name": "train_model",
      "resource_type": "Memory",
      "specified": "8.00 GB",
      "peak_used": "10.50 GB",
      "over_utilization": "+31.3%"
    }
  ]
}
FieldDescription
workflow_idWorkflow being analyzed
run_idSpecific run ID if provided, otherwise null for latest
total_resultsTotal number of completed jobs analyzed
over_utilization_countNumber of violations found
violationsArray of violation details

Violation Object

FieldDescription
job_idJob ID with violation
job_nameHuman-readable job name
resource_type"Memory", "CPU", or "Runtime"
specifiedResource requirement from workflow spec
peak_usedActual peak usage observed
over_utilizationPercentage over/under specification

plot-resources Output Files

FileDescription
resource_plot_job_<id>.htmlPer-job timeline with CPU, memory, process count
resource_plot_cpu_all_jobs.htmlCPU comparison across all jobs
resource_plot_memory_all_jobs.htmlMemory comparison across all jobs
resource_plot_summary.htmlBar chart dashboard of peak vs average

All plots are self-contained HTML files using Plotly.js with:

  • Interactive hover tooltips
  • Zoom and pan controls
  • Legend toggling
  • Export options (PNG, SVG)

Monitored Metrics

MetricUnitDescription
CPU percentage%Total CPU utilization across all cores
Memory usagebytesResident memory consumption
Process countcountNumber of processes in job’s process tree

Process Tree Tracking

The monitoring system automatically tracks child processes spawned by jobs. When a job creates worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated metrics.

Performance Characteristics

  • Single background monitoring thread regardless of job count
  • Typical overhead: <1% CPU even with 1-second sampling
  • Uses native OS APIs via the sysinfo crate
  • Non-blocking async design

OpenAPI Specification

The Torc server implements a REST API defined in api/openapi.yaml. All endpoints are prefixed with /torc-service/v1.

Core Endpoints

Workflows

Create Workflow

# curl
curl -X POST http://localhost:8080/torc-service/v1/workflows \
  -H "Content-Type: application/json" \
  -d '{
    "name": "test_workflow",
    "user": "alice",
    "description": "Test workflow"
  }' | jq '.'

# nushell
http post http://localhost:8080/torc-service/v1/workflows {
  name: "test_workflow"
  user: "alice"
  description: "Test workflow"
}

Response:

{
  "id": 1,
  "name": "test_workflow",
  "user": "alice",
  "description": "Test workflow",
  "timestamp": 1699000000.0
}

List Workflows

# curl with jq
curl http://localhost:8080/torc-service/v1/workflows?offset=0&limit=10 | jq '.workflows'

# nushell (native JSON parsing)
http get http://localhost:8080/torc-service/v1/workflows?offset=0&limit=10 | get workflows

Get Workflow

# curl
curl http://localhost:8080/torc-service/v1/workflows/1 | jq '.'

# nushell
http get http://localhost:8080/torc-service/v1/workflows/1

Initialize Jobs

# curl
curl -X POST http://localhost:8080/torc-service/v1/workflows/1/initialize_jobs \
  -H "Content-Type: application/json" \
  -d '{"reinitialize": false, "ignore_missing_data": false}' | jq '.'

# nushell
http post http://localhost:8080/torc-service/v1/workflows/1/initialize_jobs {
  reinitialize: false
  ignore_missing_data: false
}

Jobs

Create Job

# curl
curl -X POST http://localhost:8080/torc-service/v1/jobs \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "name": "job1",
    "command": "echo hello",
    "resource_requirements_id": 1,
    "input_file_ids": [],
    "output_file_ids": [],
    "depends_on_job_ids": []
  }' | jq '.'

List Jobs

# curl - filter by status
curl "http://localhost:8080/torc-service/v1/jobs?workflow_id=1&status=ready" \
  | jq '.jobs[] | {name, status, id}'

# nushell - filter and format
http get "http://localhost:8080/torc-service/v1/jobs?workflow_id=1"
  | get jobs
  | where status == "ready"
  | select name status id

Update Job Status

# curl
curl -X POST http://localhost:8080/torc-service/v1/jobs/1/manage_status_change \
  -H "Content-Type: application/json" \
  -d '{"target_status": "running"}' | jq '.'

Files

Create File

# curl
curl -X POST http://localhost:8080/torc-service/v1/files \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "name": "input_data",
    "path": "/data/input.csv"
  }' | jq '.'

List Files

curl "http://localhost:8080/torc-service/v1/files?workflow_id=1" | jq '.files'

User Data

Create User Data

curl -X POST http://localhost:8080/torc-service/v1/user_data \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "name": "config",
    "data": {"learning_rate": 0.001, "batch_size": 32}
  }' | jq '.'

Update User Data

curl -X PUT http://localhost:8080/torc-service/v1/user_data/1 \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "name": "config",
    "data": {"learning_rate": 0.01, "batch_size": 64}
  }' | jq '.'

Resource Requirements

Create Resource Requirements

curl -X POST http://localhost:8080/torc-service/v1/resource_requirements \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "name": "gpu_large",
    "num_cpus": 16,
    "num_gpus": 4,
    "num_nodes": 1,
    "memory": "128g",
    "runtime": "PT8H"
  }' | jq '.'

Memory Format: String with suffix: 1m (MB), 2g (GB), 512k (KB)

Runtime Format: ISO 8601 duration: PT30M (30 minutes), PT2H (2 hours), P1DT12H (1.5 days)

Compute Nodes

Create Compute Node

curl -X POST http://localhost:8080/torc-service/v1/compute_nodes \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "hostname": "compute-01",
    "num_cpus": 32,
    "memory": "256g",
    "num_gpus": 8,
    "is_active": true
  }' | jq '.'

List Active Compute Nodes

curl "http://localhost:8080/torc-service/v1/compute_nodes?workflow_id=1&is_active=true" \
  | jq '.compute_nodes[] | {hostname, num_cpus, num_gpus}'

Results

Create Result

curl -X POST http://localhost:8080/torc-service/v1/results \
  -H "Content-Type: application/json" \
  -d '{
    "workflow_id": 1,
    "job_id": 1,
    "exit_code": 0,
    "stdout": "Job completed successfully",
    "stderr": ""
  }' | jq '.'

Events

List Events

curl "http://localhost:8080/torc-service/v1/events?workflow_id=1&limit=20" \
  | jq '.events[] | {timestamp, data}'

Advanced Endpoints

Prepare Next Jobs for Submission (Job Runner)

curl -X POST "http://localhost:8080/torc-service/v1/workflows/1/claim_next_jobs?num_jobs=5" \
  -H "Content-Type: application/json" \
  -d '{}' | jq '.jobs'

Process Changed Job Inputs (Reinitialization)

curl -X POST "http://localhost:8080/torc-service/v1/workflows/1/process_changed_job_inputs?dry_run=true" \
  -H "Content-Type: application/json" \
  -d '{}' | jq '.reinitialized_jobs'

Configuration Reference

Complete reference for Torc configuration options.

Configuration Sources

Torc loads configuration from multiple sources in this order (later sources override earlier):

  1. Built-in defaults (lowest priority)
  2. System config: /etc/torc/config.toml
  3. User config: ~/.config/torc/config.toml (platform-dependent)
  4. Project config: ./torc.toml
  5. Environment variables: TORC_* prefix
  6. CLI arguments (highest priority)

Configuration Commands

torc config show              # Show effective configuration
torc config show --format json # Show as JSON
torc config paths             # Show configuration file locations
torc config init --user       # Create user config file
torc config init --local      # Create project config file
torc config init --system     # Create system config file
torc config validate          # Validate current configuration

Client Configuration

Settings for the torc CLI.

[client] Section

OptionTypeDefaultDescription
api_urlstringhttp://localhost:8080/torc-service/v1Torc server API URL
formatstringtableOutput format: table or json
log_levelstringinfoLog level: error, warn, info, debug, trace
usernamestring(none)Username for basic authentication

[client.run] Section

Settings for torc run command.

OptionTypeDefaultDescription
poll_intervalfloat5.0Job completion poll interval (seconds)
output_dirpathoutputOutput directory for job logs
database_poll_intervalint30Database poll interval (seconds)
max_parallel_jobsint(none)Maximum parallel jobs (overrides resource-based)
num_cpusint(none)Available CPUs for resource-based scheduling
memory_gbfloat(none)Available memory (GB) for resource-based scheduling
num_gpusint(none)Available GPUs for resource-based scheduling

Example

[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"
log_level = "info"
username = "myuser"

[client.run]
poll_interval = 5.0
output_dir = "output"
max_parallel_jobs = 4
num_cpus = 8
memory_gb = 32.0
num_gpus = 1

[client.hpc] Section

Settings for HPC profile system (used by torc hpc and torc slurm commands).

OptionTypeDefaultDescription
profile_overridestable{}Override settings for built-in HPC profiles
custom_profilestable{}Define custom HPC profiles

[client.hpc.profile_overrides.<profile>] Section

Override settings for built-in profiles (e.g., kestrel).

OptionTypeDefaultDescription
default_accountstring(none)Default Slurm account for this profile

[client.hpc.custom_profiles.<name>] Section

Define a custom HPC profile.

OptionTypeRequiredDescription
display_namestringNoHuman-readable name
descriptionstringNoProfile description
detect_env_varstringNoEnvironment variable for detection (NAME=value)
detect_hostnamestringNoRegex pattern for hostname detection
default_accountstringNoDefault Slurm account
partitionsarrayYesList of partition configurations

[[client.hpc.custom_profiles.<name>.partitions]] Section

Define partitions for a custom profile.

OptionTypeRequiredDescription
namestringYesPartition name
cpus_per_nodeintYesCPU cores per node
memory_mbintYesMemory per node in MB
max_walltime_secsintYesMaximum walltime in seconds
gpus_per_nodeintNoGPUs per node
gpu_typestringNoGPU model (e.g., “H100”)
sharedboolNoWhether partition supports shared jobs
min_nodesintNoMinimum required nodes
requires_explicit_requestboolNoMust be explicitly requested

HPC Example

[client.hpc.profile_overrides.kestrel]
default_account = "my_default_account"

[client.hpc.custom_profiles.mycluster]
display_name = "My Research Cluster"
description = "Internal research HPC system"
detect_env_var = "MY_CLUSTER=research"
default_account = "default_project"

[[client.hpc.custom_profiles.mycluster.partitions]]
name = "compute"
cpus_per_node = 64
memory_mb = 256000
max_walltime_secs = 172800
shared = false

[[client.hpc.custom_profiles.mycluster.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 128000
max_walltime_secs = 86400
gpus_per_node = 4
gpu_type = "A100"
shared = false

Server Configuration

Settings for torc-server.

[server] Section

OptionTypeDefaultDescription
log_levelstringinfoLog level
httpsboolfalseEnable HTTPS
urlstringlocalhostHostname/IP to bind to
portint8080Port to listen on
threadsint1Number of worker threads
databasestring(none)SQLite database path (falls back to DATABASE_URL env)
auth_filestring(none)Path to htpasswd file
require_authboolfalseRequire authentication for all requests
completion_check_interval_secsfloat60.0Background job processing interval

[server.logging] Section

OptionTypeDefaultDescription
log_dirpath(none)Directory for log files (enables file logging)
json_logsboolfalseUse JSON format for log files

Example

[server]
url = "0.0.0.0"
port = 8080
threads = 4
database = "/var/lib/torc/torc.db"
auth_file = "/etc/torc/htpasswd"
require_auth = true
completion_check_interval_secs = 60.0
log_level = "info"
https = false

[server.logging]
log_dir = "/var/log/torc"
json_logs = false

Dashboard Configuration

Settings for torc-dash.

[dash] Section

OptionTypeDefaultDescription
hoststring127.0.0.1Hostname/IP to bind to
portint8090Port to listen on
api_urlstringhttp://localhost:8080/torc-service/v1Torc server API URL
torc_binstringtorcPath to torc CLI binary
torc_server_binstringtorc-serverPath to torc-server binary
standaloneboolfalseAuto-start torc-server
server_portint0Server port for standalone mode (0 = auto)
databasestring(none)Database path for standalone mode
completion_check_interval_secsint5Completion check interval (standalone mode)

Example

[dash]
host = "0.0.0.0"
port = 8090
api_url = "http://localhost:8080/torc-service/v1"
torc_bin = "/usr/local/bin/torc"
torc_server_bin = "/usr/local/bin/torc-server"
standalone = true
server_port = 0
completion_check_interval_secs = 5

Environment Variables

Environment variables use double underscore (__) to separate nested keys.

Client Variables

VariableMaps To
TORC_CLIENT__API_URLclient.api_url
TORC_CLIENT__FORMATclient.format
TORC_CLIENT__LOG_LEVELclient.log_level
TORC_CLIENT__USERNAMEclient.username
TORC_CLIENT__RUN__POLL_INTERVALclient.run.poll_interval
TORC_CLIENT__RUN__OUTPUT_DIRclient.run.output_dir
TORC_CLIENT__RUN__MAX_PARALLEL_JOBSclient.run.max_parallel_jobs
TORC_CLIENT__RUN__NUM_CPUSclient.run.num_cpus
TORC_CLIENT__RUN__MEMORY_GBclient.run.memory_gb
TORC_CLIENT__RUN__NUM_GPUSclient.run.num_gpus

Server Variables

VariableMaps To
TORC_SERVER__URLserver.url
TORC_SERVER__PORTserver.port
TORC_SERVER__THREADSserver.threads
TORC_SERVER__DATABASEserver.database
TORC_SERVER__AUTH_FILEserver.auth_file
TORC_SERVER__REQUIRE_AUTHserver.require_auth
TORC_SERVER__LOG_LEVELserver.log_level
TORC_SERVER__COMPLETION_CHECK_INTERVAL_SECSserver.completion_check_interval_secs
TORC_SERVER__LOGGING__LOG_DIRserver.logging.log_dir
TORC_SERVER__LOGGING__JSON_LOGSserver.logging.json_logs

Dashboard Variables

VariableMaps To
TORC_DASH__HOSTdash.host
TORC_DASH__PORTdash.port
TORC_DASH__API_URLdash.api_url
TORC_DASH__STANDALONEdash.standalone

Legacy Variables

These environment variables are still supported directly by clap:

VariableComponentDescription
TORC_API_URLClientServer API URL (CLI only)
TORC_USERNAMEClientAuthentication username (CLI only)
TORC_PASSWORDClientAuthentication password (CLI only)
TORC_AUTH_FILEServerhtpasswd file path
TORC_LOG_DIRServerLog directory
TORC_COMPLETION_CHECK_INTERVAL_SECSServerCompletion check interval
DATABASE_URLServerSQLite database URL
RUST_LOGAllLog level filter

Complete Example

# ~/.config/torc/config.toml

[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"
log_level = "info"
username = "developer"

[client.run]
poll_interval = 5.0
output_dir = "output"
database_poll_interval = 30
num_cpus = 8
memory_gb = 32.0
num_gpus = 1

[server]
log_level = "info"
https = false
url = "localhost"
port = 8080
threads = 4
database = "/var/lib/torc/torc.db"
require_auth = false
completion_check_interval_secs = 60.0

[server.logging]
log_dir = "/var/log/torc"
json_logs = false

[dash]
host = "127.0.0.1"
port = 8090
api_url = "http://localhost:8080/torc-service/v1"
torc_bin = "torc"
torc_server_bin = "torc-server"
standalone = false
server_port = 0
completion_check_interval_secs = 5

See Also

Security Reference

This document describes Torc’s security features, threat model, and best practices for secure deployments.

Authentication & Authorization

HTTP Basic Authentication

Torc uses HTTP Basic authentication with bcrypt password hashing.

Security Properties:

  • ✅ Industry-standard authentication method
  • ✅ Bcrypt hashing with configurable work factor (cost 4-31)
  • ✅ No plaintext password storage
  • ✅ Salt automatically generated per password
  • ⚠️ Credentials sent base64-encoded (requires HTTPS)

Architecture:

Client Request
    ↓
[Authorization: Basic base64(username:password)]
    ↓
Server Middleware
    ↓
Extract credentials → Verify against htpasswd file (bcrypt)
    ↓
Success: Add username to request context
Failure: Return None authorization (logged)
    ↓
API Handler (receives authorization context)

Authentication Modes

ModeConfigurationBehavior
DisabledNo --auth-fileAll requests allowed, no authentication
Optional--auth-file onlyValid credentials logged, invalid/missing allowed
Required--auth-file --require-authInvalid/missing credentials rejected

Recommendation: Use Required mode in production.

Transport Security

HTTPS/TLS

When to use HTTPS:

  • Always when authentication is enabled
  • ✅ When transmitting sensitive workflow data
  • ✅ Over untrusted networks (internet, shared networks)
  • ✅ Compliance requirements (PCI-DSS, HIPAA, etc.)

Configuration:

# Server
torc-server run --https --auth-file /etc/torc/htpasswd

# Client
torc --url https://torc.example.com/torc-service/v1 workflows list

TLS Version: Torc uses the system’s OpenSSL/native-tls library. Ensure:

  • TLS 1.2 minimum (TLS 1.3 preferred)
  • Strong cipher suites enabled
  • Valid certificates from trusted CA

Network Security

Deployment Patterns:

Pattern 1: Internal Network Only

[Torc Clients] ←→ [Torc Server]
    (Trusted internal network)
  • May use HTTP if network is truly isolated
  • Still recommend HTTPS for defense in depth

Pattern 2: Load Balancer with TLS Termination

[Torc Clients] ←HTTPS→ [Load Balancer] ←HTTP→ [Torc Server]
    (Internet)              (Internal trusted network)
  • TLS terminates at load balancer
  • Internal traffic may use HTTP
  • Ensure load balancer validates certificates

Pattern 3: End-to-End TLS

[Torc Clients] ←HTTPS→ [Torc Server]
    (Internet or untrusted network)
  • Most secure pattern
  • TLS all the way to Torc server
  • Required for compliance scenarios

Credential Management

Password Requirements

Recommendations:

  • Minimum 12 characters
  • Mix of uppercase, lowercase, numbers, symbols
  • No dictionary words or common patterns
  • Unique per user and environment

Bcrypt Cost Factor:

CostHash TimeUse Case
4-8< 100msTesting only
10~100msLegacy systems
12~250msDefault, good for most use cases
13-14~500ms-1sProduction, sensitive data
15+> 2sHigh-security, infrequent logins

Cost Selection Criteria:

  • Higher cost = more CPU, slower login
  • Balance security vs. user experience
  • Consider attack surface (internet-facing vs. internal)

Htpasswd File Security

File Permissions:

# Restrict to server process owner only
chmod 600 /etc/torc/htpasswd
chown torc-server:torc-server /etc/torc/htpasswd

Storage Best Practices:

  • ❌ Never commit to version control
  • ❌ Never share between environments
  • ✅ Store in secure configuration management (Ansible Vault, HashiCorp Vault)
  • ✅ Backup with encryption
  • ✅ Rotate regularly (quarterly recommended)

File Format Security:

# Comments allowed
username:$2b$12$hash...
  • Only bcrypt hashes accepted ($2a$, $2b$, or $2y$)
  • No plaintext passwords
  • No MD5, SHA-1, or weak hashes

Client Credential Storage

Best Practices:

MethodSecurityUse Case
Environment variables⭐⭐⭐Scripts, automation, CI/CD
Password prompt⭐⭐⭐⭐⭐Interactive sessions
Config filesNot recommended
Command-line args⚠️Visible in process list, avoid

Examples:

# Good: Environment variables
export TORC_USERNAME=alice
export TORC_PASSWORD=$(secret-tool lookup torc password)
torc workflows list

# Good: Password prompt
torc --username alice workflows list
Password: ****

# Acceptable: CI/CD with secrets
TORC_PASSWORD=${{ secrets.TORC_PASSWORD }} torc workflows create

# Bad: Command-line argument (visible in `ps`)
torc --username alice --password mypassword workflows list

Threat Model

Threats Mitigated

ThreatMitigationEffectiveness
Unauthorized API accessRequired authentication✅ High
Credential stuffingBcrypt work factor, rate limiting✅ Medium-High
Password crackingBcrypt (cost ≥12)✅ High
Man-in-the-middleHTTPS/TLS✅ High
Credential theft (database)No plaintext storage, bcrypt✅ High

Threats Not Mitigated

ThreatImpactRecommendation
DDoS attacksHighUse rate limiting, firewalls, CDN
SQL injectionMediumUse parameterized queries (Torc does)
Insider threatsHighAudit logging, least privilege
Compromised clientHighNetwork segmentation, monitoring
Side-channel attacksLowConstant-time operations (bcrypt does)

Attack Scenarios

Scenario 1: Compromised htpasswd file

Impact: Attacker has password hashes

Risk: Medium - Bcrypt makes cracking difficult

Mitigation:

  1. Immediately revoke all user accounts
  2. Generate new htpasswd file with fresh passwords
  3. Investigate how file was compromised
  4. Increase bcrypt cost if needed

Scenario 2: Leaked credentials in logs

Impact: Credentials in plaintext in logs

Risk: High

Prevention:

  • Never log passwords
  • Sanitize logs before sharing
  • Restrict log access

Response:

  1. Rotate affected credentials immediately
  2. Audit all log access
  3. Review code for password logging

Scenario 3: Network eavesdropping (HTTP)

Impact: Credentials intercepted in transit

Risk: Critical over untrusted networks

Prevention:

  • Always use HTTPS when authentication is enabled
  • Especially critical for internet-facing deployments

Response:

  1. Enable HTTPS immediately
  2. Rotate all credentials (assume compromised)
  3. Review access logs for suspicious activity

Audit & Monitoring

Authentication Events

Server logs authentication events:

# Successful authentication
DEBUG torc::server::auth: User 'alice' authenticated successfully

# Failed authentication (wrong password)
WARN torc::server::auth: Authentication failed for user 'alice'

# Missing credentials when required
WARN torc::server::auth: Authentication required but no credentials provided

# No authentication configured
DEBUG torc::server::auth: No authentication configured, allowing request

Metrics to track:

  1. Failed authentication attempts (per user, total)
  2. Successful authentications (per user)
  3. Requests without credentials (when auth enabled)
  4. Unusual access patterns (time, volume, endpoints)

Alerting thresholds:

  • 5+ failed attempts from same user in 5 minutes
  • 100+ failed attempts total in 1 hour
  • Authentication from unexpected IP ranges
  • Access during unusual hours (if applicable)

Log aggregation:

# Collect auth events
grep "torc::server::auth" /var/log/torc-server.log

# Count failed attempts per user
grep "Authentication failed" /var/log/torc-server.log | \
  awk '{print $(NF)}' | sort | uniq -c

# Monitor in real-time
tail -f /var/log/torc-server.log | grep "WARN.*auth"

Compliance Considerations

GDPR / Privacy

User data in htpasswd:

  • Usernames may be personal data (email addresses)
  • Password hashes are not personal data (irreversible)

Recommendations:

  • Allow users to request account deletion
  • Don’t use email addresses as usernames (use aliases)
  • Document data retention policies

PCI-DSS / SOC2

Requirements that apply:

  1. Transport encryption: Use HTTPS
  2. Access control: Enable required authentication
  3. Password complexity: Enforce strong passwords
  4. Audit logging: Enable and monitor auth logs
  5. Regular reviews: Audit user accounts quarterly

Configuration:

# PCI-DSS compliant setup
torc-server run \
  --https \
  --auth-file /etc/torc/htpasswd \
  --require-auth \
  --log-level info

Security Checklist

Server Deployment

  • HTTPS enabled in production
  • Strong TLS configuration (TLS 1.2+, strong ciphers)
  • Valid certificate from trusted CA
  • Required authentication enabled (--require-auth)
  • Htpasswd file permissions: chmod 600
  • Htpasswd file owned by server process user
  • Bcrypt cost ≥ 12 (≥14 for high-security)
  • Strong passwords enforced
  • Audit logging enabled
  • Log rotation configured
  • Firewall rules limit access
  • Server runs as non-root user
  • Regular security updates applied

Client Usage

  • HTTPS URLs used when auth enabled
  • Credentials stored in environment variables (not command-line)
  • Passwords not logged
  • Passwords not committed to version control
  • Password prompting used for interactive sessions
  • CI/CD secrets used for automation
  • Regular password rotation

Operational

  • User accounts reviewed quarterly
  • Inactive accounts disabled/removed
  • Failed login attempts monitored
  • Access logs reviewed regularly
  • Incident response plan documented
  • Backup htpasswd files encrypted
  • Disaster recovery tested

Future Enhancements

Planned security features:

  1. Token-based authentication: JWT/OAuth2 support
  2. API keys: Long-lived tokens for automation
  3. Role-based access control (RBAC): Granular permissions
  4. LDAP/Active Directory integration: Enterprise SSO
  5. Rate limiting: Prevent brute force attacks
  6. 2FA/MFA support: Multi-factor authentication
  7. Session management: Token expiration, refresh
  8. Audit trail: Detailed access logging

Resources

Tutorials

This section contains learning-oriented lessons to help you get started with Torc. Each tutorial walks through a complete example from start to finish.

Tutorials:

  1. Configuration Files - Set up configuration files for Torc components
  2. Dashboard Deployment - Deploy torc-dash for local, shared, or HPC environments
  3. Workflow Wizard - Create workflows using the dashboard’s interactive wizard
  4. Many Independent Jobs - Create a workflow with 100 parallel jobs
  5. Diamond Workflow - Fan-out and fan-in with file dependencies
  6. User Data Dependencies - Pass JSON data between jobs
  7. Simple Parameterization - Single parameter dimension sweep
  8. Advanced Parameterization - Multi-dimensional hyperparameter grid search
  9. Multi-Stage Workflows with Barriers - Scale to thousands of jobs efficiently
  10. Map Python Functions - Distribute Python functions across workers
  11. Filtering CLI Output with Nushell - Filter jobs, results, and user data with readable queries
  12. Custom HPC Profile - Create an HPC profile for unsupported clusters

Start with the Configuration Files tutorial to set up your environment, then try the Dashboard Deployment tutorial if you want to use the web interface.

Example Files

The repository includes ready-to-run example workflow specifications in YAML, JSON5, and KDL formats. These complement the tutorials and demonstrate additional patterns:

ExampleDescriptionTutorial
diamond_workflow.yamlFan-out/fan-in patternDiamond Workflow
hundred_jobs_parameterized.yaml100 parallel jobs via parameterizationMany Jobs
hyperparameter_sweep.yamlML grid search (3×3×2 = 18 jobs)Advanced Params
multi_stage_barrier_pattern.yamlEfficient multi-stage workflowBarriers
resource_monitoring_demo.yamlCPU/memory tracking
workflow_actions_simple_slurm.yamlAutomated Slurm scheduling

Browse all examples:

See the examples README for the complete list and usage instructions.

Configuration Files Tutorial

This tutorial walks you through setting up Torc configuration files to customize your workflows without specifying options on every command.

What You’ll Learn

  • How to create a configuration file
  • Configuration file locations and priority
  • Using environment variables for configuration
  • Common configuration patterns

Prerequisites

  • Torc CLI installed
  • Basic familiarity with TOML format

Step 1: Check Current Configuration

First, let’s see what configuration Torc is using:

torc config paths

Output:

Configuration file paths (in priority order):

  System:  /etc/torc/config.toml (not found)
  User:    ~/.config/torc/config.toml (not found)
  Local:   torc.toml (not found)

Environment variables (highest priority):
  Use double underscore (__) to separate nested keys:
    TORC_CLIENT__API_URL, TORC_CLIENT__FORMAT, TORC_SERVER__PORT, etc.

No configuration files found. Run 'torc config init --user' to create one.

View the effective configuration (defaults):

torc config show

Step 2: Create a User Configuration File

Create a configuration file in your home directory that applies to all your Torc usage:

torc config init --user

This creates ~/.config/torc/config.toml (Linux/macOS) or the equivalent on your platform.

Step 3: Edit the Configuration

Open the configuration file in your editor:

# Linux/macOS
$EDITOR ~/.config/torc/config.toml

# Or find the path
torc config paths

Here’s a typical user configuration:

[client]
# Connect to your team's Torc server
api_url = "http://torc-server.internal:8080/torc-service/v1"

# Default to JSON output for scripting
format = "json"

# Enable debug logging
log_level = "debug"

# Username for authentication
username = "alice"

[client.run]
# Default poll interval for local runs
poll_interval = 10.0

# Default output directory
output_dir = "workflow_output"

# Resource limits for local execution
num_cpus = 8
memory_gb = 32.0
num_gpus = 1

Step 4: Validate Your Configuration

After editing, validate the configuration:

torc config validate

Output:

Validating configuration...

Loading configuration from:
  - /home/alice/.config/torc/config.toml

Configuration is valid.

Key settings:
  client.api_url = http://torc-server.internal:8080/torc-service/v1
  client.format = json
  server.port = 8080
  dash.port = 8090

Step 5: Create a Project-Local Configuration

For project-specific settings, create a torc.toml in your project directory:

cd ~/myproject
torc config init --local

Edit torc.toml:

[client]
# Project-specific server (overrides user config)
api_url = "http://localhost:8080/torc-service/v1"

[client.run]
# Project-specific output directory
output_dir = "results"

# This project needs more memory
memory_gb = 64.0

Step 6: Understanding Priority

Configuration sources are loaded in this order (later sources override earlier):

  1. Built-in defaults (lowest priority)
  2. System config (/etc/torc/config.toml)
  3. User config (~/.config/torc/config.toml)
  4. Project-local config (./torc.toml)
  5. Environment variables (TORC_*)
  6. CLI arguments (highest priority)

Example: If you have api_url set in your user config but run:

torc --url http://other-server:8080/torc-service/v1 workflows list

The CLI argument takes precedence.

Step 7: Using Environment Variables

Environment variables are useful for CI/CD pipelines and temporary overrides.

Use double underscore (__) to separate nested keys:

# Override client.api_url
export TORC_CLIENT__API_URL="http://ci-server:8080/torc-service/v1"

# Override client.format
export TORC_CLIENT__FORMAT="json"

# Override server.port
export TORC_SERVER__PORT="9999"

# Verify
torc config show | grep api_url

Step 8: Server Configuration

If you’re running torc-server, you can configure it too:

[server]
# Bind to all interfaces
url = "0.0.0.0"
port = 8080

# Use 4 worker threads
threads = 4

# Database location
database = "/var/lib/torc/torc.db"

# Authentication
auth_file = "/etc/torc/htpasswd"
require_auth = true

# Background job processing interval
completion_check_interval_secs = 30.0

# Log level
log_level = "info"

[server.logging]
# Enable file logging
log_dir = "/var/log/torc"
json_logs = true

Step 9: Dashboard Configuration

Configure torc-dash:

[dash]
# Bind address
host = "0.0.0.0"
port = 8090

# API server to connect to
api_url = "http://localhost:8080/torc-service/v1"

# Standalone mode settings
standalone = false

Common Configuration Patterns

Development Setup

# ~/.config/torc/config.toml
[client]
api_url = "http://localhost:8080/torc-service/v1"
format = "table"
log_level = "debug"

[client.run]
poll_interval = 2.0
output_dir = "output"

Production Server

# /etc/torc/config.toml
[server]
url = "0.0.0.0"
port = 8080
threads = 8
database = "/var/lib/torc/production.db"
require_auth = true
auth_file = "/etc/torc/htpasswd"
completion_check_interval_secs = 60.0
log_level = "info"

[server.logging]
log_dir = "/var/log/torc"
json_logs = true

CI/CD Pipeline

# In CI script
export TORC_CLIENT__API_URL="${CI_TORC_SERVER_URL}"
export TORC_CLIENT__FORMAT="json"

torc run workflow.yaml

Troubleshooting

Configuration Not Loading

Check which files are being loaded:

torc config validate

Environment Variables Not Working

Remember to use double underscore (__) for nesting:

# Correct
TORC_CLIENT__API_URL=http://...

# Wrong (single underscore)
TORC_CLIENT_API_URL=http://...

View Effective Configuration

See the merged result of all configuration sources:

torc config show

Next Steps

Dashboard Deployment Tutorial

This tutorial covers three common deployment scenarios for the Torc web dashboard (torc-dash). Each scenario addresses different environments and use cases.

Prefer the terminal? If you work primarily in SSH sessions or terminal environments, consider using the Terminal UI (TUI) instead. The TUI provides the same workflow and job management capabilities without requiring a web browser or SSH tunnels.

Overview of Deployment Scenarios

ScenarioEnvironmentUse Case
1. StandaloneLocal computerSingle-computer workflows, development, testing
2. All-in-One Login NodeHPC login nodeSmall HPC workflows (< 100 jobs)
3. Shared ServerHPC login node + dedicated serverLarge-scale multi-user HPC workflows

Prerequisites

Before starting, ensure you have:

  1. Built Torc binaries (see Installation):

    cargo build --release --workspace
    
  2. Added binaries to PATH:

    export PATH="$PATH:/path/to/torc/target/release"
    
  3. Initialized the database (if not using standalone mode):

    sqlx database setup
    

Scenario 1: Local Development (Standalone Mode)

Best for: Single-computer workflows on your laptop or workstation. Also ideal for development, testing, and learning Torc.

This is the simplest setup - everything runs on one machine with a single command. Use this when you want to run workflows entirely on your local computer without HPC resources.

Architecture

flowchart TB
    subgraph computer["Your Computer"]
        browser["Browser"]
        dash["torc-dash<br/>(web UI)"]
        server["torc-server<br/>(managed)"]
        cli["torc CLI"]
        db[("SQLite DB")]

        browser --> dash
        dash -->|"HTTP API"| server
        dash -->|"executes"| cli
        cli -->|"HTTP API"| server
        server --> db
    end

Setup

Step 1: Start the dashboard in standalone mode

torc-dash --standalone

This single command:

  • Automatically starts torc-server on a free port
  • Starts the dashboard on http://127.0.0.1:8090
  • Configures the dashboard to connect to the managed server

Step 2: Open your browser

Navigate to http://localhost:8090

Step 3: Create and run a workflow

  1. Click Create Workflow
  2. Upload a workflow specification file (YAML, JSON, or KDL)
  3. Click Create
  4. Click Initialize on the new workflow
  5. Click Run Locally to execute

Configuration Options

# Custom dashboard port
torc-dash --standalone --port 8080

# Specify database location
torc-dash --standalone --database /path/to/my.db

# Faster job completion detection
torc-dash --standalone --completion-check-interval-secs 2

# Specify binary paths (if not in PATH)
torc-dash --standalone \
  --torc-bin /path/to/torc \
  --torc-server-bin /path/to/torc-server

Stopping

Press Ctrl+C in the terminal. This stops both the dashboard and the managed server.


Scenario 2: All-in-One Login Node

Best for: Small HPC workflows (fewer than 100 jobs) where you want the complete Torc stack running on the login node, with jobs submitted to Slurm.

This is the simplest HPC setup - everything runs on the login node. It’s ideal for individual users running small HPC workflows without needing a dedicated server infrastructure.

Important: Login nodes are shared resources. The torc-dash and torc-server applications consume minimal resources when workflows are small (e.g., less than 100 jobs). If you run these applications on bigger workflows, especially with faster job completion interval checks, you may impact other users.

Architecture

flowchart TB
    subgraph local["Your Local Machine"]
        browser["Browser"]
    end

    subgraph login["Login Node"]
        dash["torc-dash<br/>(port 8090)"]
        server["torc-server<br/>(port 8080)"]
        cli["torc CLI"]
        db[("SQLite DB")]
        slurm["sbatch/squeue"]

        dash -->|"HTTP API"| server
        dash -->|"executes"| cli
        cli -->|"HTTP API"| server
        server --> db
        cli --> slurm
    end

    subgraph compute["Compute Nodes (Slurm)"]
        runner1["torc-slurm-job-runner<br/>(job 1)"]
        runner2["torc-slurm-job-runner<br/>(job 2)"]
        runnerN["torc-slurm-job-runner<br/>(job N)"]

        runner1 -->|"HTTP API"| server
        runner2 -->|"HTTP API"| server
        runnerN -->|"HTTP API"| server
    end

    browser -->|"SSH tunnel"| dash
    slurm --> compute

Setup

Step 1: Start torc-server on the login node

# Start server
torc-server run \
  --port 8080 \
  --database $SCRATCH/torc.db \
  --completion-check-interval-secs 60

Or as a background process:

nohup torc-server run \
  --port 8080 \
  --database $SCRATCH/torc.db \
  > $SCRATCH/torc-server.log 2>&1 &

Step 2: Start torc-dash on the same login node

# Set API URL to local server
export TORC_API_URL="http://localhost:8080/torc-service/v1"

# Start dashboard
torc-dash --port 8090

Or in the background:

nohup torc-dash --port 8090 > $SCRATCH/torc-dash.log 2>&1 &

Step 3: Access via SSH tunnel

From your local machine:

ssh -L 8090:localhost:8090 user@login-node

Important: Use localhost in the tunnel command, not the login node’s hostname. This works because torc-dash binds to 127.0.0.1 by default.

Open http://localhost:8090 in your browser.

Submitting to Slurm

Via Dashboard:

  1. Create a workflow with Slurm scheduler configuration
  2. Click Initialize
  3. Click Submit (not “Run Locally”)

Via CLI:

export TORC_API_URL="http://localhost:8080/torc-service/v1"

# Create workflow with Slurm actions
torc workflows create my_slurm_workflow.yaml

# Submit to Slurm
torc submit <workflow_id>

Monitoring Slurm Jobs

The dashboard shows job status updates as Slurm jobs progress:

  1. Go to Details tab
  2. Select Jobs
  3. Enable Auto-refresh
  4. Watch status change from pendingrunningcompleted

You can also monitor via:

  • Events tab for state transitions
  • Debugging tab for job logs after completion

Scenario 3: Shared Server on HPC

Best for: Large-scale multi-user HPC environments where a central torc-server runs persistently on a dedicated server, and multiple users access it via torc-dash from login nodes.

This is the most scalable setup, suitable for production deployments with many concurrent users and large workflows.

Architecture

flowchart TB
    subgraph local["Your Local Machine"]
        browser["Browser"]
    end

    subgraph login["Login Node"]
        dash["torc-dash<br/>(port 8090)"]
        cli["torc CLI"]

        dash -->|"executes"| cli
    end

    subgraph shared["Shared Server"]
        server["torc-server<br/>(port 8080)"]
        db[("SQLite DB")]

        server --> db
    end

    browser -->|"SSH tunnel"| dash
    dash -->|"HTTP API"| server
    cli -->|"HTTP API"| server

Setup

Step 1: Start torc-server on the shared server

On the shared server (e.g., a dedicated service node):

# Start server with production settings
torc-server run \
  --port 8080 \
  --database /shared/storage/torc.db \
  --completion-check-interval-secs 60

For production, consider running as a systemd service:

torc-server service install --user \
  --port 8080 \
  --database /shared/storage/torc.db

Step 2: Start torc-dash on a login node

SSH to the login node and start the dashboard:

# Connect to the shared server
export TORC_API_URL="http://shared-server:8080/torc-service/v1"

# Start dashboard (accessible only from login node by default)
torc-dash --port 8090

Step 3: Access the dashboard via SSH tunnel

From your local machine, create an SSH tunnel:

ssh -L 8090:localhost:8090 user@login-node

Important: Use localhost in the tunnel command, not the login node’s hostname. The tunnel forwards your local port to localhost:8090 as seen from the login node, which matches where torc-dash binds (127.0.0.1:8090).

Then open http://localhost:8090 in your local browser.

Using the CLI

Users can also interact with the shared server via CLI:

# Set the API URL
export TORC_API_URL="http://shared-server:8080/torc-service/v1"

# Create and run workflows
torc workflows create my_workflow.yaml
torc workflows run <workflow_id>

Authentication

For multi-user environments, enable authentication:

# Create htpasswd file with users
torc-htpasswd create /path/to/htpasswd
torc-htpasswd add /path/to/htpasswd alice
torc-htpasswd add /path/to/htpasswd bob

# Start server with authentication
torc-server run \
  --port 8080 \
  --auth-file /path/to/htpasswd \
  --require-auth

See Authentication for details.


Comparison Summary

FeatureStandaloneAll-in-One Login NodeShared Server
Setup complexityLowMediumMedium-High
Multi-user supportNoSingle userYes
Slurm integrationNoYesYes
Database locationLocalLogin nodeShared storage
PersistenceSession onlyDepends on setupPersistent
Best forSingle-computer workflowsSmall HPC workflows (< 100 jobs)Large-scale production

Troubleshooting

Cannot connect to server

# Check if server is running
curl http://localhost:8080/torc-service/v1/workflows

# Check server logs
tail -f torc-server.log

SSH tunnel not working

# Verify tunnel is established
lsof -i :8090

# Check for port conflicts
netstat -tuln | grep 8090

Slurm jobs not starting

# Check Slurm queue
squeue -u $USER

# Check Slurm job logs
cat output/slurm_output_*.e

Dashboard shows “Disconnected”

  • Verify API URL in Configuration tab
  • Check network connectivity to server
  • Ensure server is running and accessible

Next Steps

Creating Workflows with the Dashboard Wizard

This tutorial walks you through creating a workflow using the interactive wizard in the Torc dashboard. The wizard provides a guided, step-by-step interface for building workflows without writing YAML or JSON files.

Learning Objectives

By the end of this tutorial, you will:

  • Create a multi-job workflow using the dashboard wizard
  • Define job dependencies visually
  • Configure Slurm schedulers for HPC execution
  • Set up workflow actions to automatically schedule nodes
  • Understand how the wizard generates workflow specifications

Prerequisites

Overview

The workflow wizard guides you through five steps:

  1. Basics - Workflow name and description
  2. Jobs - Define computational tasks
  3. Schedulers - Configure Slurm schedulers (optional)
  4. Actions - Set up automatic node scheduling (optional)
  5. Review - Preview and create the workflow

Step 1: Open the Create Workflow Modal

  1. Open the Torc dashboard in your browser
  2. Click the Create Workflow button in the top-right corner
  3. Select the Wizard tab at the top of the modal

You’ll see the wizard interface with step indicators showing your progress.

Step 2: Configure Basics

Enter the basic workflow information:

  • Workflow Name (required): A unique identifier for your workflow (e.g., data-pipeline)
  • Description (optional): A brief description of what the workflow does

Click Next to proceed.

Step 3: Add Jobs

This is where you define the computational tasks in your workflow.

Adding Your First Job

  1. Click + Add Job
  2. Fill in the job details:
    • Job Name: A unique name (e.g., preprocess)
    • Command: The shell command to execute (e.g., python preprocess.py)

Setting Dependencies

The Blocked By field lets you specify which jobs must complete before this job can run:

  1. Click the Blocked By dropdown
  2. Select one or more jobs that must complete first
  3. Hold Ctrl/Cmd to select multiple jobs

Configuring Resources

Choose a resource preset or customize:

  • Small: 1 CPU, 1GB memory
  • Medium: 8 CPUs, 50GB memory
  • GPU: 1 CPU, 10GB memory, 1 GPU
  • Custom: Specify exact requirements

Example: Three-Job Pipeline

Let’s create a simple pipeline:

Job 1: preprocess

  • Name: preprocess
  • Command: echo "Preprocessing..." && sleep 5
  • Blocked By: (none - this runs first)
  • Resources: Small

Job 2: analyze

  • Name: analyze
  • Command: echo "Analyzing..." && sleep 10
  • Blocked By: preprocess
  • Resources: Medium

Job 3: report

  • Name: report
  • Command: echo "Generating report..." && sleep 3
  • Blocked By: analyze
  • Resources: Small

Click Next when all jobs are configured.

Step 4: Configure Schedulers (Optional)

If you’re running on an HPC system with Slurm, you can define scheduler configurations here. Skip this step for local execution.

Adding a Scheduler

  1. Click + Add Scheduler

  2. Fill in the required fields:

    • Scheduler Name: A reference name (e.g., compute_scheduler)
    • Account: Your Slurm account name
  3. Configure optional settings:

    • Nodes: Number of nodes to request
    • Wall Time: Maximum runtime (HH:MM:SS format)
    • Partition: Slurm partition name
    • QoS: Quality of service level
    • GRES: GPU resources (e.g., gpu:2)
    • Memory: Memory per node (e.g., 64G)
    • Temp Storage: Local scratch space
    • Extra Slurm Options: Additional sbatch flags

Example: Basic Compute Scheduler

  • Scheduler Name: compute
  • Account: my_project
  • Nodes: 1
  • Wall Time: 02:00:00
  • Partition: standard

Assigning Jobs to Schedulers

After defining schedulers, you can assign jobs to them:

  1. Go back to the Jobs step (click Back)
  2. In each job card, find the Scheduler dropdown
  3. Select the scheduler to use for that job

Jobs without a scheduler assigned will run locally.

Click Next when scheduler configuration is complete.

Step 5: Configure Actions (Optional)

Actions automatically schedule Slurm nodes when certain events occur. This is useful for dynamic resource allocation.

Trigger Types

  • When workflow starts: Schedule nodes immediately when the workflow begins
  • When jobs become ready: Schedule nodes when specific jobs are ready to run
  • When jobs complete: Schedule nodes after specific jobs finish

Adding an Action

  1. Click + Add Action
  2. Select the Trigger type
  3. Select the Scheduler to use
  4. For job-based triggers, select which Jobs trigger the action
  5. Set the Number of Allocations (how many Slurm jobs to submit)

Example: Stage-Based Scheduling

For a workflow with setup, compute, and finalize stages:

Action 1: Setup Stage

  • Trigger: When workflow starts
  • Scheduler: setup_scheduler
  • Allocations: 1

Action 2: Compute Stage

  • Trigger: When jobs become ready
  • Jobs: compute_job1, compute_job2, compute_job3
  • Scheduler: compute_scheduler
  • Allocations: 3

Action 3: Finalize Stage

  • Trigger: When jobs become ready
  • Jobs: finalize
  • Scheduler: finalize_scheduler
  • Allocations: 1

Click Next to proceed to review.

Step 6: Review and Create

The review step shows the generated workflow specification in JSON format. This is exactly what will be submitted to the server.

Reviewing the Spec

Examine the generated specification to verify:

  • All jobs are included with correct names and commands
  • Dependencies (depends_on) match your intended workflow structure
  • Resource requirements are correctly assigned
  • Schedulers have the right configuration
  • Actions trigger on the expected events

Creating the Workflow

  1. Review the Options below the wizard:

    • Initialize workflow after creation: Builds the dependency graph (recommended)
    • Run workflow immediately: Starts execution right away
  2. Click Create to submit the workflow

If successful, you’ll see a success notification and the workflow will appear in your workflow list.

Example: Complete Diamond Workflow

Here’s how to create a diamond-pattern workflow using the wizard:

     preprocess
       /    \
    work1   work2
       \    /
    postprocess

Jobs Configuration

JobCommandBlocked ByResources
preprocess./preprocess.sh(none)Small
work1./work1.shpreprocessMedium
work2./work2.shpreprocessMedium
postprocess./postprocess.shwork1, work2Small

Generated Spec Preview

The wizard generates a spec like this:

{
  "name": "diamond-workflow",
  "description": "Fan-out and fan-in example",
  "jobs": [
    {
      "name": "preprocess",
      "command": "./preprocess.sh",
      "resource_requirements": "res_1cpu_1g"
    },
    {
      "name": "work1",
      "command": "./work1.sh",
      "depends_on": ["preprocess"],
      "resource_requirements": "res_8cpu_50g"
    },
    {
      "name": "work2",
      "command": "./work2.sh",
      "depends_on": ["preprocess"],
      "resource_requirements": "res_8cpu_50g"
    },
    {
      "name": "postprocess",
      "command": "./postprocess.sh",
      "depends_on": ["work1", "work2"],
      "resource_requirements": "res_1cpu_1g"
    }
  ],
  "resource_requirements": [
    {"name": "res_1cpu_1g", "num_cpus": 1, "memory": "1g", "num_gpus": 0, "num_nodes": 1, "runtime": "PT1H"},
    {"name": "res_8cpu_50g", "num_cpus": 8, "memory": "50g", "num_gpus": 0, "num_nodes": 1, "runtime": "PT1H"}
  ]
}

Using Parameterized Jobs

The wizard supports job parameterization for creating multiple similar jobs:

  1. In a job card, find the Parameters field
  2. Enter parameters in the format: param_name: "value_spec"

Parameter Formats

  • Range: i: "1:10" creates jobs for i=1,2,3,…,10
  • Range with step: i: "0:100:10" creates jobs for i=0,10,20,…,100
  • List: dataset: "['train', 'test', 'validation']"

Example: Parameterized Processing

  • Job Name: process_{i}
  • Command: python process.py --index {i}
  • Parameters: i: "1:5"

This creates 5 jobs: process_1 through process_5.

Tips and Best Practices

Job Naming

  • Use descriptive, unique names
  • Avoid spaces and special characters
  • For parameterized jobs, include the parameter in the name (e.g., job_{i})

Dependencies

  • Keep dependency chains as short as possible
  • Use the fan-out/fan-in pattern for parallelism
  • Avoid circular dependencies (the server will reject them)

Schedulers

  • Create separate schedulers for different resource needs
  • Use descriptive names that indicate the scheduler’s purpose
  • Set realistic wall times to avoid queue priority penalties

Actions

  • Use on_workflow_start for initial resource allocation
  • Use on_jobs_ready for just-in-time scheduling
  • Match allocations to the number of parallel jobs

What You Learned

In this tutorial, you learned:

  • How to navigate the five-step workflow wizard
  • How to create jobs with commands, dependencies, and resources
  • How to configure Slurm schedulers for HPC execution
  • How to set up actions for automatic node scheduling
  • How the wizard generates workflow specifications

Next Steps

Tutorial 1: Many Independent Jobs

This tutorial teaches you how to create and run a workflow with many independent parallel jobs using Torc’s parameterization feature.

Learning Objectives

By the end of this tutorial, you will:

  • Understand how to define parameterized jobs that expand into multiple instances
  • Learn how Torc executes independent jobs in parallel
  • Know how to monitor job execution and view results

Prerequisites

  • Torc server running (see Installation)
  • Basic familiarity with YAML syntax

Use Cases

This pattern is ideal for:

  • Parameter sweeps: Testing different configurations
  • Monte Carlo simulations: Running many independent trials
  • Batch processing: Processing many files with the same logic
  • Embarrassingly parallel workloads: Any task that can be split into independent units

Step 1: Start the Torc Server

First, ensure the Torc server is running:

torc-server run

By default, the server listens on port 8080, making the API URL http://localhost:8080/torc-service/v1.

If you use a custom port, set the environment variable:

export TORC_API_URL="http://localhost:8100/torc-service/v1"

Step 2: Create the Workflow Specification

Save the following as hundred_jobs.yaml:

name: hundred_jobs_parallel
description: 100 independent jobs that can run in parallel

jobs:
  - name: job_{i:03d}
    command: |
      echo "Running job {i}"
      sleep $((RANDOM % 10 + 1))
      echo "Job {i} completed"
    resource_requirements: minimal
    parameters:
      i: "1:100"

resource_requirements:
  - name: minimal
    num_cpus: 1
    num_gpus: 0
    num_nodes: 1
    memory: 1g
    runtime: PT5M

Understanding the Specification

Let’s break down the key elements:

  • name: job_{i:03d}: The {i:03d} is a parameter placeholder. The :03d format specifier means “3-digit zero-padded integer”, so jobs will be named job_001, job_002, …, job_100.

  • parameters: i: "1:100": This defines a parameter i that ranges from 1 to 100 (inclusive). Torc will create one job for each value.

  • resource_requirements: minimal: Each job uses the “minimal” resource profile defined below.

When Torc processes this specification, it expands the single job definition into 100 separate jobs, each with its own parameter value substituted.

Step 3: Run the Workflow

Create and run the workflow in one command:

torc run hundred_jobs.yaml

This command:

  1. Creates the workflow on the server
  2. Expands the parameterized job into 100 individual jobs
  3. Initializes the dependency graph (in this case, no dependencies)
  4. Starts executing jobs in parallel

You’ll see output showing the workflow ID and progress.

Step 4: Monitor Execution

While the workflow runs, you can monitor progress:

# Check workflow status
torc workflows status <workflow_id>

# List jobs and their states
torc jobs list <workflow_id>

# Or use the interactive TUI
torc tui

Since all 100 jobs are independent (no dependencies between them), Torc will run as many in parallel as your system resources allow.

Step 5: View Results

After completion, check the results:

torc results list <workflow_id>

This shows return codes, execution times, and resource usage for each job.

How It Works

When you run this workflow, Torc:

  1. Expands parameters: The single job definition becomes 100 jobs (job_001 through job_100)
  2. Marks all as ready: Since there are no dependencies, all jobs start in the “ready” state
  3. Executes in parallel: The job runner claims and executes jobs based on available resources
  4. Tracks completion: Each job’s return code and metrics are recorded

The job runner respects the resource requirements you specified. With num_cpus: 1 per job, if your machine has 8 CPUs, approximately 8 jobs will run simultaneously.

What You Learned

In this tutorial, you learned how to:

  • ✅ Use parameter expansion (parameters: i: "1:100") to generate multiple jobs from one definition
  • ✅ Use format specifiers ({i:03d}) for consistent naming
  • ✅ Run independent parallel jobs with torc run
  • ✅ Monitor workflow progress and view results

Example Files

See hundred_jobs_parameterized.yaml for a ready-to-run version of this workflow.

Next Steps

Tutorial 2: Diamond Workflow with File Dependencies

This tutorial teaches you how to create workflows where job dependencies are automatically inferred from file inputs and outputs—a core concept in Torc called implicit dependencies.

Learning Objectives

By the end of this tutorial, you will:

  • Understand how Torc infers job dependencies from file relationships
  • Learn the “diamond” workflow pattern (fan-out and fan-in)
  • Know how to use file variable substitution (${files.input.*} and ${files.output.*})
  • See how jobs automatically unblock when their input files become available

Prerequisites

The Diamond Pattern

The “diamond” pattern is a common workflow structure where:

  1. One job produces multiple outputs (fan-out)
  2. Multiple jobs process those outputs in parallel
  3. A final job combines all results (fan-in)
graph TD
    Input["input.txt"] --> Preprocess["preprocess<br/>(generates intermediate files)"]
    Preprocess --> Int1["intermediate1.txt"]
    Preprocess --> Int2["intermediate2.txt"]

    Int1 --> Work1["work1<br/>(process intermediate1)"]
    Int2 --> Work2["work2<br/>(process intermediate2)"]

    Work1 --> Result1["result1.txt"]
    Work2 --> Result2["result2.txt"]

    Result1 --> Postprocess["postprocess<br/>(combines results)"]
    Result2 --> Postprocess

    Postprocess --> Output["output.txt"]

Notice that we never explicitly say “work1 depends on preprocess”—Torc figures this out automatically because work1 needs intermediate1.txt as input, and preprocess produces it as output.

Step 1: Create the Workflow Specification

Save as diamond.yaml:

name: diamond_workflow
description: Diamond workflow demonstrating fan-out and fan-in

jobs:
  - name: preprocess
    command: |
      cat ${files.input.input_file} |
      awk '{print $1}' > ${files.output.intermediate1}
      cat ${files.input.input_file} |
      awk '{print $2}' > ${files.output.intermediate2}
    resource_requirements: small

  - name: work1
    command: |
      cat ${files.input.intermediate1} |
      sort | uniq > ${files.output.result1}
    resource_requirements: medium

  - name: work2
    command: |
      cat ${files.input.intermediate2} |
      sort | uniq > ${files.output.result2}
    resource_requirements: medium

  - name: postprocess
    command: |
      paste ${files.input.result1} ${files.input.result2} > ${files.output.final_output}
    resource_requirements: small

files:
  - name: input_file
    path: /tmp/input.txt

  - name: intermediate1
    path: /tmp/intermediate1.txt

  - name: intermediate2
    path: /tmp/intermediate2.txt

  - name: result1
    path: /tmp/result1.txt

  - name: result2
    path: /tmp/result2.txt

  - name: final_output
    path: /tmp/output.txt

resource_requirements:
  - name: small
    num_cpus: 1
    num_gpus: 0
    num_nodes: 1
    memory: 1g
    runtime: PT10M

  - name: medium
    num_cpus: 4
    num_gpus: 0
    num_nodes: 1
    memory: 4g
    runtime: PT30M

Understanding File Variable Substitution

The key concept here is file variable substitution:

  • ${files.input.filename} - References a file this job reads (creates a dependency)
  • ${files.output.filename} - References a file this job writes (satisfies dependencies)

When Torc processes the workflow:

  1. It sees preprocess outputs intermediate1 and intermediate2
  2. It sees work1 inputs intermediate1 → dependency created
  3. It sees work2 inputs intermediate2 → dependency created
  4. It sees postprocess inputs result1 and result2 → dependencies created

This is more maintainable than explicit depends_on declarations because:

  • Dependencies are derived from actual data flow
  • Adding a new intermediate step automatically updates dependencies
  • The workflow specification documents the data flow

Step 2: Create Input Data

# Create test input file
echo -e "apple red\nbanana yellow\ncherry red\ndate brown" > /tmp/input.txt

Step 3: Create and Initialize the Workflow

# Create the workflow and capture the ID
WORKFLOW_ID=$(torc workflows create diamond.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

# Ensure the input file timestamp is current
touch /tmp/input.txt

# Initialize the workflow (builds dependency graph)
torc workflows initialize-jobs $WORKFLOW_ID

The initialize-jobs command is where Torc:

  1. Analyzes file input/output relationships
  2. Builds the dependency graph
  3. Marks jobs with satisfied dependencies as “ready”

Step 4: Observe Dependency Resolution

# Check job statuses
torc jobs list $WORKFLOW_ID

Expected output:

╭────┬──────────────┬─────────┬────────╮
│ ID │ Name         │ Status  │ ...    │
├────┼──────────────┼─────────┼────────┤
│ 1  │ preprocess   │ ready   │ ...    │
│ 2  │ work1        │ blocked │ ...    │
│ 3  │ work2        │ blocked │ ...    │
│ 4  │ postprocess  │ blocked │ ...    │
╰────┴──────────────┴─────────┴────────╯

Only preprocess is ready because:

  • Its only input (input_file) already exists
  • The others are blocked waiting for files that don’t exist yet

Step 5: Run the Workflow

torc run $WORKFLOW_ID

Watch the execution unfold:

  1. preprocess runs first - Creates intermediate1.txt and intermediate2.txt
  2. work1 and work2 unblock - Their input files now exist
  3. work1 and work2 run in parallel - They have no dependency on each other
  4. postprocess unblocks - Both result1.txt and result2.txt exist
  5. postprocess runs - Creates the final output

Step 6: Verify Results

cat /tmp/output.txt

You should see the combined, sorted, unique values from both columns of the input.

How Implicit Dependencies Work

Torc determines job order through file relationships:

JobInputsOutputsBlocked By
preprocessinput_fileintermediate1, intermediate2(nothing)
work1intermediate1result1preprocess
work2intermediate2result2preprocess
postprocessresult1, result2final_outputwork1, work2

The dependency graph is built automatically from these relationships. If you later add a validation step between preprocess and work1, you only need to update the file references—the dependencies adjust automatically.

What You Learned

In this tutorial, you learned:

  • ✅ How to define files with files: section and reference them in jobs
  • ✅ How ${files.input.*} creates implicit dependencies
  • ✅ How ${files.output.*} satisfies dependencies for downstream jobs
  • ✅ The diamond pattern: fan-out → parallel processing → fan-in
  • ✅ How Torc automatically determines execution order from data flow

When to Use File Dependencies vs Explicit Dependencies

Use file dependencies when:

  • Jobs actually read/write files
  • Data flow defines the natural ordering
  • You want self-documenting workflows

Use explicit depends_on when:

  • Dependencies are logical, not data-based
  • Jobs communicate through side effects
  • You need precise control over ordering

Example Files

See the diamond workflow examples in all three formats:

A Python version is also available: diamond_workflow.py

Next Steps

Tutorial 3: User Data Dependencies

This tutorial teaches you how to pass structured data (JSON) between jobs using Torc’s user_data feature—an alternative to file-based dependencies that stores data directly in the database.

Learning Objectives

By the end of this tutorial, you will:

  • Understand what user_data is and when to use it instead of files
  • Learn how to define user_data entries and reference them in jobs
  • Know how to update user_data from within a job
  • See how user_data creates implicit dependencies (like files)

Prerequisites

What is User Data?

User data is Torc’s mechanism for passing small, structured data between jobs without creating actual files. The data is stored in the Torc database and can be:

  • JSON objects (configurations, parameters)
  • Arrays
  • Simple values (strings, numbers)

Like files, user_data creates implicit dependencies: a job that reads user_data will be blocked until the job that writes it completes.

User Data vs Files

FeatureUser DataFiles
StorageTorc databaseFilesystem
SizeSmall (KB)Any size
FormatJSONAny format
AccessVia torc user-data CLIDirect file I/O
Best forConfig, params, metadataDatasets, binaries, logs

Step 1: Create the Workflow Specification

Save as user_data_workflow.yaml:

name: config_pipeline
description: Jobs that pass configuration via user_data

jobs:
  - name: generate_config
    command: |
      echo '{"learning_rate": 0.001, "batch_size": 32, "epochs": 10}' > /tmp/config.json
      torc user-data update ${user_data.output.ml_config} \
        --data "$(cat /tmp/config.json)"
    resource_requirements: minimal

  - name: train_model
    command: |
      echo "Training with config:"
      torc user-data get ${user_data.input.ml_config} | jq '.data'
      # In a real workflow: python train.py --config="${user_data.input.ml_config}"
    resource_requirements: gpu_large

  - name: evaluate_model
    command: |
      echo "Evaluating with config:"
      torc user-data get ${user_data.input.ml_config} | jq '.data'
      # In a real workflow: python evaluate.py --config="${user_data.input.ml_config}"
    resource_requirements: gpu_small

user_data:
  - name: ml_config
    data: null  # Will be populated by generate_config job

resource_requirements:
  - name: minimal
    num_cpus: 1
    memory: 1g
    runtime: PT5M

  - name: gpu_small
    num_cpus: 4
    num_gpus: 1
    memory: 16g
    runtime: PT1H

  - name: gpu_large
    num_cpus: 8
    num_gpus: 2
    memory: 32g
    runtime: PT4H

Understanding the Specification

Key elements:

  • user_data: section - Defines data entries, similar to files:
  • data: null - Initial value; will be populated by a job
  • ${user_data.output.ml_config} - Job will write to this user_data (creates it)
  • ${user_data.input.ml_config} - Job reads from this user_data (creates dependency)

The dependency flow:

  1. generate_config outputs ml_config → runs first
  2. train_model and evaluate_model input ml_config → blocked until step 1 completes
  3. After generate_config finishes, both become ready and can run in parallel

Step 2: Create and Initialize the Workflow

# Create the workflow
WORKFLOW_ID=$(torc workflows create user_data_workflow.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

# Initialize jobs
torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Check Initial State

Before running, examine the user_data:

# Check user_data - should be null
torc user-data list $WORKFLOW_ID

Output:

╭────┬───────────┬──────┬─────────────╮
│ ID │ Name      │ Data │ Workflow ID │
├────┼───────────┼──────┼─────────────┤
│ 1  │ ml_config │ null │ 1           │
╰────┴───────────┴──────┴─────────────╯

Check job statuses:

torc jobs list $WORKFLOW_ID

You should see:

  • generate_config: ready (no input dependencies)
  • train_model: blocked (waiting for ml_config)
  • evaluate_model: blocked (waiting for ml_config)

Step 4: Run the Workflow

torc run $WORKFLOW_ID

Step 5: Observe the Data Flow

After generate_config completes, check the updated user_data:

torc user-data list $WORKFLOW_ID -f json | jq '.[] | {name, data}'

Output:

{
  "name": "ml_config",
  "data": {
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 10
  }
}

The data is now stored in the database. At this point:

  • train_model and evaluate_model unblock
  • Both can read the configuration and run in parallel

Step 6: Verify Completion

After the workflow completes:

torc results list $WORKFLOW_ID

All three jobs should show return code 0.

How User Data Dependencies Work

The mechanism is identical to file dependencies:

SyntaxMeaningEffect
${user_data.input.name}Job reads this dataCreates dependency on producer
${user_data.output.name}Job writes this dataSatisfies dependencies

Torc substitutes these variables with the actual user_data ID at runtime, and the torc user-data CLI commands use that ID to read/write the data.

Accessing User Data in Your Code

From within a job, you can:

Read user_data:

# Get the full record
torc user-data get $USER_DATA_ID

# Get just the data field
torc user-data get $USER_DATA_ID | jq '.data'

# Save to a file for your application
torc user-data get $USER_DATA_ID | jq '.data' > config.json

Write user_data:

# Update with JSON data
torc user-data update $USER_DATA_ID --data '{"key": "value"}'

# Update from a file
torc user-data update $USER_DATA_ID --data "$(cat results.json)"

What You Learned

In this tutorial, you learned:

  • ✅ What user_data is: structured data stored in the Torc database
  • ✅ When to use it: configurations, parameters, metadata (not large files)
  • ✅ How to define user_data entries with the user_data: section
  • ✅ How ${user_data.input.*} and ${user_data.output.*} create dependencies
  • ✅ How to read and write user_data from within jobs

Common Patterns

Dynamic Configuration Generation

jobs:
  - name: analyze_data
    command: |
      # Analyze data and determine optimal parameters
      OPTIMAL_LR=$(python analyze.py --find-optimal-lr)
      torc user-data update ${user_data.output.optimal_params} \
        --data "{\"learning_rate\": $OPTIMAL_LR}"

Collecting Results from Multiple Jobs

jobs:
  - name: worker_{i}
    command: |
      RESULT=$(python process.py --id {i})
      torc user-data update ${user_data.output.result_{i}} --data "$RESULT"
    parameters:
      i: "1:10"

  - name: aggregate
    command: |
      # Collect all results
      for i in $(seq 1 10); do
        torc user-data get ${user_data.input.result_$i} >> all_results.json
      done
      python aggregate.py all_results.json

Next Steps

Tutorial 4: Simple Job Parameterization

This tutorial teaches you how to create parameter sweeps—generating multiple related jobs from a single job definition using Torc’s parameterization feature.

Learning Objectives

By the end of this tutorial, you will:

  • Understand how parameterization expands one job definition into many jobs
  • Learn the different parameter formats (lists, ranges)
  • Know how to use format specifiers for consistent naming
  • See how parameterization combines with file dependencies

Prerequisites

Why Parameterization?

Without parameterization, a 5-value hyperparameter sweep would require writing 5 separate job definitions. With parameterization, you write one definition and Torc expands it:

# Without parameterization: 5 separate definitions
jobs:
  - name: train_lr0.0001
    command: python train.py --lr=0.0001
  - name: train_lr0.0005
    command: python train.py --lr=0.0005
  # ... 3 more ...

# With parameterization: 1 definition
jobs:
  - name: train_lr{lr:.4f}
    command: python train.py --lr={lr}
    parameters:
      lr: "[0.0001,0.0005,0.001,0.005,0.01]"

Step 1: Create the Workflow Specification

Save as learning_rate_sweep.yaml:

name: lr_sweep
description: Test different learning rates

jobs:
  - name: train_lr{lr:.4f}
    command: |
      python train.py \
        --learning-rate={lr} \
        --output=/models/model_lr{lr:.4f}.pt
    resource_requirements: gpu
    output_files:
      - model_lr{lr:.4f}
    parameters:
      lr: "[0.0001,0.0005,0.001,0.005,0.01]"

  - name: evaluate_lr{lr:.4f}
    command: |
      python evaluate.py \
        --model=/models/model_lr{lr:.4f}.pt \
        --output=/results/metrics_lr{lr:.4f}.json
    resource_requirements: gpu
    input_files:
      - model_lr{lr:.4f}
    output_files:
      - metrics_lr{lr:.4f}
    parameters:
      lr: "[0.0001,0.0005,0.001,0.005,0.01]"

  - name: compare_results
    command: |
      python compare.py --input-dir=/results --output=/results/comparison.csv
    resource_requirements: minimal
    input_files:
      - metrics_lr{lr:.4f}
    parameters:
      lr: "[0.0001,0.0005,0.001,0.005,0.01]"

files:
  - name: model_lr{lr:.4f}
    path: /models/model_lr{lr:.4f}.pt
    parameters:
      lr: "[0.0001,0.0005,0.001,0.005,0.01]"

  - name: metrics_lr{lr:.4f}
    path: /results/metrics_lr{lr:.4f}.json
    parameters:
      lr: "[0.0001,0.0005,0.001,0.005,0.01]"

resource_requirements:
  - name: gpu
    num_cpus: 8
    num_gpus: 1
    memory: 16g
    runtime: PT2H

  - name: minimal
    num_cpus: 1
    memory: 2g
    runtime: PT10M

Understanding the Specification

Parameter Syntax:

  • {lr} - Simple substitution with the parameter value
  • {lr:.4f} - Format specifier: 4 decimal places (e.g., 0.0010 not 0.001)

Parameter Values:

  • "[0.0001,0.0005,0.001,0.005,0.01]" - A list of 5 specific values

File Parameterization: Notice that both jobs AND files have parameters:. When Torc expands:

  • Each train_lr{lr:.4f} job gets a corresponding model_lr{lr:.4f} file
  • The file dependencies are matched by parameter value

Dependency Flow:

  1. train_lr0.0001 → outputs model_lr0.0001 → unblocks evaluate_lr0.0001
  2. train_lr0.0005 → outputs model_lr0.0005 → unblocks evaluate_lr0.0005
  3. (and so on for each learning rate)
  4. All evaluate_* jobs → unblock compare_results

Step 2: Create and Initialize the Workflow

WORKFLOW_ID=$(torc workflows create learning_rate_sweep.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Verify the Expansion

# Count jobs (should be 11: 5 train + 5 evaluate + 1 compare)
torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'

List the job names:

torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[].name' | sort

Output:

compare_results
evaluate_lr0.0001
evaluate_lr0.0005
evaluate_lr0.0010
evaluate_lr0.0050
evaluate_lr0.0100
train_lr0.0001
train_lr0.0005
train_lr0.0010
train_lr0.0050
train_lr0.0100

Notice:

  • One job per parameter value for train_* and evaluate_*
  • Only one compare_results job (it has the parameter for dependencies, but doesn’t expand because its name has no {lr})

Step 4: Check Dependencies

torc jobs list $WORKFLOW_ID

Expected statuses:

  • All train_* jobs: ready (no input dependencies)
  • All evaluate_* jobs: blocked (waiting for corresponding model file)
  • compare_results: blocked (waiting for all metrics files)

Step 5: Run the Workflow

torc run $WORKFLOW_ID

Execution flow:

  1. All 5 training jobs run in parallel - They have no dependencies on each other
  2. Each evaluation unblocks independently - When train_lr0.0001 finishes, evaluate_lr0.0001 can start (doesn’t wait for other training jobs)
  3. Compare runs last - Only after all 5 evaluations complete

This is more efficient than a simple two-stage workflow because evaluations can start as soon as their specific training job completes.

Parameter Format Reference

List Format

Explicit list of values:

parameters:
  lr: "[0.0001,0.0005,0.001,0.005,0.01]"  # Numbers
  opt: "['adam','sgd','rmsprop']"          # Strings (note the quotes)

Range Format

For integer or float sequences:

parameters:
  i: "1:100"        # Integers 1 to 100 (inclusive)
  i: "0:100:10"     # Integers 0, 10, 20, ..., 100 (with step)
  lr: "0.0:1.0:0.1" # Floats 0.0, 0.1, 0.2, ..., 1.0

Format Specifiers

Control how values appear in names:

SpecifierExample ValueResult
{i}55
{i:03d}5005
{lr:.4f}0.0010.0010
{lr:.2e}0.0011.00e-03

How Parameterization and File Dependencies Interact

When both jobs and files are parameterized with the same parameter:

jobs:
  - name: train_{i}
    output_files: [model_{i}]
    parameters:
      i: "1:3"

  - name: eval_{i}
    input_files: [model_{i}]
    parameters:
      i: "1:3"

files:
  - name: model_{i}
    path: /models/model_{i}.pt
    parameters:
      i: "1:3"

Torc creates these relationships:

  • train_1model_1eval_1
  • train_2model_2eval_2
  • train_3model_3eval_3

Each chain is independent—eval_2 doesn’t wait for train_1.

Parameter Modes: Product vs Zip

By default, multiple parameters create a Cartesian product (all combinations). For paired parameters, use parameter_mode: zip:

jobs:
  # Default (product): 3 × 3 = 9 jobs
  - name: train_{dataset}_{model}
    command: python train.py --dataset={dataset} --model={model}
    parameters:
      dataset: "['cifar10', 'mnist', 'imagenet']"
      model: "['resnet', 'vgg', 'transformer']"

  # Zip mode: 3 paired jobs (cifar10+resnet, mnist+vgg, imagenet+transformer)
  - name: paired_{dataset}_{model}
    command: python train.py --dataset={dataset} --model={model}
    parameters:
      dataset: "['cifar10', 'mnist', 'imagenet']"
      model: "['resnet', 'vgg', 'transformer']"
    parameter_mode: zip

Use zip mode when parameters have a 1:1 correspondence (e.g., input/output file pairs, pre-determined configurations).

See Parameterization Reference for details.

What You Learned

In this tutorial, you learned:

  • ✅ How to use parameters: to expand one job definition into many
  • ✅ List format ("[a,b,c]") and range format ("1:100")
  • ✅ Format specifiers ({i:03d}, {lr:.4f}) for consistent naming
  • ✅ How parameterized files create one-to-one dependencies
  • ✅ The efficiency of parameter-matched dependencies (each chain runs independently)
  • ✅ The difference between product (default) and zip parameter modes

Next Steps

Tutorial 5: Advanced Multi-Dimensional Parameterization

This tutorial teaches you how to create multi-dimensional parameter sweeps—grid searches over multiple hyperparameters that generate all combinations automatically.

Learning Objectives

By the end of this tutorial, you will:

  • Understand how multiple parameters create a Cartesian product (all combinations)
  • Learn to structure complex workflows with data preparation, training, and aggregation stages
  • Know how to combine parameterization with explicit dependencies
  • See patterns for running large grid searches on HPC systems

Prerequisites

Multi-Dimensional Parameters: Cartesian Product

When a job has multiple parameters, Torc creates the Cartesian product—every combination of values:

parameters:
  lr: "[0.001,0.01]"   # 2 values
  bs: "[16,32]"        # 2 values

This generates 2 × 2 = 4 jobs:

  • lr=0.001, bs=16
  • lr=0.001, bs=32
  • lr=0.01, bs=16
  • lr=0.01, bs=32

With three parameters:

parameters:
  lr: "[0.0001,0.001,0.01]"  # 3 values
  bs: "[16,32,64]"            # 3 values
  opt: "['adam','sgd']"       # 2 values

This generates 3 × 3 × 2 = 18 jobs.

Step 1: Create the Workflow Specification

Save as grid_search.yaml:

name: hyperparameter_grid_search
description: 3D grid search over learning rate, batch size, and optimizer

jobs:
  # Data preparation (runs once, no parameters)
  - name: prepare_data
    command: python prepare_data.py --output=/data/processed.pkl
    resource_requirements: data_prep
    output_files:
      - training_data

  # Training jobs (one per parameter combination)
  - name: train_lr{lr:.4f}_bs{bs}_opt{opt}
    command: |
      python train.py \
        --data=/data/processed.pkl \
        --learning-rate={lr} \
        --batch-size={bs} \
        --optimizer={opt} \
        --output=/models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt \
        --metrics=/results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
    resource_requirements: gpu_training
    input_files:
      - training_data
    output_files:
      - model_lr{lr:.4f}_bs{bs}_opt{opt}
      - metrics_lr{lr:.4f}_bs{bs}_opt{opt}
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

  # Aggregate results (depends on ALL training jobs via file dependencies)
  - name: aggregate_results
    command: |
      python aggregate.py \
        --input-dir=/results \
        --output=/results/summary.csv
    resource_requirements: minimal
    input_files:
      - metrics_lr{lr:.4f}_bs{bs}_opt{opt}
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

  # Find best model (explicit dependency, no parameters)
  - name: select_best_model
    command: |
      python select_best.py \
        --summary=/results/summary.csv \
        --output=/results/best_config.json
    resource_requirements: minimal
    depends_on:
      - aggregate_results

files:
  - name: training_data
    path: /data/processed.pkl

  - name: model_lr{lr:.4f}_bs{bs}_opt{opt}
    path: /models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

  - name: metrics_lr{lr:.4f}_bs{bs}_opt{opt}
    path: /results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
    parameters:
      lr: "[0.0001,0.001,0.01]"
      bs: "[16,32,64]"
      opt: "['adam','sgd']"

resource_requirements:
  - name: data_prep
    num_cpus: 8
    memory: 32g
    runtime: PT1H

  - name: gpu_training
    num_cpus: 8
    num_gpus: 1
    memory: 16g
    runtime: PT4H

  - name: minimal
    num_cpus: 1
    memory: 2g
    runtime: PT10M

Understanding the Structure

Four-stage workflow:

  1. prepare_data (1 job) - No parameters, runs once
  2. train_* (18 jobs) - Parameterized, all depend on prepare_data
  3. aggregate_results (1 job) - Has parameters only for file dependency matching
  4. select_best_model (1 job) - Explicit dependency on aggregate_results

Key insight: Why aggregate_results has parameters

The aggregate_results job won’t expand into multiple jobs (its name has no {}). However, it needs parameters: to match the parameterized input_files. This tells Torc: “this job depends on ALL 18 metrics files.”

Step 2: Create and Initialize the Workflow

WORKFLOW_ID=$(torc workflows create grid_search.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Verify the Expansion

Count the jobs:

torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'

Expected: 21 jobs (1 prepare + 18 training + 1 aggregate + 1 select)

List the training jobs:

torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[] | select(.name | startswith("train_")) | .name' | sort

Output (18 training jobs):

train_lr0.0001_bs16_optadam
train_lr0.0001_bs16_optsgd
train_lr0.0001_bs32_optadam
train_lr0.0001_bs32_optsgd
train_lr0.0001_bs64_optadam
train_lr0.0001_bs64_optsgd
train_lr0.0010_bs16_optadam
train_lr0.0010_bs16_optsgd
train_lr0.0010_bs32_optadam
train_lr0.0010_bs32_optsgd
train_lr0.0010_bs64_optadam
train_lr0.0010_bs64_optsgd
train_lr0.0100_bs16_optadam
train_lr0.0100_bs16_optsgd
train_lr0.0100_bs32_optadam
train_lr0.0100_bs32_optsgd
train_lr0.0100_bs64_optadam
train_lr0.0100_bs64_optsgd

Step 4: Examine the Dependency Graph

torc jobs list $WORKFLOW_ID

Initial states:

  • prepare_data: ready (no dependencies)
  • All train_*: blocked (waiting for training_data file)
  • aggregate_results: blocked (waiting for all 18 metrics files)
  • select_best_model: blocked (waiting for aggregate_results)

Step 5: Run the Workflow

For local execution:

torc run $WORKFLOW_ID

Execution flow:

  1. prepare_data runs and produces training_data
  2. All 18 train_* jobs unblock and run in parallel (resource-limited)
  3. aggregate_results waits for all training jobs, then runs
  4. select_best_model runs last

Step 6: Monitor Progress

# Check status summary
torc workflows status $WORKFLOW_ID

# Watch job completion in real-time
watch -n 10 'torc jobs list-by-status $WORKFLOW_ID'

# Or use the TUI
torc tui

Step 7: Retrieve Results

After completion:

# View best configuration
cat /results/best_config.json

# View summary of all runs
cat /results/summary.csv

Scaling Considerations

Job Count Growth

Multi-dimensional parameters grow exponentially:

DimensionsValues per DimensionTotal Jobs
11010
210 × 10100
310 × 10 × 101,000
410 × 10 × 10 × 1010,000

Dependency Count

Without barriers, dependencies also grow quickly. In this tutorial:

  • 18 training jobs each depend on 1 file = 18 dependencies
  • 1 aggregate job depends on 18 files = 18 dependencies
  • Total: ~36 dependencies

For larger sweeps (1000+ jobs), consider the barrier pattern to reduce dependencies from O(n²) to O(n).

Common Patterns

Mixing Fixed and Parameterized Jobs

jobs:
  # Fixed job (no parameters)
  - name: setup
    command: ./setup.sh

  # Parameterized jobs depend on fixed job
  - name: experiment_{i}
    command: ./run.sh {i}
    depends_on: [setup]
    parameters:
      i: "1:100"

Aggregating Parameterized Results

Use the file dependency pattern shown in this tutorial:

  - name: aggregate
    input_files:
      - result_{i}    # Matches all parameterized result files
    parameters:
      i: "1:100"      # Same parameters as producer jobs

Nested Parameter Sweeps

For workflows with multiple independent sweeps:

jobs:
  # Sweep 1
  - name: sweep1_job_{a}
    parameters:
      a: "1:10"

  # Sweep 2 (independent of sweep 1)
  - name: sweep2_job_{b}
    parameters:
      b: "1:10"

What You Learned

In this tutorial, you learned:

  • ✅ How multiple parameters create a Cartesian product of jobs
  • ✅ How to structure multi-stage workflows (prep → train → aggregate → select)
  • ✅ How to use parameters in file dependencies to collect all outputs
  • ✅ How to mix parameterized and non-parameterized jobs
  • ✅ Scaling considerations for large grid searches

Example Files

See these example files for hyperparameter sweep patterns:

Next Steps

Multi-Stage Workflows with Barriers

This tutorial teaches you how to efficiently structure workflows with multiple stages using the barrier pattern. This is essential for scaling workflows to thousands of jobs.

Learning Objectives

By the end of this tutorial, you will:

  • Understand the quadratic dependency problem in multi-stage workflows
  • Use barrier jobs to efficiently synchronize between stages
  • Scale workflows to thousands of jobs with minimal overhead
  • Know when to use barriers vs. direct dependencies

Prerequisites

The Problem: Quadratic Dependencies

Let’s start with a common but inefficient pattern. Suppose you want to:

  1. Stage 1: Run 1000 preprocessing jobs in parallel
  2. Stage 2: Run 1000 analysis jobs, but only after ALL stage 1 jobs complete
  3. Stage 3: Run a final aggregation job

Naive Approach (DON’T DO THIS!)

name: "Inefficient Multi-Stage Workflow"
description: "This creates 1,000,000 dependencies!"

jobs:
  # Stage 1: 1000 preprocessing jobs
  - name: "preprocess_{i:03d}"
    command: "python preprocess.py --id {i}"
    parameters:
      i: "0:999"

  # Stage 2: Each analysis job waits for ALL preprocessing jobs
  - name: "analyze_{i:03d}"
    command: "python analyze.py --id {i}"
    depends_on_regexes: ["^preprocess_.*"]  # ⚠️ Creates 1,000,000 dependencies!
    parameters:
      i: "0:999"

  # Stage 3: Final aggregation
  - name: "final_report"
    command: "python generate_report.py"
    depends_on_regexes: ["^analyze_.*"]  # ⚠️ Creates 1,000 more dependencies

Why This is Bad

When Torc expands this workflow:

  • Each of the 1000 analyze_* jobs gets a dependency on each of the 1000 preprocess_* jobs
  • Total dependencies: 1000 × 1000 = 1,000,000 relationships
  • Workflow creation takes minutes instead of seconds
  • Database becomes bloated with dependency records
  • Job initialization is slow

The Solution: Barrier Jobs

A barrier job is a lightweight synchronization point that:

  • Depends on all jobs from the previous stage (using a regex)
  • Is depended upon by all jobs in the next stage
  • Reduces dependencies from O(n²) to O(n)

Efficient Approach (DO THIS!)

name: "Efficient Multi-Stage Workflow"
description: "Uses barrier pattern with only ~3000 dependencies"

jobs:
  # ═══════════════════════════════════════════════════════════
  # STAGE 1: Preprocessing (1000 parallel jobs)
  # ═══════════════════════════════════════════════════════════
  - name: "preprocess_{i:03d}"
    command: "python preprocess.py --id {i} --output data/stage1_{i:03d}.json"
    resource_requirements: "medium"
    parameters:
      i: "0:999"

  # ═══════════════════════════════════════════════════════════
  # BARRIER: Wait for ALL stage 1 jobs
  # ═══════════════════════════════════════════════════════════
  - name: "barrier_stage1_complete"
    command: "echo 'Stage 1 complete: 1000 files preprocessed' && date"
    resource_requirements: "tiny"
    depends_on_regexes: ["^preprocess_.*"]  # ✓ 1000 dependencies

  # ═══════════════════════════════════════════════════════════
  # STAGE 2: Analysis (1000 parallel jobs)
  # ═══════════════════════════════════════════════════════════
  - name: "analyze_{i:03d}"
    command: "python analyze.py --input data/stage1_{i:03d}.json --output data/stage2_{i:03d}.csv"
    resource_requirements: "large"
    depends_on: ["barrier_stage1_complete"]  # ✓ 1000 dependencies (one per job)
    parameters:
      i: "0:999"

  # ═══════════════════════════════════════════════════════════
  # BARRIER: Wait for ALL stage 2 jobs
  # ═══════════════════════════════════════════════════════════
  - name: "barrier_stage2_complete"
    command: "echo 'Stage 2 complete: 1000 analyses finished' && date"
    resource_requirements: "tiny"
    depends_on_regexes: ["^analyze_.*"]  # ✓ 1000 dependencies

  # ═══════════════════════════════════════════════════════════
  # STAGE 3: Final report (single job)
  # ═══════════════════════════════════════════════════════════
  - name: "final_report"
    command: "python generate_report.py --output final_report.html"
    resource_requirements: "medium"
    depends_on: ["barrier_stage2_complete"]  # ✓ 1 dependency

resource_requirements:
  - name: "tiny"
    num_cpus: 1
    num_gpus: 0
    num_nodes: 1
    memory: "100m"
    runtime: "PT1M"

  - name: "medium"
    num_cpus: 4
    num_gpus: 0
    num_nodes: 1
    memory: "4g"
    runtime: "PT30M"

  - name: "large"
    num_cpus: 16
    num_gpus: 1
    num_nodes: 1
    memory: "32g"
    runtime: "PT2H"

Dependency Breakdown

Without barriers:

  • Stage 1 → Stage 2: 1000 × 1000 = 1,000,000 dependencies
  • Stage 2 → Stage 3: 1000 = 1,000 dependencies
  • Total: 1,001,000 dependencies

With barriers:

  • Stage 1 → Barrier 1: 1,000 dependencies
  • Barrier 1 → Stage 2: 1,000 dependencies
  • Stage 2 → Barrier 2: 1,000 dependencies
  • Barrier 2 → Stage 3: 1 dependency
  • Total: 3,001 dependencies333× improvement!

Step-by-Step: Creating Your First Barrier Workflow

Let’s create a simple 2-stage workflow.

Step 1: Create the Workflow Spec

Create barrier_demo.yaml:

name: "Barrier Pattern Demo"
description: "Simple demonstration of the barrier pattern"

jobs:
  # Stage 1: Generate 100 data files
  - name: "generate_data_{i:02d}"
    command: "echo 'Data file {i}' > output/data_{i:02d}.txt"
    parameters:
      i: "0:99"

  # Barrier: Wait for all data generation
  - name: "data_generation_complete"
    command: "echo 'All 100 data files generated' && ls -l output/ | wc -l"
    depends_on_regexes: ["^generate_data_.*"]

  # Stage 2: Process each data file
  - name: "process_data_{i:02d}"
    command: "cat output/data_{i:02d}.txt | wc -w > output/processed_{i:02d}.txt"
    depends_on: ["data_generation_complete"]
    parameters:
      i: "0:99"

  # Final barrier and report
  - name: "processing_complete"
    command: "echo 'All 100 files processed' && cat output/processed_*.txt | awk '{sum+=$1} END {print sum}'"
    depends_on_regexes: ["^process_data_.*"]

Step 2: Create the Output Directory

mkdir -p output

Step 3: Create the Workflow

torc workflows create barrier_demo.yaml

You should see output like:

Created workflow with ID: 1
- Created 100 stage 1 jobs
- Created 1 barrier job
- Created 100 stage 2 jobs
- Created 1 final barrier
Total: 202 jobs, 201 dependencies

Compare this to 10,000 dependencies without barriers!

Step 4: Run the Workflow

torc workflows run 1

Step 5: Monitor Progress

torc tui

You’ll see:

  1. All 100 generate_data_* jobs run in parallel
  2. Once they finish, data_generation_complete executes
  3. Then all 100 process_data_* jobs run in parallel
  4. Finally, processing_complete executes

Making Effective Barrier Jobs

1. Keep Barriers Lightweight

Barriers should be quick and cheap:

✓ GOOD - Lightweight logging
- name: "stage1_complete"
  command: "echo 'Stage 1 done' && date"
  resource_requirements: "tiny"

✗ BAD - Heavy computation
- name: "stage1_complete"
  command: "python expensive_validation.py"  # Don't do this!
  resource_requirements: "large"

If you need validation, create a separate job:

# Barrier - lightweight
- name: "stage1_complete"
  command: "echo 'Stage 1 done'"
  resource_requirements: "tiny"
  depends_on_regexes: ["^stage1_.*"]

# Validation - heavier
- name: "validate_stage1"
  command: "python validate_all_outputs.py"
  resource_requirements: "medium"
  depends_on: ["stage1_complete"]

# Stage 2 depends on validation passing
- name: "stage2_job_{i}"
  command: "python stage2.py {i}"
  depends_on: ["validate_stage1"]
  parameters:
    i: "0:999"

2. Use Descriptive Names

Names should clearly indicate what stage completed:

✓ GOOD
- name: "barrier_preprocessing_complete"
- name: "barrier_training_complete"
- name: "all_simulations_finished"

✗ BAD
- name: "barrier1"
- name: "sync"
- name: "wait"

3. Add Useful Information

Make barriers informative:

- name: "preprocessing_complete"
  command: |
    echo "════════════════════════════════════════"
    echo "Preprocessing Complete: $(date)"
    echo "Files generated: $(ls output/stage1_*.json | wc -l)"
    echo "Total size: $(du -sh output/)"
    echo "Proceeding to analysis stage..."
    echo "════════════════════════════════════════"
  depends_on_regexes: ["^preprocess_.*"]

4. Be Careful with Regex Patterns

Ensure your regex matches exactly what you intend:

✓ GOOD - Anchored patterns
depends_on_regexes: ["^stage1_job_.*"]      # Matches "stage1_job_001", "stage1_job_042"
depends_on_regexes: ["^preprocess_\\d+$"]   # Matches "preprocess_0", "preprocess_999"

✗ BAD - Too broad
depends_on_regexes: ["stage1"]              # Matches "my_stage1_test" (unintended!)
depends_on_regexes: [".*"]                  # Matches EVERYTHING (disaster!)

Test your regex before deploying:

# Python regex tester
python3 -c "import re; print(re.match(r'^stage1_job_.*', 'stage1_job_001'))"

When NOT to Use Barriers

Barriers are not always the right solution:

1. One-to-One Dependencies

When each job in stage 2 only needs its corresponding stage 1 job:

# DON'T use a barrier here
jobs:
  - name: "preprocess_{i}"
    command: "preprocess.py {i}"
    output_files: ["data_{i}.json"]
    parameters:
      i: "0:99"

  # Each analysis only needs its own preprocessed file
  - name: "analyze_{i}"
    command: "analyze.py {i}"
    input_files: ["data_{i}.json"]  # ✓ Automatic dependency via files
    parameters:
      i: "0:99"

The file dependency system already handles this efficiently!

2. Specific Dependencies in DAGs

When you have a directed acyclic graph (DAG) with specific paths:

# Diamond pattern - specific dependencies
jobs:
  - name: "fetch_data"
    command: "fetch.py"

  - name: "process_weather"
    command: "process_weather.py"
    depends_on: ["fetch_data"]

  - name: "process_traffic"
    command: "process_traffic.py"
    depends_on: ["fetch_data"]

  - name: "generate_report"
    command: "report.py"
    depends_on: ["process_weather", "process_traffic"]  # ✓ Specific dependencies

Don’t force this into stages - the specific dependencies are clearer!

3. Small Workflows

For small workflows (< 100 jobs), the overhead of barriers isn’t worth it:

# Only 10 jobs - barriers not needed
jobs:
  - name: "job_{i}"
    command: "process.py {i}"
    depends_on_regexes: ["^prepare_.*"]  # This is fine for 10 jobs
    parameters:
      i: "0:9"

Scaling to Thousands of Jobs

The barrier pattern scales beautifully. Let’s compare performance:

Stage 1 JobsStage 2 JobsWithout BarriersWith BarriersSpeedup
10010010,000 deps (~1s)200 deps (<0.1s)10×
1,0001,0001,000,000 deps (~45s)2,000 deps (~0.5s)90×
10,00010,000100,000,000 deps (hours)20,000 deps (~5s)1000×+

As you can see, barriers become essential for large-scale workflows.

Complete Example

See multi_stage_barrier_pattern.yaml for a comprehensive example with:

  • 3 distinct stages (1000 → 1000 → 100 jobs)
  • Informative barrier jobs with progress logging
  • Different resource requirements per stage
  • Comments explaining the pattern

Summary

Use barrier jobs when all jobs in one stage must complete before any job in the next stage starts

Use file/data dependencies for one-to-one job relationships

Use specific dependencies for DAG patterns with clear paths

Keep barriers lightweight - just logging and simple checks

Use descriptive names to track workflow progress

The barrier pattern is your key to scaling Torc workflows from hundreds to thousands of jobs efficiently!

Next Steps

  • Try modifying the demo workflow to have 3 or more stages
  • Experiment with adding validation logic to barrier jobs
  • Check out Advanced Parameterization for creating complex multi-stage pipelines
  • Learn about Workflow Actions for conditional execution between stages

Map a Python function to compute nodes

This tutorial will teach you how to build a workflow from Python functions instead of CLI executables and run on it on an HPC with Slurm.

Pre-requisites

This tutorial requires installation of the python package torc-client. Until the latest version is published at pypi.org, you must clone this repository install the package in a virtual environment. Use Python 3.11 or later.

git clone https://github.com/NREL/torc
cd torc/python_client
python -m venv .venv
source .venv/bin/activate
pip install -e .

Workflow Description

Let’s suppose that your code is in a module called simulation.py and looks something like this:

def run(job_name: str, input_params: dict) -> dict:
    """Runs one simulation on a set of input parameters.

    Returns
    -------
    job_name: str
        Name of the job.
    dict
        Result of the simulation.
    """
    return {
        "inputs": input_params,
        "result": 5,
        "output_data_path": f"/projects/my-project/{job_name}",
    }


def postprocess(results: list[dict]) -> dict:
    """Collects the results of the workers and performs post-processing.

    Parameters
    ----------
    results : list[dict]
        Results from each simulation

    Returns
    -------
    dict
        Final result
    """
    total = 0
    paths = []
    for result in results:
        assert "result" in result
        assert "output_data_path" in result
        total += result["result"]
        paths.append(result["output_data_path"])
    return {"total": total, "output_data_paths": paths}

You need to run this function on hundreds of sets of input parameters and want torc to help you scale this work on an HPC.

The recommended procedure for this task is torc’s Python API as shown below. The goal is to mimic the behavior of Python’s concurrent.futures.ProcessPoolExecutor.map as much as possible.

Similar functionality is also available with Dask.

Resource Constraints

  • Each function call needs 4 CPUs and 20 GiB of memory.
  • The function call takes 1 hour to run.

A compute node with 92 GiB of memory are easiest to acquire but would only be able to run 4 jobs at a time. The 180 GiB nodes are fewer in number but would use fewer AUs because they would be able to run 8 jobs at a time.

Torc Overview

Here is what torc does to solve this problem:

  • User creates a workflow in Python.
  • User passes a callable function as well as a list of all input parameters that need to be mapped to the function.
  • For each set of input parameters torc creates a record in the user_data table in the database, creates a job with a relationship to that record as an input, and creates a placeholder for data to be created by that job.
  • When torc runs each job it reads the correct input parameters from the database, imports the user’s function, and then calls it with the input parameters.
  • When the function completes, torc stores any returned data in the database.
  • When all workers complete torc collects all result data from the database into a list and passes that to the postprocess function. It also stores any returned data from that function into the database.

Build the workflow

  1. Write a script to create the workflow. Note that you need to correct the api URL and the Slurm account.
import getpass
import os

from torc import make_api, map_function_to_jobs, setup_logging
from torc.openapi_client import (
    DefaultApi,
    ResourceRequirementsModel,
    SlurmSchedulerModel,
    WorkflowModel,
)


TORC_API_URL = os.getenv("TORC_API_URL", "http://localhost:8080/torc-service/v1")


def create_workflow(api: DefaultApi) -> WorkflowModel:
    """Create the workflow"""
    workflow = WorkflowModel(
        user=getpass.getuser(),
        name="map_function_workflow",
        description="Example workflow that maps a function across workers",
    )
    return api.create_workflow(workflow)


def build_workflow(api: DefaultApi, workflow: WorkflowModel):
    """Creates a workflow with implicit job dependencies declared through files."""
    workflow_id = workflow.id
    assert workflow_id is not None
    params = [
        {"input1": 1, "input2": 2, "input3": 3},
        {"input1": 4, "input2": 5, "input3": 6},
        {"input1": 7, "input2": 8, "input3": 9},
    ]
    assert workflow.id is not None
    rr = api.create_resource_requirements(
        ResourceRequirementsModel(
            workflow_id=workflow_id,
            name="medium",
            num_cpus=4,
            memory="20g",
            runtime="P0DT1H",
        ),
    )
    api.create_slurm_scheduler(
        SlurmSchedulerModel(
            workflow_id=workflow_id,
            name="short",
            account="my_account",
            mem="180224",
            walltime="04:00:00",
            nodes=1,
        ),
    )
    jobs = map_function_to_jobs(
        api,
        workflow_id,
        "simulation",
        "run",
        params,
        resource_requirements_id=rr.id,
        # Note that this is optional.
        postprocess_func="postprocess",
    )
    print(f"Created workflow with ID {workflow_id} {len(jobs)} jobs.")


def main():
    setup_logging()
    api = make_api(TORC_API_URL)
    workflow = create_workflow(api)
    try:
        build_workflow(api, workflow)
    except Exception:
        api.delete_workflow(workflow.id)
        raise


if __name__ == "__main__":
    main()

Requirements:

  • Your run function should raise an exception if there is a failure. If that happens, torc will record a non-zero return code for the job.

  • If you want torc to store result data in the database, return it from your run function. Note: be careful on how much result data you return. If you are using a custom database for one workflow, store as much as you want. If you are using a shared server, ensure that you are following its administrator’s policies. You could consider storing large data in files and only storing file paths in the database.

  • If you choose to define a postprocess function and want torc to store the final data in the database, return it from that function.

  • The params must be serializable in JSON format because they will be stored in the database. Basic types like numbers and strings and lists and dictionaries of those will work fine. If you need to store complex, custom types, consider these options:

    • Define data models with Pydantic. You can use their existing serialization/de-serialization methods or define custom methods.
    • Pickle your data and store the result as a string. Your run function would need to understand how to de-serialize it. Note that this has portability limitations. (Please contact the developers if you would like to see this happen automatically.)
  • Torc must be able to import simulation.py from Python. Here are some options:

    • Put the script in the current directory.
    • Install it in the environment.
    • Specify its parent directory like this: map_function_to_jobs(..., module_directory="my_module")
python map_function_across_workers.py
  1. Create the workflow.
python examples/python/map_function_across_workers.py
Created workflow 342 with 4 jobs.
  1. Run the workflow.
$ torc run 342
  1. View the result data overall or by job (if your run and postprocess functions return something). Note that listing all user-data will return input parameters.
$ torc -f json user-data list 342

Other jobs

You could add “normal” jobs to the workflow as well. For example, you might have preprocessing and post-processing work to do. You can add those jobs through the API. You could also add multiple rounds of mapped functions. map_function_to_jobs provides a depends_on_job_ids parameter to specify ordering. You could also define job-job relationships through files or user-data as discussed elsewhere in this documentation.

Tutorial 11: Filtering CLI Output with Nushell

This tutorial teaches you how to filter and analyze Torc CLI output using Nushell, a modern shell with powerful structured data capabilities.

Learning Objectives

By the end of this tutorial, you will:

  • Understand why Nushell is useful for filtering Torc output
  • Know how to filter jobs by status, name, and other fields
  • Be able to analyze results and find failures quickly
  • Create complex queries combining multiple conditions

Prerequisites

  • Torc CLI installed and configured
  • A workflow with jobs (ideally one with various statuses)

Why Nushell?

Torc’s CLI can output JSON with the -f json flag. While tools like jq can process JSON, Nushell offers a more readable, SQL-like syntax that’s easier to learn and use interactively.

Compare filtering failed jobs:

# jq (cryptic syntax)
torc jobs list 123 -f json | jq '.jobs[] | select(.status == "failed")'

# Nushell (readable, SQL-like)
torc jobs list 123 -f json | from json | get jobs | where status == "failed"

Nushell is:

  • Cross-platform: Works on Linux, macOS, and Windows
  • Readable: Uses intuitive commands like where, select, sort-by
  • Interactive: Tab completion and helpful error messages
  • Powerful: Built-in support for JSON, YAML, CSV, and more

Installing Nushell

Install Nushell from nushell.sh/book/installation:

# macOS
brew install nushell

# Windows
winget install nushell

# Linux (various methods available)
cargo install nu

After installation, run nu to start a Nushell session. You can use Nushell interactively or run individual commands with nu -c "command".

Basic Filtering

Setup: Get JSON Output

All examples assume you have a workflow ID. Replace $WORKFLOW_ID with your actual ID:

# In Nushell, set your workflow ID
let wf = 123

List All Jobs

torc jobs list $wf -f json | from json | get jobs

This parses the JSON and extracts the jobs array into a table.

Filter by Status

Find all failed jobs:

torc jobs list $wf -f json | from json | get jobs | where status == "failed"

Find jobs that are ready or running:

torc jobs list $wf -f json | from json | get jobs | where status in ["ready", "running"]

Filter by Name Pattern

Find jobs with “train” in the name:

torc jobs list $wf -f json | from json | get jobs | where name =~ "train"

The =~ operator performs substring/regex matching.

Combine Conditions

Find failed jobs with “process” in the name:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" and name =~ "process"

Find jobs that failed or were canceled:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" or status == "canceled"

Selecting and Formatting Output

Select Specific Columns

Show only name and status:

torc jobs list $wf -f json | from json | get jobs | select name status

Sort Results

Sort by name:

torc jobs list $wf -f json | from json | get jobs | sort-by name

Sort failed jobs by ID (descending):

torc jobs list $wf -f json | from json | get jobs | where status == "failed" | sort-by id -r

Count Results

Count jobs by status:

torc jobs list $wf -f json | from json | get jobs | group-by status | transpose status jobs | each { |row| { status: $row.status, count: ($row.jobs | length) } }

Or more simply, count failed jobs:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" | length

Analyzing Results

Find Jobs with Non-Zero Return Codes

torc results list $wf -f json | from json | get results | where return_code != 0

Find Results with Specific Errors

torc results list $wf -f json | from json | get results | where return_code != 0 | select job_id return_code

Join Jobs with Results

Get job names for failed results:

let jobs = (torc jobs list $wf -f json | from json | get jobs)
let results = (torc results list $wf -f json | from json | get results | where return_code != 0)
$results | each { |r|
    let job = ($jobs | where id == $r.job_id | first)
    { name: $job.name, return_code: $r.return_code, job_id: $r.job_id }
}

Working with User Data

List User Data Entries

torc user-data list $wf -f json | from json | get user_data

Filter by Key

Find user data with a specific key:

torc user-data list $wf -f json | from json | get user_data | where key =~ "config"

Parse JSON Values

User data values are JSON strings. Parse and filter them:

torc user-data list $wf -f json | from json | get user_data | each { |ud|
    { key: $ud.key, value: ($ud.value | from json) }
}

Practical Examples

Example 1: Debug Failed Jobs

Find failed jobs and get their result details:

# Get failed job IDs
let failed_ids = (torc jobs list $wf -f json | from json | get jobs | where status == "failed" | get id)

# Show results for those jobs
torc results list $wf -f json | from json | get results | where job_id in $failed_ids | select job_id return_code

Example 2: Find Stuck Jobs

Find jobs that have been running for a long time (status is “running”):

torc jobs list $wf -f json | from json | get jobs | where status == "running" | select id name

Example 3: Parameter Sweep Analysis

For a parameterized workflow, find which parameter values failed:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" and name =~ "lr" | get name

Example 4: Export to CSV

Export failed jobs to CSV for further analysis:

torc jobs list $wf -f json | from json | get jobs | where status == "failed" | to csv | save failed_jobs.csv

Quick Reference

OperationNushell Command
Parse JSONfrom json
Get fieldget jobs
Filter rowswhere status == "failed"
Select columnsselect name status id
Sortsort-by name
Sort descendingsort-by id -r
Countlength
Substring matchwhere name =~ "pattern"
Multiple conditionswhere status == "failed" and name =~ "x"
In listwhere status in ["ready", "running"]
Group bygroup-by status
Save to filesave output.json
Convert to CSVto csv

Tips

  1. Use nu interactively: Start a Nushell session to explore data step by step
  2. Tab completion: Nushell provides completions for commands and field names
  3. Pipeline debugging: Add | first 5 to see a sample before processing all data
  4. Save queries: Create shell aliases or scripts for common filters

What You Learned

In this tutorial, you learned:

  • Why Nushell is a great tool for filtering Torc CLI output
  • How to filter jobs by status and name patterns
  • How to analyze results and find failures
  • How to work with user data
  • Practical examples for debugging workflows

Next Steps

Creating a Custom HPC Profile

This tutorial walks you through creating a custom HPC profile for a cluster that Torc doesn’t have built-in support for.

Before You Start

Request Built-in Support First!

If your HPC system is widely used, consider requesting that Torc developers add it as a built-in profile. This benefits everyone using that system.

Open an issue at github.com/NREL/torc/issues with:

  • Your HPC system name and organization
  • Partition names and their resource limits (CPUs, memory, walltime, GPUs)
  • How to detect the system (environment variable or hostname pattern)
  • Any special requirements (minimum nodes, exclusive partitions, etc.)

Built-in profiles are maintained by the Torc team and stay up-to-date as systems change.

When to Create a Custom Profile

Create a custom profile when:

  • Your HPC isn’t supported and you need to use it immediately
  • You have a private or internal cluster
  • You want to test profile configurations before submitting upstream

Step 1: Gather Partition Information

First, collect information about your HPC’s partitions. On most Slurm systems:

# List all partitions
sinfo -s

# Get detailed partition info
sinfo -o "%P %c %m %l %G"

For this tutorial, let’s say your cluster “ResearchCluster” has these partitions:

PartitionCPUs/NodeMemoryMax WalltimeGPUs
batch48192 GB72 hours-
short48192 GB4 hours-
gpu32256 GB48 hours4x A100
himem481024 GB48 hours-

Step 2: Identify Detection Method

Determine how Torc can detect when you’re on this system. Common methods:

Environment variable (most common):

echo $CLUSTER_NAME    # e.g., "research"
echo $SLURM_CLUSTER   # e.g., "researchcluster"

Hostname pattern:

hostname              # e.g., "login01.research.edu"

For this tutorial, we’ll use the environment variable CLUSTER_NAME=research.

Step 3: Create the Configuration File

Create or edit your Torc configuration file:

# Linux
mkdir -p ~/.config/torc
nano ~/.config/torc/config.toml

# macOS
mkdir -p ~/Library/Application\ Support/torc
nano ~/Library/Application\ Support/torc/config.toml

Add your custom profile:

# Custom HPC Profile for ResearchCluster
[client.hpc.custom_profiles.research]
display_name = "Research Cluster"
description = "University Research HPC System"
detect_env_var = "CLUSTER_NAME=research"
default_account = "my_project"

# Batch partition - general purpose
[[client.hpc.custom_profiles.research.partitions]]
name = "batch"
cpus_per_node = 48
memory_mb = 192000        # 192 GB in MB
max_walltime_secs = 259200  # 72 hours in seconds
shared = false

# Short partition - quick jobs
[[client.hpc.custom_profiles.research.partitions]]
name = "short"
cpus_per_node = 48
memory_mb = 192000
max_walltime_secs = 14400   # 4 hours
shared = true               # Allows sharing nodes

# GPU partition
[[client.hpc.custom_profiles.research.partitions]]
name = "gpu"
cpus_per_node = 32
memory_mb = 256000          # 256 GB
max_walltime_secs = 172800  # 48 hours
gpus_per_node = 4
gpu_type = "A100"
shared = false

# High memory partition
[[client.hpc.custom_profiles.research.partitions]]
name = "himem"
cpus_per_node = 48
memory_mb = 1048576         # 1024 GB (1 TB)
max_walltime_secs = 172800  # 48 hours
shared = false

Step 4: Verify the Profile

Check that Torc recognizes your profile:

# List all profiles
torc hpc list

You should see your custom profile:

Known HPC profiles:

╭──────────┬──────────────────┬────────────┬──────────╮
│ Name     │ Display Name     │ Partitions │ Detected │
├──────────┼──────────────────┼────────────┼──────────┤
│ kestrel  │ NREL Kestrel     │ 15         │          │
│ research │ Research Cluster │ 4          │ ✓        │
╰──────────┴──────────────────┴────────────┴──────────╯

View the partitions:

torc hpc partitions research
Partitions for research:

╭─────────┬───────────┬───────────┬─────────────┬──────────╮
│ Name    │ CPUs/Node │ Mem/Node  │ Max Walltime│ GPUs     │
├─────────┼───────────┼───────────┼─────────────┼──────────┤
│ batch   │ 48        │ 192 GB    │ 72h         │ -        │
│ short   │ 48        │ 192 GB    │ 4h          │ -        │
│ gpu     │ 32        │ 256 GB    │ 48h         │ 4 (A100) │
│ himem   │ 48        │ 1024 GB   │ 48h         │ -        │
╰─────────┴───────────┴───────────┴─────────────┴──────────╯

Step 5: Test Partition Matching

Verify that Torc correctly matches resource requirements to partitions:

# Should match 'short' partition
torc hpc match research --cpus 8 --memory 16g --walltime 2h

# Should match 'gpu' partition
torc hpc match research --cpus 16 --memory 64g --walltime 8h --gpus 2

# Should match 'himem' partition
torc hpc match research --cpus 24 --memory 512g --walltime 24h

Step 6: Test Scheduler Generation

Create a test workflow to verify scheduler generation:

# test_workflow.yaml
name: profile_test
description: Test custom HPC profile

resource_requirements:
  - name: standard
    num_cpus: 16
    memory: 64g
    runtime: PT2H

  - name: gpu_compute
    num_cpus: 16
    num_gpus: 2
    memory: 128g
    runtime: PT8H

jobs:
  - name: preprocess
    command: echo "preprocessing"
    resource_requirements: standard

  - name: train
    command: echo "training"
    resource_requirements: gpu_compute
    depends_on: [preprocess]

Generate schedulers:

torc slurm generate --account my_project --profile research test_workflow.yaml

You should see the generated workflow with appropriate schedulers for each partition.

Step 7: Use Your Profile

Now you can submit workflows using your custom profile:

# Auto-detect the profile (if on the cluster)
torc submit-slurm --account my_project workflow.yaml

# Or explicitly specify the profile
torc submit-slurm --account my_project --hpc-profile research workflow.yaml

Advanced Configuration

Hostname-Based Detection

If your cluster doesn’t set a unique environment variable, use hostname detection:

[client.hpc.custom_profiles.research]
display_name = "Research Cluster"
detect_hostname = ".*\\.research\\.edu"  # Regex pattern

Minimum Node Requirements

Some partitions require a minimum number of nodes:

[[client.hpc.custom_profiles.research.partitions]]
name = "large_scale"
cpus_per_node = 128
memory_mb = 512000
max_walltime_secs = 172800
min_nodes = 16  # Must request at least 16 nodes

Explicit Request Partitions

Some partitions shouldn’t be auto-selected:

[[client.hpc.custom_profiles.research.partitions]]
name = "priority"
cpus_per_node = 48
memory_mb = 192000
max_walltime_secs = 86400
requires_explicit_request = true  # Only used when explicitly requested

Troubleshooting

Profile Not Detected

If torc hpc detect doesn’t find your profile:

  1. Check the environment variable or hostname:

    echo $CLUSTER_NAME
    hostname
    
  2. Verify the detection pattern in your config matches exactly

  3. Test with explicit profile specification:

    torc hpc show research
    

No Partition Found for Job

If torc slurm generate can’t find a matching partition:

  1. Check if any partition satisfies all requirements:

    torc hpc match research --cpus 32 --memory 128g --walltime 8h
    
  2. Verify memory is specified in MB in the config (not GB)

  3. Verify walltime is in seconds (not hours)

Configuration File Location

Torc looks for config files in these locations:

  • Linux: ~/.config/torc/config.toml
  • macOS: ~/Library/Application Support/torc/config.toml
  • Windows: %APPDATA%\torc\config.toml

You can also use the TORC_CONFIG environment variable to specify a custom path.

Contributing Your Profile

If your HPC is used by others, please contribute it upstream:

  1. Fork the Torc repository
  2. Add your profile to src/client/hpc_profiles.rs
  3. Add tests for your profile
  4. Submit a pull request

Or simply open an issue with your partition information and we’ll add it for you.

See Also

Contributing

Contributions to Torc are welcome! This guide will help you get started.

Development Setup

  1. Fork and clone the repository:
git clone https://github.com/your-username/torc.git
cd torc
  1. Install Rust and dependencies:

Make sure you have Rust 1.70 or later installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
  1. Install SQLx CLI:
cargo install sqlx-cli --no-default-features --features sqlite
  1. Set up the database:
# Create .env file
echo "DATABASE_URL=sqlite:torc.db" > .env

# Run migrations
sqlx migrate run
  1. Build and test:
cargo build
cargo test

Making Changes

Code Style

Run formatting and linting before committing:

# Format code
cargo fmt

# Run clippy
cargo clippy --all-targets --all-features

# Run all checks
cargo fmt --check && cargo clippy --all-targets --all-features -- -D warnings

Adding Tests

All new functionality should include tests:

# Run specific test
cargo test test_name -- --nocapture

# Run with logging
RUST_LOG=debug cargo test -- --nocapture

Database Migrations

If you need to modify the database schema:

# Create new migration
sqlx migrate add <migration_name>

# Edit the generated SQL file in migrations/

# Run migration
sqlx migrate run

# To revert
sqlx migrate revert

Submitting Changes

  1. Create a feature branch:
git checkout -b feature/my-new-feature
  1. Make your changes and commit:
git add .
git commit -m "Add feature: description"
  1. Ensure all tests pass:
cargo test
cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings
  1. Push to your fork:
git push origin feature/my-new-feature
  1. Open a Pull Request:

Go to the original repository and open a pull request with:

  • Clear description of changes
  • Reference to any related issues
  • Test results

Pull Request Guidelines

  • Keep PRs focused - One feature or fix per PR
  • Add tests - All new code should be tested
  • Update documentation - Update README.md, DOCUMENTATION.md, or inline docs as needed
  • Follow style guidelines - Run cargo fmt and cargo clippy
  • Write clear commit messages - Describe what and why, not just how

Areas for Contribution

High Priority

  • Performance optimizations for large workflows
  • Additional job runner implementations (Kubernetes, etc.)
  • Improved error messages and logging
  • Documentation improvements

Features

  • Workflow visualization tools
  • Job retry policies and error handling
  • Workflow templates and libraries
  • Integration with external systems

Testing

  • Additional integration tests
  • Performance benchmarks
  • Stress testing with large workflows

Code of Conduct

Be respectful and constructive in all interactions. We’re all here to make Torc better.

Questions?

  • Open an issue for bugs or feature requests
  • Start a discussion for questions or ideas
  • Check existing issues and discussions first

License

By contributing, you agree that your contributions will be licensed under the BSD 3-Clause License.