Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parallelization Strategies

Torc provides flexible parallelization strategies to accommodate different workflow patterns and resource allocation scenarios. Understanding these strategies helps you optimize job execution for your specific use case.

Overview

Torc supports two primary approaches to parallel job execution:

  1. Resource-aware allocation - Define per-job resource requirements and let runners intelligently select jobs that fit available resources
  2. Queue-depth parallelism - Control the number of concurrent jobs without resource tracking

The choice between these approaches depends on your workflow characteristics and execution environment.

Use Case 1: Resource-Aware Job Allocation

This strategy is ideal for heterogeneous workflows where jobs have varying resource requirements (CPU, memory, GPU, runtime). The server intelligently allocates jobs based on available compute node resources.

How It Works

When you define resource requirements for each job:

resource_requirements:
  - name: small
    num_cpus: 2
    num_gpus: 0
    memory: 4g
    runtime: PT30M

  - name: large
    num_cpus: 16
    num_gpus: 2
    memory: 128g
    runtime: PT8H

jobs:
  - name: preprocessing
    command: ./preprocess.sh
    resource_requirements: small

  - name: model_training
    command: python train.py
    resource_requirements: large

The job runner pulls jobs from the server by detecting its available resources automatically.

torc run $WORKFLOW_ID

The server’s GET /workflows/{id}/claim_jobs_based_on_resources endpoint:

  1. Receives the runner’s resource capacity
  2. Queries the ready queue for jobs that fit within those resources
  3. Returns a set of jobs that can run concurrently without over-subscription
  4. Updates job status from ready to pending atomically

Job Allocation Ambiguity: Two Approaches

When you have multiple compute nodes or schedulers with different capabilities, there are two ways to handle job allocation:

Approach 1: Sort Method (Flexible but Potentially Ambiguous)

How it works:

  • Jobs do NOT specify a particular scheduler/compute node
  • The server uses a job_sort_method parameter to prioritize jobs when allocating
  • Any runner with sufficient resources can claim any ready job

Available sort methods: Define the field job_sort_method in the workflow specification file (YAML/JSON/KDL)

  • gpus_runtime_memory - Prioritize jobs by GPU count (desc), then runtime (desc), then memory (desc)
  • gpus_memory_runtime - Prioritize jobs by GPU count (desc), then memory (desc), then runtime (desc)
  • none - No sorting, jobs selected in queue order

Tradeoffs:

Advantages:

  • Maximum flexibility - any runner can execute any compatible job
  • Better resource utilization - if GPU runner is idle, it can pick up CPU-only jobs
  • Simpler workflow specifications - no need to explicitly map jobs to schedulers
  • Fault tolerance - if one runner fails, others can pick up its jobs

Disadvantages:

  • Ambiguity - no guarantee GPU jobs go to GPU runners
  • Potential inefficiency - high-memory jobs might land on low-memory nodes if timing is unlucky
  • Requires careful sort method selection
  • Less predictable job placement

When to use:

  • Homogeneous or mostly-homogeneous compute resources
  • Workflows where job placement flexibility is valuable
  • When you want runners to opportunistically pick up work
  • Development and testing environments

Approach 2: Scheduler ID (Deterministic but Less Flexible)

How it works:

  • Define scheduler configurations in your workflow spec
  • Assign each job a specific scheduler_id
  • Runners provide their scheduler_config_id when requesting jobs
  • Server only returns jobs matching that scheduler ID

Example workflow specification:

slurm_schedulers:
  - name: gpu_cluster
    partition: gpu
    account: myproject

  - name: highmem_cluster
    partition: highmem
    account: myproject

jobs:
  - name: model_training
    command: python train.py
    resource_requirements: large
    slurm_scheduler: gpu_cluster     # Binds to specific scheduler

  - name: large_analysis
    command: ./analyze.sh
    resource_requirements: highmem
    slurm_scheduler: highmem_cluster

Example runner invocation:

# GPU runner - only pulls jobs assigned to gpu_cluster
torc-slurm-job-runner $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-cpus 32 \
  --num-gpus 8

# High-memory runner - only pulls jobs assigned to highmem_cluster
torc-slurm-job-runner $WORKFLOW_ID \
  --scheduler-config-id 2 \
  --num-cpus 64 \
  --memory-gb 512

Tradeoffs:

Advantages:

  • Zero ambiguity - jobs always run on intended schedulers
  • Predictable job placement
  • Prevents GPU jobs from landing on CPU-only nodes
  • Clear workflow specification - explicit job→scheduler mapping
  • Better for heterogeneous clusters (GPU vs CPU vs high-memory)

Disadvantages:

  • Less flexibility - idle runners can’t help other queues
  • Potential resource underutilization - GPU runner sits idle while CPU queue is full
  • More complex workflow specifications
  • If a scheduler fails, its jobs remain stuck until that scheduler returns

When to use:

  • Highly heterogeneous compute resources (GPU clusters, high-memory nodes, specialized hardware)
  • Production workflows requiring predictable job placement
  • Multi-cluster environments
  • When job-resource matching is critical (e.g., GPU-only codes, specific hardware requirements)
  • Slurm or HPC scheduler integrations

Choosing Between Sort Method and Scheduler ID

ScenarioRecommended ApproachRationale
All jobs can run anywhereSort methodMaximum flexibility, simplest spec
Some jobs need GPUs, some don’tScheduler IDPrevent GPU waste on CPU jobs
Multi-cluster Slurm environmentScheduler IDJobs must target correct clusters
Development/testingSort methodEasier to experiment
Production with SLAsScheduler IDPredictable resource usage
Homogeneous compute nodesSort methodNo benefit to restricting
Specialized hardware (GPUs, high-memory, FPGAs)Scheduler IDMatch jobs to capabilities

You can also mix approaches: Use scheduler_id for jobs with strict requirements, leave it NULL for flexible jobs.

Use Case 2: Queue-Depth Parallelism

This strategy is ideal for workflows with homogeneous resource requirements where you simply want to control the level of parallelism.

How It Works

Instead of tracking resources, you specify a maximum number of concurrent jobs:

torc run $WORKFLOW_ID \
  --max-parallel-jobs 10 \
  --output-dir ./results

or with Slurm:

torc slurm schedule-nodes $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-hpc-jobs 4 \
  --max-parallel-jobs 8

Server behavior:

The GET /workflows/{id}/claim_next_jobs endpoint:

  1. Accepts limit parameter specifying maximum jobs to return
  2. Ignores all resource requirements
  3. Returns the next N ready jobs from the queue
  4. Updates their status from ready to pending

Runner behavior:

  • Maintains a count of running jobs
  • When count falls below max_parallel_jobs, requests more work
  • Does NOT track CPU, memory, GPU, or other resources
  • Simply enforces the concurrency limit

Ignoring Resource Consumption

This is a critical distinction: when using --max-parallel-jobs, the runner completely ignores current resource consumption.

Normal resource-aware mode:

Runner has: 32 CPUs, 128 GB memory
Job A needs: 16 CPUs, 64 GB
Job B needs: 16 CPUs, 64 GB
Job C needs: 16 CPUs, 64 GB

Runner starts Job A and Job B (resources fully allocated)
Job C waits until resources free up

Queue-depth mode with –max-parallel-jobs 3:

Runner has: 32 CPUs, 128 GB memory (IGNORED)
Job A needs: 16 CPUs, 64 GB (IGNORED)
Job B needs: 16 CPUs, 64 GB (IGNORED)
Job C needs: 16 CPUs, 64 GB (IGNORED)

Runner starts Job A, Job B, and Job C simultaneously
Total requested: 48 CPUs, 192 GB (exceeds node capacity!)
System may: swap, OOM, or throttle performance

When to Use Queue-Depth Parallelism

✅ Use queue-depth parallelism when:

  1. All jobs have similar resource requirements

    # All jobs use ~4 CPUs, ~8GB memory
    jobs:
      - name: process_file_1
        command: ./process.sh file1.txt
      - name: process_file_2
        command: ./process.sh file2.txt
      # ... 100 similar jobs
    
  2. Resource requirements are negligible compared to node capacity

    • Running 100 lightweight Python scripts on a 64-core machine
    • I/O-bound jobs that don’t consume much CPU/memory
  3. Jobs are I/O-bound or sleep frequently

    • Data download jobs
    • Jobs waiting on external services
    • Polling or monitoring tasks
  4. You want simplicity over precision

    • Quick prototypes
    • Testing workflows
    • Simple task queues
  5. Jobs self-limit their resource usage

    • Application has built-in thread pools
    • Container resource limits
    • OS-level cgroups or resource controls

❌ Avoid queue-depth parallelism when:

  1. Jobs have heterogeneous resource requirements

    • Mix of 2-CPU and 32-CPU jobs
    • Some jobs need 4GB, others need 128GB
  2. Resource contention causes failures

    • Out-of-memory errors
    • CPU thrashing
    • GPU memory exhaustion
  3. You need efficient bin-packing

    • Maximizing node utilization
    • Complex resource constraints
  4. Jobs are compute-intensive

    • CPU-bound numerical simulations
    • Large matrix operations
    • Video encoding

Queue-Depth Parallelism in Practice

Example 1: Slurm with Queue Depth

# Schedule 4 Slurm nodes, each running up to 8 concurrent jobs
torc slurm schedule-nodes $WORKFLOW_ID \
  --scheduler-config-id 1 \
  --num-hpc-jobs 4 \
  --max-parallel-jobs 8

This creates 4 Slurm job allocations. Each allocation runs a worker that:

  • Pulls up to 8 jobs at a time
  • Runs them concurrently
  • Requests more when any job completes

Total concurrency: up to 32 jobs (4 nodes × 8 jobs/node)

Example 2: Local Runner with Queue Depth

# Run up to 20 jobs concurrently on local machine
torc-job-runner $WORKFLOW_ID \
  --max-parallel-jobs 20 \
  --output-dir ./output

Example 3: Mixed Approach

You can even run multiple runners with different strategies:

# Terminal 1: Resource-aware runner for large jobs
torc run $WORKFLOW_ID \
  --num-cpus 32 \
  --memory-gb 256

# Terminal 2: Queue-depth runner for small jobs
torc run $WORKFLOW_ID \
  --max-parallel-jobs 50

The ready queue serves both runners. The resource-aware runner gets large jobs that fit its capacity, while the queue-depth runner gets small jobs for fast parallel execution.

Performance Characteristics

Resource-aware allocation:

  • Query complexity: O(jobs in ready queue)
  • Requires computing resource sums
  • Slightly slower due to filtering and sorting
  • Better resource utilization

Queue-depth allocation:

  • Query complexity: O(1) with limit
  • Simple LIMIT clause, no resource computation
  • Faster queries
  • Simpler logic

For workflows with thousands of ready jobs, queue-depth allocation has lower overhead.

Best Practices

  1. Start with resource-aware allocation for new workflows

    • Better default behavior
    • Prevents resource over-subscription
    • Easier to debug resource issues
  2. Use scheduler_id for production multi-cluster workflows

    • Explicit job placement
    • Predictable resource usage
    • Better for heterogeneous resources
  3. Use sort_method for flexible single-cluster workflows

    • Simpler specifications
    • Better resource utilization
    • Good for homogeneous resources
  4. Use queue-depth parallelism for homogeneous task queues

    • Many similar jobs
    • I/O-bound workloads
    • When simplicity matters more than precision
  5. Monitor resource usage when switching strategies

    • Check for over-subscription
    • Verify expected parallelism
    • Look for resource contention
  6. Test with small workflows first

    • Validate job allocation behavior
    • Check resource accounting
    • Ensure jobs run on intended schedulers

Summary

StrategyUse WhenAllocation MethodResource Tracking
Resource-aware + sort_methodHeterogeneous jobs, flexible allocationServer filters by resourcesYes
Resource-aware + scheduler_idHeterogeneous jobs, strict allocationServer filters by resources AND schedulerYes
Queue-depthHomogeneous jobs, simple parallelismServer returns next N jobsNo

Choose the strategy that best matches your workflow characteristics and execution environment. You can even mix strategies across different runners for maximum flexibility.