Parallelization Strategies
Torc provides flexible parallelization strategies to accommodate different workflow patterns and resource allocation scenarios. Understanding these strategies helps you optimize job execution for your specific use case.
Overview
Torc supports two primary approaches to parallel job execution:
- Resource-aware allocation - Define per-job resource requirements and let runners intelligently select jobs that fit available resources
- Queue-depth parallelism - Control the number of concurrent jobs without resource tracking
The choice between these approaches depends on your workflow characteristics and execution environment.
Use Case 1: Resource-Aware Job Allocation
This strategy is ideal for heterogeneous workflows where jobs have varying resource requirements (CPU, memory, GPU, runtime). The server intelligently allocates jobs based on available compute node resources.
How It Works
When you define resource requirements for each job:
resource_requirements:
- name: small
num_cpus: 2
num_gpus: 0
memory: 4g
runtime: PT30M
- name: large
num_cpus: 16
num_gpus: 2
memory: 128g
runtime: PT8H
jobs:
- name: preprocessing
command: ./preprocess.sh
resource_requirements: small
- name: model_training
command: python train.py
resource_requirements: large
The job runner pulls jobs from the server by detecting its available resources automatically.
torc run $WORKFLOW_ID
The server’s GET /workflows/{id}/claim_jobs_based_on_resources endpoint:
- Receives the runner’s resource capacity
- Queries the ready queue for jobs that fit within those resources
- Returns a set of jobs that can run concurrently without over-subscription
- Updates job status from
readytopendingatomically
Job Allocation Ambiguity: Two Approaches
When you have multiple compute nodes or schedulers with different capabilities, there are two ways to handle job allocation:
Approach 1: Sort Method (Flexible but Potentially Ambiguous)
How it works:
- Jobs do NOT specify a particular scheduler/compute node
- The server uses a
job_sort_methodparameter to prioritize jobs when allocating - Any runner with sufficient resources can claim any ready job
Available sort methods: Define the field job_sort_method in the workflow specification file (YAML/JSON/KDL)
gpus_runtime_memory- Prioritize jobs by GPU count (desc), then runtime (desc), then memory (desc)gpus_memory_runtime- Prioritize jobs by GPU count (desc), then memory (desc), then runtime (desc)none- No sorting, jobs selected in queue order
Tradeoffs:
✅ Advantages:
- Maximum flexibility - any runner can execute any compatible job
- Better resource utilization - if GPU runner is idle, it can pick up CPU-only jobs
- Simpler workflow specifications - no need to explicitly map jobs to schedulers
- Fault tolerance - if one runner fails, others can pick up its jobs
❌ Disadvantages:
- Ambiguity - no guarantee GPU jobs go to GPU runners
- Potential inefficiency - high-memory jobs might land on low-memory nodes if timing is unlucky
- Requires careful sort method selection
- Less predictable job placement
When to use:
- Homogeneous or mostly-homogeneous compute resources
- Workflows where job placement flexibility is valuable
- When you want runners to opportunistically pick up work
- Development and testing environments
Approach 2: Scheduler ID (Deterministic but Less Flexible)
How it works:
- Define scheduler configurations in your workflow spec
- Assign each job a specific
scheduler_id - Runners provide their
scheduler_config_idwhen requesting jobs - Server only returns jobs matching that scheduler ID
Example workflow specification:
slurm_schedulers:
- name: gpu_cluster
partition: gpu
account: myproject
- name: highmem_cluster
partition: highmem
account: myproject
jobs:
- name: model_training
command: python train.py
resource_requirements: large
slurm_scheduler: gpu_cluster # Binds to specific scheduler
- name: large_analysis
command: ./analyze.sh
resource_requirements: highmem
slurm_scheduler: highmem_cluster
Example runner invocation:
# GPU runner - only pulls jobs assigned to gpu_cluster
torc-slurm-job-runner $WORKFLOW_ID \
--scheduler-config-id 1 \
--num-cpus 32 \
--num-gpus 8
# High-memory runner - only pulls jobs assigned to highmem_cluster
torc-slurm-job-runner $WORKFLOW_ID \
--scheduler-config-id 2 \
--num-cpus 64 \
--memory-gb 512
Tradeoffs:
✅ Advantages:
- Zero ambiguity - jobs always run on intended schedulers
- Predictable job placement
- Prevents GPU jobs from landing on CPU-only nodes
- Clear workflow specification - explicit job→scheduler mapping
- Better for heterogeneous clusters (GPU vs CPU vs high-memory)
❌ Disadvantages:
- Less flexibility - idle runners can’t help other queues
- Potential resource underutilization - GPU runner sits idle while CPU queue is full
- More complex workflow specifications
- If a scheduler fails, its jobs remain stuck until that scheduler returns
When to use:
- Highly heterogeneous compute resources (GPU clusters, high-memory nodes, specialized hardware)
- Production workflows requiring predictable job placement
- Multi-cluster environments
- When job-resource matching is critical (e.g., GPU-only codes, specific hardware requirements)
- Slurm or HPC scheduler integrations
Choosing Between Sort Method and Scheduler ID
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| All jobs can run anywhere | Sort method | Maximum flexibility, simplest spec |
| Some jobs need GPUs, some don’t | Scheduler ID | Prevent GPU waste on CPU jobs |
| Multi-cluster Slurm environment | Scheduler ID | Jobs must target correct clusters |
| Development/testing | Sort method | Easier to experiment |
| Production with SLAs | Scheduler ID | Predictable resource usage |
| Homogeneous compute nodes | Sort method | No benefit to restricting |
| Specialized hardware (GPUs, high-memory, FPGAs) | Scheduler ID | Match jobs to capabilities |
You can also mix approaches: Use scheduler_id for jobs with strict requirements, leave it NULL for flexible jobs.
Use Case 2: Queue-Depth Parallelism
This strategy is ideal for workflows with homogeneous resource requirements where you simply want to control the level of parallelism.
How It Works
Instead of tracking resources, you specify a maximum number of concurrent jobs:
torc run $WORKFLOW_ID \
--max-parallel-jobs 10 \
--output-dir ./results
or with Slurm:
torc slurm schedule-nodes $WORKFLOW_ID \
--scheduler-config-id 1 \
--num-hpc-jobs 4 \
--max-parallel-jobs 8
Server behavior:
The GET /workflows/{id}/claim_next_jobs endpoint:
- Accepts
limitparameter specifying maximum jobs to return - Ignores all resource requirements
- Returns the next N ready jobs from the queue
- Updates their status from
readytopending
Runner behavior:
- Maintains a count of running jobs
- When count falls below
max_parallel_jobs, requests more work - Does NOT track CPU, memory, GPU, or other resources
- Simply enforces the concurrency limit
Ignoring Resource Consumption
This is a critical distinction: when using --max-parallel-jobs, the runner completely ignores current resource consumption.
Normal resource-aware mode:
Runner has: 32 CPUs, 128 GB memory
Job A needs: 16 CPUs, 64 GB
Job B needs: 16 CPUs, 64 GB
Job C needs: 16 CPUs, 64 GB
Runner starts Job A and Job B (resources fully allocated)
Job C waits until resources free up
Queue-depth mode with –max-parallel-jobs 3:
Runner has: 32 CPUs, 128 GB memory (IGNORED)
Job A needs: 16 CPUs, 64 GB (IGNORED)
Job B needs: 16 CPUs, 64 GB (IGNORED)
Job C needs: 16 CPUs, 64 GB (IGNORED)
Runner starts Job A, Job B, and Job C simultaneously
Total requested: 48 CPUs, 192 GB (exceeds node capacity!)
System may: swap, OOM, or throttle performance
When to Use Queue-Depth Parallelism
✅ Use queue-depth parallelism when:
-
All jobs have similar resource requirements
# All jobs use ~4 CPUs, ~8GB memory jobs: - name: process_file_1 command: ./process.sh file1.txt - name: process_file_2 command: ./process.sh file2.txt # ... 100 similar jobs -
Resource requirements are negligible compared to node capacity
- Running 100 lightweight Python scripts on a 64-core machine
- I/O-bound jobs that don’t consume much CPU/memory
-
Jobs are I/O-bound or sleep frequently
- Data download jobs
- Jobs waiting on external services
- Polling or monitoring tasks
-
You want simplicity over precision
- Quick prototypes
- Testing workflows
- Simple task queues
-
Jobs self-limit their resource usage
- Application has built-in thread pools
- Container resource limits
- OS-level cgroups or resource controls
❌ Avoid queue-depth parallelism when:
-
Jobs have heterogeneous resource requirements
- Mix of 2-CPU and 32-CPU jobs
- Some jobs need 4GB, others need 128GB
-
Resource contention causes failures
- Out-of-memory errors
- CPU thrashing
- GPU memory exhaustion
-
You need efficient bin-packing
- Maximizing node utilization
- Complex resource constraints
-
Jobs are compute-intensive
- CPU-bound numerical simulations
- Large matrix operations
- Video encoding
Queue-Depth Parallelism in Practice
Example 1: Slurm with Queue Depth
# Schedule 4 Slurm nodes, each running up to 8 concurrent jobs
torc slurm schedule-nodes $WORKFLOW_ID \
--scheduler-config-id 1 \
--num-hpc-jobs 4 \
--max-parallel-jobs 8
This creates 4 Slurm job allocations. Each allocation runs a worker that:
- Pulls up to 8 jobs at a time
- Runs them concurrently
- Requests more when any job completes
Total concurrency: up to 32 jobs (4 nodes × 8 jobs/node)
Example 2: Local Runner with Queue Depth
# Run up to 20 jobs concurrently on local machine
torc-job-runner $WORKFLOW_ID \
--max-parallel-jobs 20 \
--output-dir ./output
Example 3: Mixed Approach
You can even run multiple runners with different strategies:
# Terminal 1: Resource-aware runner for large jobs
torc run $WORKFLOW_ID \
--num-cpus 32 \
--memory-gb 256
# Terminal 2: Queue-depth runner for small jobs
torc run $WORKFLOW_ID \
--max-parallel-jobs 50
The ready queue serves both runners. The resource-aware runner gets large jobs that fit its capacity, while the queue-depth runner gets small jobs for fast parallel execution.
Performance Characteristics
Resource-aware allocation:
- Query complexity: O(jobs in ready queue)
- Requires computing resource sums
- Slightly slower due to filtering and sorting
- Better resource utilization
Queue-depth allocation:
- Query complexity: O(1) with limit
- Simple LIMIT clause, no resource computation
- Faster queries
- Simpler logic
For workflows with thousands of ready jobs, queue-depth allocation has lower overhead.
Best Practices
-
Start with resource-aware allocation for new workflows
- Better default behavior
- Prevents resource over-subscription
- Easier to debug resource issues
-
Use scheduler_id for production multi-cluster workflows
- Explicit job placement
- Predictable resource usage
- Better for heterogeneous resources
-
Use sort_method for flexible single-cluster workflows
- Simpler specifications
- Better resource utilization
- Good for homogeneous resources
-
Use queue-depth parallelism for homogeneous task queues
- Many similar jobs
- I/O-bound workloads
- When simplicity matters more than precision
-
Monitor resource usage when switching strategies
- Check for over-subscription
- Verify expected parallelism
- Look for resource contention
-
Test with small workflows first
- Validate job allocation behavior
- Check resource accounting
- Ensure jobs run on intended schedulers
Summary
| Strategy | Use When | Allocation Method | Resource Tracking |
|---|---|---|---|
| Resource-aware + sort_method | Heterogeneous jobs, flexible allocation | Server filters by resources | Yes |
| Resource-aware + scheduler_id | Heterogeneous jobs, strict allocation | Server filters by resources AND scheduler | Yes |
| Queue-depth | Homogeneous jobs, simple parallelism | Server returns next N jobs | No |
Choose the strategy that best matches your workflow characteristics and execution environment. You can even mix strategies across different runners for maximum flexibility.