Tutorial 5: Advanced Multi-Dimensional Parameterization
This tutorial teaches you how to create multi-dimensional parameter sweeps—grid searches over multiple hyperparameters that generate all combinations automatically.
Learning Objectives
By the end of this tutorial, you will:
- Understand how multiple parameters create a Cartesian product (all combinations)
- Learn to structure complex workflows with data preparation, training, and aggregation stages
- Know how to combine parameterization with explicit dependencies
- See patterns for running large grid searches on HPC systems
Prerequisites
- Completed Tutorial 4: Simple Parameterization
- Torc server running
- Understanding of file dependencies
Multi-Dimensional Parameters: Cartesian Product
When a job has multiple parameters, Torc creates the Cartesian product—every combination of values:
parameters:
lr: "[0.001,0.01]" # 2 values
bs: "[16,32]" # 2 values
This generates 2 × 2 = 4 jobs:
lr=0.001, bs=16lr=0.001, bs=32lr=0.01, bs=16lr=0.01, bs=32
With three parameters:
parameters:
lr: "[0.0001,0.001,0.01]" # 3 values
bs: "[16,32,64]" # 3 values
opt: "['adam','sgd']" # 2 values
This generates 3 × 3 × 2 = 18 jobs.
Step 1: Create the Workflow Specification
Save as grid_search.yaml:
name: hyperparameter_grid_search
description: 3D grid search over learning rate, batch size, and optimizer
jobs:
# Data preparation (runs once, no parameters)
- name: prepare_data
command: python prepare_data.py --output=/data/processed.pkl
resource_requirements: data_prep
output_files:
- training_data
# Training jobs (one per parameter combination)
- name: train_lr{lr:.4f}_bs{bs}_opt{opt}
command: |
python train.py \
--data=/data/processed.pkl \
--learning-rate={lr} \
--batch-size={bs} \
--optimizer={opt} \
--output=/models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt \
--metrics=/results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
resource_requirements: gpu_training
input_files:
- training_data
output_files:
- model_lr{lr:.4f}_bs{bs}_opt{opt}
- metrics_lr{lr:.4f}_bs{bs}_opt{opt}
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
# Aggregate results (depends on ALL training jobs via file dependencies)
- name: aggregate_results
command: |
python aggregate.py \
--input-dir=/results \
--output=/results/summary.csv
resource_requirements: minimal
input_files:
- metrics_lr{lr:.4f}_bs{bs}_opt{opt}
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
# Find best model (explicit dependency, no parameters)
- name: select_best_model
command: |
python select_best.py \
--summary=/results/summary.csv \
--output=/results/best_config.json
resource_requirements: minimal
depends_on:
- aggregate_results
files:
- name: training_data
path: /data/processed.pkl
- name: model_lr{lr:.4f}_bs{bs}_opt{opt}
path: /models/model_lr{lr:.4f}_bs{bs}_opt{opt}.pt
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
- name: metrics_lr{lr:.4f}_bs{bs}_opt{opt}
path: /results/metrics_lr{lr:.4f}_bs{bs}_opt{opt}.json
parameters:
lr: "[0.0001,0.001,0.01]"
bs: "[16,32,64]"
opt: "['adam','sgd']"
resource_requirements:
- name: data_prep
num_cpus: 8
memory: 32g
runtime: PT1H
- name: gpu_training
num_cpus: 8
num_gpus: 1
memory: 16g
runtime: PT4H
- name: minimal
num_cpus: 1
memory: 2g
runtime: PT10M
Understanding the Structure
Four-stage workflow:
prepare_data(1 job) - No parameters, runs oncetrain_*(18 jobs) - Parameterized, all depend onprepare_dataaggregate_results(1 job) - Has parameters only for file dependency matchingselect_best_model(1 job) - Explicit dependency onaggregate_results
Key insight: Why aggregate_results has parameters
The aggregate_results job won’t expand into multiple jobs (its name has no {}). However, it needs parameters: to match the parameterized input_files. This tells Torc: “this job depends on ALL 18 metrics files.”
Step 2: Create and Initialize the Workflow
WORKFLOW_ID=$(torc workflows create grid_search.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
torc workflows initialize-jobs $WORKFLOW_ID
Step 3: Verify the Expansion
Count the jobs:
torc jobs list $WORKFLOW_ID -f json | jq '.jobs | length'
Expected: 21 jobs (1 prepare + 18 training + 1 aggregate + 1 select)
List the training jobs:
torc jobs list $WORKFLOW_ID -f json | jq -r '.jobs[] | select(.name | startswith("train_")) | .name' | sort
Output (18 training jobs):
train_lr0.0001_bs16_optadam
train_lr0.0001_bs16_optsgd
train_lr0.0001_bs32_optadam
train_lr0.0001_bs32_optsgd
train_lr0.0001_bs64_optadam
train_lr0.0001_bs64_optsgd
train_lr0.0010_bs16_optadam
train_lr0.0010_bs16_optsgd
train_lr0.0010_bs32_optadam
train_lr0.0010_bs32_optsgd
train_lr0.0010_bs64_optadam
train_lr0.0010_bs64_optsgd
train_lr0.0100_bs16_optadam
train_lr0.0100_bs16_optsgd
train_lr0.0100_bs32_optadam
train_lr0.0100_bs32_optsgd
train_lr0.0100_bs64_optadam
train_lr0.0100_bs64_optsgd
Step 4: Examine the Dependency Graph
torc jobs list $WORKFLOW_ID
Initial states:
prepare_data: ready (no dependencies)- All
train_*: blocked (waiting fortraining_datafile) aggregate_results: blocked (waiting for all 18 metrics files)select_best_model: blocked (waiting foraggregate_results)
Step 5: Run the Workflow
For local execution:
torc run $WORKFLOW_ID
Execution flow:
prepare_dataruns and producestraining_data- All 18
train_*jobs unblock and run in parallel (resource-limited) aggregate_resultswaits for all training jobs, then runsselect_best_modelruns last
Step 6: Monitor Progress
# Check status summary
torc workflows status $WORKFLOW_ID
# Watch job completion in real-time
watch -n 10 'torc jobs list-by-status $WORKFLOW_ID'
# Or use the TUI
torc tui
Step 7: Retrieve Results
After completion:
# View best configuration
cat /results/best_config.json
# View summary of all runs
cat /results/summary.csv
Scaling Considerations
Job Count Growth
Multi-dimensional parameters grow exponentially:
| Dimensions | Values per Dimension | Total Jobs |
|---|---|---|
| 1 | 10 | 10 |
| 2 | 10 × 10 | 100 |
| 3 | 10 × 10 × 10 | 1,000 |
| 4 | 10 × 10 × 10 × 10 | 10,000 |
Dependency Count
Without barriers, dependencies also grow quickly. In this tutorial:
- 18 training jobs each depend on 1 file = 18 dependencies
- 1 aggregate job depends on 18 files = 18 dependencies
- Total: ~36 dependencies
For larger sweeps (1000+ jobs), consider the barrier pattern to reduce dependencies from O(n²) to O(n).
Common Patterns
Mixing Fixed and Parameterized Jobs
jobs:
# Fixed job (no parameters)
- name: setup
command: ./setup.sh
# Parameterized jobs depend on fixed job
- name: experiment_{i}
command: ./run.sh {i}
depends_on: [setup]
parameters:
i: "1:100"
Aggregating Parameterized Results
Use the file dependency pattern shown in this tutorial:
- name: aggregate
input_files:
- result_{i} # Matches all parameterized result files
parameters:
i: "1:100" # Same parameters as producer jobs
Nested Parameter Sweeps
For workflows with multiple independent sweeps:
jobs:
# Sweep 1
- name: sweep1_job_{a}
parameters:
a: "1:10"
# Sweep 2 (independent of sweep 1)
- name: sweep2_job_{b}
parameters:
b: "1:10"
What You Learned
In this tutorial, you learned:
- ✅ How multiple parameters create a Cartesian product of jobs
- ✅ How to structure multi-stage workflows (prep → train → aggregate → select)
- ✅ How to use parameters in file dependencies to collect all outputs
- ✅ How to mix parameterized and non-parameterized jobs
- ✅ Scaling considerations for large grid searches
Example Files
See these example files for hyperparameter sweep patterns:
- hyperparameter_sweep.yaml - Basic 3×3×2 grid search
- hyperparameter_sweep_shared_params.yaml - Grid search with shared parameter definitions
Next Steps
- Multi-Stage Workflows with Barriers - Essential for scaling to thousands of jobs
- Working with Slurm - Deploy grid searches on HPC clusters
- Resource Monitoring - Track resource usage across your sweep