Tutorial 2: Diamond Workflow with File Dependencies
This tutorial teaches you how to create workflows where job dependencies are automatically inferred from file inputs and outputs—a core concept in Torc called implicit dependencies.
Learning Objectives
By the end of this tutorial, you will:
- Understand how Torc infers job dependencies from file relationships
- Learn the “diamond” workflow pattern (fan-out and fan-in)
- Know how to use file variable substitution (
${files.input.*}and${files.output.*}) - See how jobs automatically unblock when their input files become available
Prerequisites
- Completed Tutorial 1: Many Independent Jobs
- Torc server running
The Diamond Pattern
The “diamond” pattern is a common workflow structure where:
- One job produces multiple outputs (fan-out)
- Multiple jobs process those outputs in parallel
- A final job combines all results (fan-in)
graph TD
Input["input.txt"] --> Preprocess["preprocess<br/>(generates intermediate files)"]
Preprocess --> Int1["intermediate1.txt"]
Preprocess --> Int2["intermediate2.txt"]
Int1 --> Work1["work1<br/>(process intermediate1)"]
Int2 --> Work2["work2<br/>(process intermediate2)"]
Work1 --> Result1["result1.txt"]
Work2 --> Result2["result2.txt"]
Result1 --> Postprocess["postprocess<br/>(combines results)"]
Result2 --> Postprocess
Postprocess --> Output["output.txt"]
Notice that we never explicitly say “work1 depends on preprocess”—Torc figures this out automatically because work1 needs intermediate1.txt as input, and preprocess produces it as output.
Step 1: Create the Workflow Specification
Save as diamond.yaml:
name: diamond_workflow
description: Diamond workflow demonstrating fan-out and fan-in
jobs:
- name: preprocess
command: |
cat ${files.input.input_file} |
awk '{print $1}' > ${files.output.intermediate1}
cat ${files.input.input_file} |
awk '{print $2}' > ${files.output.intermediate2}
resource_requirements: small
- name: work1
command: |
cat ${files.input.intermediate1} |
sort | uniq > ${files.output.result1}
resource_requirements: medium
- name: work2
command: |
cat ${files.input.intermediate2} |
sort | uniq > ${files.output.result2}
resource_requirements: medium
- name: postprocess
command: |
paste ${files.input.result1} ${files.input.result2} > ${files.output.final_output}
resource_requirements: small
files:
- name: input_file
path: /tmp/input.txt
- name: intermediate1
path: /tmp/intermediate1.txt
- name: intermediate2
path: /tmp/intermediate2.txt
- name: result1
path: /tmp/result1.txt
- name: result2
path: /tmp/result2.txt
- name: final_output
path: /tmp/output.txt
resource_requirements:
- name: small
num_cpus: 1
num_gpus: 0
num_nodes: 1
memory: 1g
runtime: PT10M
- name: medium
num_cpus: 4
num_gpus: 0
num_nodes: 1
memory: 4g
runtime: PT30M
Understanding File Variable Substitution
The key concept here is file variable substitution:
${files.input.filename}- References a file this job reads (creates a dependency)${files.output.filename}- References a file this job writes (satisfies dependencies)
When Torc processes the workflow:
- It sees
preprocessoutputsintermediate1andintermediate2 - It sees
work1inputsintermediate1→ dependency created - It sees
work2inputsintermediate2→ dependency created - It sees
postprocessinputsresult1andresult2→ dependencies created
This is more maintainable than explicit depends_on declarations because:
- Dependencies are derived from actual data flow
- Adding a new intermediate step automatically updates dependencies
- The workflow specification documents the data flow
Step 2: Create Input Data
# Create test input file
echo -e "apple red\nbanana yellow\ncherry red\ndate brown" > /tmp/input.txt
Step 3: Create and Initialize the Workflow
# Create the workflow and capture the ID
WORKFLOW_ID=$(torc workflows create diamond.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
# Ensure the input file timestamp is current
touch /tmp/input.txt
# Initialize the workflow (builds dependency graph)
torc workflows initialize-jobs $WORKFLOW_ID
The initialize-jobs command is where Torc:
- Analyzes file input/output relationships
- Builds the dependency graph
- Marks jobs with satisfied dependencies as “ready”
Step 4: Observe Dependency Resolution
# Check job statuses
torc jobs list $WORKFLOW_ID
Expected output:
╭────┬──────────────┬─────────┬────────╮
│ ID │ Name │ Status │ ... │
├────┼──────────────┼─────────┼────────┤
│ 1 │ preprocess │ ready │ ... │
│ 2 │ work1 │ blocked │ ... │
│ 3 │ work2 │ blocked │ ... │
│ 4 │ postprocess │ blocked │ ... │
╰────┴──────────────┴─────────┴────────╯
Only preprocess is ready because:
- Its only input (
input_file) already exists - The others are blocked waiting for files that don’t exist yet
Step 5: Run the Workflow
torc run $WORKFLOW_ID
Watch the execution unfold:
preprocessruns first - Createsintermediate1.txtandintermediate2.txtwork1andwork2unblock - Their input files now existwork1andwork2run in parallel - They have no dependency on each otherpostprocessunblocks - Bothresult1.txtandresult2.txtexistpostprocessruns - Creates the final output
Step 6: Verify Results
cat /tmp/output.txt
You should see the combined, sorted, unique values from both columns of the input.
How Implicit Dependencies Work
Torc determines job order through file relationships:
| Job | Inputs | Outputs | Blocked By |
|---|---|---|---|
| preprocess | input_file | intermediate1, intermediate2 | (nothing) |
| work1 | intermediate1 | result1 | preprocess |
| work2 | intermediate2 | result2 | preprocess |
| postprocess | result1, result2 | final_output | work1, work2 |
The dependency graph is built automatically from these relationships. If you later add a validation step between preprocess and work1, you only need to update the file references—the dependencies adjust automatically.
What You Learned
In this tutorial, you learned:
- ✅ How to define files with
files:section and reference them in jobs - ✅ How
${files.input.*}creates implicit dependencies - ✅ How
${files.output.*}satisfies dependencies for downstream jobs - ✅ The diamond pattern: fan-out → parallel processing → fan-in
- ✅ How Torc automatically determines execution order from data flow
When to Use File Dependencies vs Explicit Dependencies
Use file dependencies when:
- Jobs actually read/write files
- Data flow defines the natural ordering
- You want self-documenting workflows
Use explicit depends_on when:
- Dependencies are logical, not data-based
- Jobs communicate through side effects
- You need precise control over ordering
Example Files
See the diamond workflow examples in all three formats:
A Python version is also available: diamond_workflow.py
Next Steps
- Tutorial 3: User Data Dependencies - Pass JSON data between jobs without files
- Tutorial 4: Simple Parameterization - Combine file dependencies with parameter expansion