Tutorial 3: User Data Dependencies
This tutorial teaches you how to pass structured data (JSON) between jobs using Torc’s user_data feature—an alternative to file-based dependencies that stores data directly in the database.
Learning Objectives
By the end of this tutorial, you will:
- Understand what user_data is and when to use it instead of files
- Learn how to define user_data entries and reference them in jobs
- Know how to update user_data from within a job
- See how user_data creates implicit dependencies (like files)
Prerequisites
- Completed Tutorial 2: Diamond Workflow
- Torc server running
jqcommand-line tool installed (for JSON parsing)
What is User Data?
User data is Torc’s mechanism for passing small, structured data between jobs without creating actual files. The data is stored in the Torc database and can be:
- JSON objects (configurations, parameters)
- Arrays
- Simple values (strings, numbers)
Like files, user_data creates implicit dependencies: a job that reads user_data will be blocked until the job that writes it completes.
User Data vs Files
| Feature | User Data | Files |
|---|---|---|
| Storage | Torc database | Filesystem |
| Size | Small (KB) | Any size |
| Format | JSON | Any format |
| Access | Via torc user-data CLI | Direct file I/O |
| Best for | Config, params, metadata | Datasets, binaries, logs |
Step 1: Create the Workflow Specification
Save as user_data_workflow.yaml:
name: config_pipeline
description: Jobs that pass configuration via user_data
jobs:
- name: generate_config
command: |
echo '{"learning_rate": 0.001, "batch_size": 32, "epochs": 10}' > /tmp/config.json
torc user-data update ${user_data.output.ml_config} \
--data "$(cat /tmp/config.json)"
resource_requirements: minimal
- name: train_model
command: |
echo "Training with config:"
torc user-data get ${user_data.input.ml_config} | jq '.data'
# In a real workflow: python train.py --config="${user_data.input.ml_config}"
resource_requirements: gpu_large
- name: evaluate_model
command: |
echo "Evaluating with config:"
torc user-data get ${user_data.input.ml_config} | jq '.data'
# In a real workflow: python evaluate.py --config="${user_data.input.ml_config}"
resource_requirements: gpu_small
user_data:
- name: ml_config
data: null # Will be populated by generate_config job
resource_requirements:
- name: minimal
num_cpus: 1
memory: 1g
runtime: PT5M
- name: gpu_small
num_cpus: 4
num_gpus: 1
memory: 16g
runtime: PT1H
- name: gpu_large
num_cpus: 8
num_gpus: 2
memory: 32g
runtime: PT4H
Understanding the Specification
Key elements:
user_data:section - Defines data entries, similar tofiles:data: null- Initial value; will be populated by a job${user_data.output.ml_config}- Job will write to this user_data (creates it)${user_data.input.ml_config}- Job reads from this user_data (creates dependency)
The dependency flow:
generate_configoutputsml_config→ runs firsttrain_modelandevaluate_modelinputml_config→ blocked until step 1 completes- After
generate_configfinishes, both become ready and can run in parallel
Step 2: Create and Initialize the Workflow
# Create the workflow
WORKFLOW_ID=$(torc workflows create user_data_workflow.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"
# Initialize jobs
torc workflows initialize-jobs $WORKFLOW_ID
Step 3: Check Initial State
Before running, examine the user_data:
# Check user_data - should be null
torc user-data list $WORKFLOW_ID
Output:
╭────┬───────────┬──────┬─────────────╮
│ ID │ Name │ Data │ Workflow ID │
├────┼───────────┼──────┼─────────────┤
│ 1 │ ml_config │ null │ 1 │
╰────┴───────────┴──────┴─────────────╯
Check job statuses:
torc jobs list $WORKFLOW_ID
You should see:
generate_config: ready (no input dependencies)train_model: blocked (waiting forml_config)evaluate_model: blocked (waiting forml_config)
Step 4: Run the Workflow
torc run $WORKFLOW_ID
Step 5: Observe the Data Flow
After generate_config completes, check the updated user_data:
torc user-data list $WORKFLOW_ID -f json | jq '.[] | {name, data}'
Output:
{
"name": "ml_config",
"data": {
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 10
}
}
The data is now stored in the database. At this point:
train_modelandevaluate_modelunblock- Both can read the configuration and run in parallel
Step 6: Verify Completion
After the workflow completes:
torc results list $WORKFLOW_ID
All three jobs should show return code 0.
How User Data Dependencies Work
The mechanism is identical to file dependencies:
| Syntax | Meaning | Effect |
|---|---|---|
${user_data.input.name} | Job reads this data | Creates dependency on producer |
${user_data.output.name} | Job writes this data | Satisfies dependencies |
Torc substitutes these variables with the actual user_data ID at runtime, and the torc user-data CLI commands use that ID to read/write the data.
Accessing User Data in Your Code
From within a job, you can:
Read user_data:
# Get the full record
torc user-data get $USER_DATA_ID
# Get just the data field
torc user-data get $USER_DATA_ID | jq '.data'
# Save to a file for your application
torc user-data get $USER_DATA_ID | jq '.data' > config.json
Write user_data:
# Update with JSON data
torc user-data update $USER_DATA_ID --data '{"key": "value"}'
# Update from a file
torc user-data update $USER_DATA_ID --data "$(cat results.json)"
What You Learned
In this tutorial, you learned:
- ✅ What user_data is: structured data stored in the Torc database
- ✅ When to use it: configurations, parameters, metadata (not large files)
- ✅ How to define user_data entries with the
user_data:section - ✅ How
${user_data.input.*}and${user_data.output.*}create dependencies - ✅ How to read and write user_data from within jobs
Common Patterns
Dynamic Configuration Generation
jobs:
- name: analyze_data
command: |
# Analyze data and determine optimal parameters
OPTIMAL_LR=$(python analyze.py --find-optimal-lr)
torc user-data update ${user_data.output.optimal_params} \
--data "{\"learning_rate\": $OPTIMAL_LR}"
Collecting Results from Multiple Jobs
jobs:
- name: worker_{i}
command: |
RESULT=$(python process.py --id {i})
torc user-data update ${user_data.output.result_{i}} --data "$RESULT"
parameters:
i: "1:10"
- name: aggregate
command: |
# Collect all results
for i in $(seq 1 10); do
torc user-data get ${user_data.input.result_$i} >> all_results.json
done
python aggregate.py all_results.json
Next Steps
- Tutorial 4: Simple Parameterization - Create parameter sweeps
- Tutorial 5: Advanced Parameterization - Multi-dimensional grid searches