Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Tutorial 3: User Data Dependencies

This tutorial teaches you how to pass structured data (JSON) between jobs using Torc’s user_data feature—an alternative to file-based dependencies that stores data directly in the database.

Learning Objectives

By the end of this tutorial, you will:

  • Understand what user_data is and when to use it instead of files
  • Learn how to define user_data entries and reference them in jobs
  • Know how to update user_data from within a job
  • See how user_data creates implicit dependencies (like files)

Prerequisites

What is User Data?

User data is Torc’s mechanism for passing small, structured data between jobs without creating actual files. The data is stored in the Torc database and can be:

  • JSON objects (configurations, parameters)
  • Arrays
  • Simple values (strings, numbers)

Like files, user_data creates implicit dependencies: a job that reads user_data will be blocked until the job that writes it completes.

User Data vs Files

FeatureUser DataFiles
StorageTorc databaseFilesystem
SizeSmall (KB)Any size
FormatJSONAny format
AccessVia torc user-data CLIDirect file I/O
Best forConfig, params, metadataDatasets, binaries, logs

Step 1: Create the Workflow Specification

Save as user_data_workflow.yaml:

name: config_pipeline
description: Jobs that pass configuration via user_data

jobs:
  - name: generate_config
    command: |
      echo '{"learning_rate": 0.001, "batch_size": 32, "epochs": 10}' > /tmp/config.json
      torc user-data update ${user_data.output.ml_config} \
        --data "$(cat /tmp/config.json)"
    resource_requirements: minimal

  - name: train_model
    command: |
      echo "Training with config:"
      torc user-data get ${user_data.input.ml_config} | jq '.data'
      # In a real workflow: python train.py --config="${user_data.input.ml_config}"
    resource_requirements: gpu_large

  - name: evaluate_model
    command: |
      echo "Evaluating with config:"
      torc user-data get ${user_data.input.ml_config} | jq '.data'
      # In a real workflow: python evaluate.py --config="${user_data.input.ml_config}"
    resource_requirements: gpu_small

user_data:
  - name: ml_config
    data: null  # Will be populated by generate_config job

resource_requirements:
  - name: minimal
    num_cpus: 1
    memory: 1g
    runtime: PT5M

  - name: gpu_small
    num_cpus: 4
    num_gpus: 1
    memory: 16g
    runtime: PT1H

  - name: gpu_large
    num_cpus: 8
    num_gpus: 2
    memory: 32g
    runtime: PT4H

Understanding the Specification

Key elements:

  • user_data: section - Defines data entries, similar to files:
  • data: null - Initial value; will be populated by a job
  • ${user_data.output.ml_config} - Job will write to this user_data (creates it)
  • ${user_data.input.ml_config} - Job reads from this user_data (creates dependency)

The dependency flow:

  1. generate_config outputs ml_config → runs first
  2. train_model and evaluate_model input ml_config → blocked until step 1 completes
  3. After generate_config finishes, both become ready and can run in parallel

Step 2: Create and Initialize the Workflow

# Create the workflow
WORKFLOW_ID=$(torc workflows create user_data_workflow.yaml -f json | jq -r '.id')
echo "Created workflow: $WORKFLOW_ID"

# Initialize jobs
torc workflows initialize-jobs $WORKFLOW_ID

Step 3: Check Initial State

Before running, examine the user_data:

# Check user_data - should be null
torc user-data list $WORKFLOW_ID

Output:

╭────┬───────────┬──────┬─────────────╮
│ ID │ Name      │ Data │ Workflow ID │
├────┼───────────┼──────┼─────────────┤
│ 1  │ ml_config │ null │ 1           │
╰────┴───────────┴──────┴─────────────╯

Check job statuses:

torc jobs list $WORKFLOW_ID

You should see:

  • generate_config: ready (no input dependencies)
  • train_model: blocked (waiting for ml_config)
  • evaluate_model: blocked (waiting for ml_config)

Step 4: Run the Workflow

torc run $WORKFLOW_ID

Step 5: Observe the Data Flow

After generate_config completes, check the updated user_data:

torc user-data list $WORKFLOW_ID -f json | jq '.[] | {name, data}'

Output:

{
  "name": "ml_config",
  "data": {
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 10
  }
}

The data is now stored in the database. At this point:

  • train_model and evaluate_model unblock
  • Both can read the configuration and run in parallel

Step 6: Verify Completion

After the workflow completes:

torc results list $WORKFLOW_ID

All three jobs should show return code 0.

How User Data Dependencies Work

The mechanism is identical to file dependencies:

SyntaxMeaningEffect
${user_data.input.name}Job reads this dataCreates dependency on producer
${user_data.output.name}Job writes this dataSatisfies dependencies

Torc substitutes these variables with the actual user_data ID at runtime, and the torc user-data CLI commands use that ID to read/write the data.

Accessing User Data in Your Code

From within a job, you can:

Read user_data:

# Get the full record
torc user-data get $USER_DATA_ID

# Get just the data field
torc user-data get $USER_DATA_ID | jq '.data'

# Save to a file for your application
torc user-data get $USER_DATA_ID | jq '.data' > config.json

Write user_data:

# Update with JSON data
torc user-data update $USER_DATA_ID --data '{"key": "value"}'

# Update from a file
torc user-data update $USER_DATA_ID --data "$(cat results.json)"

What You Learned

In this tutorial, you learned:

  • ✅ What user_data is: structured data stored in the Torc database
  • ✅ When to use it: configurations, parameters, metadata (not large files)
  • ✅ How to define user_data entries with the user_data: section
  • ✅ How ${user_data.input.*} and ${user_data.output.*} create dependencies
  • ✅ How to read and write user_data from within jobs

Common Patterns

Dynamic Configuration Generation

jobs:
  - name: analyze_data
    command: |
      # Analyze data and determine optimal parameters
      OPTIMAL_LR=$(python analyze.py --find-optimal-lr)
      torc user-data update ${user_data.output.optimal_params} \
        --data "{\"learning_rate\": $OPTIMAL_LR}"

Collecting Results from Multiple Jobs

jobs:
  - name: worker_{i}
    command: |
      RESULT=$(python process.py --id {i})
      torc user-data update ${user_data.output.result_{i}} --data "$RESULT"
    parameters:
      i: "1:10"

  - name: aggregate
    command: |
      # Collect all results
      for i in $(seq 1 10); do
        torc user-data get ${user_data.input.result_$i} >> all_results.json
      done
      python aggregate.py all_results.json

Next Steps