Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

How to Monitor Resource Usage

This guide shows how to track CPU and memory usage of your workflow jobs and identify resource requirement mismatches.

Enable Resource Monitoring

Resource monitoring is enabled by default for all workflows. To explicitly configure it, add a resource_monitor section to your workflow specification:

name: "My Workflow"

resource_monitor:
  enabled: true
  granularity: "summary"       # or "time_series"
  sample_interval_seconds: 5

jobs:
  # ... your jobs

To disable monitoring when creating a workflow:

torc workflows create my_workflow.yaml --no-resource-monitoring

View Summary Metrics

For workflows using summary mode (default), view resource metrics with:

torc results list <workflow_id>

The output includes columns for peak and average CPU and memory usage.

Check for Resource Violations

Use check-resource-utilization to identify jobs that exceeded their specified requirements:

# Check latest run
torc reports check-resource-utilization <workflow_id>

# Check a specific run
torc reports check-resource-utilization <workflow_id> --run-id <run_id>

# Show all jobs, not just violations
torc reports check-resource-utilization <workflow_id> --all

Example output:

⚠ Found 3 resource over-utilization violations:

Job ID | Job Name         | Resource | Specified | Peak Used | Over-Utilization
-------|------------------|----------|-----------|-----------|------------------
15     | train_model      | Memory   | 8.00 GB   | 10.50 GB  | +31.3%
15     | train_model      | Runtime  | 2h 0m 0s  | 2h 45m 0s | +37.5%
16     | large_preprocess | CPU      | 800%      | 950.5%    | +18.8%

Adjust Resource Requirements

After identifying violations, update your workflow specification:

# Before: job used 10.5 GB but was allocated 8 GB
resource_requirements:
  - name: training
    memory: 8g
    runtime: PT2H

# After: increased with buffer
resource_requirements:
  - name: training
    memory: 12g       # 10.5 GB peak + 15% buffer
    runtime: PT3H     # 2h 45m actual + buffer

Guidelines for buffers:

  • Memory: Add 10-20% above peak usage
  • Runtime: Add 15-30% above actual duration
  • CPU: Round up to next core count

Enable Time Series Monitoring

For detailed resource analysis over time, switch to time series mode:

resource_monitor:
  granularity: "time_series"
  sample_interval_seconds: 2

This creates a SQLite database with samples at regular intervals.

Generate Resource Plots

Create interactive visualizations from time series data:

# Generate all plots
torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/

# Generate plots for specific jobs
torc plot-resources output/resource_utilization/resource_metrics_*.db \
  -o plots/ \
  --job-ids 15,16

The tool generates:

  • Individual job plots showing CPU, memory, and process count over time
  • Overview plots comparing all jobs
  • Summary dashboard with bar charts

Query Time Series Data Directly

Access the SQLite database for custom analysis:

sqlite3 -table output/resource_utilization/resource_metrics_1_1.db
-- View samples for a specific job
SELECT job_id, timestamp, cpu_percent, memory_bytes, num_processes
FROM job_resource_samples
WHERE job_id = 1
ORDER BY timestamp;

-- View job metadata
SELECT * FROM job_metadata;

Troubleshooting

No metrics recorded

  • Check that monitoring wasn’t disabled with --no-resource-monitoring
  • Ensure jobs run long enough for at least one sample (default: 5 seconds)

Time series database not created

  • Verify the output directory is writable
  • Confirm granularity: "time_series" is set in the workflow spec

Missing child process metrics

  • Decrease sample_interval_seconds to catch short-lived processes

Next Steps