Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Resource Monitoring Reference

Technical reference for Torc’s resource monitoring system.

Configuration Options

The resource_monitor section in workflow specifications accepts the following fields:

FieldTypeDefaultDescription
enabledbooleantrueEnable or disable monitoring
granularitystring"summary""summary" or "time_series"
sample_interval_secondsinteger5Seconds between resource samples
generate_plotsbooleanfalseReserved for future use

Granularity Modes

Summary mode ("summary"):

  • Stores only peak and average values per job
  • Metrics stored in the main database results table
  • Minimal storage overhead

Time series mode ("time_series"):

  • Stores samples at regular intervals
  • Creates separate SQLite database per workflow run
  • Database location: <output_dir>/resource_utilization/resource_metrics_<hostname>_<workflow_id>_<run_id>.db

Sample Interval Guidelines

Job DurationRecommended Interval
< 1 hour1-2 seconds
1-4 hours5 seconds (default)
> 4 hours10-30 seconds

Time Series Database Schema

job_resource_samples Table

ColumnTypeDescription
idINTEGERPrimary key
job_idINTEGERTorc job ID
timestampREALUnix timestamp
cpu_percentREALCPU utilization percentage
memory_bytesINTEGERMemory usage in bytes
num_processesINTEGERProcess count including children

job_metadata Table

ColumnTypeDescription
job_idINTEGERPrimary key, Torc job ID
job_nameTEXTHuman-readable job name

Summary Metrics in Results

When using summary mode, the following fields are added to job results:

FieldTypeDescription
peak_cpu_percentfloatMaximum CPU percentage observed
avg_cpu_percentfloatAverage CPU percentage
peak_memory_gbfloatMaximum memory in GB
avg_memory_gbfloatAverage memory in GB

check-resource-utilization JSON Output

When using --format json:

{
  "workflow_id": 123,
  "run_id": null,
  "total_results": 10,
  "over_utilization_count": 3,
  "violations": [
    {
      "job_id": 15,
      "job_name": "train_model",
      "resource_type": "Memory",
      "specified": "8.00 GB",
      "peak_used": "10.50 GB",
      "over_utilization": "+31.3%"
    }
  ]
}
FieldDescription
workflow_idWorkflow being analyzed
run_idSpecific run ID if provided, otherwise null for latest
total_resultsTotal number of completed jobs analyzed
over_utilization_countNumber of violations found
violationsArray of violation details

Violation Object

FieldDescription
job_idJob ID with violation
job_nameHuman-readable job name
resource_type"Memory", "CPU", or "Runtime"
specifiedResource requirement from workflow spec
peak_usedActual peak usage observed
over_utilizationPercentage over/under specification

plot-resources Output Files

FileDescription
resource_plot_job_<id>.htmlPer-job timeline with CPU, memory, process count
resource_plot_cpu_all_jobs.htmlCPU comparison across all jobs
resource_plot_memory_all_jobs.htmlMemory comparison across all jobs
resource_plot_summary.htmlBar chart dashboard of peak vs average

All plots are self-contained HTML files using Plotly.js with:

  • Interactive hover tooltips
  • Zoom and pan controls
  • Legend toggling
  • Export options (PNG, SVG)

Monitored Metrics

MetricUnitDescription
CPU percentage%Total CPU utilization across all cores
Memory usagebytesResident memory consumption
Process countcountNumber of processes in job’s process tree

Process Tree Tracking

The monitoring system automatically tracks child processes spawned by jobs. When a job creates worker processes (e.g., Python multiprocessing), all descendants are included in the aggregated metrics.

Performance Characteristics

  • Single background monitoring thread regardless of job count
  • Typical overhead: <1% CPU even with 1-second sampling
  • Uses native OS APIs via the sysinfo crate
  • Non-blocking async design