Debugging Slurm Workflows
When running workflows on Slurm clusters, Torc provides additional debugging tools specifically designed for Slurm environments. This guide covers Slurm-specific debugging techniques and tools.
For general debugging concepts and tools that apply to all workflows, see Debugging Workflows.
Overview
Slurm workflows generate additional log files beyond the standard job logs:
- Slurm stdout/stderr: Output from Slurm’s perspective (job allocation, environment setup)
- Slurm environment logs: All SLURM environment variables captured at job runner startup
- dmesg logs: Kernel message buffer captured when the Slurm job runner exits
These logs help diagnose issues specific to the cluster environment, such as resource allocation failures, node problems, and system-level errors.
Slurm Log File Structure
For jobs executed via Slurm scheduler (compute_node_type: "slurm"), the debug report includes these additional log paths:
{
"job_stdout": "output/job_stdio/job_456.o",
"job_stderr": "output/job_stdio/job_456.e",
"job_runner_log": "output/job_runner_slurm_12345_node01_67890.log",
"slurm_stdout": "output/slurm_output_12345.o",
"slurm_stderr": "output/slurm_output_12345.e",
"slurm_env_log": "output/slurm_env_12345_node01_67890.log",
"dmesg_log": "output/dmesg_slurm_12345_node01_67890.log"
}
Log File Descriptions
-
slurm_stdout (
output/slurm_output_<slurm_job_id>.o):- Standard output from Slurm’s perspective
- Includes Slurm environment setup, job allocation info
- Use for: Debugging Slurm job submission issues
-
slurm_stderr (
output/slurm_output_<slurm_job_id>.e):- Standard error from Slurm’s perspective
- Contains Slurm-specific errors (allocation failures, node issues)
- Use for: Investigating Slurm scheduler problems
-
slurm_env_log (
output/slurm_env_<slurm_job_id>_<node_id>_<task_pid>.log):- All SLURM environment variables captured at job runner startup
- Contains job allocation details, resource limits, node assignments
- Use for: Verifying Slurm job configuration, debugging resource allocation issues
-
dmesg log (
output/dmesg_slurm_<slurm_job_id>_<node_id>_<task_pid>.log):- Kernel message buffer captured when the Slurm job runner exits
- Contains system-level events: OOM killer activity, hardware errors, kernel panics
- Use for: Investigating job failures caused by system-level issues (e.g., out-of-memory kills, hardware failures)
Note: Slurm job runner logs include the Slurm job ID, node ID, and task PID in the filename for correlation with Slurm’s own logs.
Parsing Slurm Log Files for Errors
The torc slurm parse-logs command scans Slurm stdout/stderr log files for known error patterns and correlates them with affected Torc jobs:
# Parse logs for a specific workflow
torc slurm parse-logs <workflow_id>
# Specify custom output directory
torc slurm parse-logs <workflow_id> --output-dir /path/to/output
# Output as JSON for programmatic processing
torc slurm parse-logs <workflow_id> --format json
Detected Error Patterns
The command detects common Slurm failure patterns including:
Memory Errors:
out of memory,oom-kill,cannot allocate memorymemory cgroup out of memory,Exceeded job memory limittask/cgroup: .*: Killedstd::bad_alloc(C++),MemoryError(Python)
Slurm-Specific Errors:
slurmstepd: error:,srun: error:DUE TO TIME LIMIT,DUE TO PREEMPTIONNODE_FAIL,FAILED,CANCELLEDExceeded.*step.*limit
GPU/CUDA Errors:
CUDA out of memory,CUDA error,GPU memory.*exceeded
Signal/Crash Errors:
Segmentation fault,SIGSEGVBus error,SIGBUSkilled by signal,core dumped
Python Errors:
Traceback (most recent call last)ModuleNotFoundError,ImportError
File System Errors:
No space left on device,Disk quota exceededRead-only file system,Permission denied
Network Errors:
Connection refused,Connection timed out,Network is unreachable
Example Output
Table format:
Slurm Log Analysis Results
==========================
Found 2 error(s) in log files:
╭─────────────────────────────┬──────────────┬──────┬─────────────────────────────┬──────────┬──────────────────────────────╮
│ File │ Slurm Job ID │ Line │ Pattern │ Severity │ Affected Torc Jobs │
├─────────────────────────────┼──────────────┼──────┼─────────────────────────────┼──────────┼──────────────────────────────┤
│ slurm_output_12345.e │ 12345 │ 42 │ Out of Memory (OOM) Kill │ critical │ process_data (ID: 456) │
│ slurm_output_12346.e │ 12346 │ 15 │ CUDA out of memory │ error │ train_model (ID: 789) │
╰─────────────────────────────┴──────────────┴──────┴─────────────────────────────┴──────────┴──────────────────────────────╯
Viewing Slurm Accounting Data
The torc slurm sacct command displays a summary of Slurm job accounting data for all scheduled compute nodes in a workflow:
# Display sacct summary table for a workflow
torc slurm sacct <workflow_id>
# Also save full JSON files for detailed analysis
torc slurm sacct <workflow_id> --save-json --output-dir /path/to/output
# Output as JSON for programmatic processing
torc slurm sacct <workflow_id> --format json
Summary Table Fields
The command displays a summary table with key metrics:
- Slurm Job: The Slurm job ID
- Job Step: Name of the job step (e.g., “worker_1”, “batch”)
- State: Job state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, etc.)
- Exit Code: Exit code of the job step
- Elapsed: Wall clock time for the job step
- Max RSS: Maximum resident set size (memory usage)
- CPU Time: Total CPU time consumed
- Nodes: Compute nodes used
Example Output
Slurm Accounting Summary for Workflow 123
╭────────────┬───────────┬───────────┬───────────┬─────────┬─────────┬──────────┬─────────╮
│ Slurm Job │ Job Step │ State │ Exit Code │ Elapsed │ Max RSS │ CPU Time │ Nodes │
├────────────┼───────────┼───────────┼───────────┼─────────┼─────────┼──────────┼─────────┤
│ 12345 │ worker_1 │ COMPLETED │ 0 │ 2h 15m │ 4.5GB │ 4h 30m │ node01 │
│ 12345 │ batch │ COMPLETED │ 0 │ 2h 16m │ 128.0MB │ 1m 30s │ node01 │
│ 12346 │ worker_1 │ FAILED │ 1 │ 45m 30s │ 8.2GB │ 1h 30m │ node02 │
╰────────────┴───────────┴───────────┴───────────┴─────────┴─────────┴──────────┴─────────╯
Total: 3 job steps
Saving Full JSON Output
Use --save-json to save full sacct JSON output to files for detailed analysis:
torc slurm sacct 123 --save-json --output-dir output
# Creates: output/sacct_12345.json, output/sacct_12346.json, etc.
Viewing Slurm Logs in torc-dash
The torc-dash web interface provides two ways to view Slurm logs:
Debugging Tab - Slurm Log Analysis
The Debugging tab includes a “Slurm Log Analysis” section:
- Navigate to the Debugging tab
- Find the Slurm Log Analysis section
- Enter the output directory path (default:
output) - Click Analyze Slurm Logs
The results show all detected errors with their Slurm job IDs, line numbers, error patterns, severity levels, and affected Torc jobs.
Debugging Tab - Slurm Accounting Data
The Debugging tab also includes a “Slurm Accounting Data” section:
- Navigate to the Debugging tab
- Find the Slurm Accounting Data section
- Click Collect sacct Data
This displays a summary table showing job state, exit codes, elapsed time, memory usage (Max RSS), CPU time, and nodes for all Slurm job steps. The table helps quickly identify failed jobs and resource usage patterns.
Scheduled Nodes Tab - View Slurm Logs
You can view individual Slurm job logs directly from the Details view:
- Select a workflow
- Go to the Details tab
- Switch to the Scheduled Nodes sub-tab
- Find a Slurm scheduled node in the table
- Click the View Logs button in the Logs column
This opens a modal with tabs for viewing the Slurm job’s stdout and stderr files.
Viewing Slurm Logs in the TUI
The torc tui terminal interface also supports Slurm log viewing:
- Launch the TUI:
torc tui - Select a workflow and press Enter to load details
- Press Tab to switch to the Scheduled Nodes tab
- Navigate to a Slurm scheduled node using arrow keys
- Press
lto view the Slurm job’s logs
The log viewer shows:
- stdout tab: Slurm job standard output (
slurm_output_<id>.o) - stderr tab: Slurm job standard error (
slurm_output_<id>.e)
Use Tab to switch between stdout/stderr, arrow keys to scroll, / to search, and q to close.
Debugging Slurm Job Failures
When a Slurm job fails, follow this debugging workflow:
-
Parse logs for known errors:
torc slurm parse-logs <workflow_id> -
If OOM or resource issues are detected, collect sacct data:
torc slurm sacct <workflow_id> cat output/sacct_<slurm_job_id>.json | jq '.jobs[].steps[].tres.requested' -
View the specific Slurm log files:
- Use torc-dash: Details → Scheduled Nodes → View Logs
- Or use TUI: Scheduled Nodes tab → press
l - Or directly:
cat output/slurm_output_<slurm_job_id>.e
-
Check the job’s own stderr for application errors:
torc reports results <workflow_id> > report.json jq -r '.results[] | select(.return_code != 0) | .job_stderr' report.json | xargs cat -
Review dmesg logs for system-level issues:
cat output/dmesg_slurm_<slurm_job_id>_*.log
Common Slurm Issues and Solutions
Out of Memory (OOM) Kills
Symptoms:
torc slurm parse-logsshows “Out of Memory (OOM) Kill”- Job exits with signal 9 (SIGKILL)
- dmesg log shows “oom-kill” entries
Solutions:
- Increase memory request in job specification
- Check
torc slurm sacctoutput for actual memory usage (Max RSS) - Consider splitting job into smaller chunks
Time Limit Exceeded
Symptoms:
torc slurm parse-logsshows “DUE TO TIME LIMIT”- Job state in sacct shows “TIMEOUT”
Solutions:
- Increase runtime in job specification
- Check if job is stuck (review stdout for progress)
- Consider optimizing the job or splitting into phases
Node Failures
Symptoms:
torc slurm parse-logsshows “NODE_FAIL”- Job may have completed partially
Solutions:
- Reinitialize workflow to retry failed jobs
- Check cluster status with
sinfo - Review dmesg logs for hardware issues
GPU/CUDA Errors
Symptoms:
torc slurm parse-logsshows “CUDA out of memory” or “CUDA error”
Solutions:
- Reduce batch size or model size
- Check GPU memory with
nvidia-smiin job script - Ensure correct CUDA version is loaded
Related Commands
torc slurm parse-logs: Parse Slurm logs for known error patternstorc slurm sacct: Collect Slurm accounting data for workflow jobstorc reports results: Generate debug report with all log file pathstorc results list: View summary of job results in table formattorc-dash: Launch web interface with Slurm log viewingtorc tui: Launch terminal UI with Slurm log viewing
For general debugging tools and workflows, see Debugging Workflows.