Workflow Actions
Workflow actions enable automatic execution of commands and resource allocation in response to workflow lifecycle events. Actions provide hooks for setup, teardown, monitoring, and dynamic resource management throughout workflow execution.
Overview
Actions consist of three components:
- Trigger - The condition that activates the action
- Action Type - The operation to perform
- Configuration - Parameters specific to the action
actions:
- trigger_type: "on_workflow_start"
action_type: "run_commands"
commands:
- "mkdir -p output logs"
- "echo 'Workflow started' > logs/status.txt"
Trigger Types
Workflow Lifecycle Triggers
on_workflow_start
Executes once when the workflow is initialized.
When it fires: During initialize_jobs after jobs are transitioned from uninitialized to ready/blocked states.
Typical use cases:
- Scheduling Slurm allocations
- Creating directory structures
- Copying initial data
- trigger_type: "on_workflow_start"
action_type: "run_commands"
commands:
- "mkdir -p output checkpoints temp"
- "echo 'Workflow started at $(date)' > workflow.log"
on_workflow_complete
Executes once when all jobs reach terminal states (completed, failed, or canceled).
When it fires: After the last job completes, as detected by the job runner.
Typical use cases:
- Archiving final results
- Uploading to remote storage
- Cleanup of temporary files
- Generating summary reports
- trigger_type: "on_workflow_complete"
action_type: "run_commands"
commands:
- "tar -czf results.tar.gz output/"
- "aws s3 cp results.tar.gz s3://bucket/results/"
- "rm -rf temp/"
Job-Based Triggers
on_jobs_ready
Executes when all specified jobs transition to the “ready” state.
When it fires: When the last specified job becomes ready to execute (all dependencies satisfied).
Typical use cases:
- Scheduling Slurm allocations
- Starting phase-specific monitoring
- Pre-computation setup
- Notifications before expensive operations
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
jobs: ["train_model_001", "train_model_002", "train_model_003"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Important: The action triggers only when all matching jobs are ready, not individually as each job becomes ready.
on_jobs_complete
Executes when all specified jobs reach terminal states (completed, failed, or canceled).
When it fires: When the last specified job finishes execution.
Typical use cases:
- Scheduling Slurm allocations
- Cleaning up intermediate files
- Archiving phase results
- Freeing storage space
- Phase-specific reporting
- trigger_type: "on_jobs_complete"
action_type: "run_commands"
jobs: ["preprocess_1", "preprocess_2", "preprocess_3"]
commands:
- "echo 'Preprocessing phase complete' >> workflow.log"
- "rm -rf raw_data/"
Worker Lifecycle Triggers
Worker lifecycle triggers are persistent by default, meaning they execute once per worker (job runner), not once per workflow.
on_worker_start
Executes when each worker (job runner) starts.
When it fires: After a job runner starts and checks for workflow actions, before claiming any jobs.
Typical use cases:
- Worker-specific initialization
- Setting up worker-local logging
- Copying data to compute node local storage
- Initializing worker-specific resources
- Recording worker startup metrics
- trigger_type: "on_worker_start"
action_type: "run_commands"
persistent: true # Each worker executes this
commands:
- "echo 'Worker started on $(hostname) at $(date)' >> worker.log"
- "mkdir -p worker_temp"
on_worker_complete
Executes when each worker completes (exits the main loop).
When it fires: After a worker finishes processing jobs and before it shuts down.
Typical use cases:
- Worker-specific cleanup
- Uploading worker-specific logs
- Recording worker completion metrics
- Cleaning up worker-local resources
- trigger_type: "on_worker_complete"
action_type: "run_commands"
persistent: true # Each worker executes this
commands:
- "echo 'Worker completed on $(hostname) at $(date)' >> worker.log"
- "rm -rf worker_temp"
Job Selection
For on_jobs_ready and on_jobs_complete triggers, specify which jobs to monitor.
Exact Job Names
- trigger_type: "on_jobs_complete"
action_type: "run_commands"
jobs: ["job1", "job2", "job3"]
commands:
- "echo 'Specific jobs complete'"
Regular Expressions
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
job_name_regexes: ["train_model_[0-9]+", "eval_.*"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
Common regex patterns:
"train_.*"- All jobs starting with “train_”"model_[0-9]+"- Jobs like “model_1”, “model_2”".*_stage1"- All jobs ending with “_stage1”"job_(a|b|c)"- Jobs “job_a”, “job_b”, or “job_c”
Combining Selection Methods
You can use both together - the action triggers when all matching jobs meet the condition:
jobs: ["critical_job"]
job_name_regexes: ["batch_.*"]
# Triggers when "critical_job" AND all "batch_*" jobs are ready/complete
Action Types
run_commands
Execute shell commands sequentially on a compute node.
Configuration:
- trigger_type: "on_workflow_complete"
action_type: "run_commands"
commands:
- "tar -czf results.tar.gz output/"
- "aws s3 cp results.tar.gz s3://bucket/"
Execution details:
- Commands run in the workflow’s output directory
- Commands execute sequentially (one after another)
- If a command fails, the action fails (but workflow continues)
- Commands run on compute nodes, not the submission node
- Uses the shell environment of the job runner process
schedule_nodes
Dynamically allocate compute resources from a Slurm scheduler.
Configuration:
- trigger_type: "on_jobs_ready"
action_type: "schedule_nodes"
jobs: ["train_model_1", "train_model_2"]
scheduler: "gpu_cluster"
scheduler_type: "slurm"
num_allocations: 2
start_one_worker_per_node: true
max_parallel_jobs: 8
Parameters:
scheduler(required) - Name of Slurm scheduler configuration (must exist inslurm_schedulers)scheduler_type(required) - Must be “slurm”num_allocations(required) - Number of Slurm allocation requests to submitstart_one_worker_per_node(optional) - Start one job runner per node (default: false)max_parallel_jobs(optional) - Maximum concurrent jobs per runner
Use cases:
- Just-in-time resource allocation
- Cost optimization (allocate only when needed)
- Separating workflow phases with different resource requirements
Complete Examples
Refer to this example
Execution Model
Action Claiming and Execution
- Atomic Claiming: Actions are claimed atomically by workers to prevent duplicate execution
- Non-Persistent Actions: Execute once per workflow (first worker to claim executes)
- Persistent Actions: Can be claimed and executed by multiple workers
- Trigger Counting: Job-based triggers increment a counter as jobs transition; action becomes pending when count reaches threshold
- Immediate Availability: Worker lifecycle actions are immediately available after workflow initialization
Action Lifecycle
[Workflow Created]
↓
[initialize_jobs called]
↓
├─→ on_workflow_start actions become pending
├─→ on_worker_start actions become pending (persistent)
├─→ on_worker_complete actions become pending (persistent)
└─→ on_jobs_ready actions wait for job transitions
↓
[Worker Claims and Executes Actions]
↓
[Jobs Execute]
↓
[Jobs Complete]
↓
├─→ on_jobs_complete actions become pending when all specified jobs complete
└─→ on_workflow_complete actions become pending when all jobs complete
↓
[Workers Exit]
↓
[on_worker_complete actions execute per worker]
Important Characteristics
- No Rollback: Failed actions don’t affect workflow execution
- Compute Node Execution: Actions run on compute nodes via job runners
- One-Time Triggers: Non-persistent actions trigger once when conditions are first met
- No Inter-Action Dependencies: Actions don’t depend on other actions
- Concurrent Workers: Multiple workers can execute different actions simultaneously
Workflow Reinitialization
When a workflow is reinitialized (e.g., after resetting failed jobs), actions are reset to allow them to trigger again:
- Executed flags are cleared: All actions can be claimed and executed again
- Trigger counts are recalculated: For
on_jobs_readyandon_jobs_completeactions, the trigger count is set based on current job states
Example scenario:
- job1 and job2 are independent jobs
- postprocess_job depends on both job1 and job2
- An
on_jobs_readyaction triggers when postprocess_job becomes ready
After first run completes:
- job1 fails, job2 succeeds
- User resets failed jobs and reinitializes
- job2 is already Completed, so it counts toward the trigger count
- When job1 completes in the second run, postprocess_job becomes ready
- The action triggers again because the trigger count reaches the required threshold
This ensures actions properly re-trigger after workflow reinitialization, even when some jobs remain in their completed state.
Limitations
- No Action Dependencies: Actions cannot depend on other actions completing
- No Conditional Execution: Actions cannot have conditional logic (use multiple actions with different job selections instead)
- No Action Retries: Failed actions are not automatically retried
- Single Action Type: Each action has one action_type (cannot combine run_commands and schedule_nodes)
- No Dynamic Job Selection: Job names/patterns are fixed at action creation time
For complex workflows requiring these features, consider:
- Using job dependencies to order operations
- Creating separate jobs for conditional logic
- Implementing retry logic within command scripts
- Creating multiple actions for different scenarios