Auto-Tune Resource Requirements¶
In cases where you have many jobs and are uncertain of their resource requirements, you can use torc’s auto-tune feature.
The general concept is this:
Create one resource requirements definition for each class of job; let’s call them small, medium, and large. Make a guess about their requirements.
Run one job of each type.
Inspect the actual resource utilization.
Update the resource requirements definitions in the database.
Run all jobs.
Here is the step-by-step process.
Create the resource requirement definitions in the workflow specification file. Be conservative with initial values. You don’t want the jobs to fail with walltime timeouts or out-of-memory errors.
resource_requirements: [
{
name: "small",
num_cpus: 1,
num_gpus: 0,
num_nodes: 1,
memory: "10g",
runtime: "P0DT30M"
},
{
name: "medium",
num_cpus: 4,
num_gpus: 0,
num_nodes: 1,
memory: "20g",
runtime: "P0DT1H"
},
{
name: "medium",
num_cpus: 8,
num_gpus: 0,
num_nodes: 1,
memory: "40g",
runtime: "P0DT6H"
},
],
Specify a
resource_requirements
name for each job. You may or may not need to create a different scheduler for each job type. This example assumes a different scheduler is required in order to account for different walltimes.
jobs: [
{
name: "work1",
command: "python work.py 1",
resource_requirements: "small",
scheduler: "slurm_schedulers/small",
},
{
name: "work2",
command: "python work.py 2",
resource_requirements: "small",
scheduler: "slurm_schedulers/small",
},
{
name: "work3",
command: "python work.py 3",
resource_requirements: "medium",
scheduler: "slurm_schedulers/medium",
},
{
name: "work4",
command: "python work.py 4",
resource_requirements: "medium",
scheduler: "slurm_schedulers/medium",
},
{
name: "work5",
command: "python work.py 5",
resource_requirements: "large",
scheduler: "slurm_schedulers/large",
},
{
name: "work6",
command: "python work.py 6",
resource_requirements: "large",
scheduler: "slurm_schedulers/large",
},
]
Start the workflow with the
--auto-tune-resource-requirements
option.
$ torc workflows start -a
Schedule one node for each resource requirements type. First, identify the scheduler keys.
$ torc hpc slurm list-configs
+-------------------------------------------------------------------------------------------------------------------------------------------+
| Slurm configurations in workflow 95612117 |
+-------+--------+------------+------+------+-------+-----------+--------+------+----------+----------+-------------------------------------+
| index | name | account | gres | mem | nodes | partition | qos | tmp | walltime | key | id |
+-------+--------+------------+------+------+-------+-----------+--------+------+----------+----------+-------------------------------------+
| 0 | small | my_account | None | None | 1 | None | normal | None | 00:30:00 | 95614387 | slurm_schedulers__95612117/95614387 |
| 1 | medium | my_account | None | None | 1 | None | normal | None | 01:00:00 | 95614398 | slurm_schedulers__95612117/95614398 |
| 2 | large | my_account | None | None | 1 | None | normal | None | 06:00:00 | 95614405 | slurm_schedulers__95612117/95614405 |
+-------+--------+------------+------+------+-------+-----------+--------+------+----------+----------+-------------------------------------+
$ torc hpc slurm schedule-nodes -n 1 -s 95614387
$ torc hpc slurm schedule-nodes -n 1 -s 95614398
$ torc hpc slurm schedule-nodes -n 1 -s 95614405
Wait for all jobs to finish.
Run this command to process the results and update the database.
$ torc workflows process-auto-tune-resource-requirements-results
2023-04-14 12:23:09,222 - INFO [torc.cli.workflows workflows.py:355] : Updated resource requirements. Look at current requirements with
'torc -k 95612117 -u http://localhost:8529/_db/test-workflows/torc-service resource-requirements list'
and at changes by reading the events with
'torc -k 95612117 -u http://localhost:8529/_db/test-workflows/torc-service events list -f category=resource_requirements'
Note the output above. You can use the suggested commands to view what torc changed. You make more changes if you’d like. Refer to the command
torc hpc slurm modify-config
.Schedule more nodes for each set of requirements. You will likely need many more nodes this time. Use the
recommend
command to help estimate the number of nodes.
$ torc hpc slurm recommend-nodes -s 95614387
$ torc hpc slurm recommend-nodes -s 95614398
$ torc hpc slurm recommend-nodes -s 95614405
Use the output above to assign numbers for X, Y, and Z below.
$ torc hpc slurm schedule-nodes -n X -s 95614387
$ torc hpc slurm schedule-nodes -n Y -s 95614398
$ torc hpc slurm schedule-nodes -n Z -s 95614405