Auto-Tune Resource Requirements

In cases where you have many jobs and are uncertain of their resource requirements, you can use torc’s auto-tune feature.

The general concept is this:

  • Create one resource requirements definition for each class of job; let’s call them small, medium, and large. Make a guess about their requirements.

  • Run one job of each type.

  • Inspect the actual resource utilization.

  • Update the resource requirements definitions in the database.

  • Run all jobs.

Here is the step-by-step process.

  1. Create the resource requirement definitions in the workflow specification file. Be conservative with initial values. You don’t want the jobs to fail with walltime timeouts or out-of-memory errors.

resource_requirements: [
    {
      name: "small",
      num_cpus: 1,
      num_gpus: 0,
      num_nodes: 1,
      memory: "10g",
      runtime: "P0DT30M"
    },
    {
      name: "medium",
      num_cpus: 4,
      num_gpus: 0,
      num_nodes: 1,
      memory: "20g",
      runtime: "P0DT1H"
    },
    {
      name: "medium",
      num_cpus: 8,
      num_gpus: 0,
      num_nodes: 1,
      memory: "40g",
      runtime: "P0DT6H"
    },
],
  1. Specify a resource_requirements name for each job. You may or may not need to create a different scheduler for each job type. This example assumes a different scheduler is required in order to account for different walltimes.

jobs: [
  {
    name: "work1",
    command: "python work.py 1",
    resource_requirements: "small",
    scheduler: "slurm_schedulers/small",
  },
  {
    name: "work2",
    command: "python work.py 2",
    resource_requirements: "small",
    scheduler: "slurm_schedulers/small",
  },
  {
    name: "work3",
    command: "python work.py 3",
    resource_requirements: "medium",
    scheduler: "slurm_schedulers/medium",
  },
  {
    name: "work4",
    command: "python work.py 4",
    resource_requirements: "medium",
    scheduler: "slurm_schedulers/medium",
  },
  {
    name: "work5",
    command: "python work.py 5",
    resource_requirements: "large",
    scheduler: "slurm_schedulers/large",
  },
  {
    name: "work6",
    command: "python work.py 6",
    resource_requirements: "large",
    scheduler: "slurm_schedulers/large",
  },
]
  1. Start the workflow with the --auto-tune-resource-requirements option.

$ torc workflows start -a
  1. Schedule one node for each resource requirements type. First, identify the scheduler keys.

$ torc hpc slurm list-configs

+-------------------------------------------------------------------------------------------------------------------------------------------+
|                                                 Slurm configurations in workflow 95612117                                                 |
+-------+--------+------------+------+------+-------+-----------+--------+------+----------+----------+-------------------------------------+
| index |  name  |  account   | gres | mem  | nodes | partition |  qos   | tmp  | walltime |   key    |                  id                 |
+-------+--------+------------+------+------+-------+-----------+--------+------+----------+----------+-------------------------------------+
|   0   | small  | my_account | None | None |   1   |    None   | normal | None | 00:30:00 | 95614387 | slurm_schedulers__95612117/95614387 |
|   1   | medium | my_account | None | None |   1   |    None   | normal | None | 01:00:00 | 95614398 | slurm_schedulers__95612117/95614398 |
|   2   | large  | my_account | None | None |   1   |    None   | normal | None | 06:00:00 | 95614405 | slurm_schedulers__95612117/95614405 |
+-------+--------+------------+------+------+-------+-----------+--------+------+----------+----------+-------------------------------------+
$ torc hpc slurm schedule-nodes -n 1 -s 95614387
$ torc hpc slurm schedule-nodes -n 1 -s 95614398
$ torc hpc slurm schedule-nodes -n 1 -s 95614405
  1. Wait for all jobs to finish.

  2. Run this command to process the results and update the database.

$ torc workflows process-auto-tune-resource-requirements-results

2023-04-14 12:23:09,222 - INFO [torc.cli.workflows workflows.py:355] : Updated resource requirements. Look at current requirements with
  'torc -k 95612117 -u http://localhost:8529/_db/test-workflows/torc-service resource-requirements list'
 and at changes by reading the events with
  'torc -k 95612117 -u http://localhost:8529/_db/test-workflows/torc-service events list -f category=resource_requirements'
  1. Note the output above. You can use the suggested commands to view what torc changed. You make more changes if you’d like. Refer to the command torc hpc slurm modify-config.

  2. Schedule more nodes for each set of requirements. You will likely need many more nodes this time. Use the recommend command to help estimate the number of nodes.

$ torc hpc slurm recommend-nodes -s 95614387
$ torc hpc slurm recommend-nodes -s 95614398
$ torc hpc slurm recommend-nodes -s 95614405

Use the output above to assign numbers for X, Y, and Z below.

$ torc hpc slurm schedule-nodes -n X -s 95614387
$ torc hpc slurm schedule-nodes -n Y -s 95614398
$ torc hpc slurm schedule-nodes -n Z -s 95614405