Jobs

Input Parameters

Users can define these parameters in the workflow specification file or through CLI/API input commands:

  • command: String invoked by torc to run a job. It is typically an executable with a unique set of arguments and options. The executable can look up other input parameters from the database. Refer to Job Input/Output Data for a discussion of of how to store input and output data for jobs.

  • cancel_on_blocking_job_failure: If this is set to true and a job upon which this job is dependent fails, torc will cancel this job.

  • scheduler: This optional parameter can be set for the following condition:

    You can use it to ensure that your desired compute node / job assignments are achieved. The torc worker app pulls ready jobs after first sorting them in descending order by these attributes: GPUs, runtime, memory. This will ensure that long jobs will start first and usually prevent big-memory nodes from picking up small-memory jobs. However, you may prefer a different priority. For example, you may prefer sorting by memory before runtime. This setting ensures that a compute node will only pull jobs that you want it to get. Refer to prepare_jobs_sort_method in Advanced Configuration Options for additional customization options.

  • schedule_compute_nodes: Set this to a ComputeNodeScheduleParams object to tell torc to schedule new compute nodes when this job reaches a status of ready.

  • supports_termination: Should be set to true if the job handles the signal SIGTERM. Refer to Gracefully shutdown jobs.

Torc Parameters

  • key: Unique identifier of the job in the database. By default, generated by the database. Users can define their own keys, but this is not recommended in most situations.

  • status: Current status of the job in a workflow. Refer to Job Statuses.

Resource Requirements

You can store definitions of job resource requirements in the database and then associate them with jobs. This is critical because it informs torc about what jobs can be run in parallel on a single compute node.

The recommended way of defining these relationships is through the workflow specification (JSON5) file. One set of resource requirements looks like this:

{
  name: "large",
  num_cpus: 36,
  num_gpus: 0,
  num_nodes: 1,
  memory: "80g",
  runtime: "P0DT12H"
}

This says that any job assigned these requirements will consume 36 CPUs, 80 GB of memory, and run for 12 hours.

You assign one or more jobs to these requirements in the resource_requirements field of the job specification: resource_requirements: "large".

Job Input/Output Data

Torc provides a mechanism for users to store input and output data in the database. This data can be stored on a per-job basis or for the overall workflow. Torc uses the user_data collection for this.

You can store job-to-job relationships if one job stores data that will be consumed by one or more other jobs. This is analagous to job-file-job relationships discussed elsewhere. In both cases torc will sequence execution of jobs based on these dependencies.

One way to run jobs with different parameters is to pass those parameters as command-line arguments and options. A second way is to store the input parameters in the user_data collection of the database. A common runner script can pull the parameters for each specific job at runtime.

Note

Torc sets the environment variables TORC_WORKFLOW_KEY and TORC_JOB_KEY. Scripts can use these values to retrieve data from the database.

Jobs can also store result data and metatdata in the database.

Warning

The database is not currently designed to store large result data. You can store small result data or pointers to where the actual data resides.

Here is how to store and retrieve user data from torc CLI commands. Refer to Passing data between jobs for an example using the workflow specification file and API commands.

Torc CLI

Add data to the database.

$ torc user-data add -n my_val -s 92181820 -d "{key1: 'val1', key2: 'val2'}"
2023-03-29 09:45:59,678 - INFO [torc.cli.user_data user_data.py:41] : Added user_data key=92398595
$ torc jobs list-user-data 92181820
[
  {
    "_key": "92340362",
    "_rev": "_fw4IkZ----",
    "key3": "val3"
  },
  {
    "_key": "92340378",
    "_rev": "_fw4IkX----",
    "key1": "val1",
    "key2": "val2"
  }
]
$ torc user-data list
[
  {
    "_key": "92398595",
    "_rev": "_fw4IkX----",
    "key1": "val1",
    "key2": "val2"
  },
]

$ torc user-data get 92398595
{
  '_key': '92398595',
  '_rev': '_fw2IcgK---',
  'key1': 'val1',
  'key2': 'val2'
}

$ torc user-data delete 92398595 92398602
2023-03-29 09:47:56,772 - INFO [torc.cli.user_data user_data.py:54] : Deleted user_data=92398595
2023-03-29 09:47:56,799 - INFO [torc.cli.user_data user_data.py:54] : Deleted user_data=92398602

Add a placeholder item to the database. The actual data will be populated in the database by job 92340392 and then consumed by job 92340393. Torc will ensure that 92340393 cannot run until 92340392 completes.

$ torc user-data add --name output_data1 --stores 92340392 --consumes 92340393

Ephemeral data

The user_data collection offers an optional field to control ephemeral data. This is useful for cases where you want to ensure that a job always runs in workflow restarts because it creates a resource needed by other jobs. Torc will clear the data field of all user_data documents if the is_ephemeral flag is true (it defaults to false).

One example of how this can be used is an Apache Spark cluster needed by a job. Let’s suppose that the cluster does not exist beforehand and needs to be created by the workflow. One way to accomplish this is to add a job that creates the cluster, create a user_data document as a placeholder for the cluster URL, declare that the cluster-create script will store the data, and declare that the work job will consume the data. Torc will sequence the jobs just that the cluster-create script runs first, it uploads the URL, then, when the work script runs, it reads the URL and connects to the cluster.