Job Submission Strategies¶
This page provides examples on how to optimize JADE for different types of use cases.
Independent, short, multi-core jobs¶
Constraints:
Each job process consumes all cores of the compute node.
Each job takes no more than 30 minutes.
There are 1000 jobs.
There are no job dependencies.
Nodes on the short queue (4-hour max walltime) have a short acquisition time.
Nodes on the standard queue have a long acquisition time.
Strategy:
Define the walltime as 4 hours (for NREL/Eagle HPC) so that it will pick the
short
partition.Limit the number of parallel processes on each node (default is num CPUs).
Set the per-node batch size to the number of jobs that can complete within the max walltime.
Try to use as many nodes as possible in parallel.
Command:
$ jade submit-jobs --num-parallel-processes-per-node=1 --per-node-batch-size=8 config.json
JADE will submit 125 single-node jobs to the HPC.
Independent, short, single-core jobs¶
Same as above except that each job consumes only one core.
Case 1¶
Constraints:
A compute node has 36 cores.
Strategy:
One compute node can complete 36 jobs every 30 minutes or 288 jobs in 4 hours.
4 compute nodes are needed.
Command:
$ jade submit-jobs --per-node-batch-size=250 config.json
JADE will submit 4 single-node jobs to the HPC.
Note
If you will have hundreds of thousands of jobs and hundreds of nodes,
you may experience lock contention issues. If you aren’t concerned with
job-ordering then consider setting --no-distributed-submitter
.
Case 2¶
Same as case 1 but acquisition time is long on all queues.
Strategy:
Acquire one node on the standard queue and then run all jobs on it.
Command:
$ jade submit-jobs --per-node-batch-size=1000 config.json
Independent, single-core jobs with variable runtimes¶
Some jobs complete in 10 minutes, some take 2 hours.
Case 1¶
Constraints:
A compute node has 36 cores.
Strategy:
One compute node can complete 36 jobs every 30 minutes or 288 jobs in 4 hours.
4 compute nodes are needed.
Command:
$ jade submit-jobs --per-node-batch-size=250 config.json
JADE will submit 4 single-node jobs to the HPC.
Case 2¶
Constraints:
Each job process consumes one core of the compute node.
Some jobs take 10 minutes, some take 2 hours.
There are no job dependencies.
Nodes on the short queue (4-hour max walltime) have a short acquisition time.
Nodes on the standard queue have a long acquisition time.
Strategy:
Define
estimated_run_minutes
for each job.Run
jade submit-jobs
with--time-based-batching
and--num-parallel-processes-per-node=36
.Set the walltime value to 4 hours.
JADE will build variable-sized batches based how many jobs can complete in 4 hours on each node.
Command:
$ jade submit-jobs --num-parallel-processes-per-node=36 --time-based-batching config.json
Jobs that require different submission parameters¶
Some jobs will take less than 4 hours, and so can run on the short queue. Other jobs take longer and so need to run on the standard queue.
Strategy:
Define two instances of a SubmissionGroup.
Set the submission group for each job appropriately.
A submission group allows you to define batch parameters like
per-node-batch-size
as well as HPC parameters. You can customize most of
these parameters for each submission group.
Here’s how to modify the existing config.json
file.
Create default submission parameters with
jade config submitter-params -c short-jobs.json
.Customize the file as necessary.
Add those parameters as a submission group with
jade config add-submission-group short-jobs.json short_jobs config.json
Repeat steps 1-3 to create a group called
long_jobs
.Edit the
submission_group
field for each job inconfig.json
to be one of the group names defined above.
Here is an example of part of a config.json
file:
{
"jobs": [
{
"command": "bash my_script.sh 1",
"job_id": 1,
"blocked_by": [],
"extension": "generic_command",
"append_output_dir": false,
"cancel_on_blocking_job_failure": false,
"estimated_run_minutes": null,
"ext": {},
"submission_group": "short_jobs"
},
{
"command": "bash my_script.sh 2",
"job_id": 2,
"blocked_by": [],
"extension": "generic_command",
"append_output_dir": false,
"cancel_on_blocking_job_failure": false,
"estimated_run_minutes": null,
"ext": {},
"submission_group": "long_jobs"
}
],
"submission_groups": [
{
"name": "short_jobs",
"submitter_params": {
"hpc_config": {
"hpc_type": "slurm",
"job_prefix": "job",
"hpc": {
"account": "my_account",
"walltime": "4:00:00"
}
},
"per_node_batch_size": 500,
"try_add_blocked_jobs": true,
"time_based_batching": false
}
},
{
"name": "long_jobs",
"submitter_params": {
"hpc_config": {
"hpc_type": "slurm",
"job_prefix": "job",
"hpc": {
"account": "my_account",
"walltime": "24:00:00"
}
},
"per_node_batch_size": 500,
"try_add_blocked_jobs": true,
"time_based_batching": false
}
}
]
}
Refer to Submission Group Behaviors for additional information.
Jobs that require multiple nodes¶
Note
This is an experimental feature. Please let us know your feedback.
Constraints:
A job needs 5 nodes.
One node should become a manager that starts worker processes on all nodes.
You have a script/program that can use all nodes.
Strategy:
Use JADE’s multi-node manager to run your script.
Set
nodes = 5
in thehpc_config.toml
file.Set
use_multi_node_manager = true
for the job in theconfig.json
.The HPC will start JADE’s manager script. JADE will assign the
manager
role to the first node in the HPC node list. It will invoke your script, passing the runtime output directory and all node hostnames through environment variables.Your script uses all nodes to complete your work.
Warning
Be careful if you add more jobs to the config, such as for post-processing. Put them in a different submission group if they are single-node jobs.
Here is an example using a Julia
script that uses the Distributed
module to perform work on multiple nodes.
Contents of a script called run_jobs.jl
:
using Distributed
using Random
function run_jobs(output_dir, hostnames)
machines = [(x, i) for (i, x) in enumerate(hostnames)]
addprocs(machines)
@everywhere println("hello from $(gethostname())")
results = [@spawnat i rand(10) for i in 1:length(hostnames)]
for (i, result) in enumerate(results)
res = maximum(fetch(result))
println("Largest value from $(hostnames[i]) = $res")
end
end
output = ENV["JADE_OUTPUT_DIR"]
workers = split(ENV["JADE_COMPUTE_NODE_NAMES"], " ")
isempty(workers) && error("no compute node names were set in JADE_COMPUTE_NODE_NAMES")
run_jobs(output, workers)
JADE job definition:
{
"command": "julia run_jobs.jl arg1 arg2",
"job_id": 1,
"blocked_by": [],
"extension": "generic_command",
"append_output_dir": true,
"cancel_on_blocking_job_failure": false,
"estimated_run_minutes": null,
"use_multi_node_manager": true
}
HPC parameters:
hpc_type = "slurm"
job_prefix = "job"
[hpc]
account = "my_account"
walltime = "4:00:00"
nodes = 5
JADE will set these environment variables:
JADE_OUTPUT_DIR
: output directory passed tojade submit-jobs
JADE_COMPUTE_NODE_NAMES
: all compute node names allocated by the HPC
JADE will run the user command on the manager node when the HPC allocates the nodes.
$ julia run_jobs.jl arg1 arg2