.. _submission_strategies: ************************* Job Submission Strategies ************************* This page provides examples on how to optimize JADE for different types of use cases. Independent, short, multi-core jobs =================================== **Constraints**: - Each job process consumes all cores of the compute node. - Each job takes no more than 30 minutes. - There are 1000 jobs. - There are no job dependencies. - Nodes on the short queue (4-hour max walltime) have a short acquisition time. - Nodes on the standard queue have a long acquisition time. **Strategy**: - Define the walltime as 4 hours (for NREL/Eagle HPC) so that it will pick the ``short`` partition. - Limit the number of parallel processes on each node (default is num CPUs). - Set the per-node batch size to the number of jobs that can complete within the max walltime. - Try to use as many nodes as possible in parallel. **Command**: .. code-block:: bash $ jade submit-jobs --num-parallel-processes-per-node=1 --per-node-batch-size=8 config.json JADE will submit 125 single-node jobs to the HPC. Independent, short, single-core jobs ==================================== Same as above except that each job consumes only one core. Case 1 ------ **Constraints**: - A compute node has 36 cores. **Strategy**: - One compute node can complete 36 jobs every 30 minutes or 288 jobs in 4 hours. - 4 compute nodes are needed. **Command**: .. code-block:: bash $ jade submit-jobs --per-node-batch-size=250 config.json JADE will submit 4 single-node jobs to the HPC. .. note:: If you will have hundreds of thousands of jobs and hundreds of nodes, you may experience lock contention issues. If you aren't concerned with job-ordering then consider setting ``--no-distributed-submitter``. Case 2 ------ Same as case 1 but acquisition time is long on all queues. **Strategy**: - Acquire one node on the standard queue and then run all jobs on it. **Command**: .. code-block:: bash $ jade submit-jobs --per-node-batch-size=1000 config.json Independent, single-core jobs with variable runtimes ==================================================== Some jobs complete in 10 minutes, some take 2 hours. Case 1 ------ **Constraints**: - A compute node has 36 cores. **Strategy**: - One compute node can complete 36 jobs every 30 minutes or 288 jobs in 4 hours. - 4 compute nodes are needed. **Command**: .. code-block:: bash $ jade submit-jobs --per-node-batch-size=250 config.json JADE will submit 4 single-node jobs to the HPC. Case 2 ------ **Constraints**: - Each job process consumes one core of the compute node. - Some jobs take 10 minutes, some take 2 hours. - There are no job dependencies. - Nodes on the short queue (4-hour max walltime) have a short acquisition time. - Nodes on the standard queue have a long acquisition time. **Strategy**: - Define ``estimated_run_minutes`` for each job. - Run ``jade submit-jobs`` with ``--time-based-batching`` and ``--num-parallel-processes-per-node=36``. - Set the walltime value to 4 hours. - JADE will build variable-sized batches based how many jobs can complete in 4 hours on each node. **Command**: .. code-block:: bash $ jade submit-jobs --num-parallel-processes-per-node=36 --time-based-batching config.json .. _submission_group_strategy: Jobs that require different submission parameters ================================================= Some jobs will take less than 4 hours, and so can run on the short queue. Other jobs take longer and so need to run on the standard queue. **Strategy**: - Define two instances of a :ref:`model_submission_group`. - Set the submission group for each job appropriately. A submission group allows you to define batch parameters like ``per-node-batch-size`` as well as HPC parameters. You can customize most of these parameters for each submission group. Here's how to modify the existing ``config.json`` file. 1. Create default submission parameters with ``jade config submitter-params -c short-jobs.json``. 2. Customize the file as necessary. 3. Add those parameters as a submission group with ``jade config add-submission-group short-jobs.json short_jobs config.json`` 4. Repeat steps 1-3 to create a group called ``long_jobs``. 5. Edit the ``submission_group`` field for each job in ``config.json`` to be one of the group names defined above. Here is an example of part of a ``config.json`` file: .. code-block:: json { "jobs": [ { "command": "bash my_script.sh 1", "job_id": 1, "blocked_by": [], "extension": "generic_command", "append_output_dir": false, "cancel_on_blocking_job_failure": false, "estimated_run_minutes": null, "ext": {}, "submission_group": "short_jobs" }, { "command": "bash my_script.sh 2", "job_id": 2, "blocked_by": [], "extension": "generic_command", "append_output_dir": false, "cancel_on_blocking_job_failure": false, "estimated_run_minutes": null, "ext": {}, "submission_group": "long_jobs" } ], "submission_groups": [ { "name": "short_jobs", "submitter_params": { "hpc_config": { "hpc_type": "slurm", "job_prefix": "job", "hpc": { "account": "my_account", "walltime": "4:00:00" } }, "per_node_batch_size": 500, "try_add_blocked_jobs": true, "time_based_batching": false } }, { "name": "long_jobs", "submitter_params": { "hpc_config": { "hpc_type": "slurm", "job_prefix": "job", "hpc": { "account": "my_account", "walltime": "24:00:00" } }, "per_node_batch_size": 500, "try_add_blocked_jobs": true, "time_based_batching": false } } ] } Refer to :ref:`submission_group_behaviors` for additional information. .. _multi_node_job_strategy: Jobs that require multiple nodes ================================ .. note:: This is an experimental feature. Please let us know your feedback. **Constraints**: - A job needs 5 nodes. - One node should become a manager that starts worker processes on all nodes. - You have a script/program that can use all nodes. **Strategy**: Use JADE's multi-node manager to run your script. - Set ``nodes = 5`` in the ``hpc_config.toml`` file. - Set ``use_multi_node_manager = true`` for the job in the ``config.json``. - The HPC will start JADE's manager script. JADE will assign the ``manager`` role to the first node in the HPC node list. It will invoke your script, passing the runtime output directory and all node hostnames through environment variables. - Your script uses all nodes to complete your work. .. warning:: Be careful if you add more jobs to the config, such as for post-processing. Put them in a different submission group if they are single-node jobs. Here is an example using a ``Julia`` script that uses the ``Distributed`` module to perform work on multiple nodes. Contents of a script called ``run_jobs.jl``: .. code-block:: julia using Distributed using Random function run_jobs(output_dir, hostnames) machines = [(x, i) for (i, x) in enumerate(hostnames)] addprocs(machines) @everywhere println("hello from $(gethostname())") results = [@spawnat i rand(10) for i in 1:length(hostnames)] for (i, result) in enumerate(results) res = maximum(fetch(result)) println("Largest value from $(hostnames[i]) = $res") end end output = ENV["JADE_OUTPUT_DIR"] workers = split(ENV["JADE_COMPUTE_NODE_NAMES"], " ") isempty(workers) && error("no compute node names were set in JADE_COMPUTE_NODE_NAMES") run_jobs(output, workers) **JADE job definition**: .. code-block:: json { "command": "julia run_jobs.jl arg1 arg2", "job_id": 1, "blocked_by": [], "extension": "generic_command", "append_output_dir": true, "cancel_on_blocking_job_failure": false, "estimated_run_minutes": null, "use_multi_node_manager": true } **HPC parameters**:: hpc_type = "slurm" job_prefix = "job" [hpc] account = "my_account" walltime = "4:00:00" nodes = 5 JADE will set these environment variables: - ``JADE_OUTPUT_DIR``: output directory passed to ``jade submit-jobs`` - ``JADE_COMPUTE_NODE_NAMES``: all compute node names allocated by the HPC JADE will run the user command on the manager node when the HPC allocates the nodes. .. code-block:: bash $ julia run_jobs.jl arg1 arg2