jade.hpc.slurm_manager.SlurmManager

class jade.hpc.slurm_manager.SlurmManager(config)[source]

Bases: HpcManagerInterface

Manages Slurm jobs.

Methods

am_i_manager()

Return True if the current node is the manager node.

cancel_job(job_id)

Cancel job.

check_status([name, job_id])

Check the status of a job.

check_statuses()

Check the statuses of all user jobs.

check_storage_configuration()

Checks if the storage configuration is appropriate for execution.

create_cluster()

Create a Dask cluster.

create_local_cluster()

Create a Dask local cluster.

create_submission_script(name, script, ...)

Create the script to queue the jobs to the HPC.

get_config()

Get HPC configuration parameters.

get_current_job_id()

Get the job ID for the local compute node.

get_job_stats(job_id)

Get stats for job ID.

get_local_scratch()

Get path to local storage space.

get_node_id()

Return the node ID of the current system.

get_num_cpus()

Return the number of CPUs in the system.

list_active_nodes(job_id)

Return the nodes currently participating in the job.

log_environment_variables()

Logs all relevant HPC environment variables.

submit(filename)

Submit the work to the HPC queue.

Attributes

USER

am_i_manager()[source]

Return True if the current node is the manager node.

Return type:

bool

cancel_job(job_id)[source]

Cancel job.

Parameters:

job_id (str)

Returns:

return code

Return type:

int

check_status(name=None, job_id=None)[source]

Check the status of a job. Either name or job_id must be passed. Handles transient errors for up to one minute.

Parameters:
  • name (str) – job name

  • job_id (str) – job ID

Return type:

HpcJobInfo

Raises:

ExecutionError – Raised if statuses cannot be retrieved.

check_statuses()[source]

Check the statuses of all user jobs. Handles transient errors for up to one minute.

Returns:

key is job_id, value is HpcJobStatus

Return type:

dict

Raises:

ExecutionError – Raised if statuses cannot be retrieved.

static check_storage_configuration()[source]

Checks if the storage configuration is appropriate for execution.

Raises:

InvalidConfiguration – Raised if the configuration is not valid

get_config()[source]

Get HPC configuration parameters.

Return type:

dict

get_current_job_id()[source]

Get the job ID for the local compute node.

Return type:

str

create_cluster()[source]

Create a Dask cluster.

Returns:

SLURM: SLURMCluster

Return type:

Dask cluster

create_local_cluster()[source]

Create a Dask local cluster.

Return type:

dask.distributed.LocalCluster

create_submission_script(name, script, filename, path)[source]

Create the script to queue the jobs to the HPC.

Parameters:
  • name (str) – job name

  • script (str) – script to execute on HPC

  • filename (str) – submission script filename

  • path (str) – path for stdout and stderr files

get_job_stats(job_id)[source]

Get stats for job ID.

Return type:

HpcJobStats

get_local_scratch()[source]

Get path to local storage space.

Return type:

str

get_node_id()[source]

Return the node ID of the current system.

Return type:

str

static get_num_cpus()[source]

Return the number of CPUs in the system.

Return type:

int

list_active_nodes(job_id)[source]

Return the nodes currently participating in the job. Order should be deterministic.

Parameters:

job_id (str)

Returns:

list of node hostnames

Return type:

list

log_environment_variables()[source]

Logs all relevant HPC environment variables.

submit(filename)[source]

Submit the work to the HPC queue. Handles transient errors for up to one minute.

Parameters:

filename (str) – HPC script filename

Returns:

(Status, job_id, stderr)

Return type:

tuple of Status, str, str