gaps.cli.command.CLICommandFromFunction#
- class CLICommandFromFunction(function, name=None, add_collect=False, split_keys=None, config_preprocessor=None, skip_doc_params=None)[source]#
Bases:
AbstractBaseCLICommandConfiguration
Configure a CLI command to execute a function on multiple nodes.
This class configures a CLI command that runs a given function across multiple nodes on an HPC. The primary utility is to split the function execution spatially, meaning that individual nodes will run the function on a subset of the input points. However, this configuration also supports splitting the execution on other inputs (in lieu of or in addition to the geospatial partitioning).
- Parameters:
function (callable) – The function to run on individual nodes. This function will be used to generate all the documentation and template configuration files, so it should be thoroughly documented (using a NumPy Style Python Docstring). In particular, the “Extended Summary” and the “Parameters” section will be pulled form the docstring.
Warning
The “Extended Summary” section may not show up properly if the short “Summary” section is missing.
This function must return the path to the output file it generates (or a list of paths if multiple output files are generated). If no output files are generated, the function must return
None
or an empty list. In order to avoid clashing output file names with jobs on other nodes, make sure to “request” thetag
argument. This function can “request” the following arguments by including them in the function signature (gaps
will automatically pass them to the function without any additional used input):- tagstr
Short string unique to this job run that can be used to generate unique output filenames, thereby avoiding clashing output files with jobs on other nodes. This string contains a leading underscore, so the file name can easily be generated:
f"{out_file_name}{tag}.{extension}"
.- command_namestr
Name of the command being run. This is equivalent to the
name
input argument.- pipeline_stepstr
Name of the pipeline step being run. This is often the same as command_name, but can be different if a pipeline contains duplicate steps.
- config_filestr
Path to the configuration file specified by the user.
- project_dirstr
Path to the project directory (parent directory of the configuration file).
- job_namestr
Name of the job being run. This is typically a combination of the project directory, the command name, and a tag unique to a job. Note that the tag will not be included if you request this argument in a config preprocessing function, as the execution has not been split into multiple jobs by that point.
- out_dirstr
Path to output directory - typically equivalent to the project directory.
- out_fpathstr
Suggested path to output file. You are not required to use this argument - it is provided purely for convenience purposes. This argument combines the
out_dir
withjob_name
to yield a unique output filepath for a given node. Note that the output filename will contain the tag. Also note that this string WILL NOT contain a file-ending, so that will have to be added by the node function.
If your function is capable of multiprocessing, you should also include
max_workers
in the function signature.gaps
will pass an integer equal to the number of processes the user wants to run on a single node for this value.Warning
The keywords
{"max-workers", "sites_per_worker", "memory_utilization_limit", "timeout", "pool_size"}
are assumed to describe execution control. If you request any of these as function arguments, users of your CLI will specify them in the execution_control block of the input config file.Note that the
config
parameter is not allowed as a function signature item. Please request all the required keys/inputs directly instead. This function can also request “private” arguments by including a leading underscore in the argument name. These arguments are NOT exposed to users in the documentation or template configuration files. Instead, it is expected that theconfig_preprocessor
function fills these arguments in programmatically before the function is distributed across nodes. See the implementation ofgaps.cli.collect.collect()
andgaps.cli.preprocessing.preprocess_collect_config()
for an example of this pattern. You can use theskip_doc_params
input below to achieve the same results without the underscore syntax (helpful for public-facing functions).name (str, optional) – Name of the command. This will be the name used to call the command on the terminal. This name does not have to match the function name. It is encouraged to use lowercase names with dashes (“-”) instead of underscores (“_”) to stay consistent with click’s naming conventions. By default,
None
, which uses the function name as the command name (with minor formatting to conform toclick
-style commands).add_collect (bool, optional) – Option to add a “collect-{command_name}” command immediately following this command to collect the (multiple) output files generated across nodes into a single file. The collect command will only work if the output files the previous command generates are HDF5 files with
meta
andtime_index
datasets (standardrex
HDF5 file structure). If you set this option toTrue
, your run function must return the path (as a string) to the output file it generates in order for users to be able to use"PIPELINE"
as the input to thecollect_pattern
key in the collection config. The path returned by this function must also include thetag
in the output file name in order for collection to function properly (an easy way to do this is to requesttag
in the function signature and name the output file generated by the function using thef"{out_file_name}{tag}.{extension}"
format). By default,False
.split_keys (set | container, optional) – A set of strings identifying the names of the config keys that
gaps
should split the function execution on. To specify geospatial partitioning in particular, ensure that the mainfunction
has a “project_points” argument (which accepts agaps.project_points.ProjectPoints
instance) and specify “project_points” as a split argument. Users of the CLI will only need to specify the path to the project points file and a “nodes” argument in the execution control. To split execution on additional/other inputs, include them by name in this input (and ensure the run function accepts them as input). You may include tuples of strings in this iterable as well. Tuples of strings will be interpreted as combinations of keys whose values should be iterated over simultaneously. For example, specifyingsplit_keys=[("a", "b")]
and invoking with a config file wherea = [1, 2]
andb = [3, 4]
will run the main function two times (on two nodes), first with the inputsa=1, b=3
and then with the inputsa=2, b=4
. It is the responsibility of the developer using this class to ensure that the user input for allsplit_keys
is an iterable (typically a list), and that the lengths of all “paired” keys match. To allow non-iterable user input for split keys, use theconfig_preprocessor
argument to specify a preprocessing function that converts the user input into a list of the expected inputs. If users specify an empty list orNone
for a key insplit_keys
, then GAPs will passNone
as the value for that key (i.e. ifsplit_keys=["a"]
and users specify"a": []
in their config, then thefunction
will be called witha=None
). IfNone
, execution is not split across nodes, and a single node is always used for the function call. By default,None
.config_preprocessor (callable, optional) – Optional function for configuration pre-processing. The preprocessing step occurs before jobs are split across HPC nodes, and can therefore be used to calculate the
split_keys
input and/or validate that it conforms to the requirements layed out above. At minimum, this function should have “config” as the first parameter (which will receive the user configuration input as a dictionary) and must return the updated config dictionary. This function can also “request” the following arguments by including them in the function signature:- command_namestr
Name of the command being run. This is equivalent to the
name
input above.- pipeline_stepstr
Name of the pipeline step being run. This is often the same as command_name, but can be different if a pipeline contains duplicate steps.
- config_filePath
Path to the configuration file specified by the user.
- project_dirPath
Path to the project directory (parent directory of the configuration file).
- job_namestr
Name of the job being run. This is typically a combination of the project directory and the command name.
- out_dirPath
Path to output directory - typically equivalent to the project directory.
- out_fpathPath
Suggested path to output file. You are not required to use this argument - it is provided purely for convenience purposes. This argument combines the
out_dir
withjob_name
to yield a unique output filepath for a given node. Note that the output filename WILL NOT contain the tag, since the number of split nodes have not been determined when the config pre-processing function is called. Also note that this string WILL NOT contain a file-ending, so that will have to be added by the node function.- log_directoryPath
Path to log output directory (defaults to project_dir / “logs”).
- verbosebool
Flag indicating wether the user has selected a DEBUG verbosity level for logs.
These inputs will be provided by GAPs and will not be displayed to users in the template configuration files or documentation. See
gaps.cli.preprocessing.preprocess_collect_config()
for an example. Note that thetag
parameter is not allowed as a pre-processing function signature item (the node jobs will not have been configured before this function executes). This function can also “request” new user inputs that are not present in the signature of the main run function. In this case, the documentation for these new arguments is pulled from theconfig_preprocessor
function. This feature can be used to request auxillary information from the user to fill in “private” inputs to the main run function. See the implementation ofgaps.cli.preprocessing.preprocess_collect_config()
andgaps.cli.collect.collect()
for an example of this pattern. Do not request parameters with the same names as any of your model function (i.e. ifres_file
is a model parameter, do not request it in the preprocessing function docstring - extract it from the config dictionary instead). By default,None
.skip_doc_params (iterable of str, optional) – Optional iterable of parameter names that should be excluded from the documentation/template configuration files. This can be useful if your pre-processing function automatically sets some parameters based on other user input. This option is an alternative to the “private” arguments discussed in the
function
parameter documentation above. By default,None
.
Methods
Attributes
Documentation object.
True
if execution is split spatially across nodes.- property documentation#
Documentation object.
- Type: