reV collect

Execute the collect step from a config file.

Collect data generated across multiple nodes into a single HDF5 file.

The general structure for calling this CLI command is given below (add --help to print help info to the terminal).

reV collect [OPTIONS]

Options

-c, --config_file <config_file>

Required Path to the collect configuration file. Below is a sample template config

{
    "execution_control": {
        "option": "local",
        "allocation": "[REQUIRED IF ON HPC]",
        "walltime": "[REQUIRED IF ON HPC]",
        "qos": "normal",
        "memory": null,
        "queue": null,
        "feature": null,
        "conda_env": null,
        "module": null,
        "sh_script": null
    },
    "log_directory": "./logs",
    "log_level": "INFO",
    "project_points": null,
    "datasets": null,
    "purge_chunks": false,
    "clobber": true,
    "collect_pattern": "PIPELINE"
}

Parameters

execution_controldict

Dictionary containing execution control arguments. Allowed arguments are:

option:

({‘local’, ‘kestrel’, ‘eagle’, ‘awspc’, ‘slurm’, ‘peregrine’}) Hardware run option. Determines the type of job scheduler to use as well as the base AU cost. The “slurm” option is a catchall for HPC systems that use the SLURM scheduler and should only be used if desired hardware is not listed above. If “local”, no other HPC-specific keys in are required in execution_control (they are ignored if provided).

allocation:

(str) HPC project (allocation) handle.

walltime:

(int) Node walltime request in hours.

qos:

(str, optional) Quality-of-service specifier. For Kestrel users: This should be one of {‘standby’, ‘normal’, ‘high’}. Note that ‘high’ priority doubles the AU cost. By default, "normal".

memory:

(int, optional) Node memory max limit (in GB). By default, None, which uses the scheduler’s default memory limit. For Kestrel users: If you would like to use the full node memory, leave this argument unspecified (or set to None) if you are running on standard nodes. However, if you would like to use the bigmem nodes, you must specify the full upper limit of memory you would like for your job, otherwise you will be limited to the standard node memory size (250GB).

queue:

(str, optional; PBS ONLY) HPC queue to submit job to. Examples include: ‘debug’, ‘short’, ‘batch’, ‘batch-h’, ‘long’, etc. By default, None, which uses “test_queue”.

feature:

(str, optional) Additional flags for SLURM job (e.g. “-p debug”). By default, None, which does not specify any additional flags.

conda_env:

(str, optional) Name of conda environment to activate. By default, None, which does not load any environments.

module:

(str, optional) Module to load. By default, None, which does not load any modules.

sh_script:

(str, optional) Extra shell script to run before command call. By default, None, which does not run any scripts.

Only the option key is required for local execution. For execution on the HPC, the allocation and walltime keys are also required. All other options are populated with default values, as seen above.

log_directorystr

Path to directory where logs should be written. Path can be relative and does not have to exist on disk (it will be created if missing). By default, "./logs".

log_level{“DEBUG”, “INFO”, “WARNING”, “ERROR”}

String representation of desired logger verbosity. Suitable options are DEBUG (most verbose), INFO (moderately verbose), WARNING (only log warnings and errors), and ERROR (only log errors). By default, "INFO".

project_pointsstr | list, optional

This input should represent the project points that correspond to the full collection of points contained in the input HDF5 files to be collected. You may simply point to a ProjectPoints csv file that contains the GID’s that should be collected. You may also input the GID’s as a list, though this may not be suitable for collections with a large number of points. You may also set this to input to None to generate a list of GID’s automatically from the input files. By default, None.

datasetslist of str, optional

List of dataset names to collect into the output file. If collection is performed into multiple files (i.e. multiple input patterns), this list can contain all relevant datasets across all files (a warning wil be thrown, but it is safe to ignore it). If None, all datasets from the input files are collected. By default, None.

purge_chunksbool, optional

Option to delete single-node input HDF5 files. Note that the input files will not be removed if any of the datasets they contain have not been collected, regardless of the value of this input. By default, False.

clobberbool, optional

Flag to purge all collection output HDF5 files prior to running the collection step if they exist on disk. This helps avoid any surprising data byproducts when re-running the collection step in a project directory. By default, True.

collect_patternstr | list | dict, optional

Unix-style /filepath/pattern*.h5 representing the files to be collected into a single output HDF5 file. If no output file path is specified (i.e. this input is a single pattern or a list of patterns), the output file path will be inferred from the pattern itself (specifically, the wildcard will be removed and the result will be the output file path). If a list of patterns is provided, each pattern will be collected into a separate output file. To specify the name of the output file(s), set this input to a dictionary where the keys are paths to the output file (including the filename itself; relative paths are allowed) and the values are patterns representing the files that should be collected into the output file. If running a collect job as part of a pipeline, this input can be set to "PIPELINE", which will parse the output of the previous step and generate the input file pattern and output file name automatically. By default, "PIPELINE".

Note that you may remove any keys with a null value if you do not intend to update them yourself.