reV collect
Execute the collect
step from a config file.
Collect data generated across multiple nodes into a single HDF5 file.
The general structure for calling this CLI command is given below
(add --help
to print help info to the terminal).
reV collect [OPTIONS]
Options
- -c, --config_file <config_file>
Required Path to the
collect
configuration file. Below is a sample template config{ "execution_control": { "option": "local", "allocation": "[REQUIRED IF ON HPC]", "walltime": "[REQUIRED IF ON HPC]", "qos": "normal", "memory": null, "queue": null, "feature": null, "conda_env": null, "module": null, "sh_script": null, "num_test_nodes": null }, "log_directory": "./logs", "log_level": "INFO", "project_points": null, "datasets": null, "purge_chunks": false, "clobber": true, "collect_pattern": "PIPELINE" }
execution_control: option: local allocation: '[REQUIRED IF ON HPC]' walltime: '[REQUIRED IF ON HPC]' qos: normal memory: null queue: null feature: null conda_env: null module: null sh_script: null num_test_nodes: null log_directory: ./logs log_level: INFO project_points: null datasets: null purge_chunks: false clobber: true collect_pattern: PIPELINE
log_directory = "./logs" log_level = "INFO" purge_chunks = false clobber = true collect_pattern = "PIPELINE" [execution_control] option = "local" allocation = "[REQUIRED IF ON HPC]" walltime = "[REQUIRED IF ON HPC]" qos = "normal"
Parameters
- execution_controldict
Dictionary containing execution control arguments. Allowed arguments are:
- option:
({‘local’, ‘kestrel’, ‘eagle’, ‘awspc’, ‘slurm’, ‘peregrine’}) Hardware run option. Determines the type of job scheduler to use as well as the base AU cost. The “slurm” option is a catchall for HPC systems that use the SLURM scheduler and should only be used if desired hardware is not listed above. If “local”, no other HPC-specific keys in are required in execution_control (they are ignored if provided).
- allocation:
(str) HPC project (allocation) handle.
- walltime:
(int) Node walltime request in hours.
- qos:
(str, optional) Quality-of-service specifier. For Kestrel users: This should be one of {‘standby’, ‘normal’, ‘high’}. Note that ‘high’ priority doubles the AU cost. By default,
"normal"
.- memory:
(int, optional) Node memory max limit (in GB). By default,
None
, which uses the scheduler’s default memory limit. For Kestrel users: If you would like to use the full node memory, leave this argument unspecified (or set toNone
) if you are running on standard nodes. However, if you would like to use the bigmem nodes, you must specify the full upper limit of memory you would like for your job, otherwise you will be limited to the standard node memory size (250GB).- queue:
(str, optional; PBS ONLY) HPC queue to submit job to. Examples include: ‘debug’, ‘short’, ‘batch’, ‘batch-h’, ‘long’, etc. By default,
None
, which uses “test_queue”.- feature:
(str, optional) Additional flags for SLURM job (e.g. “-p debug”). By default,
None
, which does not specify any additional flags.- conda_env:
(str, optional) Name of conda environment to activate. By default,
None
, which does not load any environments.- module:
(str, optional) Module to load. By default,
None
, which does not load any modules.- sh_script:
(str, optional) Extra shell script to run before command call. By default,
None
, which does not run any scripts.- num_test_nodes:
(str, optional) Number of nodes to submit before terminating the submission process. This can be used to test a new submission configuration without sumbitting all nodes (i.e. only running a handful to ensure the inputs are specified correctly and the outputs look reasonable). By default,
None
, which submits all node jobs.
Only the option key is required for local execution. For execution on the HPC, the allocation and walltime keys are also required. All other options are populated with default values, as seen above.
- log_directorystr
Path to directory where logs should be written. Path can be relative and does not have to exist on disk (it will be created if missing). By default,
"./logs"
.- log_level{“DEBUG”, “INFO”, “WARNING”, “ERROR”}
String representation of desired logger verbosity. Suitable options are
DEBUG
(most verbose),INFO
(moderately verbose),WARNING
(only log warnings and errors), andERROR
(only log errors). By default,"INFO"
.- project_pointsstr | list, optional
This input should represent the project points that correspond to the full collection of points contained in the input HDF5 files to be collected. You may simply point to a ProjectPoints csv file that contains the GID’s that should be collected. You may also input the GID’s as a list, though this may not be suitable for collections with a large number of points. You may also set this to input to
None
to generate a list of GID’s automatically from the input files. By default, None.- datasetslist of str, optional
List of dataset names to collect into the output file. If collection is performed into multiple files (i.e. multiple input patterns), this list can contain all relevant datasets across all files (a warning wil be thrown, but it is safe to ignore it). If
None
, all datasets from the input files are collected. By default,None
.- purge_chunksbool, optional
Option to delete single-node input HDF5 files. Note that the input files will not be removed if any of the datasets they contain have not been collected, regardless of the value of this input. By default,
False
.- clobberbool, optional
Flag to purge all collection output HDF5 files prior to running the collection step if they exist on disk. This helps avoid any surprising data byproducts when re-running the collection step in a project directory. By default,
True
.- collect_patternstr | list | dict, optional
Unix-style
/filepath/pattern*.h5
representing the files to be collected into a single output HDF5 file. If no output file path is specified (i.e. this input is a single pattern or a list of patterns), the output file path will be inferred from the pattern itself (specifically, the wildcard will be removed and the result will be the output file path). If a list of patterns is provided, each pattern will be collected into a separate output file. To specify the name of the output file(s), set this input to a dictionary where the keys are paths to the output file (including the filename itself; relative paths are allowed) and the values are patterns representing the files that should be collected into the output file. If running a collect job as part of a pipeline, this input can be set to"PIPELINE"
, which will parse the output of the previous step and generate the input file pattern and output file name automatically. By default,"PIPELINE"
.
Note that you may remove any keys with a
null
value if you do not intend to update them yourself.