gaps.collection.Collector#
- class Collector(h5_file, collect_pattern, project_points, clobber=False, config_file=None, command_name='collect')[source]#
Bases:
object
Collector of multiple source files into a single output file
- Parameters:
h5_file (path-like) – Path to output HDF5 file into which data will be collected.
collect_pattern (str) – Unix-style /filepath/pattern*.h5 representing a list of input files to be collected into a single HDF5 file.
project_points (str | slice | list | pandas.DataFrame | None) – Project points that correspond to the full collection of points contained in the HDF5 files to be collected. None if points list is to be ignored (i.e. collect all data in the input HDF5 files without checking that all gids are there).
clobber (bool, optional) – Flag to purge output HDF5 file if it already exists. By default, False.
config_file (str, optional) – Path to config file used to set up this collection run (if applicable). This is used to store information about the collection in the output file attrs. By default,
None
.command_name (str, default=”collect”) – Name of the command that is being run. This is used to set the config key in the attributes of the output file. By default,
"collect
.
Methods
add_dataset
(h5_file, collect_pattern, dataset_in)Collect and add a dataset to a single HDF5 file
collect
(dataset_in[, dataset_out, ...])Collect a dataset from h5_dir to h5_file
Combine meta data from input files and write to out file
Combine time_index from input files and write to out file
get_dataset_shape
(dataset_name)Extract dataset shape from the first file in the collection
move_chunks
([sub_dir])Move chunked files from a directory to a sub-directory
Remove chunked files from a directory
Attributes
List of gids corresponding to all sites to combine
List of paths to HDF5 files to be combined
- get_dataset_shape(dataset_name)[source]#
Extract dataset shape from the first file in the collection
- Parameters:
dataset_name (str) – Dataset to be collected whose shape is in question.
- Returns:
shape (tuple) – Dataset shape tuple.
- combine_time_index()[source]#
Combine time_index from input files and write to out file
If time_index is not given in the input HDF5 files, the time_index in the output file is set to None.
- purge_chunks()[source]#
Remove chunked files from a directory
- Warns:
gapsCollectionWarning – If some datasets have not been collected.
Warning
This function WILL NOT delete files if any datasets were not collected.
- move_chunks(sub_dir='chunk_files')[source]#
Move chunked files from a directory to a sub-directory
- Parameters:
sub_dir (path-like, optional) – Sub directory name to move chunks to. By default, “chunk_files”.
- collect(dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#
Collect a dataset from h5_dir to h5_file
- Parameters:
dataset_in (str) – Name of dataset to collect. If source shape is 2D, time index will be collected as well.
dataset_out (str) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.
memory_utilization_limit (float) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.
pass_through (bool) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.
See also
Collector.add_dataset
Collect a dataset into an existing HDF5 file.
- classmethod add_dataset(h5_file, collect_pattern, dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#
Collect and add a dataset to a single HDF5 file
- Parameters:
h5_file (path-like) – Path to output HDF5 file into which data will be collected. Note that this file must already exist and have a valid meta.
collect_pattern (str) – Unix-style /filepath/pattern*.h5 representing a list of input files to be collected into a single HDF5 file.
dataset_in (str) – Name of dataset to collect. If source shape is 2D, time index will be collected as well.
dataset_out (str) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.
memory_utilization_limit (float) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.
pass_through (bool) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.
See also
Collector.collect
Collect a dataset into a file that does not yet exist.