gaps.collection.Collector#

class Collector(h5_file, collect_pattern, project_points, clobber=False, config_file=None, command_name='collect')[source]#

Bases: object

Collector of multiple source files into a single output file

Parameters:
  • h5_file (path-like) – Path to output HDF5 file into which data will be collected.

  • collect_pattern (str) – Unix-style /filepath/pattern*.h5 representing a list of input files to be collected into a single HDF5 file.

  • project_points (str | slice | list | pandas.DataFrame | None) – Project points that correspond to the full collection of points contained in the HDF5 files to be collected. None if points list is to be ignored (i.e. collect all data in the input HDF5 files without checking that all gids are there).

  • clobber (bool, optional) – Flag to purge output HDF5 file if it already exists. By default, False.

  • config_file (str, optional) – Path to config file used to set up this collection run (if applicable). This is used to store information about the collection in the output file attrs. By default, None.

  • command_name (str, default=”collect”) – Name of the command that is being run. This is used to set the config key in the attributes of the output file. By default, "collect.

Methods

add_dataset(h5_file, collect_pattern, dataset_in)

Collect and add a dataset to a single HDF5 file

collect(dataset_in[, dataset_out, ...])

Collect a dataset from h5_dir to h5_file

combine_meta()

Combine meta data from input files and write to out file

combine_time_index()

Combine time_index from input files and write to out file

get_dataset_shape(dataset_name)

Extract dataset shape from the first file in the collection

move_chunks([sub_dir])

Move chunked files from a directory to a sub-directory

purge_chunks()

Remove chunked files from a directory

Attributes

gids

List of gids corresponding to all sites to combine

h5_files

List of paths to HDF5 files to be combined

get_dataset_shape(dataset_name)[source]#

Extract dataset shape from the first file in the collection

Parameters:

dataset_name (str) – Dataset to be collected whose shape is in question.

Returns:

shape (tuple) – Dataset shape tuple.

property h5_files#

List of paths to HDF5 files to be combined

Type:

list

property gids#

List of gids corresponding to all sites to combine

Type:

list

combine_meta()[source]#

Combine meta data from input files and write to out file

combine_time_index()[source]#

Combine time_index from input files and write to out file

If time_index is not given in the input HDF5 files, the time_index in the output file is set to None.

purge_chunks()[source]#

Remove chunked files from a directory

Warns:

gapsCollectionWarning – If some datasets have not been collected.

Warning

This function WILL NOT delete files if any datasets were not collected.

move_chunks(sub_dir='chunk_files')[source]#

Move chunked files from a directory to a sub-directory

Parameters:

sub_dir (path-like, optional) – Sub directory name to move chunks to. By default, “chunk_files”.

collect(dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#

Collect a dataset from h5_dir to h5_file

Parameters:
  • dataset_in (str) – Name of dataset to collect. If source shape is 2D, time index will be collected as well.

  • dataset_out (str) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.

  • memory_utilization_limit (float) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.

  • pass_through (bool) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.

See also

Collector.add_dataset

Collect a dataset into an existing HDF5 file.

classmethod add_dataset(h5_file, collect_pattern, dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#

Collect and add a dataset to a single HDF5 file

Parameters:
  • h5_file (path-like) – Path to output HDF5 file into which data will be collected. Note that this file must already exist and have a valid meta.

  • collect_pattern (str) – Unix-style /filepath/pattern*.h5 representing a list of input files to be collected into a single HDF5 file.

  • dataset_in (str) – Name of dataset to collect. If source shape is 2D, time index will be collected as well.

  • dataset_out (str) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.

  • memory_utilization_limit (float) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.

  • pass_through (bool) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.

See also

Collector.collect

Collect a dataset into a file that does not yet exist.