gaps.collection.Collector#

class Collector(h5_file, collect_pattern, project_points, clobber=False)[source]#

Bases: object

Collector of multiple source files into a single output file.

Parameters:
  • h5_file (path-like) – Path to output HDF5 file into which data will be collected.

  • collect_pattern (str) – Unix-style /filepath/pattern*.h5 representing a list of input files to be collected into a single HDF5 file.

  • project_points (str | slice | list | pandas.DataFrame | None) – Project points that correspond to the full collection of points contained in the HDF5 files to be collected. None if points list is to be ignored (i.e. collect all data in the input HDF5 files without checking that all gids are there).

  • clobber (bool, optional) – Flag to purge output HDF5 file if it already exists. By default, False.

Methods

add_dataset(h5_file, collect_pattern, dataset_in)

Collect and add a dataset to a single HDF5 file.

collect(dataset_in[, dataset_out, ...])

Collect a dataset from h5_dir to h5_file

combine_meta()

Combine meta data from input HDF5 files and write to out file.

combine_time_index()

Combine time_index from input HDF5 files and write to out file.

get_dataset_shape(dataset_name)

Extract dataset shape from the first file in the collection list.

move_chunks([sub_dir])

Move chunked files from a directory to a sub-directory.

purge_chunks()

Remove chunked files from a directory.

Attributes

gids

List of gids corresponding to all sites to be combined.

h5_files

List of paths to HDF5 files to be combined.

get_dataset_shape(dataset_name)[source]#

Extract dataset shape from the first file in the collection list.

Parameters:

dataset_name (str) – Dataset to be collected whose shape is in question.

Returns:

shape (tuple) – Dataset shape tuple.

property h5_files#

List of paths to HDF5 files to be combined.

Type:

list

property gids#

List of gids corresponding to all sites to be combined.

Type:

list

combine_meta()[source]#

Combine meta data from input HDF5 files and write to out file.

combine_time_index()[source]#

Combine time_index from input HDF5 files and write to out file.

If time_index is not given in the input HDF5 files, the time_index in the output file is set to None.

purge_chunks()[source]#

Remove chunked files from a directory.

Warns:

gapsCollectionWarning – If some datasets have not been collected.

Warning

This function WILL NOT delete files if any datasets were not collected.

move_chunks(sub_dir='chunk_files')[source]#

Move chunked files from a directory to a sub-directory.

Parameters:

sub_dir (path-like, optional) – Sub directory name to move chunks to. By default, “chunk_files”.

collect(dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#

Collect a dataset from h5_dir to h5_file

Parameters:
  • dataset_in (str) – Name of dataset to collect. If source shape is 2D, time index will be collected as well.

  • dataset_out (str) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.

  • memory_utilization_limit (float) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.

  • pass_through (bool) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.

See also

Collector.add_dataset

Collect a dataset into an existing HDF5 file.

classmethod add_dataset(h5_file, collect_pattern, dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#

Collect and add a dataset to a single HDF5 file.

Parameters:
  • h5_file (path-like) – Path to output HDF5 file into which data will be collected. Note that this file must already exist and have a valid meta.

  • collect_pattern (str) – Unix-style /filepath/pattern*.h5 representing a list of input files to be collected into a single HDF5 file.

  • dataset_in (str) – Name of dataset to collect. If source shape is 2D, time index will be collected as well.

  • dataset_out (str) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.

  • memory_utilization_limit (float) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.

  • pass_through (bool) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.

See also

Collector.collect

Collect a dataset into a file that does not yet exist.