gaps.collection.DatasetCollector#

class DatasetCollector(h5_file, source_files, gids, dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#

Bases: object

Collector for a single dataset.

Parameters:
  • h5_file (path-like) – Path to h5_file into which dataset is to be collected.

  • source_files (list) – List of source filepaths.

  • gids (list) – List of gids to be collected.

  • dataset_in (str) – Name of dataset to collect.

  • dataset_out (str, optional) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.

  • memory_utilization_limit (float, optional) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.

  • pass_through (bool, optional) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.

Methods

collect_dataset(h5_file, source_files, gids, ...)

Collect a dataset from multiple source files into a single file.

Attributes

duplicate_gids

True if there are duplicate gids being collected.

gids

List of gids corresponding to all sites to be combined.

property gids#

List of gids corresponding to all sites to be combined.

Type:

list

property duplicate_gids#

True if there are duplicate gids being collected.

Type:

bool

classmethod collect_dataset(h5_file, source_files, gids, dataset_in, dataset_out=None, memory_utilization_limit=0.7, pass_through=False)[source]#

Collect a dataset from multiple source files into a single file.

Parameters:
  • h5_file (path-like) – Path to h5_file into which dataset is to be collected.

  • source_files (list) – List of source filepaths.

  • gids (list) – List of gids to be collected.

  • dataset_in (str) – Name of dataset to collect.

  • dataset_out (str, optional) – Name of dataset into which collected data is to be written. If None the name of the output dataset is assumed to match the dataset input name. By default, None.

  • memory_utilization_limit (float, optional) – Memory utilization limit (fractional). This sets how many sites will be collected at a time. By default, 0.7.

  • pass_through (bool, optional) – Flag to just pass through dataset from one of the source files, assuming all of the source files have identical copies of this dataset. By default, False.