sup3r.postprocessing.collectors.h5.CollectorH5#

class CollectorH5(file_paths)[source]#

Bases: BaseCollector

Sup3r H5 file collection framework

Parameters:

file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.<ext>. Files should have non-overlapping time_index and spatial domains.

Methods

collect(file_paths, out_file, features[, ...])

Collect data files from a dir to one output file.

collect_feature(dset, target_masked_meta, ...)

Collect chunks for single feature

get_chunk_indices(file)

Get spatial and temporal chunk indices from the given file name.

get_collection_attrs(file_paths[, ...])

Get important dataset attributes from a file list to be collected.

get_coordinate_indices(target_meta, full_meta)

Get coordindate indices in meta data for given targets

get_data(file_path, feature, time_index, ...)

Retreive a data array from a chunked file.

get_dset_attrs(feature)

Get attrributes for output feature

get_flist_chunks(file_paths[, n_writes])

Group files by temporal_chunk_index and then combines these groups if n_writes is less than the number of time_chunks.

get_node_cmd(config)

Get a CLI call to collect data.

get_slices(final_time_index, final_meta, ...)

Get index slices where the new ti/meta belong in the final ti/meta.

get_target_and_masked_meta(meta[, ...])

Use combined meta for all files and target_meta_file to get mapping from the full meta to the target meta and the mapping from the target meta to the full meta, both of which are masked to remove coordinates not present in the target_meta.

get_time_dim_name(filepath)

Get the name of the time dimension in the given file

get_unique_chunk_files(file_paths)

We get files for the unique spatial and temporal extents covered by all collection files.

write_data(out_file, dsets, time_index, ...)

Write list of datasets to out_file.

classmethod get_slices(final_time_index, final_meta, new_time_index, new_meta)[source]#

Get index slices where the new ti/meta belong in the final ti/meta.

Parameters:
  • final_time_index (pd.Datetimeindex) – Time index of the final file that new_time_index is being written to.

  • final_meta (pd.DataFrame) – Meta data of the final file that new_meta is being written to.

  • new_time_index (pd.Datetimeindex) – Chunk time index that is a subset of the final_time_index.

  • new_meta (pd.DataFrame) – Chunk meta data that is a subset of the final_meta.

Returns:

  • row_slice (slice) – final_time_index[row_slice] = new_time_index

  • col_slice (slice) – final_meta[col_slice] = new_meta

get_coordinate_indices(target_meta, full_meta, threshold=0.0001)[source]#

Get coordindate indices in meta data for given targets

Parameters:
  • target_meta (pd.DataFrame) – Dataframe of coordinates to find within the full meta

  • full_meta (pd.DataFrame) – Dataframe of full set of coordinates for unfiltered dataset

  • threshold (float) – Threshold distance for finding target coordinates within full meta

get_data(file_path, feature, time_index, meta, scale_factor, dtype, threshold=0.0001)[source]#

Retreive a data array from a chunked file.

Parameters:
  • file_path (str) – h5 file to get data from

  • feature (str) – dataset to retrieve data from fpath.

  • time_index (pd.Datetimeindex) – Time index of the final file.

  • meta (pd.DataFrame) – Meta data of the final file.

  • scale_factor (int | float) – Final destination scale factor after collection. If the data retrieval from the files to be collected has a different scale factor, the collected data will be rescaled and returned as float32.

  • dtype (np.dtype) – Final dtype to return data as

  • threshold (float) – Threshold distance for finding target coordinates within full meta

Returns:

  • f_data (Union[np.ndarray, da.core.Array]) – Data array from the fpath cast as input dtype.

  • row_slice (slice) – final_time_index[row_slice] = new_time_index

  • col_slice (slice) – final_meta[col_slice] = new_meta

get_unique_chunk_files(file_paths)[source]#

We get files for the unique spatial and temporal extents covered by all collection files. Since the files have a suffix _{temporal_chunk_index}_{spatial_chunk_index}.h5 we just use all files with a single spatial_chunk_index for the full time index and all files with a single temporal_chunk_index for the full meta.

Parameters:
  • t_files (list) – Explicit list of str file paths which, when combined, provide the entire spatial domain.

  • s_files (list) – Explicit list of str file paths which, when combined, provide the entire temporal extent.

get_target_and_masked_meta(meta, target_meta_file=None, threshold=0.0001)[source]#

Use combined meta for all files and target_meta_file to get mapping from the full meta to the target meta and the mapping from the target meta to the full meta, both of which are masked to remove coordinates not present in the target_meta.

Parameters:
  • meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta

  • target_meta_file (str) – Path to target final meta containing coordinates to keep from the full list of coordinates present in the collected meta for the full file list.

  • threshold (float) – Threshold distance for finding target coordinates within full meta

Returns:

  • target_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta

  • masked_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected masked against target_meta

get_collection_attrs(file_paths, max_workers=None, target_meta_file=None, threshold=0.0001)[source]#

Get important dataset attributes from a file list to be collected.

Assumes the file list is chunked in time (row chunked).

Parameters:
  • file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.h5.

  • max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.

  • target_meta_file (str) – Path to target final meta containing coordinates to keep from the full list of coordinates present in the collected meta for the full file list.

  • threshold (float) – Threshold distance for finding target coordinates within full meta

Returns:

  • time_index (pd.datetimeindex) – Concatenated full size datetime index from the flist that is being collected

  • target_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta

  • masked_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected masked against target_meta

  • shape (tuple) – Output (collected) dataset shape

  • global_attrs (dict) – Global attributes from the first file in file_paths (it’s assumed that all the files in file_paths have the same global file attributes).

get_flist_chunks(file_paths, n_writes=None)[source]#

Group files by temporal_chunk_index and then combines these groups if n_writes is less than the number of time_chunks. Assumes file_paths have a suffix format like _{temporal_chunk_index}_{spatial_chunk_index}.h5

Parameters:
  • file_paths (list) – List of file paths each with a suffix _{temporal_chunk_index}_{spatial_chunk_index}.h5

  • n_writes (int | None) – Number of writes to use for collection

Returns:

flist_chunks (list) – List of file list chunks. Used to split collection and writing into multiple steps.

collect_feature(dset, target_masked_meta, target_meta_file, time_index, shape, flist_chunks, out_file, threshold=0.0001, max_workers=None)[source]#

Collect chunks for single feature

dsetstr

Dataset name to collect.

target_masked_metapd.DataFrame

Same as subset_masked_meta but instead for the entire list of files to be collected.

target_meta_filestr

Path to target final meta containing coordinates to keep from the full file list collected meta. This can be but is not necessarily a subset of the full list of coordinates for all files in the file list. This is used to remove coordinates from the full file list which are not present in the target_meta. Either this full meta or a subset, depending on which coordinates are present in the data to be collected, will be the final meta for the collected output files.

time_indexpd.datetimeindex

Concatenated datetime index for the given file paths.

shapetuple

Output (collected) dataset shape

flist_chunkslist

List of file list chunks. Used to split collection and writing into multiple steps.

out_filestr

File path of final output file.

thresholdfloat

Threshold distance for finding target coordinates within full meta

max_workersint | None

Number of workers to use in parallel. 1 runs serial, None will use all available workers.

classmethod collect(file_paths, out_file, features, max_workers=None, log_level=None, log_file=None, target_meta_file=None, n_writes=None, overwrite=True, threshold=0.0001)[source]#

Collect data files from a dir to one output file.

Filename requirements:
  • Should end with “.h5”

Parameters:
  • file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.h5. Files resolved by this argument must be of the form *_{temporal_chunk_index}_{spatial_chunk_index}.h5.

  • out_file (str) – File path of final output file.

  • features (list) – List of dsets to collect

  • max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.

  • log_level (str | None) – Desired log level, None will not initialize logging.

  • log_file (str | None) – Target log file. None logs to stdout.

  • write_status (bool) – Flag to write status file once complete if running from pipeline.

  • job_name (str) – Job name for status file if running from pipeline.

  • pipeline_step (str, optional) – Name of the pipeline step being run. If None, the pipeline_step will be set to "collect, mimicking old reV behavior. By default, None.

  • target_meta_file (str) – Path to target final meta containing coordinates to keep from the full file list collected meta. This can be but is not necessarily a subset of the full list of coordinates for all files in the file list. This is used to remove coordinates from the full file list which are not present in the target_meta. Either this full meta or a subset, depending on which coordinates are present in the data to be collected, will be the final meta for the collected output files.

  • n_writes (int | None) – Number of writes to split full file list into. Must be less than or equal to the number of temporal chunks if chunks have different time indices.

  • overwrite (bool) – Whether to overwrite existing output file

  • threshold (float) – Threshold distance for finding target coordinates within full meta

static get_chunk_indices(file)#

Get spatial and temporal chunk indices from the given file name.

Returns:

  • temporal_chunk_index (str) – Zero padded integer for the temporal chunk index

  • spatial_chunk_index (str) – Zero padded integer for the spatial chunk index

static get_dset_attrs(feature)#

Get attrributes for output feature

Parameters:

feature (str) – Name of feature to write

Returns:

  • attrs (dict) – Dictionary of attributes for requested dset

  • dtype (str) – Data type for requested dset. Defaults to float32

classmethod get_node_cmd(config)#

Get a CLI call to collect data.

Parameters:

config (dict) – sup3r collection config with all necessary args and kwargs to run data collection.

static get_time_dim_name(filepath)#

Get the name of the time dimension in the given file

Parameters:

filepath (str) – Path to the file

Returns:

time_key (str) – Name of the time dimension in the given file

classmethod write_data(out_file, dsets, time_index, data_list, meta, global_attrs=None)#

Write list of datasets to out_file.

Parameters:
  • out_file (str) – Pre-existing H5 file output path

  • dsets (list) – list of datasets to write to out_file

  • time_index (pd.DatetimeIndex()) – Pandas datetime index to use for file time_index.

  • data_list (list) – List of np.ndarray objects to write to out_file

  • meta (pd.DataFrame) – Full meta dataframe for the final output data.

  • global_attrs (dict) – Namespace of file-global attributes for the final output data.