sup3r.postprocessing.collectors.h5.CollectorH5#

class CollectorH5(file_paths)[source]#

Bases: BaseCollector

Sup3r H5 file collection framework

Parameters:: file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.<ext>. Files should have non-overlapping time_index and spatial domains.

Methods

`collect`(file_paths, out_file, features[, ...])	Collect data files from a dir to one output file.
`collect_feature`(dset, target_masked_meta, ...)	Collect chunks for single feature
`get_chunk_indices`(file)	Get spatial and temporal chunk indices from the given file name.
`get_collection_attrs`(file_paths[, ...])	Get important dataset attributes from a file list to be collected.
`get_coordinate_indices`(target_meta, full_meta)	Get coordindate indices in meta data for given targets
`get_data`(file_path, feature, time_index, ...)	Retreive a data array from a chunked file.
`get_dset_attrs`(feature)	Get attrributes for output feature
`get_flist_chunks`(file_paths[, n_writes])	Group files by temporal_chunk_index and then combines these groups if `n_writes` is less than the number of time_chunks.
`get_node_cmd`(config)	Get a CLI call to collect data.
`get_slices`(final_time_index, final_meta, ...)	Get index slices where the new ti/meta belong in the final ti/meta.
`get_target_and_masked_meta`(meta[, ...])	Use combined meta for all files and target_meta_file to get mapping from the full meta to the target meta and the mapping from the target meta to the full meta, both of which are masked to remove coordinates not present in the target_meta.
`get_time_dim_name`(filepath)	Get the name of the time dimension in the given file
`get_unique_chunk_files`(file_paths)	We get files for the unique spatial and temporal extents covered by all collection files.
`write_data`(out_file, dsets, time_index, ...)	Write list of datasets to out_file.

classmethod get_slices(final_time_index, final_meta, new_time_index, new_meta)[source]#

Get index slices where the new ti/meta belong in the final ti/meta.

Parameters:

final_time_index (pd.Datetimeindex) – Time index of the final file that new_time_index is being written to.
final_meta (pd.DataFrame) – Meta data of the final file that new_meta is being written to.
new_time_index (pd.Datetimeindex) – Chunk time index that is a subset of the final_time_index.
new_meta (pd.DataFrame) – Chunk meta data that is a subset of the final_meta.

Returns:

row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta

get_coordinate_indices(target_meta, full_meta, threshold=0.0001)[source]#

Get coordindate indices in meta data for given targets

Parameters:

target_meta (pd.DataFrame) – Dataframe of coordinates to find within the full meta
full_meta (pd.DataFrame) – Dataframe of full set of coordinates for unfiltered dataset
threshold (float) – Threshold distance for finding target coordinates within full meta

get_data(file_path, feature, time_index, meta, scale_factor, dtype, threshold=0.0001)[source]#

Retreive a data array from a chunked file.

Parameters:

file_path (str) – h5 file to get data from
feature (str) – dataset to retrieve data from fpath.
time_index (pd.Datetimeindex) – Time index of the final file.
meta (pd.DataFrame) – Meta data of the final file.
scale_factor (int | float) – Final destination scale factor after collection. If the data retrieval from the files to be collected has a different scale factor, the collected data will be rescaled and returned as float32.
dtype (np.dtype) – Final dtype to return data as
threshold (float) – Threshold distance for finding target coordinates within full meta

Returns:

f_data (Union[np.ndarray, da.core.Array]) – Data array from the fpath cast as input dtype.
row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta

get_unique_chunk_files(file_paths)[source]#

We get files for the unique spatial and temporal extents covered by all collection files. Since the files have a suffix _{temporal_chunk_index}_{spatial_chunk_index}.h5 we just use all files with a single spatial_chunk_index for the full time index and all files with a single temporal_chunk_index for the full meta.

Parameters:

t_files (list) – Explicit list of str file paths which, when combined, provide the entire spatial domain.
s_files (list) – Explicit list of str file paths which, when combined, provide the entire temporal extent.

get_target_and_masked_meta(meta, target_meta_file=None, threshold=0.0001)[source]#

Use combined meta for all files and target_meta_file to get mapping from the full meta to the target meta and the mapping from the target meta to the full meta, both of which are masked to remove coordinates not present in the target_meta.

Parameters:

meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta
target_meta_file (str) – Path to target final meta containing coordinates to keep from the full list of coordinates present in the collected meta for the full file list.
threshold (float) – Threshold distance for finding target coordinates within full meta

Returns:

target_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta
masked_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected masked against target_meta

get_collection_attrs(file_paths, max_workers=None, target_meta_file=None, threshold=0.0001)[source]#

Get important dataset attributes from a file list to be collected.

Assumes the file list is chunked in time (row chunked).

Parameters:

file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.h5.
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
target_meta_file (str) – Path to target final meta containing coordinates to keep from the full list of coordinates present in the collected meta for the full file list.
threshold (float) – Threshold distance for finding target coordinates within full meta

Returns:

time_index (pd.datetimeindex) – Concatenated full size datetime index from the flist that is being collected
target_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta
masked_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected masked against target_meta
shape (tuple) – Output (collected) dataset shape
global_attrs (dict) – Global attributes from the first file in file_paths (it’s assumed that all the files in file_paths have the same global file attributes).

get_flist_chunks(file_paths, n_writes=None)[source]#

Group files by temporal_chunk_index and then combines these groups if n_writes is less than the number of time_chunks. Assumes file_paths have a suffix format like _{temporal_chunk_index}_{spatial_chunk_index}.h5

Parameters:

file_paths (list) – List of file paths each with a suffix _{temporal_chunk_index}_{spatial_chunk_index}.h5
n_writes (int | None) – Number of writes to use for collection

Returns:

flist_chunks (list) – List of file list chunks. Used to split collection and writing into multiple steps.

collect_feature(dset, target_masked_meta, target_meta_file, time_index, shape, flist_chunks, out_file, threshold=0.0001, max_workers=None)[source]#

Collect chunks for single feature

dsetstr: Dataset name to collect.
target_masked_metapd.DataFrame: Same as subset_masked_meta but instead for the entire list of files to be collected.
target_meta_filestr: Path to target final meta containing coordinates to keep from the full file list collected meta. This can be but is not necessarily a subset of the full list of coordinates for all files in the file list. This is used to remove coordinates from the full file list which are not present in the target_meta. Either this full meta or a subset, depending on which coordinates are present in the data to be collected, will be the final meta for the collected output files.
time_indexpd.datetimeindex: Concatenated datetime index for the given file paths.
shapetuple: Output (collected) dataset shape
flist_chunkslist: List of file list chunks. Used to split collection and writing into multiple steps.
out_filestr: File path of final output file.
thresholdfloat: Threshold distance for finding target coordinates within full meta
max_workersint | None: Number of workers to use in parallel. 1 runs serial, None will use all available workers.

classmethod collect(file_paths, out_file, features, max_workers=None, log_level=None, log_file=None, target_meta_file=None, n_writes=None, overwrite=True, threshold=0.0001)[source]#

Collect data files from a dir to one output file.

Filename requirements:

Should end with “.h5”

Parameters:

file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.h5. Files resolved by this argument must be of the form *_{temporal_chunk_index}_{spatial_chunk_index}.h5.
out_file (str) – File path of final output file.
features (list) – List of dsets to collect
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
log_level (str | None) – Desired log level, None will not initialize logging.
log_file (str | None) – Target log file. None logs to stdout.
write_status (bool) – Flag to write status file once complete if running from pipeline.
job_name (str) – Job name for status file if running from pipeline.
pipeline_step (str, optional) – Name of the pipeline step being run. If None, the pipeline_step will be set to "collect, mimicking old reV behavior. By default, None.
target_meta_file (str) – Path to target final meta containing coordinates to keep from the full file list collected meta. This can be but is not necessarily a subset of the full list of coordinates for all files in the file list. This is used to remove coordinates from the full file list which are not present in the target_meta. Either this full meta or a subset, depending on which coordinates are present in the data to be collected, will be the final meta for the collected output files.
n_writes (int | None) – Number of writes to split full file list into. Must be less than or equal to the number of temporal chunks if chunks have different time indices.
overwrite (bool) – Whether to overwrite existing output file
threshold (float) – Threshold distance for finding target coordinates within full meta

static get_chunk_indices(file)#

Get spatial and temporal chunk indices from the given file name.

Returns:

temporal_chunk_index (str) – Zero padded integer for the temporal chunk index
spatial_chunk_index (str) – Zero padded integer for the spatial chunk index

static get_dset_attrs(feature)#

Get attrributes for output feature

Parameters:

feature (str) – Name of feature to write

Returns:

attrs (dict) – Dictionary of attributes for requested dset
dtype (str) – Data type for requested dset. Defaults to float32

classmethod get_node_cmd(config)#

Get a CLI call to collect data.

Parameters:: config (dict) – sup3r collection config with all necessary args and kwargs to run data collection.

static get_time_dim_name(filepath)#

Get the name of the time dimension in the given file

Parameters:: filepath (str) – Path to the file
Returns:: time_key (str) – Name of the time dimension in the given file

classmethod write_data(out_file, dsets, time_index, data_list, meta, global_attrs=None)#

Write list of datasets to out_file.

Parameters:

out_file (str) – Pre-existing H5 file output path
dsets (list) – list of datasets to write to out_file
time_index (pd.DatetimeIndex()) – Pandas datetime index to use for file time_index.
data_list (list) – List of np.ndarray objects to write to out_file
meta (pd.DataFrame) – Full meta dataframe for the final output data.
global_attrs (dict) – Namespace of file-global attributes for the final output data.

sup3r.postprocessing.collectors.h5.CollectorH5

Contents

sup3r.postprocessing.collectors.h5.CollectorH5#