sup3r.postprocessing.collectors.h5.CollectorH5#
- class CollectorH5(file_paths)[source]#
Bases:
BaseCollector
Sup3r H5 file collection framework
- Parameters:
file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.<ext>. Files should have non-overlapping time_index and spatial domains.
Methods
collect
(file_paths, out_file, features[, ...])Collect data files from a dir to one output file.
collect_feature
(dset, target_masked_meta, ...)Collect chunks for single feature
get_chunk_indices
(file)Get spatial and temporal chunk indices from the given file name.
get_collection_attrs
(file_paths[, ...])Get important dataset attributes from a file list to be collected.
get_coordinate_indices
(target_meta, full_meta)Get coordindate indices in meta data for given targets
get_data
(file_path, feature, time_index, ...)Retreive a data array from a chunked file.
get_dset_attrs
(feature)Get attrributes for output feature
get_flist_chunks
(file_paths[, n_writes])Group files by temporal_chunk_index and then combines these groups if
n_writes
is less than the number of time_chunks.get_node_cmd
(config)Get a CLI call to collect data.
get_slices
(final_time_index, final_meta, ...)Get index slices where the new ti/meta belong in the final ti/meta.
get_target_and_masked_meta
(meta[, ...])Use combined meta for all files and target_meta_file to get mapping from the full meta to the target meta and the mapping from the target meta to the full meta, both of which are masked to remove coordinates not present in the target_meta.
get_time_dim_name
(filepath)Get the name of the time dimension in the given file
get_unique_chunk_files
(file_paths)We get files for the unique spatial and temporal extents covered by all collection files.
write_data
(out_file, dsets, time_index, ...)Write list of datasets to out_file.
- classmethod get_slices(final_time_index, final_meta, new_time_index, new_meta)[source]#
Get index slices where the new ti/meta belong in the final ti/meta.
- Parameters:
final_time_index (pd.Datetimeindex) – Time index of the final file that new_time_index is being written to.
final_meta (pd.DataFrame) – Meta data of the final file that new_meta is being written to.
new_time_index (pd.Datetimeindex) – Chunk time index that is a subset of the final_time_index.
new_meta (pd.DataFrame) – Chunk meta data that is a subset of the final_meta.
- Returns:
row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta
- get_coordinate_indices(target_meta, full_meta, threshold=0.0001)[source]#
Get coordindate indices in meta data for given targets
- Parameters:
target_meta (pd.DataFrame) – Dataframe of coordinates to find within the full meta
full_meta (pd.DataFrame) – Dataframe of full set of coordinates for unfiltered dataset
threshold (float) – Threshold distance for finding target coordinates within full meta
- get_data(file_path, feature, time_index, meta, scale_factor, dtype, threshold=0.0001)[source]#
Retreive a data array from a chunked file.
- Parameters:
file_path (str) – h5 file to get data from
feature (str) – dataset to retrieve data from fpath.
time_index (pd.Datetimeindex) – Time index of the final file.
meta (pd.DataFrame) – Meta data of the final file.
scale_factor (int | float) – Final destination scale factor after collection. If the data retrieval from the files to be collected has a different scale factor, the collected data will be rescaled and returned as float32.
dtype (np.dtype) – Final dtype to return data as
threshold (float) – Threshold distance for finding target coordinates within full meta
- Returns:
f_data (Union[np.ndarray, da.core.Array]) – Data array from the fpath cast as input dtype.
row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta
- get_unique_chunk_files(file_paths)[source]#
We get files for the unique spatial and temporal extents covered by all collection files. Since the files have a suffix
_{temporal_chunk_index}_{spatial_chunk_index}.h5
we just use all files with a singlespatial_chunk_index
for the full time index and all files with a singletemporal_chunk_index
for the full meta.- Parameters:
t_files (list) – Explicit list of str file paths which, when combined, provide the entire spatial domain.
s_files (list) – Explicit list of str file paths which, when combined, provide the entire temporal extent.
- get_target_and_masked_meta(meta, target_meta_file=None, threshold=0.0001)[source]#
Use combined meta for all files and target_meta_file to get mapping from the full meta to the target meta and the mapping from the target meta to the full meta, both of which are masked to remove coordinates not present in the target_meta.
- Parameters:
meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta
target_meta_file (str) – Path to target final meta containing coordinates to keep from the full list of coordinates present in the collected meta for the full file list.
threshold (float) – Threshold distance for finding target coordinates within full meta
- Returns:
target_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta
masked_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected masked against target_meta
- get_collection_attrs(file_paths, max_workers=None, target_meta_file=None, threshold=0.0001)[source]#
Get important dataset attributes from a file list to be collected.
Assumes the file list is chunked in time (row chunked).
- Parameters:
file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.h5.
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
target_meta_file (str) – Path to target final meta containing coordinates to keep from the full list of coordinates present in the collected meta for the full file list.
threshold (float) – Threshold distance for finding target coordinates within full meta
- Returns:
time_index (pd.datetimeindex) – Concatenated full size datetime index from the flist that is being collected
target_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected or provided target meta
masked_meta (pd.DataFrame) – Concatenated full size meta data from the flist that is being collected masked against target_meta
shape (tuple) – Output (collected) dataset shape
global_attrs (dict) – Global attributes from the first file in file_paths (it’s assumed that all the files in file_paths have the same global file attributes).
- get_flist_chunks(file_paths, n_writes=None)[source]#
Group files by temporal_chunk_index and then combines these groups if
n_writes
is less than the number of time_chunks. Assumes file_paths have a suffix format like_{temporal_chunk_index}_{spatial_chunk_index}.h5
- Parameters:
file_paths (list) – List of file paths each with a suffix
_{temporal_chunk_index}_{spatial_chunk_index}.h5
n_writes (int | None) – Number of writes to use for collection
- Returns:
flist_chunks (list) – List of file list chunks. Used to split collection and writing into multiple steps.
- collect_feature(dset, target_masked_meta, target_meta_file, time_index, shape, flist_chunks, out_file, threshold=0.0001, max_workers=None)[source]#
Collect chunks for single feature
- dsetstr
Dataset name to collect.
- target_masked_metapd.DataFrame
Same as subset_masked_meta but instead for the entire list of files to be collected.
- target_meta_filestr
Path to target final meta containing coordinates to keep from the full file list collected meta. This can be but is not necessarily a subset of the full list of coordinates for all files in the file list. This is used to remove coordinates from the full file list which are not present in the target_meta. Either this full meta or a subset, depending on which coordinates are present in the data to be collected, will be the final meta for the collected output files.
- time_indexpd.datetimeindex
Concatenated datetime index for the given file paths.
- shapetuple
Output (collected) dataset shape
- flist_chunkslist
List of file list chunks. Used to split collection and writing into multiple steps.
- out_filestr
File path of final output file.
- thresholdfloat
Threshold distance for finding target coordinates within full meta
- max_workersint | None
Number of workers to use in parallel. 1 runs serial, None will use all available workers.
- classmethod collect(file_paths, out_file, features, max_workers=None, log_level=None, log_file=None, target_meta_file=None, n_writes=None, overwrite=True, threshold=0.0001)[source]#
Collect data files from a dir to one output file.
- Filename requirements:
Should end with “.h5”
- Parameters:
file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.h5. Files resolved by this argument must be of the form
*_{temporal_chunk_index}_{spatial_chunk_index}.h5
.out_file (str) – File path of final output file.
features (list) – List of dsets to collect
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
log_level (str | None) – Desired log level, None will not initialize logging.
log_file (str | None) – Target log file. None logs to stdout.
write_status (bool) – Flag to write status file once complete if running from pipeline.
job_name (str) – Job name for status file if running from pipeline.
pipeline_step (str, optional) – Name of the pipeline step being run. If
None
, thepipeline_step
will be set to"collect
, mimicking old reV behavior. By default,None
.target_meta_file (str) – Path to target final meta containing coordinates to keep from the full file list collected meta. This can be but is not necessarily a subset of the full list of coordinates for all files in the file list. This is used to remove coordinates from the full file list which are not present in the target_meta. Either this full meta or a subset, depending on which coordinates are present in the data to be collected, will be the final meta for the collected output files.
n_writes (int | None) – Number of writes to split full file list into. Must be less than or equal to the number of temporal chunks if chunks have different time indices.
overwrite (bool) – Whether to overwrite existing output file
threshold (float) – Threshold distance for finding target coordinates within full meta
- static get_chunk_indices(file)#
Get spatial and temporal chunk indices from the given file name.
- Returns:
temporal_chunk_index (str) – Zero padded integer for the temporal chunk index
spatial_chunk_index (str) – Zero padded integer for the spatial chunk index
- static get_dset_attrs(feature)#
Get attrributes for output feature
- Parameters:
feature (str) – Name of feature to write
- Returns:
attrs (dict) – Dictionary of attributes for requested dset
dtype (str) – Data type for requested dset. Defaults to float32
- classmethod get_node_cmd(config)#
Get a CLI call to collect data.
- Parameters:
config (dict) – sup3r collection config with all necessary args and kwargs to run data collection.
- static get_time_dim_name(filepath)#
Get the name of the time dimension in the given file
- Parameters:
filepath (str) – Path to the file
- Returns:
time_key (str) – Name of the time dimension in the given file
- classmethod write_data(out_file, dsets, time_index, data_list, meta, global_attrs=None)#
Write list of datasets to out_file.
- Parameters:
out_file (str) – Pre-existing H5 file output path
dsets (list) – list of datasets to write to out_file
time_index (pd.DatetimeIndex()) – Pandas datetime index to use for file time_index.
data_list (list) – List of np.ndarray objects to write to out_file
meta (pd.DataFrame) – Full meta dataframe for the final output data.
global_attrs (dict) – Namespace of file-global attributes for the final output data.