sup3r.preprocessing.data_handling.dual_data_handling.DualDataHandler

class DualDataHandler(hr_handler, lr_handler, cache_pattern=None, overwrite_cache=False, regrid_workers=1, load_cached=True, shuffle_time=False, regrid_lr=True, s_enhance=1, t_enhance=1, val_split=0.0)[source]

Bases: CacheHandlingMixIn, TrainingPrepMixIn

Batch handling class for h5 data as high res (usually WTK) and netcdf data as low res (usually ERA5)

NOTE: When initializing the lr_handler it’s important to pick a shape argument that will produce a low res domain that completely overlaps with the high res domain. When the high res data is not on a regular grid (WTK uses lambert) the low res shape is not simply the high res shape divided by s_enhance. It is easiest to not provide a shape argument at all for lr_handler and to get the full domain.

Initialize data handler using hr and lr data handlers for h5 data and nc data

Parameters:

hr_handler (DataHandler) – DataHandler for high_res data
lr_handler (DataHandler) – DataHandler for low_res data
cache_pattern (str) – Pattern for files to use for saving regridded ERA data.
overwrite_cache (bool) – Whether to overwrite regrid cache
regrid_workers (int | None) – Number of workers to use for regridding routine.
load_cached (bool) – Whether to load cache to memory or wait until load_cached() is called.
shuffle_time (bool) – Whether to shuffle time indices prior to training/validation split
regrid_lr (bool) – Flag to regrid the low-res handler data to the high-res handler grid. This will take care of any minor inconsistencies in different projections. Disable this if the grids are known to be the same.
s_enhance (int) – Spatial enhancement factor
t_enhance (int) – Temporal enhancement factor
val_split (float) – Percentage of data to reserve for validation.

Methods

`check_cached_features`(features[, ...])	Check which features have been cached and check flags to determine whether to load or extract this features again
`check_clear_data`()	Check if data was cached and free memory if load_cached is False
`get_cache_file_names`(cache_pattern[, ...])	Get names of cache files from cache_pattern and feature names
`get_data`()	Check hr and lr shapes and trim hr data if needed to match required relationship to lr shape based on enhancement factors.
`get_lr_data`()	Check if era data is cached.
`get_lr_regridded_data`()	Regrid low_res data for all requested noncached features.
`get_next`()	Get next high_res + low_res.
`get_regridder`()	Get regridder object
`load_cached_data`()	Load regridded low_res and high_res cache data
`load_lr_cached_data`()	Load low_res cache data
`normalize`([means, stds, max_workers])	Normalize low_res and high_res data
`parallel_load`(data, cache_files, features[, ...])	Load feature data in parallel

Attributes

`cache_files`	Get file names of regridded cache data
`cache_pattern`	Get correct cache file pattern for formatting.
`cached_features`	List of features which have been requested but have been determined not to need extraction.
`data`	Get low res data.
`feature_mem`	Number of bytes for a single feature array.
`features`	Get a list of data features including features from both the lr and hr data handlers
`grid_mem`	Get memory used by a feature at a single time step
`hr_exo_features`	Get a list of high-resolution features that are only used for training e.g., mid-network high-res topo injection.
`hr_lat_lon`	Get high_res lat lon array
`hr_out_features`	Get a list of high-resolution features that are intended to be output by the GAN.
`hr_required_shape`	Return required shape for high_res data
`hr_sample_shape`	Get hr sample shape
`lr_features`	Get a list of low-resolution features.
`lr_grid_shape`	Return grid shape for regridded low_res data
`lr_input_data`	Get low res data used as input to regridding routine
`lr_lat_lon`	Get low_res lat lon array
`lr_only_features`	Features to use for training only and not output
`lr_required_shape`	Return required shape for regridded low_res data
`lr_sample_shape`	Get lr sample shape
`means`	Get the mean values for each feature.
`noncached_features`	Get list of features needing extraction or derivation
`norm_workers`	Get upper bound on workers used for normalization.
`sample_shape`	Get lr sample shape
`shape`	Get low_res shape
`size`	Get low_res size
`stds`	Get the standard deviation values for each feature.
`try_load`	Check if we should try to load cached data
`val_data`	Get low res validation data.

get_data()[source]: Check hr and lr shapes and trim hr data if needed to match required relationship to lr shape based on enhancement factors. Then regrid lr data and split hr and lr data into training and validation sets.

property means

Get the mean values for each feature. Mean values from the low-res data handler are prioritized because these are typically the “input” features

Returns:: dict

property stds

Get the standard deviation values for each feature. Mean values from the low-res data handler are prioritized because these are typically the “input” features

Returns:: dict

normalize(means=None, stds=None, max_workers=None)[source]

Normalize low_res and high_res data

Parameters:

means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.
stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.
max_workers (None | int) – Has no effect. Used to match MixIn class signature.

property features: Get a list of data features including features from both the lr and hr data handlers

property lr_only_features: Features to use for training only and not output

property lr_features: Get a list of low-resolution features. All low-resolution features are used for training.

property hr_exo_features: Get a list of high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set.

property hr_out_features: Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property grid_mem

Get memory used by a feature at a single time step

Returns:: int – Number of bytes for a single feature array at a single time step

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:: int – Number of bytes for a single feature array

property sample_shape: Get lr sample shape

property lr_sample_shape: Get lr sample shape

property hr_sample_shape: Get hr sample shape

property data: Get low res data. Same as self.lr_data but used to match property used for computing means and stdevs

property val_data: Get low res validation data. Same as self.lr_val_data but used to match property used by normalization routine.

property lr_input_data: Get low res data used as input to regridding routine

property lr_required_shape: Return required shape for regridded low_res data

property shape: Get low_res shape

property size: Get low_res size

property hr_required_shape: Return required shape for high_res data

property lr_grid_shape: Return grid shape for regridded low_res data

property lr_lat_lon: Get low_res lat lon array

property hr_lat_lon: Get high_res lat lon array

property cache_files: Get file names of regridded cache data

property noncached_features: Get list of features needing extraction or derivation

property try_load: Check if we should try to load cached data

load_lr_cached_data()[source]: Load low_res cache data

load_cached_data()[source]: Load regridded low_res and high_res cache data

check_clear_data()[source]: Check if data was cached and free memory if load_cached is False

get_lr_data()[source]: Check if era data is cached. If not then extract data and regrid. Save to cache if cache pattern provided.

get_regridder()[source]: Get regridder object

property cache_pattern

Get correct cache file pattern for formatting.

Returns:: _cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features: List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:

features (list) – list of features to extract
cache_files (list | None) – Path to files with saved feature data
overwrite_cache (bool) – Whether to overwrite cached files
load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:

cache_pattern (str) – Pattern to use for cache file names
grid_shape (tuple) – Shape of grid to use for cache file naming
time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming
target (tuple) – Target to use for cache file naming
features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

get_lr_regridded_data()[source]: Regrid low_res data for all requested noncached features. Load cached features if available and overwrite=False

property norm_workers: Get upper bound on workers used for normalization.

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:

data (ndarray) – Array to fill with cached data
cache_files (list) – List of cache files for each feature
features (list) – List of requested features
max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

get_next()[source]

Get next high_res + low_res. Gets random spatiotemporal sample for h5 data and then uses enhancement factors to subsample interpolated/regridded low_res data for same spatiotemporal extent.

Returns:

hr_data (ndarray) – Array of high resolution data with each feature equal in shape to hr_sample_shape
lr_data (ndarray) – Array of low resolution data with each feature equal in shape to lr_sample_shape