sup3r.preprocessing.data_handling.dual_data_handling.DualDataHandler

class DualDataHandler(hr_handler, lr_handler, cache_pattern=None, overwrite_cache=False, regrid_workers=1, load_cached=True, shuffle_time=False, regrid_lr=True, s_enhance=1, t_enhance=1, val_split=0.0)[source]

Bases: CacheHandlingMixIn, TrainingPrepMixIn

Batch handling class for h5 data as high res (usually WTK) and netcdf data as low res (usually ERA5)

NOTE: When initializing the lr_handler it’s important to pick a shape argument that will produce a low res domain that completely overlaps with the high res domain. When the high res data is not on a regular grid (WTK uses lambert) the low res shape is not simply the high res shape divided by s_enhance. It is easiest to not provide a shape argument at all for lr_handler and to get the full domain.

Initialize data handler using hr and lr data handlers for h5 data and nc data

Parameters:
  • hr_handler (DataHandler) – DataHandler for high_res data

  • lr_handler (DataHandler) – DataHandler for low_res data

  • cache_pattern (str) – Pattern for files to use for saving regridded ERA data.

  • overwrite_cache (bool) – Whether to overwrite regrid cache

  • regrid_workers (int | None) – Number of workers to use for regridding routine.

  • load_cached (bool) – Whether to load cache to memory or wait until load_cached() is called.

  • shuffle_time (bool) – Whether to shuffle time indices prior to training/validation split

  • regrid_lr (bool) – Flag to regrid the low-res handler data to the high-res handler grid. This will take care of any minor inconsistencies in different projections. Disable this if the grids are known to be the same.

  • s_enhance (int) – Spatial enhancement factor

  • t_enhance (int) – Temporal enhancement factor

  • val_split (float) – Percentage of data to reserve for validation.

Methods

check_cached_features(features[, ...])

Check which features have been cached and check flags to determine whether to load or extract this features again

check_clear_data()

Check if data was cached and free memory if load_cached is False

get_cache_file_names(cache_pattern[, ...])

Get names of cache files from cache_pattern and feature names

get_data()

Check hr and lr shapes and trim hr data if needed to match required relationship to lr shape based on enhancement factors.

get_lr_data()

Check if era data is cached.

get_lr_regridded_data()

Regrid low_res data for all requested noncached features.

get_next()

Get next high_res + low_res.

get_regridder()

Get regridder object

load_cached_data()

Load regridded low_res and high_res cache data

load_lr_cached_data()

Load low_res cache data

normalize([means, stds, max_workers])

Normalize low_res and high_res data

parallel_load(data, cache_files, features[, ...])

Load feature data in parallel

Attributes

cache_files

Get file names of regridded cache data

cache_pattern

Get correct cache file pattern for formatting.

cached_features

List of features which have been requested but have been determined not to need extraction.

data

Get low res data.

feature_mem

Number of bytes for a single feature array.

features

Get a list of data features including features from both the lr and hr data handlers

grid_mem

Get memory used by a feature at a single time step

hr_exo_features

Get a list of high-resolution features that are only used for training e.g., mid-network high-res topo injection.

hr_lat_lon

Get high_res lat lon array

hr_out_features

Get a list of high-resolution features that are intended to be output by the GAN.

hr_required_shape

Return required shape for high_res data

hr_sample_shape

Get hr sample shape

lr_features

Get a list of low-resolution features.

lr_grid_shape

Return grid shape for regridded low_res data

lr_input_data

Get low res data used as input to regridding routine

lr_lat_lon

Get low_res lat lon array

lr_only_features

Features to use for training only and not output

lr_required_shape

Return required shape for regridded low_res data

lr_sample_shape

Get lr sample shape

means

Get the mean values for each feature.

noncached_features

Get list of features needing extraction or derivation

norm_workers

Get upper bound on workers used for normalization.

sample_shape

Get lr sample shape

shape

Get low_res shape

size

Get low_res size

stds

Get the standard deviation values for each feature.

try_load

Check if we should try to load cached data

val_data

Get low res validation data.

get_data()[source]

Check hr and lr shapes and trim hr data if needed to match required relationship to lr shape based on enhancement factors. Then regrid lr data and split hr and lr data into training and validation sets.

property means

Get the mean values for each feature. Mean values from the low-res data handler are prioritized because these are typically the “input” features

Returns:

dict

property stds

Get the standard deviation values for each feature. Mean values from the low-res data handler are prioritized because these are typically the “input” features

Returns:

dict

normalize(means=None, stds=None, max_workers=None)[source]

Normalize low_res and high_res data

Parameters:
  • means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.

  • stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.

  • max_workers (None | int) – Has no effect. Used to match MixIn class signature.

property features

Get a list of data features including features from both the lr and hr data handlers

property lr_only_features

Features to use for training only and not output

property lr_features

Get a list of low-resolution features. All low-resolution features are used for training.

property hr_exo_features

Get a list of high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set.

property hr_out_features

Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property grid_mem

Get memory used by a feature at a single time step

Returns:

int – Number of bytes for a single feature array at a single time step

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:

int – Number of bytes for a single feature array

property sample_shape

Get lr sample shape

property lr_sample_shape

Get lr sample shape

property hr_sample_shape

Get hr sample shape

property data

Get low res data. Same as self.lr_data but used to match property used for computing means and stdevs

property val_data

Get low res validation data. Same as self.lr_val_data but used to match property used by normalization routine.

property lr_input_data

Get low res data used as input to regridding routine

property lr_required_shape

Return required shape for regridded low_res data

property shape

Get low_res shape

property size

Get low_res size

property hr_required_shape

Return required shape for high_res data

property lr_grid_shape

Return grid shape for regridded low_res data

property lr_lat_lon

Get low_res lat lon array

property hr_lat_lon

Get high_res lat lon array

property cache_files

Get file names of regridded cache data

property noncached_features

Get list of features needing extraction or derivation

property try_load

Check if we should try to load cached data

load_lr_cached_data()[source]

Load low_res cache data

load_cached_data()[source]

Load regridded low_res and high_res cache data

check_clear_data()[source]

Check if data was cached and free memory if load_cached is False

get_lr_data()[source]

Check if era data is cached. If not then extract data and regrid. Save to cache if cache pattern provided.

get_regridder()[source]

Get regridder object

property cache_pattern

Get correct cache file pattern for formatting.

Returns:

_cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features

List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:
  • features (list) – list of features to extract

  • cache_files (list | None) – Path to files with saved feature data

  • overwrite_cache (bool) – Whether to overwrite cached files

  • load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:
  • cache_pattern (str) – Pattern to use for cache file names

  • grid_shape (tuple) – Shape of grid to use for cache file naming

  • time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming

  • target (tuple) – Target to use for cache file naming

  • features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

get_lr_regridded_data()[source]

Regrid low_res data for all requested noncached features. Load cached features if available and overwrite=False

property norm_workers

Get upper bound on workers used for normalization.

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:
  • data (ndarray) – Array to fill with cached data

  • cache_files (list) – List of cache files for each feature

  • features (list) – List of requested features

  • max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

get_next()[source]

Get next high_res + low_res. Gets random spatiotemporal sample for h5 data and then uses enhancement factors to subsample interpolated/regridded low_res data for same spatiotemporal extent.

Returns:

  • hr_data (ndarray) – Array of high resolution data with each feature equal in shape to hr_sample_shape

  • lr_data (ndarray) – Array of low resolution data with each feature equal in shape to lr_sample_shape