sup3r.preprocessing.data_handling.base.DataHandlerDC

class DataHandlerDC(file_paths, features, target=None, shape=None, max_delta=20, temporal_slice=slice(None, None, 1), hr_spatial_coarsen=None, time_roll=0, val_split=0.0, sample_shape=(10, 10, 1), raster_file=None, raster_index=None, shuffle_time=False, time_chunk_size=None, cache_pattern=None, overwrite_cache=False, overwrite_ti_cache=False, load_cached=False, lr_only_features=(), hr_exo_features=(), handle_features=None, single_ts_files=None, mask_nan=False, fill_nan=False, worker_kwargs=None, res_kwargs=None)[source]

Bases: DataHandler

Data-centric data handler

Parameters:

file_paths (str | list) – A single source h5 wind file to extract raster data from or a list of netcdf files with identical grid. The string can be a unix-style file path which will be passed through glob.glob
features (list) – list of features to extract from the provided data
target (tuple) – (lat, lon) lower left corner of raster. Either need target+shape or raster_file.
shape (tuple) – (rows, cols) grid size. Either need target+shape or raster_file.
max_delta (int, optional) – Optional maximum limit on the raster shape that is retrieved at once. If shape is (20, 20) and max_delta=10, the full raster will be retrieved in four chunks of (10, 10). This helps adapt to non-regular grids that curve over large distances, by default 20
temporal_slice (slice) – Slice specifying extent and step of temporal extraction. e.g. slice(start, stop, time_pruning). If equal to slice(None, None, 1) the full time dimension is selected.
hr_spatial_coarsen (int | None) – Optional input to coarsen the high-resolution spatial field. This can be used if (for example) you have 2km source data, but you want the final high res prediction target to be 4km resolution, then hr_spatial_coarsen would be 2 so that the GAN is trained on aggregated 4km high-res data.
time_roll (int) – The number of places by which elements are shifted in the time axis. Can be used to convert data to different timezones. This is passed to np.roll(a, time_roll, axis=2) and happens AFTER the temporal_slice operation.
val_split (float32) – Fraction of data to store for validation
sample_shape (tuple) – Size of spatial and temporal domain used in a single high-res observation for batching
raster_file (str | None) – .txt file for raster_index array for the corresponding target and shape. If specified the raster_index will be loaded from the file if it exists or written to the file if it does not yet exist. If None and raster_index is not provided raster_index will be calculated directly. Either need target+shape, raster_file, or raster_index input.
raster_index (list) – List of tuples or slices. Used as an alternative to computing the raster index from target+shape or loading the raster index from file
shuffle_time (bool) – Whether to shuffle time indices before validation split
time_chunk_size (int) – Size of chunks to split time dimension into for parallel data extraction. If running in serial this can be set to the size of the full time index for best performance.
cache_pattern (str | None) – Pattern for files for saving feature data. e.g. file_path_{feature}.pkl. Each feature will be saved to a file with the feature name replaced in cache_pattern. If not None feature arrays will be saved here and not stored in self.data until load_cached_data is called. The cache_pattern can also include {shape}, {target}, {times} which will help ensure unique cache files for complex problems.
overwrite_cache (bool) – Whether to overwrite any previously saved cache files.
overwrite_ti_cache (bool) – Whether to overwrite any previously saved time index cache files.
overwrite_ti_cache (bool) – Whether to overwrite saved time index cache files.
load_cached (bool) – Whether to load data from cache files
lr_only_features (list | tuple) – List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
hr_exo_features (list | tuple) – List of feature names or patt*erns that should be included in the high-resolution observation but not expected to be output from the generative model. An example is high-res topography that is to be injected mid-network.
handle_features (list | None) – Optional list of features which are available in the provided data. Providing this eliminates the need for an initial search of available features prior to data extraction.
single_ts_files (bool | None) – Whether input files are single time steps or not. If they are this enables some reduced computation. If None then this will be determined from file_paths directly.
mask_nan (bool) – Flag to mask out (remove) any timesteps with NaN data from the source dataset. This is False by default because it can create discontinuities in the timeseries.
fill_nan (bool) – Flag to gap-fill any NaN data from the source dataset using a nearest neighbor algorithm. This is False by default because it can hide bad datasets that should be identified by the user.
worker_kwargs (dict | None) – Dictionary of worker values. Can include max_workers, extract_workers, compute_workers, load_workers, norm_workers, and ti_workers. Each argument needs to be an integer or None.

The value of max workers will set the value of all other worker args. If max_workers == 1 then all processes will be serialized. If max_workers == None then other worker args will use their own provided values.

extract_workers is the max number of workers to use for extracting features from source data. If None it will be estimated based on memory limits. If 1 processes will be serialized. compute_workers is the max number of workers to use for computing derived features from raw features in source data. load_workers is the max number of workers to use for loading cached feature data. norm_workers is the max number of workers to use for normalizing feature data. ti_workers is the max number of workers to use to get full time index. Useful when there are many input files each with a single time step. If this is greater than one, time indices for input files will be extracted in parallel and then concatenated to get the full time index. If input files do not all have time indices or if there are few input files this should be set to one.
res_kwargs (dict | None) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘concat_dim’: ‘Time’, ‘combine’: ‘nested’, ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **res_kwargs)

Methods

`cache_data`(cache_file_paths)	Cache feature data to file and delete from memory
`cap_worker_args`(max_workers)	Cap all workers args by max_workers
`check_cached_features`(features[, ...])	Check which features have been cached and check flags to determine whether to load or extract this features again
`check_clear_data`()	Check if data is cached and clear data if not load_cached
`clear_data`()	Free memory used for data arrays
`data_fill`(shifted_time_chunks[, max_workers])	Fill final data array with extracted / computed chunks
`extract_feature`(file_paths, raster_index, ...)	Extract single feature from data source
`get_cache_file_names`(cache_pattern[, ...])	Get names of cache files from cache_pattern and feature names
`get_capped_workers`(max_workers_cap, max_workers)	Get max number of workers for a given job.
`get_closest_lat_lon`(lat_lon, target)	Get closest indices to target lat lon
`get_full_domain`(file_paths)	Get target and shape for full domain
`get_handle_features`(file_paths)	Get all available features in input data
`get_input_arrays`(data, chunk_number, f, ...)	Get only arrays needed for computations
`get_inputs_recursive`(feature, handle_features)	Lookup inputs needed to compute feature.
`get_lat_lon`(file_paths, raster_index[, ...])	Get lat/lon grid for requested target and shape
`get_lat_lon_df`(target[, features])	Get timeseries for given target
`get_next`([temporal_weights, spatial_weights])	Get data for observation using weighted random observation index.
`get_node_cmd`(config)	Get a CLI call to initialize DataHandler and cache data.
`get_observation_index`([temporal_weights, ...])	Randomly gets weighted spatial sample and time sample
`get_raster_index`()	Get raster index for file data.
`get_raw_feature_list`(features, handle_features)	Lookup inputs needed to compute feature
`get_time_index`(file_paths[, max_workers])	Get raw time index for source data
`has_exact_feature`(feature, handle)	Check if exact feature is in handle
`has_multilevel_feature`(feature, handle)	Check if exact feature is in handle
`has_surrounding_features`(feature, handle)	Check if handle has feature values at surrounding heights.
`lats_are_descending`(lat_lon)	Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).
`lin_bc`(bc_files[, threshold])	Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc
`load_cached_data`([with_split])	Load data from cache files and split into training and validation
`lookup`(feature, attr_name[, handle_features])	Lookup feature in feature registry
`mask_nan`()	Drop timesteps with NaN data
`normalize`([means, stds, features, max_workers])	Normalize all data features.
`parallel_compute`(data, file_paths, ...[, ...])	Compute features using parallel subprocesses
`parallel_extract`(file_paths, raster_index, ...)	Extract features using parallel subprocesses
`parallel_load`(data, cache_files, features[, ...])	Load feature data in parallel
`pop_old_data`(data, chunk_number, all_features)	Remove input feature data if no longer needed for requested features
`preflight`()	Run some preflight checks and verify that the inputs are valid
`qdm_bc`(bc_files, reference_feature[, ...])	Bias Correction using Quantile Delta Mapping
`recursive_compute`(data, feature, ...)	Compute intermediate features recursively
`run_all_data_init`()	Build base 4D data array.
`run_data_compute`()	Run the data computation / derivation from raw features to desired features.
`run_data_extraction`()	Run the raw dataset extraction process from disk to raw un-manipulated datasets.
`run_nn_fill`()	Run nn nan fill on full data array.
`serial_compute`(data, file_paths, ...)	Compute features in series
`serial_data_fill`(shifted_time_chunks)	Fill final data array in serial
`serial_extract`(file_paths, raster_index, ...)	Extract features in series
`source_handler`(file_paths, **kwargs)	Handle for source data.
`split_data`([data, val_split, shuffle_time])	Split time dimension into set of training indices and validation indices
`time_index_conflict_check`()	Check if the number of input files and the length of the time index is the same
`valid_handle_features`(features, handle_features)	Check if features are in handle
`valid_input_features`(features, handle_features)	Check if features are in handle or have compute methods

Attributes

`FEATURE_REGISTRY`
`attrs`	Get atttributes of input data
`cache_files`	Cache files for storing extracted data
`cache_pattern`	Get correct cache file pattern for formatting.
`cached_features`	List of features which have been requested but have been determined not to need extraction.
`compute_workers`	Get upper bound for compute workers based on memory limits.
`derive_features`	List of features which need to be derived from other features
`extract_features`	Features to extract directly from the source handler
`extract_workers`	Get upper bound for extract workers based on memory limits.
`feature_mem`	Number of bytes for a single feature array.
`file_paths`	Get file paths for input data
`full_raw_lat_lon`	Get the full lat/lon grid without doing any latitude inversion
`grid_mem`	Get memory used by a feature at a single time step
`grid_shape`	Get shape of raster
`handle_features`	All features available in raw input
`hr_exo_features`	Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.
`hr_out_features`	Get a list of high-resolution features that are intended to be output by the GAN.
`input_file_info`	Method to provide info about files in log output.
`invert_lat`	Whether to invert the latitude axis during data extraction.
`is_time_independent`	Get whether source data files are time independent
`lat_lon`	Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
`latitude`	Flattened list of latitudes
`load_workers`	Get upper bound on load workers based on memory limits.
`longitude`	Flattened list of longitudes
`lr_features`	Get a list of low-resolution features.
`lr_only_features`	List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
`means`	Get the mean values for each feature.
`meta`	Meta dataframe with coordinates.
`n_tsteps`	Get number of time steps to extract
`need_full_domain`	Check whether we need to get the full lat/lon grid to determine target and shape values
`noncached_features`	Get list of features needing extraction or derivation
`norm_workers`	Get upper bound on workers used for normalization.
`raster_index`	Raster index property
`raw_features`	Get list of features needed for computations
`raw_lat_lon`	Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
`raw_time_index`	Time index for input data without time pruning.
`raw_tsteps`	Get number of time steps for all input files
`requested_shape`	Get requested shape for cached data
`shape`	Full data shape
`single_ts_files`	Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
`size`	Size of data array
`source_type`	Get data type for source files.
`stds`	Get the standard deviation values for each feature.
`target`	Get lower left corner of raster
`temporal_slice`	Get temporal range to extract from full dataset
`ti_workers`	Get max number of workers for computing time index
`time_chunk_size`	Get upper bound on time chunk size based on memory limits
`time_chunks`	Get time chunks which will be extracted from source data
`time_freq_hours`	Get the time frequency in hours as a float
`time_index`	Time index for input data with time pruning.
`time_index_file`	Get time index file path
`try_load`	Check if we should try to load cache

property attrs

Get atttributes of input data

Returns:: dict – Dictionary of attributes

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

Parameters:: cache_file_paths (str | None) – Path to file for saving feature data

property cache_files: Cache files for storing extracted data

property cache_pattern

Get correct cache file pattern for formatting.

Returns:: _cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features: List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

cap_worker_args(max_workers): Cap all workers args by max_workers

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:

features (list) – list of features to extract
cache_files (list | None) – Path to files with saved feature data
overwrite_cache (bool) – Whether to overwrite cached files
load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

check_clear_data(): Check if data is cached and clear data if not load_cached

clear_data(): Free memory used for data arrays

property compute_workers: Get upper bound for compute workers based on memory limits. Used to compute derived features from source dataset.

data_fill(shifted_time_chunks, max_workers=None)

Fill final data array with extracted / computed chunks

Parameters:

shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array
max_workers (int | None) – Max number of workers to use for building final data array. If None max available workers will be used. If 1 cached data will be loaded in serial

property derive_features: List of features which need to be derived from other features

abstract classmethod extract_feature(file_paths, raster_index, feature, time_slice=slice(None, None, None), **kwargs)

Extract single feature from data source

Parameters:

file_paths (list) – path to data file
raster_index (ndarray) – Raster index array
time_slice (slice) – slice of time to extract
feature (str) – Feature to extract from data
kwargs (dict) – Keyword arguments passed to source handler

Returns:

ndarray – Data array for extracted feature (spatial_1, spatial_2, temporal)

property extract_features: Features to extract directly from the source handler

property extract_workers: Get upper bound for extract workers based on memory limits. Used to extract data from source dataset. The max number of extract workers is number of time chunks * number of features

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:: int – Number of bytes for a single feature array

property file_paths: Get file paths for input data

property full_raw_lat_lon: Get the full lat/lon grid without doing any latitude inversion

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:

cache_pattern (str) – Pattern to use for cache file names
grid_shape (tuple) – Shape of grid to use for cache file naming
time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming
target (tuple) – Target to use for cache file naming
features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

static get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job. Capped to global max workers if specified

Parameters:

max_workers_cap (int | None) – Cap for job specific max_workers
max_workers (int | None) – Job specific max_workers

Returns:

max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided

static get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

Parameters:

lat_lon (ndarray) – Array of lat/lon (spatial_1, spatial_2, 2) Last dimension in order of (lat, lon)
target (tuple) – (lat, lon) for target coordinate

Returns:

row (int) – row index for closest lat/lon to target lat/lon
col (int) – col index for closest lat/lon to target lat/lon

abstract classmethod get_full_domain(file_paths): Get target and shape for full domain

classmethod get_handle_features(file_paths)

Get all available features in input data

Parameters:: file_paths (list) – List of input file paths
Returns:: handle_features (list) – List of available input features

classmethod get_input_arrays(data, chunk_number, f, handle_features)

Get only arrays needed for computations

Parameters:

data (dict) – Dictionary of feature arrays
chunk_number – time chunk for which to get input arrays
f (str) – feature to compute using input arrays
handle_features (list) – Features available in raw data

Returns:

dict – Dictionary of arrays with only needed features

classmethod get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature. Walk through inputs methods for each required feature to get all raw features.

Parameters:

feature (str) – Feature for which to get needed inputs for derivation
handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)

Get lat/lon grid for requested target and shape

Parameters:

file_paths (list) – path to data file
raster_index (ndarray | list) – Raster index array or list of slices
invert_lat (bool) – Flag to invert data along the latitude axis. Wrf data tends to use an increasing ordering for latitude while wtk uses a decreasing ordering.

Returns:

ndarray – (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension

get_lat_lon_df(target, features=None)

Get timeseries for given target

Parameters:

target (tuple) – (lat, lon) for target coordinate
features (list | None) – Optional list of features to include in returned data. If None then all available features are returned.

Returns:

df (pd.DataFrame) – Pandas dataframe with columns for each feature and timeindex for the given target

classmethod get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

Parameters:: config (dict) – sup3r data handler config with all necessary args and kwargs to initialize DataHandler and run data extraction.

abstract get_raster_index()

Get raster index for file data. Here we assume the list of paths in file_paths all have data with the same spatial domain. We use the first file in the list to compute the raster

Returns:: raster_index (np.ndarray) – 2D array of grid indices for H5 or list of slices for NETCDF

classmethod get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

Parameters:

features (list) – Features for which to get needed inputs for derivation
handle_features (list) – Features available in raw data

Returns:

list – List of input features

abstract get_time_index(file_paths, max_workers=None, **kwargs): Get raw time index for source data

property grid_mem

Get memory used by a feature at a single time step

Returns:: int – Number of bytes for a single feature array at a single time step

property grid_shape

Get shape of raster

Returns:: _grid_shape (tuple) – (rows, cols) grid size.

property handle_features: All features available in raw input

classmethod has_exact_feature(feature, handle)

Check if exact feature is in handle

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains exact feature or not

classmethod has_multilevel_feature(feature, handle)

Check if exact feature is in handle

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains multilevel data for given feature

classmethod has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights. e.g. if feature=U_40m check if the handler has u at heights below and above 40m

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether feature has surrounding heights

property hr_exo_features: Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.

property hr_out_features: Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property input_file_info

Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.

Returns:: str – message to append to log output that does not include a huge info dump of file paths

property invert_lat: Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)

property is_time_independent: Get whether source data files are time independent

property lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]

Returns:: ndarray

property latitude: Flattened list of latitudes

classmethod lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)

Parameters:: lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)
Returns:: bool

lin_bc(bc_files, threshold=0.1)

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

Parameters:

bc_files (list | tuple | str) – One or more filepaths to .h5 files output by MonthlyLinearCorrection or LinearCorrection. These should contain datasets named “{feature}_scalar” and “{feature}_adder” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time is length 1 for annual correction or 12 for monthly correction.
threshold (float) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

load_cached_data(with_split=True)

Load data from cache files and split into training and validation

Parameters:: with_split (bool) – Whether to split into training and validation data or not.

property load_workers: Get upper bound on load workers based on memory limits. Used to load cached data.

property longitude: Flattened list of longitudes

classmethod lookup(feature, attr_name, handle_features=None)

Lookup feature in feature registry

Parameters:

feature (str) – Feature to lookup in registry
attr_name (str) – Type of method to lookup. e.g. inputs or compute
handle_features (list) – List of feature names (datasets) available in the source file. If feature is found explicitly in this list, height/pressure suffixes will not be appended to the output.

Returns:

method | None – Feature registry method corresponding to feature

property lr_features: Get a list of low-resolution features. It is assumed that all features are used in the low-resolution observations. If you want to use high-res-only features, use the DualDataHandler class.

property lr_only_features: List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

mask_nan(): Drop timesteps with NaN data

property means

Get the mean values for each feature.

Returns:: dict

property meta: Meta dataframe with coordinates.

property n_tsteps: Get number of time steps to extract

property need_full_domain: Check whether we need to get the full lat/lon grid to determine target and shape values

property noncached_features: Get list of features needing extraction or derivation

property norm_workers: Get upper bound on workers used for normalization.

normalize(means=None, stds=None, features=None, max_workers=None)

Normalize all data features.

Parameters:

means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.
stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.
features (list | None) – List of features used for indexing data array during normalization.
max_workers (None | int) – Max workers to perform normalization. if None, self.norm_workers will be used

classmethod parallel_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features, max_workers=None)

Compute features using parallel subprocesses

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data
max_workers (int | None) – Number of max workers to use for computation. If equal to 1 then method is run in serial

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. Includes e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

classmethod parallel_extract(file_paths, raster_index, time_chunks, input_features, max_workers=None, **kwargs)

Extract features using parallel subprocesses

Parameters:

file_paths (list) – list of file paths
raster_index (ndarray | list) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
max_workers (int | None) – Number of max workers to use for extraction. If equal to 1 then method is run in serial
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:

data (ndarray) – Array to fill with cached data
cache_files (list) – List of cache files for each feature
features (list) – List of requested features
max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

classmethod pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
chunk_number (int) – time chunk index to check
all_features (list) – list of all requested features including those requiring derivation from input features

preflight(): Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature, relative=True, threshold=0.1, no_trend=False)

Bias Correction using Quantile Delta Mapping

Bias correct this DataHandler’s data with Quantile Delta Mapping. The required statistical distributions should be pre-calculated using sup3r.bias.qdm.QuantileDeltaMappingCorrection.

Warning: There is no guarantee that the coefficients from bc_files match the resource processed here. Be careful choosing bc_files.

Parameters:

bc_files (list | tuple | str) – One or more filepaths to .h5 files output by bias_calc.QuantileDeltaMappingCorrection. These should contain datasets named “base_{reference_feature}_params”, “bias_{feature}_params”, and “bias_fut_{feature}_params” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time.
reference_feature (str) – Name of the feature used as (historical) reference. Dataset with name “base_{reference_feature}_params” will be retrieved from bc_files.
relative (bool, default=True) – Switcher to apply QDM as a relative (use True) or absolute (use False) correction value.
threshold (float, default=0.1) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.
no_trend (bool, default=False) – An option to ignore the trend component of the correction, thus resulting in an ordinary Quantile Mapping, i.e. corrects the bias by comparing the distributions of the biased dataset with a reference datasets. See params_mf of rex.utilities.bc_utils.QuantileDeltaMapping. Note that this assumes that “bias_{feature}_params” (params_mh) is the data distribution representative for the target data.

property raster_index: Raster index property

property raw_features: Get list of features needed for computations

property raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.

Returns:: ndarray

property raw_time_index: Time index for input data without time pruning. This is the base time index for the raw input data.

property raw_tsteps: Get number of time steps for all input files

classmethod recursive_compute(data, feature, handle_features, file_paths, raster_index)

Compute intermediate features recursively

Parameters:

data (dict) – dictionary of feature arrays. e.g. data[feature] = array. (spatial_1, spatial_2, temporal)
feature (str) – Name of feature to compute
handle_features (list) – Features available in raw data
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain

Returns:

ndarray – Array of computed feature data

property requested_shape: Get requested shape for cached data

run_all_data_init()

Build base 4D data array. Can handle multiple files but assumes each file has the same spatial domain

Returns:: data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

run_data_compute(): Run the data computation / derivation from raw features to desired features.

run_data_extraction(): Run the raw dataset extraction process from disk to raw un-manipulated datasets.

run_nn_fill(): Run nn nan fill on full data array.

classmethod serial_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features)

Compute features in series

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

Parameters:: shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

classmethod serial_extract(file_paths, raster_index, time_chunks, input_features, **kwargs)

Extract features in series

Parameters:

file_paths (list) – list of file paths
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

property shape

Full data shape

Returns:: shape (tuple) – Full data shape (spatial_1, spatial_2, temporal, features)

property single_ts_files: Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

property size

Size of data array

Returns:: size (int) – Number of total elements contained in data array

abstract classmethod source_handler(file_paths, **kwargs)

Handle for source data. Uses xarray, ResourceX, etc.

NOTE: that xarray appears to treat open file handlers as singletons within a threadpool, so its okay to open this source_handler without a context handler or a .close() statement.

property source_type: Get data type for source files. Either nc or h5

split_data(data=None, val_split=0.0, shuffle_time=False)

Split time dimension into set of training indices and validation indices

Parameters:

data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)
val_split (float) – Fraction of data to separate for validation.
shuffle_time (bool) – Whether to shuffle time or not.

Returns:

data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Training data fraction of initial data array. Initial data array is overwritten by this new data array.
val_data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Validation data fraction of initial data array.

property stds

Get the standard deviation values for each feature.

Returns:: dict

property target

Get lower left corner of raster

Returns:: _target (tuple) – (lat, lon) lower left corner of raster.

property temporal_slice: Get temporal range to extract from full dataset

property ti_workers: Get max number of workers for computing time index

property time_chunk_size: Get upper bound on time chunk size based on memory limits

property time_chunks

Get time chunks which will be extracted from source data

Returns:: _time_chunks (list) – List of time chunks used to split up source data time dimension so that each chunk can be extracted individually

property time_freq_hours: Get the time frequency in hours as a float

property time_index: Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.

time_index_conflict_check(): Check if the number of input files and the length of the time index is the same

property time_index_file: Get time index file path

property try_load: Check if we should try to load cache

classmethod valid_handle_features(features, handle_features)

Check if features are in handle

Parameters:

features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle

classmethod valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Parameters:

features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle

get_observation_index(temporal_weights=None, spatial_weights=None)[source]

Randomly gets weighted spatial sample and time sample

Parameters:

temporal_weights (array) – Weights used to select time slice (n_time_chunks)
spatial_weights (array) – Weights used to select spatial chunks (n_lat_chunks * n_lon_chunks)

Returns:

observation_index (tuple) – Tuple of sampled spatial grid, time slice, and features indices. Used to get single observation like self.data[observation_index]

get_next(temporal_weights=None, spatial_weights=None)[source]

Get data for observation using weighted random observation index. Loops repeatedly over randomized time index.

Parameters:

temporal_weights (array) – Weights used to select time slice (n_time_chunks)
spatial_weights (array) – Weights used to select spatial chunks (n_lat_chunks * n_lon_chunks)

Returns:

observation (np.ndarray) – 4D array (spatial_1, spatial_2, temporal, features)