sup3r.preprocessing.data_handling.nc_data_handling.DataHandlerNCforCCwithPowerLaw

class DataHandlerNCforCCwithPowerLaw(*args, nsrdb_source_fp=None, nsrdb_agg=1, nsrdb_smoothing=0, **kwargs)[source]

Bases: DataHandlerNCforCC

Data Handler for NETCDF climate change data with power law based extrapolation for windspeeds

Initialize NETCDF data handler for climate change data.

Parameters:

*args (list) – Same ordered required arguments as DataHandler parent class.
nsrdb_source_fp (str | None) – Optional NSRDB source h5 file to retrieve clearsky_ghi from to calculate CC clearsky_ratio along with rsds (ghi) from the CC netcdf file.
nsrdb_agg (int) – Optional number of NSRDB source pixels to aggregate clearsky_ghi from to a single climate change netcdf pixel. This can be used if the CC.nc data is at a much coarser resolution than the source nsrdb data.
nsrdb_smoothing (float) – Optional gaussian filter smoothing factor to smooth out clearsky_ghi from high-resolution nsrdb source data. This is typically done because spatially aggregated nsrdb data is still usually rougher than CC irradiance data.
**kwargs (list) – Same optional keyword arguments as DataHandler parent class.

Methods

`cache_data`(cache_file_paths)	Cache feature data to file and delete from memory
`cap_worker_args`(max_workers)	Cap all workers args by max_workers
`check_cached_features`(features[, ...])	Check which features have been cached and check flags to determine whether to load or extract this features again
`check_clear_data`()	Check if data is cached and clear data if not load_cached
`clear_data`()	Free memory used for data arrays
`compute_raster_index`(file_paths, target, ...)	Get raster index for a given target and shape
`data_fill`(shifted_time_chunks[, max_workers])	Fill final data array with extracted / computed chunks
`direct_extract`(handle, feature, ...)	Extract requested feature directly from source data, rather than interpolating to a requested height or pressure level
`extract_feature`(file_paths, raster_index, ...)	Extract single feature from data source.
`get_cache_file_names`(cache_pattern[, ...])	Get names of cache files from cache_pattern and feature names
`get_capped_workers`(max_workers_cap, max_workers)	Get max number of workers for a given job.
`get_clearsky_ghi`()	Get clearsky ghi from an exogenous NSRDB source h5 file at the target CC meta data and time index.
`get_closest_lat_lon`(lat_lon, target)	Get closest indices to target lat lon
`get_file_times`(file_paths, **kwargs)	Get time index from data files
`get_full_domain`(file_paths)	Get full shape and min available lat lon.
`get_handle_features`(file_paths)	Get all available features in input data
`get_input_arrays`(data, chunk_number, f, ...)	Get only arrays needed for computations
`get_inputs_recursive`(feature, handle_features)	Lookup inputs needed to compute feature.
`get_lat_lon`(file_paths, raster_index[, ...])	Get lat/lon grid for requested target and shape
`get_lat_lon_df`(target[, features])	Get timeseries for given target
`get_next`()	Get data for observation using random observation index.
`get_node_cmd`(config)	Get a CLI call to initialize DataHandler and cache data.
`get_raster_index`()	Get raster index for file data.
`get_raw_feature_list`(features, handle_features)	Lookup inputs needed to compute feature
`get_time_index`(file_paths[, max_workers])	Get time index from data files
`has_exact_feature`(feature, handle)	Check if exact feature is in handle
`has_multilevel_feature`(feature, handle)	Check if exact feature is in handle
`has_surrounding_features`(feature, handle)	Check if handle has feature values at surrounding heights.
`lats_are_descending`(lat_lon)	Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).
`lin_bc`(bc_files[, threshold])	Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc
`load_cached_data`([with_split])	Load data from cache files and split into training and validation
`lookup`(feature, attr_name[, handle_features])	Lookup feature in feature registry
`mask_nan`()	Drop timesteps with NaN data
`normalize`([means, stds, features, max_workers])	Normalize all data features.
`parallel_compute`(data, file_paths, ...[, ...])	Compute features using parallel subprocesses
`parallel_extract`(file_paths, raster_index, ...)	Extract features using parallel subprocesses
`parallel_load`(data, cache_files, features[, ...])	Load feature data in parallel
`pop_old_data`(data, chunk_number, all_features)	Remove input feature data if no longer needed for requested features
`preflight`()	Run some preflight checks and verify that the inputs are valid
`qdm_bc`(bc_files, reference_feature[, ...])	Bias Correction using Quantile Delta Mapping
`recursive_compute`(data, feature, ...)	Compute intermediate features recursively
`run_all_data_init`()	Build base 4D data array.
`run_data_compute`()	Run the data computation / derivation from raw features to desired features.
`run_data_extraction`()	Run the raw dataset extraction process from disk to raw un-manipulated datasets.
`run_nn_fill`()	Run nn nan fill on full data array.
`serial_compute`(data, file_paths, ...)	Compute features in series
`serial_data_fill`(shifted_time_chunks)	Fill final data array in serial
`serial_extract`(file_paths, raster_index, ...)	Extract features in series
`source_handler`(file_paths, **kwargs)	Xarray data handler
`split_data`([data, val_split, shuffle_time])	Split time dimension into set of training indices and validation indices
`time_index_conflict_check`()	Check if the number of input files and the length of the time index is the same
`valid_handle_features`(features, handle_features)	Check if features are in handle
`valid_input_features`(features, handle_features)	Check if features are in handle or have compute methods

Attributes

`CHUNKS`	CHUNKS sets the chunk sizes to extract from the data in each dimension.
`FEATURE_REGISTRY`
`attrs`	Get atttributes of input data
`cache_files`	Cache files for storing extracted data
`cache_pattern`	Get correct cache file pattern for formatting.
`cached_features`	List of features which have been requested but have been determined not to need extraction.
`compute_workers`	Get upper bound for compute workers based on memory limits.
`derive_features`	List of features which need to be derived from other features
`extract_features`	Features to extract directly from the source handler
`extract_workers`	Get upper bound for extract workers based on memory limits.
`feature_mem`	Number of bytes for a single feature array.
`file_paths`	Get file paths for input data
`full_raw_lat_lon`	Get the full lat/lon grid without doing any latitude inversion
`grid_mem`	Get memory used by a feature at a single time step
`grid_shape`	Get shape of raster
`handle_features`	All features available in raw input
`hr_exo_features`	Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.
`hr_out_features`	Get a list of high-resolution features that are intended to be output by the GAN.
`input_file_info`	Method to provide info about files in log output.
`invert_lat`	Whether to invert the latitude axis during data extraction.
`is_time_independent`	Get whether source data files are time independent
`lat_lon`	Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
`latitude`	Flattened list of latitudes
`load_workers`	Get upper bound on load workers based on memory limits.
`longitude`	Flattened list of longitudes
`lr_features`	Get a list of low-resolution features.
`lr_only_features`	List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
`means`	Get the mean values for each feature.
`meta`	Meta dataframe with coordinates.
`n_tsteps`	Get number of time steps to extract
`need_full_domain`	Check whether we need to get the full lat/lon grid to determine target and shape values
`noncached_features`	Get list of features needing extraction or derivation
`norm_workers`	Get upper bound on workers used for normalization.
`raster_index`	Raster index property
`raw_features`	Get list of features needed for computations
`raw_lat_lon`	Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
`raw_time_index`	Time index for input data without time pruning.
`raw_tsteps`	Get number of time steps for all input files
`requested_shape`	Get requested shape for cached data
`shape`	Full data shape
`single_ts_files`	Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
`size`	Size of data array
`source_type`	Get data type for source files.
`stds`	Get the standard deviation values for each feature.
`target`	Get lower left corner of raster
`temporal_slice`	Get temporal range to extract from full dataset
`ti_workers`	Get max number of workers for computing time index
`time_chunk_size`	Get upper bound on time chunk size based on memory limits
`time_chunks`	Get time chunks which will be extracted from source data
`time_freq_hours`	Get the time frequency in hours as a float
`time_index`	Time index for input data with time pruning.
`time_index_file`	Get time index file path
`try_load`	Check if we should try to load cache

CHUNKS: ClassVar[dict] = {'lat': 20, 'lon': 20, 'time': 5}: CHUNKS sets the chunk sizes to extract from the data in each dimension. Chunk sizes that approximately match the data volume being extracted typically results in the most efficient IO.

property attrs

Get atttributes of input data

Returns:: dict – Dictionary of attributes

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

Parameters:: cache_file_paths (str | None) – Path to file for saving feature data

property cache_files: Cache files for storing extracted data

property cache_pattern

Get correct cache file pattern for formatting.

Returns:: _cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features: List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

cap_worker_args(max_workers): Cap all workers args by max_workers

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:

features (list) – list of features to extract
cache_files (list | None) – Path to files with saved feature data
overwrite_cache (bool) – Whether to overwrite cached files
load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

check_clear_data(): Check if data is cached and clear data if not load_cached

clear_data(): Free memory used for data arrays

classmethod compute_raster_index(file_paths, target, grid_shape)

Get raster index for a given target and shape

Parameters:

file_paths (list) – List of input data file paths
target (tuple) – Target coordinate for lower left corner of extracted data
grid_shape (tuple) – Shape out extracted data

Returns:

list – List of slices corresponding to extracted data region

property compute_workers: Get upper bound for compute workers based on memory limits. Used to compute derived features from source dataset.

data_fill(shifted_time_chunks, max_workers=None)

Fill final data array with extracted / computed chunks

Parameters:

shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array
max_workers (int | None) – Max number of workers to use for building final data array. If None max available workers will be used. If 1 cached data will be loaded in serial

property derive_features: List of features which need to be derived from other features

classmethod direct_extract(handle, feature, raster_index, time_slice)

Extract requested feature directly from source data, rather than interpolating to a requested height or pressure level

Parameters:

handle (xarray) – netcdf data object
feature (str) – Name of feature to extract directly from source handler
raster_index (list) – List of slices for raster index of spatial domain
time_slice (slice) – slice of time to extract

Returns:

fdata (ndarray) – Data array for requested feature

classmethod extract_feature(file_paths, raster_index, feature, time_slice=slice(None, None, None), **kwargs)

Extract single feature from data source. The requested feature can match exactly to one found in the source data or can have a matching prefix with a suffix specifying the height or pressure level to interpolate to. e.g. feature=U_100m -> interpolate exact match U to 100 meters.

Parameters:

file_paths (list) – path to data file
raster_index (ndarray) – Raster index array
feature (str) – Feature to extract from data
time_slice (slice) – slice of time to extract
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

ndarray – Data array for extracted feature (spatial_1, spatial_2, temporal)

property extract_features: Features to extract directly from the source handler

property extract_workers: Get upper bound for extract workers based on memory limits. Used to extract data from source dataset

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:: int – Number of bytes for a single feature array

property file_paths: Get file paths for input data

property full_raw_lat_lon: Get the full lat/lon grid without doing any latitude inversion

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:

cache_pattern (str) – Pattern to use for cache file names
grid_shape (tuple) – Shape of grid to use for cache file naming
time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming
target (tuple) – Target to use for cache file naming
features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

static get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job. Capped to global max workers if specified

Parameters:

max_workers_cap (int | None) – Cap for job specific max_workers
max_workers (int | None) – Job specific max_workers

Returns:

max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided

get_clearsky_ghi()

Get clearsky ghi from an exogenous NSRDB source h5 file at the target CC meta data and time index.

Returns:: cs_ghi (np.ndarray) – Clearsky ghi (W/m2) from the nsrdb_source_fp h5 source file. Data shape is (lat, lon, time) where time is daily average values.

static get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

Parameters:

lat_lon (ndarray) – Array of lat/lon (spatial_1, spatial_2, 2) Last dimension in order of (lat, lon)
target (tuple) – (lat, lon) for target coordinate

Returns:

row (int) – row index for closest lat/lon to target lat/lon
col (int) – col index for closest lat/lon to target lat/lon

classmethod get_file_times(file_paths, **kwargs)

Get time index from data files

Parameters:

file_paths (list) – path to data file
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

time_index (pd.Datetimeindex) – List of times as a Datetimeindex

classmethod get_full_domain(file_paths)

Get full shape and min available lat lon. To simplify processing of full domain without needing to specify target and shape.

Parameters:

file_paths (list) – List of data file paths

Returns:

target (tuple) – (lat, lon) for lower left corner
lat_lon (ndarray) – Raw lat/lon array for entire domain

classmethod get_handle_features(file_paths)

Get all available features in input data

Parameters:: file_paths (list) – List of input file paths
Returns:: handle_features (list) – List of available input features

classmethod get_input_arrays(data, chunk_number, f, handle_features)

Get only arrays needed for computations

Parameters:

data (dict) – Dictionary of feature arrays
chunk_number – time chunk for which to get input arrays
f (str) – feature to compute using input arrays
handle_features (list) – Features available in raw data

Returns:

dict – Dictionary of arrays with only needed features

classmethod get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature. Walk through inputs methods for each required feature to get all raw features.

Parameters:

feature (str) – Feature for which to get needed inputs for derivation
handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)

Get lat/lon grid for requested target and shape

Parameters:

file_paths (list) – path to data file
raster_index (ndarray | list) – Raster index array or list of slices
invert_lat (bool) – Flag to invert data along the latitude axis. Wrf data tends to use an increasing ordering for latitude while wtk uses a decreasing ordering.

Returns:

ndarray – (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension

get_lat_lon_df(target, features=None)

Get timeseries for given target

Parameters:

target (tuple) – (lat, lon) for target coordinate
features (list | None) – Optional list of features to include in returned data. If None then all available features are returned.

Returns:

df (pd.DataFrame) – Pandas dataframe with columns for each feature and timeindex for the given target

get_next()

Get data for observation using random observation index. Loops repeatedly over randomized time index

Returns:: observation (np.ndarray) – 4D array (spatial_1, spatial_2, temporal, features)

classmethod get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

Parameters:: config (dict) – sup3r data handler config with all necessary args and kwargs to initialize DataHandler and run data extraction.

get_raster_index()

Get raster index for file data. Here we assume the list of paths in file_paths all have data with the same spatial domain. We use the first file in the list to compute the raster.

Returns:: raster_index (np.ndarray) – 2D array of grid indices

classmethod get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

Parameters:

features (list) – Features for which to get needed inputs for derivation
handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_time_index(file_paths, max_workers=None, **kwargs)

Get time index from data files

Parameters:

file_paths (list) – path to data file
max_workers (int | None) – Max number of workers to use for parallel time index building
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

time_index (pd.Datetimeindex) – List of times as a Datetimeindex

property grid_mem

Get memory used by a feature at a single time step

Returns:: int – Number of bytes for a single feature array at a single time step

property grid_shape

Get shape of raster

Returns:: _grid_shape (tuple) – (rows, cols) grid size.

property handle_features: All features available in raw input

classmethod has_exact_feature(feature, handle)

Check if exact feature is in handle

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains exact feature or not

classmethod has_multilevel_feature(feature, handle)

Check if exact feature is in handle

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains multilevel data for given feature

classmethod has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights. e.g. if feature=U_40m check if the handler has u at heights below and above 40m

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether feature has surrounding heights

property hr_exo_features: Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.

property hr_out_features: Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property input_file_info

Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.

Returns:: str – message to append to log output that does not include a huge info dump of file paths

property invert_lat: Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)

property is_time_independent: Get whether source data files are time independent

property lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]

Returns:: ndarray

property latitude: Flattened list of latitudes

classmethod lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)

Parameters:: lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)
Returns:: bool

lin_bc(bc_files, threshold=0.1)

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

Parameters:

bc_files (list | tuple | str) – One or more filepaths to .h5 files output by MonthlyLinearCorrection or LinearCorrection. These should contain datasets named “{feature}_scalar” and “{feature}_adder” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time is length 1 for annual correction or 12 for monthly correction.
threshold (float) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

load_cached_data(with_split=True)

Load data from cache files and split into training and validation

Parameters:: with_split (bool) – Whether to split into training and validation data or not.

property load_workers: Get upper bound on load workers based on memory limits. Used to load cached data.

property longitude: Flattened list of longitudes

classmethod lookup(feature, attr_name, handle_features=None)

Lookup feature in feature registry

Parameters:

feature (str) – Feature to lookup in registry
attr_name (str) – Type of method to lookup. e.g. inputs or compute
handle_features (list) – List of feature names (datasets) available in the source file. If feature is found explicitly in this list, height/pressure suffixes will not be appended to the output.

Returns:

method | None – Feature registry method corresponding to feature

property lr_features: Get a list of low-resolution features. It is assumed that all features are used in the low-resolution observations. If you want to use high-res-only features, use the DualDataHandler class.

property lr_only_features: List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

mask_nan(): Drop timesteps with NaN data

property means

Get the mean values for each feature.

Returns:: dict

property meta: Meta dataframe with coordinates.

property n_tsteps: Get number of time steps to extract

property need_full_domain: Check whether we need to get the full lat/lon grid to determine target and shape values

property noncached_features: Get list of features needing extraction or derivation

property norm_workers: Get upper bound on workers used for normalization.

normalize(means=None, stds=None, features=None, max_workers=None)

Normalize all data features.

Parameters:

means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.
stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.
features (list | None) – List of features used for indexing data array during normalization.
max_workers (None | int) – Max workers to perform normalization. if None, self.norm_workers will be used

classmethod parallel_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features, max_workers=None)

Compute features using parallel subprocesses

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data
max_workers (int | None) – Number of max workers to use for computation. If equal to 1 then method is run in serial

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. Includes e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

classmethod parallel_extract(file_paths, raster_index, time_chunks, input_features, max_workers=None, **kwargs)

Extract features using parallel subprocesses

Parameters:

file_paths (list) – list of file paths
raster_index (ndarray | list) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
max_workers (int | None) – Number of max workers to use for extraction. If equal to 1 then method is run in serial
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:

data (ndarray) – Array to fill with cached data
cache_files (list) – List of cache files for each feature
features (list) – List of requested features
max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

classmethod pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
chunk_number (int) – time chunk index to check
all_features (list) – list of all requested features including those requiring derivation from input features

preflight(): Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature, relative=True, threshold=0.1, no_trend=False)

Bias Correction using Quantile Delta Mapping

Bias correct this DataHandler’s data with Quantile Delta Mapping. The required statistical distributions should be pre-calculated using sup3r.bias.qdm.QuantileDeltaMappingCorrection.

Warning: There is no guarantee that the coefficients from bc_files match the resource processed here. Be careful choosing bc_files.

Parameters:

bc_files (list | tuple | str) – One or more filepaths to .h5 files output by bias_calc.QuantileDeltaMappingCorrection. These should contain datasets named “base_{reference_feature}_params”, “bias_{feature}_params”, and “bias_fut_{feature}_params” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time.
reference_feature (str) – Name of the feature used as (historical) reference. Dataset with name “base_{reference_feature}_params” will be retrieved from bc_files.
relative (bool, default=True) – Switcher to apply QDM as a relative (use True) or absolute (use False) correction value.
threshold (float, default=0.1) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.
no_trend (bool, default=False) – An option to ignore the trend component of the correction, thus resulting in an ordinary Quantile Mapping, i.e. corrects the bias by comparing the distributions of the biased dataset with a reference datasets. See params_mf of rex.utilities.bc_utils.QuantileDeltaMapping. Note that this assumes that “bias_{feature}_params” (params_mh) is the data distribution representative for the target data.

property raster_index: Raster index property

property raw_features: Get list of features needed for computations

property raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.

Returns:: ndarray

property raw_time_index: Time index for input data without time pruning. This is the base time index for the raw input data.

property raw_tsteps: Get number of time steps for all input files

classmethod recursive_compute(data, feature, handle_features, file_paths, raster_index)

Compute intermediate features recursively

Parameters:

data (dict) – dictionary of feature arrays. e.g. data[feature] = array. (spatial_1, spatial_2, temporal)
feature (str) – Name of feature to compute
handle_features (list) – Features available in raw data
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain

Returns:

ndarray – Array of computed feature data

property requested_shape: Get requested shape for cached data

run_all_data_init()

Build base 4D data array. Can handle multiple files but assumes each file has the same spatial domain

Returns:: data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

run_data_compute(): Run the data computation / derivation from raw features to desired features.

run_data_extraction()

Run the raw dataset extraction process from disk to raw un-manipulated datasets.

Includes a special method to extract clearsky_ghi from a exogenous NSRDB source h5 file (required to compute clearsky_ratio).

run_nn_fill(): Run nn nan fill on full data array.

classmethod serial_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features)

Compute features in series

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

Parameters:: shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

classmethod serial_extract(file_paths, raster_index, time_chunks, input_features, **kwargs)

Extract features in series

Parameters:

file_paths (list) – list of file paths
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

property shape

Full data shape

Returns:: shape (tuple) – Full data shape (spatial_1, spatial_2, temporal, features)

property single_ts_files: Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

property size

Size of data array

Returns:: size (int) – Number of total elements contained in data array

classmethod source_handler(file_paths, **kwargs)

Xarray data handler

Note that xarray appears to treat open file handlers as singletons within a threadpool, so its okay to open this source_handler without a context handler or a .close() statement.

Parameters:

file_paths (str | list) – paths to data files
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

data (xarray.Dataset)

property source_type: Get data type for source files. Either nc or h5

split_data(data=None, val_split=0.0, shuffle_time=False)

Split time dimension into set of training indices and validation indices

Parameters:

data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)
val_split (float) – Fraction of data to separate for validation.
shuffle_time (bool) – Whether to shuffle time or not.

Returns:

data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Training data fraction of initial data array. Initial data array is overwritten by this new data array.
val_data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Validation data fraction of initial data array.

property stds

Get the standard deviation values for each feature.

Returns:: dict

property target

Get lower left corner of raster

Returns:: _target (tuple) – (lat, lon) lower left corner of raster.

property temporal_slice: Get temporal range to extract from full dataset

property ti_workers: Get max number of workers for computing time index

property time_chunk_size: Get upper bound on time chunk size based on memory limits

property time_chunks

Get time chunks which will be extracted from source data

Returns:: _time_chunks (list) – List of time chunks used to split up source data time dimension so that each chunk can be extracted individually

property time_freq_hours: Get the time frequency in hours as a float

property time_index: Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.

time_index_conflict_check(): Check if the number of input files and the length of the time index is the same

property time_index_file: Get time index file path

property try_load: Check if we should try to load cache

classmethod valid_handle_features(features, handle_features)

Check if features are in handle

Parameters:

features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle

classmethod valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Parameters:

features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle