sup3r.preprocessing.data_handling.nc_data_handling.DataHandlerNCforCCwithPowerLaw

class DataHandlerNCforCCwithPowerLaw(*args, nsrdb_source_fp=None, nsrdb_agg=1, nsrdb_smoothing=0, **kwargs)[source]

Bases: DataHandlerNCforCC

Data Handler for NETCDF climate change data with power law based extrapolation for windspeeds

Initialize NETCDF data handler for climate change data.

Parameters:
  • *args (list) – Same ordered required arguments as DataHandler parent class.

  • nsrdb_source_fp (str | None) – Optional NSRDB source h5 file to retrieve clearsky_ghi from to calculate CC clearsky_ratio along with rsds (ghi) from the CC netcdf file.

  • nsrdb_agg (int) – Optional number of NSRDB source pixels to aggregate clearsky_ghi from to a single climate change netcdf pixel. This can be used if the CC.nc data is at a much coarser resolution than the source nsrdb data.

  • nsrdb_smoothing (float) – Optional gaussian filter smoothing factor to smooth out clearsky_ghi from high-resolution nsrdb source data. This is typically done because spatially aggregated nsrdb data is still usually rougher than CC irradiance data.

  • **kwargs (list) – Same optional keyword arguments as DataHandler parent class.

Methods

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

cap_worker_args(max_workers)

Cap all workers args by max_workers

check_cached_features(features[, ...])

Check which features have been cached and check flags to determine whether to load or extract this features again

check_clear_data()

Check if data is cached and clear data if not load_cached

clear_data()

Free memory used for data arrays

compute_raster_index(file_paths, target, ...)

Get raster index for a given target and shape

data_fill(shifted_time_chunks[, max_workers])

Fill final data array with extracted / computed chunks

direct_extract(handle, feature, ...)

Extract requested feature directly from source data, rather than interpolating to a requested height or pressure level

extract_feature(file_paths, raster_index, ...)

Extract single feature from data source.

get_cache_file_names(cache_pattern[, ...])

Get names of cache files from cache_pattern and feature names

get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job.

get_clearsky_ghi()

Get clearsky ghi from an exogenous NSRDB source h5 file at the target CC meta data and time index.

get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

get_file_times(file_paths, **kwargs)

Get time index from data files

get_full_domain(file_paths)

Get full shape and min available lat lon.

get_handle_features(file_paths)

Get all available features in input data

get_input_arrays(data, chunk_number, f, ...)

Get only arrays needed for computations

get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature.

get_lat_lon(file_paths, raster_index[, ...])

Get lat/lon grid for requested target and shape

get_lat_lon_df(target[, features])

Get timeseries for given target

get_next()

Get data for observation using random observation index.

get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

get_raster_index()

Get raster index for file data.

get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

get_time_index(file_paths[, max_workers])

Get time index from data files

has_exact_feature(feature, handle)

Check if exact feature is in handle

has_multilevel_feature(feature, handle)

Check if exact feature is in handle

has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights.

lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).

lin_bc(bc_files[, threshold])

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

load_cached_data([with_split])

Load data from cache files and split into training and validation

lookup(feature, attr_name[, handle_features])

Lookup feature in feature registry

mask_nan()

Drop timesteps with NaN data

normalize([means, stds, features, max_workers])

Normalize all data features.

parallel_compute(data, file_paths, ...[, ...])

Compute features using parallel subprocesses

parallel_extract(file_paths, raster_index, ...)

Extract features using parallel subprocesses

parallel_load(data, cache_files, features[, ...])

Load feature data in parallel

pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

preflight()

Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature[, ...])

Bias Correction using Quantile Delta Mapping

recursive_compute(data, feature, ...)

Compute intermediate features recursively

run_all_data_init()

Build base 4D data array.

run_data_compute()

Run the data computation / derivation from raw features to desired features.

run_data_extraction()

Run the raw dataset extraction process from disk to raw un-manipulated datasets.

run_nn_fill()

Run nn nan fill on full data array.

serial_compute(data, file_paths, ...)

Compute features in series

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

serial_extract(file_paths, raster_index, ...)

Extract features in series

source_handler(file_paths, **kwargs)

Xarray data handler

split_data([data, val_split, shuffle_time])

Split time dimension into set of training indices and validation indices

time_index_conflict_check()

Check if the number of input files and the length of the time index is the same

valid_handle_features(features, handle_features)

Check if features are in handle

valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Attributes

CHUNKS

CHUNKS sets the chunk sizes to extract from the data in each dimension.

FEATURE_REGISTRY

attrs

Get atttributes of input data

cache_files

Cache files for storing extracted data

cache_pattern

Get correct cache file pattern for formatting.

cached_features

List of features which have been requested but have been determined not to need extraction.

compute_workers

Get upper bound for compute workers based on memory limits.

derive_features

List of features which need to be derived from other features

extract_features

Features to extract directly from the source handler

extract_workers

Get upper bound for extract workers based on memory limits.

feature_mem

Number of bytes for a single feature array.

file_paths

Get file paths for input data

full_raw_lat_lon

Get the full lat/lon grid without doing any latitude inversion

grid_mem

Get memory used by a feature at a single time step

grid_shape

Get shape of raster

handle_features

All features available in raw input

hr_exo_features

Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.

hr_out_features

Get a list of high-resolution features that are intended to be output by the GAN.

input_file_info

Method to provide info about files in log output.

invert_lat

Whether to invert the latitude axis during data extraction.

is_time_independent

Get whether source data files are time independent

lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.

latitude

Flattened list of latitudes

load_workers

Get upper bound on load workers based on memory limits.

longitude

Flattened list of longitudes

lr_features

Get a list of low-resolution features.

lr_only_features

List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

means

Get the mean values for each feature.

meta

Meta dataframe with coordinates.

n_tsteps

Get number of time steps to extract

need_full_domain

Check whether we need to get the full lat/lon grid to determine target and shape values

noncached_features

Get list of features needing extraction or derivation

norm_workers

Get upper bound on workers used for normalization.

raster_index

Raster index property

raw_features

Get list of features needed for computations

raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.

raw_time_index

Time index for input data without time pruning.

raw_tsteps

Get number of time steps for all input files

requested_shape

Get requested shape for cached data

shape

Full data shape

single_ts_files

Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

size

Size of data array

source_type

Get data type for source files.

stds

Get the standard deviation values for each feature.

target

Get lower left corner of raster

temporal_slice

Get temporal range to extract from full dataset

ti_workers

Get max number of workers for computing time index

time_chunk_size

Get upper bound on time chunk size based on memory limits

time_chunks

Get time chunks which will be extracted from source data

time_freq_hours

Get the time frequency in hours as a float

time_index

Time index for input data with time pruning.

time_index_file

Get time index file path

try_load

Check if we should try to load cache

CHUNKS: ClassVar[dict] = {'lat': 20, 'lon': 20, 'time': 5}

CHUNKS sets the chunk sizes to extract from the data in each dimension. Chunk sizes that approximately match the data volume being extracted typically results in the most efficient IO.

property attrs

Get atttributes of input data

Returns:

dict – Dictionary of attributes

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

Parameters:

cache_file_paths (str | None) – Path to file for saving feature data

property cache_files

Cache files for storing extracted data

property cache_pattern

Get correct cache file pattern for formatting.

Returns:

_cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features

List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

cap_worker_args(max_workers)

Cap all workers args by max_workers

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:
  • features (list) – list of features to extract

  • cache_files (list | None) – Path to files with saved feature data

  • overwrite_cache (bool) – Whether to overwrite cached files

  • load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

check_clear_data()

Check if data is cached and clear data if not load_cached

clear_data()

Free memory used for data arrays

classmethod compute_raster_index(file_paths, target, grid_shape)

Get raster index for a given target and shape

Parameters:
  • file_paths (list) – List of input data file paths

  • target (tuple) – Target coordinate for lower left corner of extracted data

  • grid_shape (tuple) – Shape out extracted data

Returns:

list – List of slices corresponding to extracted data region

property compute_workers

Get upper bound for compute workers based on memory limits. Used to compute derived features from source dataset.

data_fill(shifted_time_chunks, max_workers=None)

Fill final data array with extracted / computed chunks

Parameters:
  • shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

  • max_workers (int | None) – Max number of workers to use for building final data array. If None max available workers will be used. If 1 cached data will be loaded in serial

property derive_features

List of features which need to be derived from other features

classmethod direct_extract(handle, feature, raster_index, time_slice)

Extract requested feature directly from source data, rather than interpolating to a requested height or pressure level

Parameters:
  • handle (xarray) – netcdf data object

  • feature (str) – Name of feature to extract directly from source handler

  • raster_index (list) – List of slices for raster index of spatial domain

  • time_slice (slice) – slice of time to extract

Returns:

fdata (ndarray) – Data array for requested feature

classmethod extract_feature(file_paths, raster_index, feature, time_slice=slice(None, None, None), **kwargs)

Extract single feature from data source. The requested feature can match exactly to one found in the source data or can have a matching prefix with a suffix specifying the height or pressure level to interpolate to. e.g. feature=U_100m -> interpolate exact match U to 100 meters.

Parameters:
  • file_paths (list) – path to data file

  • raster_index (ndarray) – Raster index array

  • feature (str) – Feature to extract from data

  • time_slice (slice) – slice of time to extract

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

ndarray – Data array for extracted feature (spatial_1, spatial_2, temporal)

property extract_features

Features to extract directly from the source handler

property extract_workers

Get upper bound for extract workers based on memory limits. Used to extract data from source dataset

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:

int – Number of bytes for a single feature array

property file_paths

Get file paths for input data

property full_raw_lat_lon

Get the full lat/lon grid without doing any latitude inversion

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:
  • cache_pattern (str) – Pattern to use for cache file names

  • grid_shape (tuple) – Shape of grid to use for cache file naming

  • time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming

  • target (tuple) – Target to use for cache file naming

  • features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

static get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job. Capped to global max workers if specified

Parameters:
  • max_workers_cap (int | None) – Cap for job specific max_workers

  • max_workers (int | None) – Job specific max_workers

Returns:

max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided

get_clearsky_ghi()

Get clearsky ghi from an exogenous NSRDB source h5 file at the target CC meta data and time index.

Returns:

cs_ghi (np.ndarray) – Clearsky ghi (W/m2) from the nsrdb_source_fp h5 source file. Data shape is (lat, lon, time) where time is daily average values.

static get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

Parameters:
  • lat_lon (ndarray) – Array of lat/lon (spatial_1, spatial_2, 2) Last dimension in order of (lat, lon)

  • target (tuple) – (lat, lon) for target coordinate

Returns:

  • row (int) – row index for closest lat/lon to target lat/lon

  • col (int) – col index for closest lat/lon to target lat/lon

classmethod get_file_times(file_paths, **kwargs)

Get time index from data files

Parameters:
  • file_paths (list) – path to data file

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

time_index (pd.Datetimeindex) – List of times as a Datetimeindex

classmethod get_full_domain(file_paths)

Get full shape and min available lat lon. To simplify processing of full domain without needing to specify target and shape.

Parameters:

file_paths (list) – List of data file paths

Returns:

  • target (tuple) – (lat, lon) for lower left corner

  • lat_lon (ndarray) – Raw lat/lon array for entire domain

classmethod get_handle_features(file_paths)

Get all available features in input data

Parameters:

file_paths (list) – List of input file paths

Returns:

handle_features (list) – List of available input features

classmethod get_input_arrays(data, chunk_number, f, handle_features)

Get only arrays needed for computations

Parameters:
  • data (dict) – Dictionary of feature arrays

  • chunk_number – time chunk for which to get input arrays

  • f (str) – feature to compute using input arrays

  • handle_features (list) – Features available in raw data

Returns:

dict – Dictionary of arrays with only needed features

classmethod get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature. Walk through inputs methods for each required feature to get all raw features.

Parameters:
  • feature (str) – Feature for which to get needed inputs for derivation

  • handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)

Get lat/lon grid for requested target and shape

Parameters:
  • file_paths (list) – path to data file

  • raster_index (ndarray | list) – Raster index array or list of slices

  • invert_lat (bool) – Flag to invert data along the latitude axis. Wrf data tends to use an increasing ordering for latitude while wtk uses a decreasing ordering.

Returns:

ndarray – (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension

get_lat_lon_df(target, features=None)

Get timeseries for given target

Parameters:
  • target (tuple) – (lat, lon) for target coordinate

  • features (list | None) – Optional list of features to include in returned data. If None then all available features are returned.

Returns:

df (pd.DataFrame) – Pandas dataframe with columns for each feature and timeindex for the given target

get_next()

Get data for observation using random observation index. Loops repeatedly over randomized time index

Returns:

observation (np.ndarray) – 4D array (spatial_1, spatial_2, temporal, features)

classmethod get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

Parameters:

config (dict) – sup3r data handler config with all necessary args and kwargs to initialize DataHandler and run data extraction.

get_raster_index()

Get raster index for file data. Here we assume the list of paths in file_paths all have data with the same spatial domain. We use the first file in the list to compute the raster.

Returns:

raster_index (np.ndarray) – 2D array of grid indices

classmethod get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

Parameters:
  • features (list) – Features for which to get needed inputs for derivation

  • handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_time_index(file_paths, max_workers=None, **kwargs)

Get time index from data files

Parameters:
  • file_paths (list) – path to data file

  • max_workers (int | None) – Max number of workers to use for parallel time index building

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

time_index (pd.Datetimeindex) – List of times as a Datetimeindex

property grid_mem

Get memory used by a feature at a single time step

Returns:

int – Number of bytes for a single feature array at a single time step

property grid_shape

Get shape of raster

Returns:

_grid_shape (tuple) – (rows, cols) grid size.

property handle_features

All features available in raw input

classmethod has_exact_feature(feature, handle)

Check if exact feature is in handle

Parameters:
  • feature (str) – Raw feature name e.g. U_100m

  • handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains exact feature or not

classmethod has_multilevel_feature(feature, handle)

Check if exact feature is in handle

Parameters:
  • feature (str) – Raw feature name e.g. U_100m

  • handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains multilevel data for given feature

classmethod has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights. e.g. if feature=U_40m check if the handler has u at heights below and above 40m

Parameters:
  • feature (str) – Raw feature name e.g. U_100m

  • handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether feature has surrounding heights

property hr_exo_features

Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.

property hr_out_features

Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property input_file_info

Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.

Returns:

str – message to append to log output that does not include a huge info dump of file paths

property invert_lat

Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)

property is_time_independent

Get whether source data files are time independent

property lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]

Returns:

ndarray

property latitude

Flattened list of latitudes

classmethod lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)

Parameters:

lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)

Returns:

bool

lin_bc(bc_files, threshold=0.1)

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

Parameters:
  • bc_files (list | tuple | str) – One or more filepaths to .h5 files output by MonthlyLinearCorrection or LinearCorrection. These should contain datasets named “{feature}_scalar” and “{feature}_adder” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time is length 1 for annual correction or 12 for monthly correction.

  • threshold (float) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

load_cached_data(with_split=True)

Load data from cache files and split into training and validation

Parameters:

with_split (bool) – Whether to split into training and validation data or not.

property load_workers

Get upper bound on load workers based on memory limits. Used to load cached data.

property longitude

Flattened list of longitudes

classmethod lookup(feature, attr_name, handle_features=None)

Lookup feature in feature registry

Parameters:
  • feature (str) – Feature to lookup in registry

  • attr_name (str) – Type of method to lookup. e.g. inputs or compute

  • handle_features (list) – List of feature names (datasets) available in the source file. If feature is found explicitly in this list, height/pressure suffixes will not be appended to the output.

Returns:

method | None – Feature registry method corresponding to feature

property lr_features

Get a list of low-resolution features. It is assumed that all features are used in the low-resolution observations. If you want to use high-res-only features, use the DualDataHandler class.

property lr_only_features

List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

mask_nan()

Drop timesteps with NaN data

property means

Get the mean values for each feature.

Returns:

dict

property meta

Meta dataframe with coordinates.

property n_tsteps

Get number of time steps to extract

property need_full_domain

Check whether we need to get the full lat/lon grid to determine target and shape values

property noncached_features

Get list of features needing extraction or derivation

property norm_workers

Get upper bound on workers used for normalization.

normalize(means=None, stds=None, features=None, max_workers=None)

Normalize all data features.

Parameters:
  • means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.

  • stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.

  • features (list | None) – List of features used for indexing data array during normalization.

  • max_workers (None | int) – Max workers to perform normalization. if None, self.norm_workers will be used

classmethod parallel_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features, max_workers=None)

Compute features using parallel subprocesses

Parameters:
  • data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

  • file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.

  • raster_index (ndarray) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • derived_features (list) – list of feature strings which need to be derived

  • all_features (list) – list of all features including those requiring derivation from input features

  • handle_features (list) – Features available in raw data

  • max_workers (int | None) – Number of max workers to use for computation. If equal to 1 then method is run in serial

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. Includes e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

classmethod parallel_extract(file_paths, raster_index, time_chunks, input_features, max_workers=None, **kwargs)

Extract features using parallel subprocesses

Parameters:
  • file_paths (list) – list of file paths

  • raster_index (ndarray | list) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • input_features (list) – list of input feature strings

  • max_workers (int | None) – Number of max workers to use for extraction. If equal to 1 then method is run in serial

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:
  • data (ndarray) – Array to fill with cached data

  • cache_files (list) – List of cache files for each feature

  • features (list) – List of requested features

  • max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

classmethod pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

Parameters:
  • data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

  • chunk_number (int) – time chunk index to check

  • all_features (list) – list of all requested features including those requiring derivation from input features

preflight()

Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature, relative=True, threshold=0.1, no_trend=False)

Bias Correction using Quantile Delta Mapping

Bias correct this DataHandler’s data with Quantile Delta Mapping. The required statistical distributions should be pre-calculated using sup3r.bias.qdm.QuantileDeltaMappingCorrection.

Warning: There is no guarantee that the coefficients from bc_files match the resource processed here. Be careful choosing bc_files.

Parameters:
  • bc_files (list | tuple | str) – One or more filepaths to .h5 files output by bias_calc.QuantileDeltaMappingCorrection. These should contain datasets named “base_{reference_feature}_params”, “bias_{feature}_params”, and “bias_fut_{feature}_params” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time.

  • reference_feature (str) – Name of the feature used as (historical) reference. Dataset with name “base_{reference_feature}_params” will be retrieved from bc_files.

  • relative (bool, default=True) – Switcher to apply QDM as a relative (use True) or absolute (use False) correction value.

  • threshold (float, default=0.1) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

  • no_trend (bool, default=False) – An option to ignore the trend component of the correction, thus resulting in an ordinary Quantile Mapping, i.e. corrects the bias by comparing the distributions of the biased dataset with a reference datasets. See params_mf of rex.utilities.bc_utils.QuantileDeltaMapping. Note that this assumes that “bias_{feature}_params” (params_mh) is the data distribution representative for the target data.

property raster_index

Raster index property

property raw_features

Get list of features needed for computations

property raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.

Returns:

ndarray

property raw_time_index

Time index for input data without time pruning. This is the base time index for the raw input data.

property raw_tsteps

Get number of time steps for all input files

classmethod recursive_compute(data, feature, handle_features, file_paths, raster_index)

Compute intermediate features recursively

Parameters:
  • data (dict) – dictionary of feature arrays. e.g. data[feature] = array. (spatial_1, spatial_2, temporal)

  • feature (str) – Name of feature to compute

  • handle_features (list) – Features available in raw data

  • file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.

  • raster_index (ndarray) – raster index for spatial domain

Returns:

ndarray – Array of computed feature data

property requested_shape

Get requested shape for cached data

run_all_data_init()

Build base 4D data array. Can handle multiple files but assumes each file has the same spatial domain

Returns:

data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

run_data_compute()

Run the data computation / derivation from raw features to desired features.

run_data_extraction()

Run the raw dataset extraction process from disk to raw un-manipulated datasets.

Includes a special method to extract clearsky_ghi from a exogenous NSRDB source h5 file (required to compute clearsky_ratio).

run_nn_fill()

Run nn nan fill on full data array.

classmethod serial_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features)

Compute features in series

Parameters:
  • data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

  • file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.

  • raster_index (ndarray) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • derived_features (list) – list of feature strings which need to be derived

  • all_features (list) – list of all features including those requiring derivation from input features

  • handle_features (list) – Features available in raw data

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

Parameters:

shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

classmethod serial_extract(file_paths, raster_index, time_chunks, input_features, **kwargs)

Extract features in series

Parameters:
  • file_paths (list) – list of file paths

  • raster_index (ndarray) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • input_features (list) – list of input feature strings

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

property shape

Full data shape

Returns:

shape (tuple) – Full data shape (spatial_1, spatial_2, temporal, features)

property single_ts_files

Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

property size

Size of data array

Returns:

size (int) – Number of total elements contained in data array

classmethod source_handler(file_paths, **kwargs)

Xarray data handler

Note that xarray appears to treat open file handlers as singletons within a threadpool, so its okay to open this source_handler without a context handler or a .close() statement.

Parameters:
  • file_paths (str | list) – paths to data files

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

data (xarray.Dataset)

property source_type

Get data type for source files. Either nc or h5

split_data(data=None, val_split=0.0, shuffle_time=False)

Split time dimension into set of training indices and validation indices

Parameters:
  • data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

  • val_split (float) – Fraction of data to separate for validation.

  • shuffle_time (bool) – Whether to shuffle time or not.

Returns:

  • data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Training data fraction of initial data array. Initial data array is overwritten by this new data array.

  • val_data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Validation data fraction of initial data array.

property stds

Get the standard deviation values for each feature.

Returns:

dict

property target

Get lower left corner of raster

Returns:

_target (tuple) – (lat, lon) lower left corner of raster.

property temporal_slice

Get temporal range to extract from full dataset

property ti_workers

Get max number of workers for computing time index

property time_chunk_size

Get upper bound on time chunk size based on memory limits

property time_chunks

Get time chunks which will be extracted from source data

Returns:

_time_chunks (list) – List of time chunks used to split up source data time dimension so that each chunk can be extracted individually

property time_freq_hours

Get the time frequency in hours as a float

property time_index

Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.

time_index_conflict_check()

Check if the number of input files and the length of the time index is the same

property time_index_file

Get time index file path

property try_load

Check if we should try to load cache

classmethod valid_handle_features(features, handle_features)

Check if features are in handle

Parameters:
  • features (str | list) – Raw feature names e.g. U_100m

  • handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle

classmethod valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Parameters:
  • features (str | list) – Raw feature names e.g. U_100m

  • handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle