sup3r.preprocessing.data_handling.h5_data_handling.DataHandlerH5WindCC

class DataHandlerH5WindCC(*args, **kwargs)[source]

Bases: DataHandlerH5

Special data handling and batch sampling for h5 wtk or nsrdb data for climate change applications

Parameters:
  • *args (list) – Same positional args as DataHandlerH5

  • **kwargs (dict) – Same keyword args as DataHandlerH5

Methods

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

cap_worker_args(max_workers)

Cap all workers args by max_workers

check_cached_features(features[, ...])

Check which features have been cached and check flags to determine whether to load or extract this features again

check_clear_data()

Check if data is cached and clear data if not load_cached

clear_data()

Free memory used for data arrays

data_fill(shifted_time_chunks[, max_workers])

Fill final data array with extracted / computed chunks

extract_feature(file_paths, raster_index, ...)

Extract single feature from data source

get_cache_file_names(cache_pattern[, ...])

Get names of cache files from cache_pattern and feature names

get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job.

get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

get_full_domain(file_paths)

Get target and shape for largest domain possible

get_handle_features(file_paths)

Get all available features in input data

get_input_arrays(data, chunk_number, f, ...)

Get only arrays needed for computations

get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature.

get_lat_lon(file_paths, raster_index[, ...])

Get lat/lon grid for requested target and shape

get_lat_lon_df(target[, features])

Get timeseries for given target

get_next()

Get data for observation using random observation index.

get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

get_observation_index()

Randomly gets spatial sample and time sample

get_raster_index()

Get raster index for file data.

get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

get_time_index(file_paths[, max_workers])

Get time index from data files

has_exact_feature(feature, handle)

Check if exact feature is in handle

has_multilevel_feature(feature, handle)

Check if exact feature is in handle

has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights.

lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).

lin_bc(bc_files[, threshold])

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

load_cached_data([with_split])

Load data from cache files and split into training and validation

lookup(feature, attr_name[, handle_features])

Lookup feature in feature registry

mask_nan()

Drop timesteps with NaN data

normalize([means, stds, features, max_workers])

Normalize all data features.

parallel_compute(data, file_paths, ...[, ...])

Compute features using parallel subprocesses

parallel_extract(file_paths, raster_index, ...)

Extract features using parallel subprocesses

parallel_load(data, cache_files, features[, ...])

Load feature data in parallel

pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

preflight()

Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature[, ...])

Bias Correction using Quantile Delta Mapping

recursive_compute(data, feature, ...)

Compute intermediate features recursively

run_all_data_init()

Build base 4D data array.

run_daily_averages()

Calculate daily average data and store as attribute.

run_data_compute()

Run the data computation / derivation from raw features to desired features.

run_data_extraction()

Run the raw dataset extraction process from disk to raw un-manipulated datasets.

run_nn_fill()

Run nn nan fill on full data array.

serial_compute(data, file_paths, ...)

Compute features in series

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

serial_extract(file_paths, raster_index, ...)

Extract features in series

source_handler(file_paths, **kwargs)

Rex data handler

split_data([data, val_split, shuffle_time])

Split time dimension into set of training indices and validation indices.

time_index_conflict_check()

Check if the number of input files and the length of the time index is the same

valid_handle_features(features, handle_features)

Check if features are in handle

valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Attributes

FEATURE_REGISTRY

attrs

Get atttributes of input data

cache_files

Cache files for storing extracted data

cache_pattern

Get correct cache file pattern for formatting.

cached_features

List of features which have been requested but have been determined not to need extraction.

compute_workers

Get upper bound for compute workers based on memory limits.

derive_features

List of features which need to be derived from other features

extract_features

Features to extract directly from the source handler

extract_workers

Get upper bound for extract workers based on memory limits.

feature_mem

Number of bytes for a single feature array.

file_paths

Get file paths for input data

full_raw_lat_lon

Get the full lat/lon grid without doing any latitude inversion

grid_mem

Get memory used by a feature at a single time step

grid_shape

Get shape of raster

handle_features

All features available in raw input

hr_exo_features

Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.

hr_out_features

Get a list of high-resolution features that are intended to be output by the GAN.

input_file_info

Method to provide info about files in log output.

invert_lat

Whether to invert the latitude axis during data extraction.

is_time_independent

Get whether source data files are time independent

lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.

latitude

Flattened list of latitudes

load_workers

Get upper bound on load workers based on memory limits.

longitude

Flattened list of longitudes

lr_features

Get a list of low-resolution features.

lr_only_features

List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

means

Get the mean values for each feature.

meta

Meta dataframe with coordinates.

n_tsteps

Get number of time steps to extract

need_full_domain

Check whether we need to get the full lat/lon grid to determine target and shape values

noncached_features

Get list of features needing extraction or derivation

norm_workers

Get upper bound on workers used for normalization.

raster_index

Raster index property

raw_features

Get list of features needed for computations

raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.

raw_time_index

Time index for input data without time pruning.

raw_tsteps

Get number of time steps for all input files

requested_shape

Get requested shape for cached data

shape

Full data shape

single_ts_files

Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

size

Size of data array

source_type

Get data type for source files.

stds

Get the standard deviation values for each feature.

target

Get lower left corner of raster

temporal_slice

Get temporal range to extract from full dataset

ti_workers

Get max number of workers for computing time index

time_chunk_size

Get upper bound on time chunk size based on memory limits

time_chunks

Get time chunks which will be extracted from source data

time_freq_hours

Get the time frequency in hours as a float

time_index

Time index for input data with time pruning.

time_index_file

Get time index file path

try_load

Check if we should try to load cache

REX_HANDLER

alias of MultiFileWindX

run_daily_averages()[source]

Calculate daily average data and store as attribute.

get_observation_index()[source]

Randomly gets spatial sample and time sample

Returns:

  • obs_ind_hourly (tuple) – Tuple of sampled spatial grid, time slice, and features indices. Used to get single observation like self.data[observation_index]. This is for hourly high-res data slicing.

  • obs_ind_daily (tuple) – Same as obs_ind_hourly but the temporal index (i=2) is a slice of the daily data (self.daily_data) with day integers.

get_next()[source]

Get data for observation using random observation index. Loops repeatedly over randomized time index

Returns:

  • obs_hourly (np.ndarray) – 4D array (spatial_1, spatial_2, temporal_hourly, features)

  • obs_daily_avg (np.ndarray) – 4D array but the temporal axis is temporal_hourly//24 (spatial_1, spatial_2, temporal_daily, features)

split_data(data=None, val_split=0.0, shuffle_time=False)[source]

Split time dimension into set of training indices and validation indices. For NSRDB it makes sure that the splits happen at midnight.

Parameters:
  • data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

  • val_split (float) – Fraction of data to separate for validation.

  • shuffle_time (bool) – No effect. Used to fit base class function signature.

Returns:

  • data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Training data fraction of initial data array. Initial data array is overwritten by this new data array.

  • val_data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Validation data fraction of initial data array.

property attrs

Get atttributes of input data

Returns:

dict – Dictionary of attributes

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

Parameters:

cache_file_paths (str | None) – Path to file for saving feature data

property cache_files

Cache files for storing extracted data

property cache_pattern

Get correct cache file pattern for formatting.

Returns:

_cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features

List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

cap_worker_args(max_workers)

Cap all workers args by max_workers

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:
  • features (list) – list of features to extract

  • cache_files (list | None) – Path to files with saved feature data

  • overwrite_cache (bool) – Whether to overwrite cached files

  • load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

check_clear_data()

Check if data is cached and clear data if not load_cached

clear_data()

Free memory used for data arrays

property compute_workers

Get upper bound for compute workers based on memory limits. Used to compute derived features from source dataset.

data_fill(shifted_time_chunks, max_workers=None)

Fill final data array with extracted / computed chunks

Parameters:
  • shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

  • max_workers (int | None) – Max number of workers to use for building final data array. If None max available workers will be used. If 1 cached data will be loaded in serial

property derive_features

List of features which need to be derived from other features

classmethod extract_feature(file_paths, raster_index, feature, time_slice=slice(None, None, None), **kwargs)

Extract single feature from data source

Parameters:
  • file_paths (list) – path to data file

  • raster_index (ndarray) – Raster index array

  • feature (str) – Feature to extract from data

  • time_slice (slice) – slice of time to extract

  • kwargs (dict) – keyword arguments passed to source handler

Returns:

ndarray – Data array for extracted feature (spatial_1, spatial_2, temporal)

property extract_features

Features to extract directly from the source handler

property extract_workers

Get upper bound for extract workers based on memory limits. Used to extract data from source dataset. The max number of extract workers is number of time chunks * number of features

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:

int – Number of bytes for a single feature array

property file_paths

Get file paths for input data

property full_raw_lat_lon

Get the full lat/lon grid without doing any latitude inversion

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:
  • cache_pattern (str) – Pattern to use for cache file names

  • grid_shape (tuple) – Shape of grid to use for cache file naming

  • time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming

  • target (tuple) – Target to use for cache file naming

  • features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

static get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job. Capped to global max workers if specified

Parameters:
  • max_workers_cap (int | None) – Cap for job specific max_workers

  • max_workers (int | None) – Job specific max_workers

Returns:

max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided

static get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

Parameters:
  • lat_lon (ndarray) – Array of lat/lon (spatial_1, spatial_2, 2) Last dimension in order of (lat, lon)

  • target (tuple) – (lat, lon) for target coordinate

Returns:

  • row (int) – row index for closest lat/lon to target lat/lon

  • col (int) – col index for closest lat/lon to target lat/lon

classmethod get_full_domain(file_paths)

Get target and shape for largest domain possible

classmethod get_handle_features(file_paths)

Get all available features in input data

Parameters:

file_paths (list) – List of input file paths

Returns:

handle_features (list) – List of available input features

classmethod get_input_arrays(data, chunk_number, f, handle_features)

Get only arrays needed for computations

Parameters:
  • data (dict) – Dictionary of feature arrays

  • chunk_number – time chunk for which to get input arrays

  • f (str) – feature to compute using input arrays

  • handle_features (list) – Features available in raw data

Returns:

dict – Dictionary of arrays with only needed features

classmethod get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature. Walk through inputs methods for each required feature to get all raw features.

Parameters:
  • feature (str) – Feature for which to get needed inputs for derivation

  • handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)

Get lat/lon grid for requested target and shape

Parameters:
  • file_paths (list) – path to data file

  • raster_index (ndarray | list) – Raster index array or list of slices

  • invert_lat (bool) – Flag to invert data along the latitude axis. Wrf data tends to use an increasing ordering for latitude while wtk uses a decreasing ordering.

Returns:

ndarray – (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension

get_lat_lon_df(target, features=None)

Get timeseries for given target

Parameters:
  • target (tuple) – (lat, lon) for target coordinate

  • features (list | None) – Optional list of features to include in returned data. If None then all available features are returned.

Returns:

df (pd.DataFrame) – Pandas dataframe with columns for each feature and timeindex for the given target

classmethod get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

Parameters:

config (dict) – sup3r data handler config with all necessary args and kwargs to initialize DataHandler and run data extraction.

get_raster_index()

Get raster index for file data. Here we assume the list of paths in file_paths all have data with the same spatial domain. We use the first file in the list to compute the raster.

Returns:

raster_index (np.ndarray) – 2D array of grid indices

classmethod get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

Parameters:
  • features (list) – Features for which to get needed inputs for derivation

  • handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_time_index(file_paths, max_workers=None, **kwargs)

Get time index from data files

Parameters:
  • file_paths (list) – path to data file

  • max_workers (int | None) – placeholder to match signature

  • kwargs (dict) – placeholder to match signature

Returns:

time_index (pd.DateTimeIndex) – Time index from h5 source file(s)

property grid_mem

Get memory used by a feature at a single time step

Returns:

int – Number of bytes for a single feature array at a single time step

property grid_shape

Get shape of raster

Returns:

_grid_shape (tuple) – (rows, cols) grid size.

property handle_features

All features available in raw input

classmethod has_exact_feature(feature, handle)

Check if exact feature is in handle

Parameters:
  • feature (str) – Raw feature name e.g. U_100m

  • handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains exact feature or not

classmethod has_multilevel_feature(feature, handle)

Check if exact feature is in handle

Parameters:
  • feature (str) – Raw feature name e.g. U_100m

  • handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains multilevel data for given feature

classmethod has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights. e.g. if feature=U_40m check if the handler has u at heights below and above 40m

Parameters:
  • feature (str) – Raw feature name e.g. U_100m

  • handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether feature has surrounding heights

property hr_exo_features

Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.

property hr_out_features

Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property input_file_info

Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.

Returns:

str – message to append to log output that does not include a huge info dump of file paths

property invert_lat

Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)

property is_time_independent

Get whether source data files are time independent

property lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]

Returns:

ndarray

property latitude

Flattened list of latitudes

classmethod lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)

Parameters:

lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)

Returns:

bool

lin_bc(bc_files, threshold=0.1)

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

Parameters:
  • bc_files (list | tuple | str) – One or more filepaths to .h5 files output by MonthlyLinearCorrection or LinearCorrection. These should contain datasets named “{feature}_scalar” and “{feature}_adder” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time is length 1 for annual correction or 12 for monthly correction.

  • threshold (float) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

load_cached_data(with_split=True)

Load data from cache files and split into training and validation

Parameters:

with_split (bool) – Whether to split into training and validation data or not.

property load_workers

Get upper bound on load workers based on memory limits. Used to load cached data.

property longitude

Flattened list of longitudes

classmethod lookup(feature, attr_name, handle_features=None)

Lookup feature in feature registry

Parameters:
  • feature (str) – Feature to lookup in registry

  • attr_name (str) – Type of method to lookup. e.g. inputs or compute

  • handle_features (list) – List of feature names (datasets) available in the source file. If feature is found explicitly in this list, height/pressure suffixes will not be appended to the output.

Returns:

method | None – Feature registry method corresponding to feature

property lr_features

Get a list of low-resolution features. It is assumed that all features are used in the low-resolution observations. If you want to use high-res-only features, use the DualDataHandler class.

property lr_only_features

List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

mask_nan()

Drop timesteps with NaN data

property means

Get the mean values for each feature.

Returns:

dict

property meta

Meta dataframe with coordinates.

property n_tsteps

Get number of time steps to extract

property need_full_domain

Check whether we need to get the full lat/lon grid to determine target and shape values

property noncached_features

Get list of features needing extraction or derivation

property norm_workers

Get upper bound on workers used for normalization.

normalize(means=None, stds=None, features=None, max_workers=None)

Normalize all data features.

Parameters:
  • means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.

  • stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.

  • features (list | None) – List of features used for indexing data array during normalization.

  • max_workers (None | int) – Max workers to perform normalization. if None, self.norm_workers will be used

classmethod parallel_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features, max_workers=None)

Compute features using parallel subprocesses

Parameters:
  • data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

  • file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.

  • raster_index (ndarray) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • derived_features (list) – list of feature strings which need to be derived

  • all_features (list) – list of all features including those requiring derivation from input features

  • handle_features (list) – Features available in raw data

  • max_workers (int | None) – Number of max workers to use for computation. If equal to 1 then method is run in serial

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. Includes e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

classmethod parallel_extract(file_paths, raster_index, time_chunks, input_features, max_workers=None, **kwargs)

Extract features using parallel subprocesses

Parameters:
  • file_paths (list) – list of file paths

  • raster_index (ndarray | list) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • input_features (list) – list of input feature strings

  • max_workers (int | None) – Number of max workers to use for extraction. If equal to 1 then method is run in serial

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:
  • data (ndarray) – Array to fill with cached data

  • cache_files (list) – List of cache files for each feature

  • features (list) – List of requested features

  • max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

classmethod pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

Parameters:
  • data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

  • chunk_number (int) – time chunk index to check

  • all_features (list) – list of all requested features including those requiring derivation from input features

preflight()

Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature, relative=True, threshold=0.1, no_trend=False)

Bias Correction using Quantile Delta Mapping

Bias correct this DataHandler’s data with Quantile Delta Mapping. The required statistical distributions should be pre-calculated using sup3r.bias.qdm.QuantileDeltaMappingCorrection.

Warning: There is no guarantee that the coefficients from bc_files match the resource processed here. Be careful choosing bc_files.

Parameters:
  • bc_files (list | tuple | str) – One or more filepaths to .h5 files output by bias_calc.QuantileDeltaMappingCorrection. These should contain datasets named “base_{reference_feature}_params”, “bias_{feature}_params”, and “bias_fut_{feature}_params” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time.

  • reference_feature (str) – Name of the feature used as (historical) reference. Dataset with name “base_{reference_feature}_params” will be retrieved from bc_files.

  • relative (bool, default=True) – Switcher to apply QDM as a relative (use True) or absolute (use False) correction value.

  • threshold (float, default=0.1) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

  • no_trend (bool, default=False) – An option to ignore the trend component of the correction, thus resulting in an ordinary Quantile Mapping, i.e. corrects the bias by comparing the distributions of the biased dataset with a reference datasets. See params_mf of rex.utilities.bc_utils.QuantileDeltaMapping. Note that this assumes that “bias_{feature}_params” (params_mh) is the data distribution representative for the target data.

property raster_index

Raster index property

property raw_features

Get list of features needed for computations

property raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.

Returns:

ndarray

property raw_time_index

Time index for input data without time pruning. This is the base time index for the raw input data.

property raw_tsteps

Get number of time steps for all input files

classmethod recursive_compute(data, feature, handle_features, file_paths, raster_index)

Compute intermediate features recursively

Parameters:
  • data (dict) – dictionary of feature arrays. e.g. data[feature] = array. (spatial_1, spatial_2, temporal)

  • feature (str) – Name of feature to compute

  • handle_features (list) – Features available in raw data

  • file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.

  • raster_index (ndarray) – raster index for spatial domain

Returns:

ndarray – Array of computed feature data

property requested_shape

Get requested shape for cached data

run_all_data_init()

Build base 4D data array. Can handle multiple files but assumes each file has the same spatial domain

Returns:

data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

run_data_compute()

Run the data computation / derivation from raw features to desired features.

run_data_extraction()

Run the raw dataset extraction process from disk to raw un-manipulated datasets.

run_nn_fill()

Run nn nan fill on full data array.

classmethod serial_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features)

Compute features in series

Parameters:
  • data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

  • file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.

  • raster_index (ndarray) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • derived_features (list) – list of feature strings which need to be derived

  • all_features (list) – list of all features including those requiring derivation from input features

  • handle_features (list) – Features available in raw data

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

Parameters:

shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

classmethod serial_extract(file_paths, raster_index, time_chunks, input_features, **kwargs)

Extract features in series

Parameters:
  • file_paths (list) – list of file paths

  • raster_index (ndarray) – raster index for spatial domain

  • time_chunks (list) – List of slices to chunk data feature extraction along time dimension

  • input_features (list) – list of input feature strings

  • kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

    ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

    which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

property shape

Full data shape

Returns:

shape (tuple) – Full data shape (spatial_1, spatial_2, temporal, features)

property single_ts_files

Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

property size

Size of data array

Returns:

size (int) – Number of total elements contained in data array

classmethod source_handler(file_paths, **kwargs)

Rex data handler

Note that xarray appears to treat open file handlers as singletons within a threadpool, so its okay to open this source_handler without a context handler or a .close() statement.

Parameters:
  • file_paths (str | list) – paths to data files

  • kwargs (dict) – keyword arguments passed to source handler

Returns:

data (ResourceX)

property source_type

Get data type for source files. Either nc or h5

property stds

Get the standard deviation values for each feature.

Returns:

dict

property target

Get lower left corner of raster

Returns:

_target (tuple) – (lat, lon) lower left corner of raster.

property temporal_slice

Get temporal range to extract from full dataset

property ti_workers

Get max number of workers for computing time index

property time_chunk_size

Get upper bound on time chunk size based on memory limits

property time_chunks

Get time chunks which will be extracted from source data

Returns:

_time_chunks (list) – List of time chunks used to split up source data time dimension so that each chunk can be extracted individually

property time_freq_hours

Get the time frequency in hours as a float

property time_index

Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.

time_index_conflict_check()

Check if the number of input files and the length of the time index is the same

property time_index_file

Get time index file path

property try_load

Check if we should try to load cache

classmethod valid_handle_features(features, handle_features)

Check if features are in handle

Parameters:
  • features (str | list) – Raw feature names e.g. U_100m

  • handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle

classmethod valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Parameters:
  • features (str | list) – Raw feature names e.g. U_100m

  • handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle