sup3r.preprocessing.data_handling.h5_data_handling.DataHandlerH5WindCC

class DataHandlerH5WindCC(*args, **kwargs)[source]

Bases: DataHandlerH5

Special data handling and batch sampling for h5 wtk or nsrdb data for climate change applications

Parameters:

*args (list) – Same positional args as DataHandlerH5
**kwargs (dict) – Same keyword args as DataHandlerH5

Methods

`cache_data`(cache_file_paths)	Cache feature data to file and delete from memory
`cap_worker_args`(max_workers)	Cap all workers args by max_workers
`check_cached_features`(features[, ...])	Check which features have been cached and check flags to determine whether to load or extract this features again
`check_clear_data`()	Check if data is cached and clear data if not load_cached
`clear_data`()	Free memory used for data arrays
`data_fill`(shifted_time_chunks[, max_workers])	Fill final data array with extracted / computed chunks
`extract_feature`(file_paths, raster_index, ...)	Extract single feature from data source
`get_cache_file_names`(cache_pattern[, ...])	Get names of cache files from cache_pattern and feature names
`get_capped_workers`(max_workers_cap, max_workers)	Get max number of workers for a given job.
`get_closest_lat_lon`(lat_lon, target)	Get closest indices to target lat lon
`get_full_domain`(file_paths)	Get target and shape for largest domain possible
`get_handle_features`(file_paths)	Get all available features in input data
`get_input_arrays`(data, chunk_number, f, ...)	Get only arrays needed for computations
`get_inputs_recursive`(feature, handle_features)	Lookup inputs needed to compute feature.
`get_lat_lon`(file_paths, raster_index[, ...])	Get lat/lon grid for requested target and shape
`get_lat_lon_df`(target[, features])	Get timeseries for given target
`get_next`()	Get data for observation using random observation index.
`get_node_cmd`(config)	Get a CLI call to initialize DataHandler and cache data.
`get_observation_index`()	Randomly gets spatial sample and time sample
`get_raster_index`()	Get raster index for file data.
`get_raw_feature_list`(features, handle_features)	Lookup inputs needed to compute feature
`get_time_index`(file_paths[, max_workers])	Get time index from data files
`has_exact_feature`(feature, handle)	Check if exact feature is in handle
`has_multilevel_feature`(feature, handle)	Check if exact feature is in handle
`has_surrounding_features`(feature, handle)	Check if handle has feature values at surrounding heights.
`lats_are_descending`(lat_lon)	Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).
`lin_bc`(bc_files[, threshold])	Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc
`load_cached_data`([with_split])	Load data from cache files and split into training and validation
`lookup`(feature, attr_name[, handle_features])	Lookup feature in feature registry
`mask_nan`()	Drop timesteps with NaN data
`normalize`([means, stds, features, max_workers])	Normalize all data features.
`parallel_compute`(data, file_paths, ...[, ...])	Compute features using parallel subprocesses
`parallel_extract`(file_paths, raster_index, ...)	Extract features using parallel subprocesses
`parallel_load`(data, cache_files, features[, ...])	Load feature data in parallel
`pop_old_data`(data, chunk_number, all_features)	Remove input feature data if no longer needed for requested features
`preflight`()	Run some preflight checks and verify that the inputs are valid
`qdm_bc`(bc_files, reference_feature[, ...])	Bias Correction using Quantile Delta Mapping
`recursive_compute`(data, feature, ...)	Compute intermediate features recursively
`run_all_data_init`()	Build base 4D data array.
`run_daily_averages`()	Calculate daily average data and store as attribute.
`run_data_compute`()	Run the data computation / derivation from raw features to desired features.
`run_data_extraction`()	Run the raw dataset extraction process from disk to raw un-manipulated datasets.
`run_nn_fill`()	Run nn nan fill on full data array.
`serial_compute`(data, file_paths, ...)	Compute features in series
`serial_data_fill`(shifted_time_chunks)	Fill final data array in serial
`serial_extract`(file_paths, raster_index, ...)	Extract features in series
`source_handler`(file_paths, **kwargs)	Rex data handler
`split_data`([data, val_split, shuffle_time])	Split time dimension into set of training indices and validation indices.
`time_index_conflict_check`()	Check if the number of input files and the length of the time index is the same
`valid_handle_features`(features, handle_features)	Check if features are in handle
`valid_input_features`(features, handle_features)	Check if features are in handle or have compute methods

Attributes

`FEATURE_REGISTRY`
`attrs`	Get atttributes of input data
`cache_files`	Cache files for storing extracted data
`cache_pattern`	Get correct cache file pattern for formatting.
`cached_features`	List of features which have been requested but have been determined not to need extraction.
`compute_workers`	Get upper bound for compute workers based on memory limits.
`derive_features`	List of features which need to be derived from other features
`extract_features`	Features to extract directly from the source handler
`extract_workers`	Get upper bound for extract workers based on memory limits.
`feature_mem`	Number of bytes for a single feature array.
`file_paths`	Get file paths for input data
`full_raw_lat_lon`	Get the full lat/lon grid without doing any latitude inversion
`grid_mem`	Get memory used by a feature at a single time step
`grid_shape`	Get shape of raster
`handle_features`	All features available in raw input
`hr_exo_features`	Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.
`hr_out_features`	Get a list of high-resolution features that are intended to be output by the GAN.
`input_file_info`	Method to provide info about files in log output.
`invert_lat`	Whether to invert the latitude axis during data extraction.
`is_time_independent`	Get whether source data files are time independent
`lat_lon`	Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
`latitude`	Flattened list of latitudes
`load_workers`	Get upper bound on load workers based on memory limits.
`longitude`	Flattened list of longitudes
`lr_features`	Get a list of low-resolution features.
`lr_only_features`	List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
`means`	Get the mean values for each feature.
`meta`	Meta dataframe with coordinates.
`n_tsteps`	Get number of time steps to extract
`need_full_domain`	Check whether we need to get the full lat/lon grid to determine target and shape values
`noncached_features`	Get list of features needing extraction or derivation
`norm_workers`	Get upper bound on workers used for normalization.
`raster_index`	Raster index property
`raw_features`	Get list of features needed for computations
`raw_lat_lon`	Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
`raw_time_index`	Time index for input data without time pruning.
`raw_tsteps`	Get number of time steps for all input files
`requested_shape`	Get requested shape for cached data
`shape`	Full data shape
`single_ts_files`	Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
`size`	Size of data array
`source_type`	Get data type for source files.
`stds`	Get the standard deviation values for each feature.
`target`	Get lower left corner of raster
`temporal_slice`	Get temporal range to extract from full dataset
`ti_workers`	Get max number of workers for computing time index
`time_chunk_size`	Get upper bound on time chunk size based on memory limits
`time_chunks`	Get time chunks which will be extracted from source data
`time_freq_hours`	Get the time frequency in hours as a float
`time_index`	Time index for input data with time pruning.
`time_index_file`	Get time index file path
`try_load`	Check if we should try to load cache

REX_HANDLER: alias of MultiFileWindX

run_daily_averages()[source]: Calculate daily average data and store as attribute.

get_observation_index()[source]

Randomly gets spatial sample and time sample

Returns:

obs_ind_hourly (tuple) – Tuple of sampled spatial grid, time slice, and features indices. Used to get single observation like self.data[observation_index]. This is for hourly high-res data slicing.
obs_ind_daily (tuple) – Same as obs_ind_hourly but the temporal index (i=2) is a slice of the daily data (self.daily_data) with day integers.

get_next()[source]

Get data for observation using random observation index. Loops repeatedly over randomized time index

Returns:

obs_hourly (np.ndarray) – 4D array (spatial_1, spatial_2, temporal_hourly, features)
obs_daily_avg (np.ndarray) – 4D array but the temporal axis is temporal_hourly//24 (spatial_1, spatial_2, temporal_daily, features)

split_data(data=None, val_split=0.0, shuffle_time=False)[source]

Split time dimension into set of training indices and validation indices. For NSRDB it makes sure that the splits happen at midnight.

Parameters:

data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)
val_split (float) – Fraction of data to separate for validation.
shuffle_time (bool) – No effect. Used to fit base class function signature.

Returns:

data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Training data fraction of initial data array. Initial data array is overwritten by this new data array.
val_data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Validation data fraction of initial data array.

property attrs

Get atttributes of input data

Returns:: dict – Dictionary of attributes

cache_data(cache_file_paths)

Cache feature data to file and delete from memory

Parameters:: cache_file_paths (str | None) – Path to file for saving feature data

property cache_files: Cache files for storing extracted data

property cache_pattern

Get correct cache file pattern for formatting.

Returns:: _cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features: List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

cap_worker_args(max_workers): Cap all workers args by max_workers

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:

features (list) – list of features to extract
cache_files (list | None) – Path to files with saved feature data
overwrite_cache (bool) – Whether to overwrite cached files
load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

check_clear_data(): Check if data is cached and clear data if not load_cached

clear_data(): Free memory used for data arrays

property compute_workers: Get upper bound for compute workers based on memory limits. Used to compute derived features from source dataset.

data_fill(shifted_time_chunks, max_workers=None)

Fill final data array with extracted / computed chunks

Parameters:

shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array
max_workers (int | None) – Max number of workers to use for building final data array. If None max available workers will be used. If 1 cached data will be loaded in serial

property derive_features: List of features which need to be derived from other features

classmethod extract_feature(file_paths, raster_index, feature, time_slice=slice(None, None, None), **kwargs)

Extract single feature from data source

Parameters:

file_paths (list) – path to data file
raster_index (ndarray) – Raster index array
feature (str) – Feature to extract from data
time_slice (slice) – slice of time to extract
kwargs (dict) – keyword arguments passed to source handler

Returns:

ndarray – Data array for extracted feature (spatial_1, spatial_2, temporal)

property extract_features: Features to extract directly from the source handler

property extract_workers: Get upper bound for extract workers based on memory limits. Used to extract data from source dataset. The max number of extract workers is number of time chunks * number of features

property feature_mem

Number of bytes for a single feature array. Used to estimate max_workers.

Returns:: int – Number of bytes for a single feature array

property file_paths: Get file paths for input data

property full_raw_lat_lon: Get the full lat/lon grid without doing any latitude inversion

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:

cache_pattern (str) – Pattern to use for cache file names
grid_shape (tuple) – Shape of grid to use for cache file naming
time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming
target (tuple) – Target to use for cache file naming
features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

static get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job. Capped to global max workers if specified

Parameters:

max_workers_cap (int | None) – Cap for job specific max_workers
max_workers (int | None) – Job specific max_workers

Returns:

max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided

static get_closest_lat_lon(lat_lon, target)

Get closest indices to target lat lon

Parameters:

lat_lon (ndarray) – Array of lat/lon (spatial_1, spatial_2, 2) Last dimension in order of (lat, lon)
target (tuple) – (lat, lon) for target coordinate

Returns:

row (int) – row index for closest lat/lon to target lat/lon
col (int) – col index for closest lat/lon to target lat/lon

classmethod get_full_domain(file_paths): Get target and shape for largest domain possible

classmethod get_handle_features(file_paths)

Get all available features in input data

Parameters:: file_paths (list) – List of input file paths
Returns:: handle_features (list) – List of available input features

classmethod get_input_arrays(data, chunk_number, f, handle_features)

Get only arrays needed for computations

Parameters:

data (dict) – Dictionary of feature arrays
chunk_number – time chunk for which to get input arrays
f (str) – feature to compute using input arrays
handle_features (list) – Features available in raw data

Returns:

dict – Dictionary of arrays with only needed features

classmethod get_inputs_recursive(feature, handle_features)

Lookup inputs needed to compute feature. Walk through inputs methods for each required feature to get all raw features.

Parameters:

feature (str) – Feature for which to get needed inputs for derivation
handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)

Get lat/lon grid for requested target and shape

Parameters:

file_paths (list) – path to data file
raster_index (ndarray | list) – Raster index array or list of slices
invert_lat (bool) – Flag to invert data along the latitude axis. Wrf data tends to use an increasing ordering for latitude while wtk uses a decreasing ordering.

Returns:

ndarray – (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension

get_lat_lon_df(target, features=None)

Get timeseries for given target

Parameters:

target (tuple) – (lat, lon) for target coordinate
features (list | None) – Optional list of features to include in returned data. If None then all available features are returned.

Returns:

df (pd.DataFrame) – Pandas dataframe with columns for each feature and timeindex for the given target

classmethod get_node_cmd(config)

Get a CLI call to initialize DataHandler and cache data.

Parameters:: config (dict) – sup3r data handler config with all necessary args and kwargs to initialize DataHandler and run data extraction.

get_raster_index()

Get raster index for file data. Here we assume the list of paths in file_paths all have data with the same spatial domain. We use the first file in the list to compute the raster.

Returns:: raster_index (np.ndarray) – 2D array of grid indices

classmethod get_raw_feature_list(features, handle_features)

Lookup inputs needed to compute feature

Parameters:

features (list) – Features for which to get needed inputs for derivation
handle_features (list) – Features available in raw data

Returns:

list – List of input features

classmethod get_time_index(file_paths, max_workers=None, **kwargs)

Get time index from data files

Parameters:

file_paths (list) – path to data file
max_workers (int | None) – placeholder to match signature
kwargs (dict) – placeholder to match signature

Returns:

time_index (pd.DateTimeIndex) – Time index from h5 source file(s)

property grid_mem

Get memory used by a feature at a single time step

Returns:: int – Number of bytes for a single feature array at a single time step

property grid_shape

Get shape of raster

Returns:: _grid_shape (tuple) – (rows, cols) grid size.

property handle_features: All features available in raw input

classmethod has_exact_feature(feature, handle)

Check if exact feature is in handle

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains exact feature or not

classmethod has_multilevel_feature(feature, handle)

Check if exact feature is in handle

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether handle contains multilevel data for given feature

classmethod has_surrounding_features(feature, handle)

Check if handle has feature values at surrounding heights. e.g. if feature=U_40m check if the handler has u at heights below and above 40m

Parameters:

feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object

Returns:

bool – Whether feature has surrounding heights

property hr_exo_features: Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.

property hr_out_features: Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features

property input_file_info

Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.

Returns:: str – message to append to log output that does not include a huge info dump of file paths

property invert_lat: Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)

property is_time_independent: Get whether source data files are time independent

property lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]

Returns:: ndarray

property latitude: Flattened list of latitudes

classmethod lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)

Parameters:: lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)
Returns:: bool

lin_bc(bc_files, threshold=0.1)

Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc

Parameters:

bc_files (list | tuple | str) – One or more filepaths to .h5 files output by MonthlyLinearCorrection or LinearCorrection. These should contain datasets named “{feature}_scalar” and “{feature}_adder” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time is length 1 for annual correction or 12 for monthly correction.
threshold (float) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.

load_cached_data(with_split=True)

Load data from cache files and split into training and validation

Parameters:: with_split (bool) – Whether to split into training and validation data or not.

property load_workers: Get upper bound on load workers based on memory limits. Used to load cached data.

property longitude: Flattened list of longitudes

classmethod lookup(feature, attr_name, handle_features=None)

Lookup feature in feature registry

Parameters:

feature (str) – Feature to lookup in registry
attr_name (str) – Type of method to lookup. e.g. inputs or compute
handle_features (list) – List of feature names (datasets) available in the source file. If feature is found explicitly in this list, height/pressure suffixes will not be appended to the output.

Returns:

method | None – Feature registry method corresponding to feature

property lr_features: Get a list of low-resolution features. It is assumed that all features are used in the low-resolution observations. If you want to use high-res-only features, use the DualDataHandler class.

property lr_only_features: List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.

mask_nan(): Drop timesteps with NaN data

property means

Get the mean values for each feature.

Returns:: dict

property meta: Meta dataframe with coordinates.

property n_tsteps: Get number of time steps to extract

property need_full_domain: Check whether we need to get the full lat/lon grid to determine target and shape values

property noncached_features: Get list of features needing extraction or derivation

property norm_workers: Get upper bound on workers used for normalization.

normalize(means=None, stds=None, features=None, max_workers=None)

Normalize all data features.

Parameters:

means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.
stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.
features (list | None) – List of features used for indexing data array during normalization.
max_workers (None | int) – Max workers to perform normalization. if None, self.norm_workers will be used

classmethod parallel_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features, max_workers=None)

Compute features using parallel subprocesses

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data
max_workers (int | None) – Number of max workers to use for computation. If equal to 1 then method is run in serial

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. Includes e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

classmethod parallel_extract(file_paths, raster_index, time_chunks, input_features, max_workers=None, **kwargs)

Extract features using parallel subprocesses

Parameters:

file_paths (list) – list of file paths
raster_index (ndarray | list) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
max_workers (int | None) – Number of max workers to use for extraction. If equal to 1 then method is run in serial
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:

data (ndarray) – Array to fill with cached data
cache_files (list) – List of cache files for each feature
features (list) – List of requested features
max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

classmethod pop_old_data(data, chunk_number, all_features)

Remove input feature data if no longer needed for requested features

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
chunk_number (int) – time chunk index to check
all_features (list) – list of all requested features including those requiring derivation from input features

preflight(): Run some preflight checks and verify that the inputs are valid

qdm_bc(bc_files, reference_feature, relative=True, threshold=0.1, no_trend=False)

Bias Correction using Quantile Delta Mapping

Bias correct this DataHandler’s data with Quantile Delta Mapping. The required statistical distributions should be pre-calculated using sup3r.bias.qdm.QuantileDeltaMappingCorrection.

Warning: There is no guarantee that the coefficients from bc_files match the resource processed here. Be careful choosing bc_files.

Parameters:

bc_files (list | tuple | str) – One or more filepaths to .h5 files output by bias_calc.QuantileDeltaMappingCorrection. These should contain datasets named “base_{reference_feature}_params”, “bias_{feature}_params”, and “bias_fut_{feature}_params” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time.
reference_feature (str) – Name of the feature used as (historical) reference. Dataset with name “base_{reference_feature}_params” will be retrieved from bc_files.
relative (bool, default=True) – Switcher to apply QDM as a relative (use True) or absolute (use False) correction value.
threshold (float, default=0.1) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.
no_trend (bool, default=False) – An option to ignore the trend component of the correction, thus resulting in an ordinary Quantile Mapping, i.e. corrects the bias by comparing the distributions of the biased dataset with a reference datasets. See params_mf of rex.utilities.bc_utils.QuantileDeltaMapping. Note that this assumes that “bias_{feature}_params” (params_mh) is the data distribution representative for the target data.

property raster_index: Raster index property

property raw_features: Get list of features needed for computations

property raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.

Returns:: ndarray

property raw_time_index: Time index for input data without time pruning. This is the base time index for the raw input data.

property raw_tsteps: Get number of time steps for all input files

classmethod recursive_compute(data, feature, handle_features, file_paths, raster_index)

Compute intermediate features recursively

Parameters:

data (dict) – dictionary of feature arrays. e.g. data[feature] = array. (spatial_1, spatial_2, temporal)
feature (str) – Name of feature to compute
handle_features (list) – Features available in raw data
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain

Returns:

ndarray – Array of computed feature data

property requested_shape: Get requested shape for cached data

run_all_data_init()

Build base 4D data array. Can handle multiple files but assumes each file has the same spatial domain

Returns:: data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)

run_data_compute(): Run the data computation / derivation from raw features to desired features.

run_data_extraction(): Run the raw dataset extraction process from disk to raw un-manipulated datasets.

run_nn_fill(): Run nn nan fill on full data array.

classmethod serial_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features)

Compute features in series

Parameters:

data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data

Returns:

data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

serial_data_fill(shifted_time_chunks)

Fill final data array in serial

Parameters:: shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array

classmethod serial_extract(file_paths, raster_index, time_chunks, input_features, **kwargs)

Extract features in series

Parameters:

file_paths (list) – list of file paths
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,

‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}

which then gets passed to xr.open_mfdataset(file, **kwargs)

Returns:

dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)

property shape

Full data shape

Returns:: shape (tuple) – Full data shape (spatial_1, spatial_2, temporal, features)

property single_ts_files: Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

property size

Size of data array

Returns:: size (int) – Number of total elements contained in data array

classmethod source_handler(file_paths, **kwargs)

Rex data handler

Note that xarray appears to treat open file handlers as singletons within a threadpool, so its okay to open this source_handler without a context handler or a .close() statement.

Parameters:

file_paths (str | list) – paths to data files
kwargs (dict) – keyword arguments passed to source handler

Returns:

data (ResourceX)

property source_type: Get data type for source files. Either nc or h5

property stds

Get the standard deviation values for each feature.

Returns:: dict

property target

Get lower left corner of raster

Returns:: _target (tuple) – (lat, lon) lower left corner of raster.

property temporal_slice: Get temporal range to extract from full dataset

property ti_workers: Get max number of workers for computing time index

property time_chunk_size: Get upper bound on time chunk size based on memory limits

property time_chunks

Get time chunks which will be extracted from source data

Returns:: _time_chunks (list) – List of time chunks used to split up source data time dimension so that each chunk can be extracted individually

property time_freq_hours: Get the time frequency in hours as a float

property time_index: Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.

time_index_conflict_check(): Check if the number of input files and the length of the time index is the same

property time_index_file: Get time index file path

property try_load: Check if we should try to load cache

classmethod valid_handle_features(features, handle_features)

Check if features are in handle

Parameters:

features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle

classmethod valid_input_features(features, handle_features)

Check if features are in handle or have compute methods

Parameters:

features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data

Returns:

bool – Whether feature basename is in handle