sup3r.preprocessing.data_handling.base.DataHandlerDC
- class DataHandlerDC(file_paths, features, target=None, shape=None, max_delta=20, temporal_slice=slice(None, None, 1), hr_spatial_coarsen=None, time_roll=0, val_split=0.0, sample_shape=(10, 10, 1), raster_file=None, raster_index=None, shuffle_time=False, time_chunk_size=None, cache_pattern=None, overwrite_cache=False, overwrite_ti_cache=False, load_cached=False, lr_only_features=(), hr_exo_features=(), handle_features=None, single_ts_files=None, mask_nan=False, fill_nan=False, worker_kwargs=None, res_kwargs=None)[source]
Bases:
DataHandler
Data-centric data handler
- Parameters:
file_paths (str | list) – A single source h5 wind file to extract raster data from or a list of netcdf files with identical grid. The string can be a unix-style file path which will be passed through glob.glob
features (list) – list of features to extract from the provided data
target (tuple) – (lat, lon) lower left corner of raster. Either need target+shape or raster_file.
shape (tuple) – (rows, cols) grid size. Either need target+shape or raster_file.
max_delta (int, optional) – Optional maximum limit on the raster shape that is retrieved at once. If shape is (20, 20) and max_delta=10, the full raster will be retrieved in four chunks of (10, 10). This helps adapt to non-regular grids that curve over large distances, by default 20
temporal_slice (slice) – Slice specifying extent and step of temporal extraction. e.g. slice(start, stop, time_pruning). If equal to slice(None, None, 1) the full time dimension is selected.
hr_spatial_coarsen (int | None) – Optional input to coarsen the high-resolution spatial field. This can be used if (for example) you have 2km source data, but you want the final high res prediction target to be 4km resolution, then hr_spatial_coarsen would be 2 so that the GAN is trained on aggregated 4km high-res data.
time_roll (int) – The number of places by which elements are shifted in the time axis. Can be used to convert data to different timezones. This is passed to np.roll(a, time_roll, axis=2) and happens AFTER the temporal_slice operation.
val_split (float32) – Fraction of data to store for validation
sample_shape (tuple) – Size of spatial and temporal domain used in a single high-res observation for batching
raster_file (str | None) – .txt file for raster_index array for the corresponding target and shape. If specified the raster_index will be loaded from the file if it exists or written to the file if it does not yet exist. If None and raster_index is not provided raster_index will be calculated directly. Either need target+shape, raster_file, or raster_index input.
raster_index (list) – List of tuples or slices. Used as an alternative to computing the raster index from target+shape or loading the raster index from file
shuffle_time (bool) – Whether to shuffle time indices before validation split
time_chunk_size (int) – Size of chunks to split time dimension into for parallel data extraction. If running in serial this can be set to the size of the full time index for best performance.
cache_pattern (str | None) – Pattern for files for saving feature data. e.g. file_path_{feature}.pkl. Each feature will be saved to a file with the feature name replaced in cache_pattern. If not None feature arrays will be saved here and not stored in self.data until load_cached_data is called. The cache_pattern can also include {shape}, {target}, {times} which will help ensure unique cache files for complex problems.
overwrite_cache (bool) – Whether to overwrite any previously saved cache files.
overwrite_ti_cache (bool) – Whether to overwrite any previously saved time index cache files.
overwrite_ti_cache (bool) – Whether to overwrite saved time index cache files.
load_cached (bool) – Whether to load data from cache files
lr_only_features (list | tuple) – List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
hr_exo_features (list | tuple) – List of feature names or patt*erns that should be included in the high-resolution observation but not expected to be output from the generative model. An example is high-res topography that is to be injected mid-network.
handle_features (list | None) – Optional list of features which are available in the provided data. Providing this eliminates the need for an initial search of available features prior to data extraction.
single_ts_files (bool | None) – Whether input files are single time steps or not. If they are this enables some reduced computation. If None then this will be determined from file_paths directly.
mask_nan (bool) – Flag to mask out (remove) any timesteps with NaN data from the source dataset. This is False by default because it can create discontinuities in the timeseries.
fill_nan (bool) – Flag to gap-fill any NaN data from the source dataset using a nearest neighbor algorithm. This is False by default because it can hide bad datasets that should be identified by the user.
worker_kwargs (dict | None) – Dictionary of worker values. Can include max_workers, extract_workers, compute_workers, load_workers, norm_workers, and ti_workers. Each argument needs to be an integer or None.
The value of max workers will set the value of all other worker args. If max_workers == 1 then all processes will be serialized. If max_workers == None then other worker args will use their own provided values.
extract_workers is the max number of workers to use for extracting features from source data. If None it will be estimated based on memory limits. If 1 processes will be serialized. compute_workers is the max number of workers to use for computing derived features from raw features in source data. load_workers is the max number of workers to use for loading cached feature data. norm_workers is the max number of workers to use for normalizing feature data. ti_workers is the max number of workers to use to get full time index. Useful when there are many input files each with a single time step. If this is greater than one, time indices for input files will be extracted in parallel and then concatenated to get the full time index. If input files do not all have time indices or if there are few input files this should be set to one.
res_kwargs (dict | None) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,
‘concat_dim’: ‘Time’, ‘combine’: ‘nested’, ‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}
which then gets passed to xr.open_mfdataset(file, **res_kwargs)
Methods
cache_data
(cache_file_paths)Cache feature data to file and delete from memory
cap_worker_args
(max_workers)Cap all workers args by max_workers
check_cached_features
(features[, ...])Check which features have been cached and check flags to determine whether to load or extract this features again
Check if data is cached and clear data if not load_cached
Free memory used for data arrays
data_fill
(shifted_time_chunks[, max_workers])Fill final data array with extracted / computed chunks
extract_feature
(file_paths, raster_index, ...)Extract single feature from data source
get_cache_file_names
(cache_pattern[, ...])Get names of cache files from cache_pattern and feature names
get_capped_workers
(max_workers_cap, max_workers)Get max number of workers for a given job.
get_closest_lat_lon
(lat_lon, target)Get closest indices to target lat lon
get_full_domain
(file_paths)Get target and shape for full domain
get_handle_features
(file_paths)Get all available features in input data
get_input_arrays
(data, chunk_number, f, ...)Get only arrays needed for computations
get_inputs_recursive
(feature, handle_features)Lookup inputs needed to compute feature.
get_lat_lon
(file_paths, raster_index[, ...])Get lat/lon grid for requested target and shape
get_lat_lon_df
(target[, features])Get timeseries for given target
get_next
([temporal_weights, spatial_weights])Get data for observation using weighted random observation index.
get_node_cmd
(config)Get a CLI call to initialize DataHandler and cache data.
get_observation_index
([temporal_weights, ...])Randomly gets weighted spatial sample and time sample
Get raster index for file data.
get_raw_feature_list
(features, handle_features)Lookup inputs needed to compute feature
get_time_index
(file_paths[, max_workers])Get raw time index for source data
has_exact_feature
(feature, handle)Check if exact feature is in handle
has_multilevel_feature
(feature, handle)Check if exact feature is in handle
has_surrounding_features
(feature, handle)Check if handle has feature values at surrounding heights.
lats_are_descending
(lat_lon)Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).
lin_bc
(bc_files[, threshold])Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc
load_cached_data
([with_split])Load data from cache files and split into training and validation
lookup
(feature, attr_name[, handle_features])Lookup feature in feature registry
mask_nan
()Drop timesteps with NaN data
normalize
([means, stds, features, max_workers])Normalize all data features.
parallel_compute
(data, file_paths, ...[, ...])Compute features using parallel subprocesses
parallel_extract
(file_paths, raster_index, ...)Extract features using parallel subprocesses
parallel_load
(data, cache_files, features[, ...])Load feature data in parallel
pop_old_data
(data, chunk_number, all_features)Remove input feature data if no longer needed for requested features
Run some preflight checks and verify that the inputs are valid
qdm_bc
(bc_files, reference_feature[, ...])Bias Correction using Quantile Delta Mapping
recursive_compute
(data, feature, ...)Compute intermediate features recursively
Build base 4D data array.
Run the data computation / derivation from raw features to desired features.
Run the raw dataset extraction process from disk to raw un-manipulated datasets.
Run nn nan fill on full data array.
serial_compute
(data, file_paths, ...)Compute features in series
serial_data_fill
(shifted_time_chunks)Fill final data array in serial
serial_extract
(file_paths, raster_index, ...)Extract features in series
source_handler
(file_paths, **kwargs)Handle for source data.
split_data
([data, val_split, shuffle_time])Split time dimension into set of training indices and validation indices
Check if the number of input files and the length of the time index is the same
valid_handle_features
(features, handle_features)Check if features are in handle
valid_input_features
(features, handle_features)Check if features are in handle or have compute methods
Attributes
FEATURE_REGISTRY
Get atttributes of input data
Cache files for storing extracted data
Get correct cache file pattern for formatting.
List of features which have been requested but have been determined not to need extraction.
Get upper bound for compute workers based on memory limits.
List of features which need to be derived from other features
Features to extract directly from the source handler
Get upper bound for extract workers based on memory limits.
Number of bytes for a single feature array.
Get file paths for input data
Get the full lat/lon grid without doing any latitude inversion
Get memory used by a feature at a single time step
Get shape of raster
All features available in raw input
Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection.
Get a list of high-resolution features that are intended to be output by the GAN.
Method to provide info about files in log output.
Whether to invert the latitude axis during data extraction.
Get whether source data files are time independent
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
Flattened list of latitudes
Get upper bound on load workers based on memory limits.
Flattened list of longitudes
Get a list of low-resolution features.
List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
Get the mean values for each feature.
Meta dataframe with coordinates.
Get number of time steps to extract
Check whether we need to get the full lat/lon grid to determine target and shape values
Get list of features needing extraction or derivation
Get upper bound on workers used for normalization.
Raster index property
Get list of features needed for computations
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
Time index for input data without time pruning.
Get number of time steps for all input files
Get requested shape for cached data
Full data shape
Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
Size of data array
Get data type for source files.
Get the standard deviation values for each feature.
Get lower left corner of raster
Get temporal range to extract from full dataset
Get max number of workers for computing time index
Get upper bound on time chunk size based on memory limits
Get time chunks which will be extracted from source data
Get the time frequency in hours as a float
Time index for input data with time pruning.
Get time index file path
Check if we should try to load cache
- property attrs
Get atttributes of input data
- Returns:
dict – Dictionary of attributes
- cache_data(cache_file_paths)
Cache feature data to file and delete from memory
- Parameters:
cache_file_paths (str | None) – Path to file for saving feature data
- property cache_files
Cache files for storing extracted data
- property cache_pattern
Get correct cache file pattern for formatting.
- Returns:
_cache_pattern (str) – The cache file pattern with formatting keys included.
- property cached_features
List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.
- cap_worker_args(max_workers)
Cap all workers args by max_workers
- static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)
Check which features have been cached and check flags to determine whether to load or extract this features again
- Parameters:
features (list) – list of features to extract
cache_files (list | None) – Path to files with saved feature data
overwrite_cache (bool) – Whether to overwrite cached files
load_cached (bool) – Whether to load data from cache files
- Returns:
list – List of features to extract. Might not include features which have cache files.
- check_clear_data()
Check if data is cached and clear data if not load_cached
- clear_data()
Free memory used for data arrays
- property compute_workers
Get upper bound for compute workers based on memory limits. Used to compute derived features from source dataset.
- data_fill(shifted_time_chunks, max_workers=None)
Fill final data array with extracted / computed chunks
- Parameters:
shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array
max_workers (int | None) – Max number of workers to use for building final data array. If None max available workers will be used. If 1 cached data will be loaded in serial
- property derive_features
List of features which need to be derived from other features
- abstract classmethod extract_feature(file_paths, raster_index, feature, time_slice=slice(None, None, None), **kwargs)
Extract single feature from data source
- Parameters:
file_paths (list) – path to data file
raster_index (ndarray) – Raster index array
time_slice (slice) – slice of time to extract
feature (str) – Feature to extract from data
kwargs (dict) – Keyword arguments passed to source handler
- Returns:
ndarray – Data array for extracted feature (spatial_1, spatial_2, temporal)
- property extract_features
Features to extract directly from the source handler
- property extract_workers
Get upper bound for extract workers based on memory limits. Used to extract data from source dataset. The max number of extract workers is number of time chunks * number of features
- property feature_mem
Number of bytes for a single feature array. Used to estimate max_workers.
- Returns:
int – Number of bytes for a single feature array
- property file_paths
Get file paths for input data
- property full_raw_lat_lon
Get the full lat/lon grid without doing any latitude inversion
- get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)
Get names of cache files from cache_pattern and feature names
- Parameters:
cache_pattern (str) – Pattern to use for cache file names
grid_shape (tuple) – Shape of grid to use for cache file naming
time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming
target (tuple) – Target to use for cache file naming
features (list) – List of features to use for cache file naming
- Returns:
list – List of cache file names
- static get_capped_workers(max_workers_cap, max_workers)
Get max number of workers for a given job. Capped to global max workers if specified
- Parameters:
max_workers_cap (int | None) – Cap for job specific max_workers
max_workers (int | None) – Job specific max_workers
- Returns:
max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided
- static get_closest_lat_lon(lat_lon, target)
Get closest indices to target lat lon
- Parameters:
lat_lon (ndarray) – Array of lat/lon (spatial_1, spatial_2, 2) Last dimension in order of (lat, lon)
target (tuple) – (lat, lon) for target coordinate
- Returns:
row (int) – row index for closest lat/lon to target lat/lon
col (int) – col index for closest lat/lon to target lat/lon
- abstract classmethod get_full_domain(file_paths)
Get target and shape for full domain
- classmethod get_handle_features(file_paths)
Get all available features in input data
- Parameters:
file_paths (list) – List of input file paths
- Returns:
handle_features (list) – List of available input features
- classmethod get_input_arrays(data, chunk_number, f, handle_features)
Get only arrays needed for computations
- Parameters:
data (dict) – Dictionary of feature arrays
chunk_number – time chunk for which to get input arrays
f (str) – feature to compute using input arrays
handle_features (list) – Features available in raw data
- Returns:
dict – Dictionary of arrays with only needed features
- classmethod get_inputs_recursive(feature, handle_features)
Lookup inputs needed to compute feature. Walk through inputs methods for each required feature to get all raw features.
- Parameters:
feature (str) – Feature for which to get needed inputs for derivation
handle_features (list) – Features available in raw data
- Returns:
list – List of input features
- classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)
Get lat/lon grid for requested target and shape
- Parameters:
file_paths (list) – path to data file
raster_index (ndarray | list) – Raster index array or list of slices
invert_lat (bool) – Flag to invert data along the latitude axis. Wrf data tends to use an increasing ordering for latitude while wtk uses a decreasing ordering.
- Returns:
ndarray – (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension
- get_lat_lon_df(target, features=None)
Get timeseries for given target
- Parameters:
target (tuple) – (lat, lon) for target coordinate
features (list | None) – Optional list of features to include in returned data. If None then all available features are returned.
- Returns:
df (pd.DataFrame) – Pandas dataframe with columns for each feature and timeindex for the given target
- classmethod get_node_cmd(config)
Get a CLI call to initialize DataHandler and cache data.
- Parameters:
config (dict) – sup3r data handler config with all necessary args and kwargs to initialize DataHandler and run data extraction.
- abstract get_raster_index()
Get raster index for file data. Here we assume the list of paths in file_paths all have data with the same spatial domain. We use the first file in the list to compute the raster
- Returns:
raster_index (np.ndarray) – 2D array of grid indices for H5 or list of slices for NETCDF
- classmethod get_raw_feature_list(features, handle_features)
Lookup inputs needed to compute feature
- Parameters:
features (list) – Features for which to get needed inputs for derivation
handle_features (list) – Features available in raw data
- Returns:
list – List of input features
- abstract get_time_index(file_paths, max_workers=None, **kwargs)
Get raw time index for source data
- property grid_mem
Get memory used by a feature at a single time step
- Returns:
int – Number of bytes for a single feature array at a single time step
- property grid_shape
Get shape of raster
- Returns:
_grid_shape (tuple) – (rows, cols) grid size.
- property handle_features
All features available in raw input
- classmethod has_exact_feature(feature, handle)
Check if exact feature is in handle
- Parameters:
feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object
- Returns:
bool – Whether handle contains exact feature or not
- classmethod has_multilevel_feature(feature, handle)
Check if exact feature is in handle
- Parameters:
feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object
- Returns:
bool – Whether handle contains multilevel data for given feature
- classmethod has_surrounding_features(feature, handle)
Check if handle has feature values at surrounding heights. e.g. if feature=U_40m check if the handler has u at heights below and above 40m
- Parameters:
feature (str) – Raw feature name e.g. U_100m
handle (xarray.Dataset) – netcdf data object
- Returns:
bool – Whether feature has surrounding heights
- property hr_exo_features
Get a list of exogenous high-resolution features that are only used for training e.g., mid-network high-res topo injection. These must come at the end of the high-res feature set. These can also be input to the model as low-res features.
- property hr_out_features
Get a list of high-resolution features that are intended to be output by the GAN. Does not include high-resolution exogenous features
- property input_file_info
Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.
- Returns:
str – message to append to log output that does not include a huge info dump of file paths
- property invert_lat
Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)
- property is_time_independent
Get whether source data files are time independent
- property lat_lon
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]
- Returns:
ndarray
- property latitude
Flattened list of latitudes
- classmethod lats_are_descending(lat_lon)
Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)
- Parameters:
lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)
- Returns:
bool
- lin_bc(bc_files, threshold=0.1)
Bias correct the data in this DataHandler using linear bias correction factors from files output by MonthlyLinearCorrection or LinearCorrection from sup3r.bias.bias_calc
- Parameters:
bc_files (list | tuple | str) – One or more filepaths to .h5 files output by MonthlyLinearCorrection or LinearCorrection. These should contain datasets named “{feature}_scalar” and “{feature}_adder” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time is length 1 for annual correction or 12 for monthly correction.
threshold (float) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.
- load_cached_data(with_split=True)
Load data from cache files and split into training and validation
- Parameters:
with_split (bool) – Whether to split into training and validation data or not.
- property load_workers
Get upper bound on load workers based on memory limits. Used to load cached data.
- property longitude
Flattened list of longitudes
- classmethod lookup(feature, attr_name, handle_features=None)
Lookup feature in feature registry
- Parameters:
feature (str) – Feature to lookup in registry
attr_name (str) – Type of method to lookup. e.g. inputs or compute
handle_features (list) – List of feature names (datasets) available in the source file. If feature is found explicitly in this list, height/pressure suffixes will not be appended to the output.
- Returns:
method | None – Feature registry method corresponding to feature
- property lr_features
Get a list of low-resolution features. It is assumed that all features are used in the low-resolution observations. If you want to use high-res-only features, use the DualDataHandler class.
- property lr_only_features
List of feature names or patt*erns that should only be included in the low-res training set and not the high-res observations.
- mask_nan()
Drop timesteps with NaN data
- property means
Get the mean values for each feature.
- Returns:
dict
- property meta
Meta dataframe with coordinates.
- property n_tsteps
Get number of time steps to extract
- property need_full_domain
Check whether we need to get the full lat/lon grid to determine target and shape values
- property noncached_features
Get list of features needing extraction or derivation
- property norm_workers
Get upper bound on workers used for normalization.
- normalize(means=None, stds=None, features=None, max_workers=None)
Normalize all data features.
- Parameters:
means (dict | none) – Dictionary of means for all features with keys: feature names and values: mean values. If this is None, the self.means attribute will be used. If this is not None, this DataHandler object means attribute will be updated.
stds (dict | none) – dictionary of standard deviation values for all features with keys: feature names and values: standard deviations. If this is None, the self.stds attribute will be used. If this is not None, this DataHandler object stds attribute will be updated.
features (list | None) – List of features used for indexing data array during normalization.
max_workers (None | int) – Max workers to perform normalization. if None, self.norm_workers will be used
- classmethod parallel_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features, max_workers=None)
Compute features using parallel subprocesses
- Parameters:
data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data
max_workers (int | None) – Number of max workers to use for computation. If equal to 1 then method is run in serial
- Returns:
data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. Includes e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
- classmethod parallel_extract(file_paths, raster_index, time_chunks, input_features, max_workers=None, **kwargs)
Extract features using parallel subprocesses
- Parameters:
file_paths (list) – list of file paths
raster_index (ndarray | list) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
max_workers (int | None) – Number of max workers to use for extraction. If equal to 1 then method is run in serial
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,
‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}
which then gets passed to xr.open_mfdataset(file, **kwargs)
- Returns:
dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
- parallel_load(data, cache_files, features, max_workers=None)
Load feature data in parallel
- Parameters:
data (ndarray) – Array to fill with cached data
cache_files (list) – List of cache files for each feature
features (list) – List of requested features
max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.
- classmethod pop_old_data(data, chunk_number, all_features)
Remove input feature data if no longer needed for requested features
- Parameters:
data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
chunk_number (int) – time chunk index to check
all_features (list) – list of all requested features including those requiring derivation from input features
- preflight()
Run some preflight checks and verify that the inputs are valid
- qdm_bc(bc_files, reference_feature, relative=True, threshold=0.1, no_trend=False)
Bias Correction using Quantile Delta Mapping
Bias correct this DataHandler’s data with Quantile Delta Mapping. The required statistical distributions should be pre-calculated using
sup3r.bias.qdm.QuantileDeltaMappingCorrection
.Warning: There is no guarantee that the coefficients from
bc_files
match the resource processed here. Be careful choosingbc_files
.- Parameters:
bc_files (list | tuple | str) – One or more filepaths to .h5 files output by
bias_calc.QuantileDeltaMappingCorrection
. These should contain datasets named “base_{reference_feature}_params”, “bias_{feature}_params”, and “bias_fut_{feature}_params” where {feature} is one of the features contained by this DataHandler and the data is a 3D array of shape (lat, lon, time) where time.reference_feature (str) – Name of the feature used as (historical) reference. Dataset with name “base_{reference_feature}_params” will be retrieved from
bc_files
.relative (bool, default=True) – Switcher to apply QDM as a relative (use True) or absolute (use False) correction value.
threshold (float, default=0.1) – Nearest neighbor euclidean distance threshold. If the DataHandler coordinates are more than this value away from the bias correction lat/lon, an error is raised.
no_trend (bool, default=False) – An option to ignore the trend component of the correction, thus resulting in an ordinary Quantile Mapping, i.e. corrects the bias by comparing the distributions of the biased dataset with a reference datasets. See
params_mf
ofrex.utilities.bc_utils.QuantileDeltaMapping
. Note that this assumes that “bias_{feature}_params” (params_mh
) is the data distribution representative for the target data.
- property raster_index
Raster index property
- property raw_features
Get list of features needed for computations
- property raw_lat_lon
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.
- Returns:
ndarray
- property raw_time_index
Time index for input data without time pruning. This is the base time index for the raw input data.
- property raw_tsteps
Get number of time steps for all input files
- classmethod recursive_compute(data, feature, handle_features, file_paths, raster_index)
Compute intermediate features recursively
- Parameters:
data (dict) – dictionary of feature arrays. e.g. data[feature] = array. (spatial_1, spatial_2, temporal)
feature (str) – Name of feature to compute
handle_features (list) – Features available in raw data
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
- Returns:
ndarray – Array of computed feature data
- property requested_shape
Get requested shape for cached data
- run_all_data_init()
Build base 4D data array. Can handle multiple files but assumes each file has the same spatial domain
- Returns:
data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)
- run_data_compute()
Run the data computation / derivation from raw features to desired features.
- run_data_extraction()
Run the raw dataset extraction process from disk to raw un-manipulated datasets.
- run_nn_fill()
Run nn nan fill on full data array.
- classmethod serial_compute(data, file_paths, raster_index, time_chunks, derived_features, all_features, handle_features)
Compute features in series
- Parameters:
data (dict) – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
file_paths (list) – Paths to data files. Used if compute method operates directly on source handler instead of input arrays. This is done with features without inputs methods like lat_lon and topography.
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
derived_features (list) – list of feature strings which need to be derived
all_features (list) – list of all features including those requiring derivation from input features
handle_features (list) – Features available in raw data
- Returns:
data (dict) – dictionary of feature arrays, including computed features, with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
- serial_data_fill(shifted_time_chunks)
Fill final data array in serial
- Parameters:
shifted_time_chunks (list) – List of time slices corresponding to the appropriate location of extracted / computed chunks in the final data array
- classmethod serial_extract(file_paths, raster_index, time_chunks, input_features, **kwargs)
Extract features in series
- Parameters:
file_paths (list) – list of file paths
raster_index (ndarray) – raster index for spatial domain
time_chunks (list) – List of slices to chunk data feature extraction along time dimension
input_features (list) – list of input feature strings
kwargs (dict) – kwargs passed to source handler for data extraction. e.g. This could be {‘parallel’: True,
‘chunks’: {‘south_north’: 120, ‘west_east’: 120}}
which then gets passed to xr.open_mfdataset(file, **kwargs)
- Returns:
dict – dictionary of feature arrays with integer keys for chunks and str keys for features. e.g. data[chunk_number][feature] = array. (spatial_1, spatial_2, temporal)
- property shape
Full data shape
- Returns:
shape (tuple) – Full data shape (spatial_1, spatial_2, temporal, features)
- property single_ts_files
Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
- property size
Size of data array
- Returns:
size (int) – Number of total elements contained in data array
- abstract classmethod source_handler(file_paths, **kwargs)
Handle for source data. Uses xarray, ResourceX, etc.
NOTE: that xarray appears to treat open file handlers as singletons within a threadpool, so its okay to open this source_handler without a context handler or a .close() statement.
- property source_type
Get data type for source files. Either nc or h5
- split_data(data=None, val_split=0.0, shuffle_time=False)
Split time dimension into set of training indices and validation indices
- Parameters:
data (np.ndarray) – 4D array of high res data (spatial_1, spatial_2, temporal, features)
val_split (float) – Fraction of data to separate for validation.
shuffle_time (bool) – Whether to shuffle time or not.
- Returns:
data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Training data fraction of initial data array. Initial data array is overwritten by this new data array.
val_data (np.ndarray) – (spatial_1, spatial_2, temporal, features) Validation data fraction of initial data array.
- property stds
Get the standard deviation values for each feature.
- Returns:
dict
- property target
Get lower left corner of raster
- Returns:
_target (tuple) – (lat, lon) lower left corner of raster.
- property temporal_slice
Get temporal range to extract from full dataset
- property ti_workers
Get max number of workers for computing time index
- property time_chunk_size
Get upper bound on time chunk size based on memory limits
- property time_chunks
Get time chunks which will be extracted from source data
- Returns:
_time_chunks (list) – List of time chunks used to split up source data time dimension so that each chunk can be extracted individually
- property time_freq_hours
Get the time frequency in hours as a float
- property time_index
Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.
- time_index_conflict_check()
Check if the number of input files and the length of the time index is the same
- property time_index_file
Get time index file path
- property try_load
Check if we should try to load cache
- classmethod valid_handle_features(features, handle_features)
Check if features are in handle
- Parameters:
features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data
- Returns:
bool – Whether feature basename is in handle
- classmethod valid_input_features(features, handle_features)
Check if features are in handle or have compute methods
- Parameters:
features (str | list) – Raw feature names e.g. U_100m
handle_features (list) – Features available in raw data
- Returns:
bool – Whether feature basename is in handle
- get_observation_index(temporal_weights=None, spatial_weights=None)[source]
Randomly gets weighted spatial sample and time sample
- Parameters:
temporal_weights (array) – Weights used to select time slice (n_time_chunks)
spatial_weights (array) – Weights used to select spatial chunks (n_lat_chunks * n_lon_chunks)
- Returns:
observation_index (tuple) – Tuple of sampled spatial grid, time slice, and features indices. Used to get single observation like self.data[observation_index]
- get_next(temporal_weights=None, spatial_weights=None)[source]
Get data for observation using weighted random observation index. Loops repeatedly over randomized time index.
- Parameters:
temporal_weights (array) – Weights used to select time slice (n_time_chunks)
spatial_weights (array) – Weights used to select spatial chunks (n_lat_chunks * n_lon_chunks)
- Returns:
observation (np.ndarray) – 4D array (spatial_1, spatial_2, temporal, features)