sup3r.preprocessing.data_handling.mixin.InputMixIn
- class InputMixIn(target, shape, raster_file=None, raster_index=None, temporal_slice=slice(None, None, 1), res_kwargs=None)[source]
Bases:
CacheHandlingMixIn
MixIn class with properties and methods for handling the spatiotemporal data domain to extract from source data.
Provide properties of the spatiotemporal data domain
- Parameters:
target (tuple) – (lat, lon) lower left corner of raster. Either need target+shape or raster_file.
shape (tuple) – (rows, cols) grid size. Either need target+shape or raster_file.
raster_file (str | None) – File for raster_index array for the corresponding target and shape. If specified the raster_index will be loaded from the file if it exists or written to the file if it does not yet exist. If None and raster_index is not provided raster_index will be calculated directly. Either need target+shape, raster_file, or raster_index input.
raster_index (list) – List of tuples or slices. Used as an alternative to computing the raster index from target+shape or loading the raster index from file
temporal_slice (slice) – Slice specifying extent and step of temporal extraction. e.g. slice(start, stop, time_pruning). If equal to slice(None, None, 1) the full time dimension is selected.
res_kwargs (dict | None) – Dictionary of kwargs to pass to xarray.open_mfdataset.
Methods
cap_worker_args
(max_workers)Cap all workers args by max_workers
check_cached_features
(features[, ...])Check which features have been cached and check flags to determine whether to load or extract this features again
get_cache_file_names
(cache_pattern[, ...])Get names of cache files from cache_pattern and feature names
get_capped_workers
(max_workers_cap, max_workers)Get max number of workers for a given job.
get_full_domain
(file_paths)Get full lat/lon grid for when target + shape are not specified
get_lat_lon
(file_paths, raster_index[, ...])Get lat/lon grid for requested target and shape
get_time_index
(file_paths[, max_workers])Get raw time index for source data
lats_are_descending
(lat_lon)Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).
parallel_load
(data, cache_files, features[, ...])Load feature data in parallel
Check if the number of input files and the length of the time index is the same
Attributes
Cache files for storing extracted data
Get correct cache file pattern for formatting.
List of features which have been requested but have been determined not to need extraction.
Get file paths for input data
Get the full lat/lon grid without doing any latitude inversion
Get shape of raster
Method to provide info about files in log output.
Whether to invert the latitude axis during data extraction.
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
Flattened list of latitudes
Flattened list of longitudes
Meta dataframe with coordinates.
Check whether we need to get the full lat/lon grid to determine target and shape values
Get list of features needing extraction or derivation
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.
Time index for input data without time pruning.
Get number of time steps for all input files
Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
Get data type for source files.
Get lower left corner of raster
Get temporal range to extract from full dataset
Get max number of workers for computing time index
Get the time frequency in hours as a float
Time index for input data with time pruning.
Get time index file path
Check if we should try to load cache
- property raw_tsteps
Get number of time steps for all input files
- property single_ts_files
Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice
- static get_capped_workers(max_workers_cap, max_workers)[source]
Get max number of workers for a given job. Capped to global max workers if specified
- Parameters:
max_workers_cap (int | None) – Cap for job specific max_workers
max_workers (int | None) – Job specific max_workers
- Returns:
max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided
- abstract classmethod get_full_domain(file_paths)[source]
Get full lat/lon grid for when target + shape are not specified
- abstract classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)[source]
Get lat/lon grid for requested target and shape
- abstract get_time_index(file_paths, max_workers=None, **kwargs)[source]
Get raw time index for source data
- property input_file_info
Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.
- Returns:
str – message to append to log output that does not include a huge info dump of file paths
- property temporal_slice
Get temporal range to extract from full dataset
- property file_paths
Get file paths for input data
- property ti_workers
Get max number of workers for computing time index
- property need_full_domain
Check whether we need to get the full lat/lon grid to determine target and shape values
- property full_raw_lat_lon
Get the full lat/lon grid without doing any latitude inversion
- property raw_lat_lon
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.
- Returns:
ndarray
- property latitude
Flattened list of latitudes
- property longitude
Flattened list of longitudes
- property meta
Meta dataframe with coordinates.
- property lat_lon
Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]
- Returns:
ndarray
- property invert_lat
Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)
- property target
Get lower left corner of raster
- Returns:
_target (tuple) – (lat, lon) lower left corner of raster.
- classmethod lats_are_descending(lat_lon)[source]
Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)
- Parameters:
lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)
- Returns:
bool
- property grid_shape
Get shape of raster
- Returns:
_grid_shape (tuple) – (rows, cols) grid size.
- property source_type
Get data type for source files. Either nc or h5
- property raw_time_index
Time index for input data without time pruning. This is the base time index for the raw input data.
- time_index_conflict_check()[source]
Check if the number of input files and the length of the time index is the same
- property time_index
Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.
- property time_freq_hours
Get the time frequency in hours as a float
- property time_index_file
Get time index file path
- property cache_files
Cache files for storing extracted data
- property cache_pattern
Get correct cache file pattern for formatting.
- Returns:
_cache_pattern (str) – The cache file pattern with formatting keys included.
- property cached_features
List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.
- static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)
Check which features have been cached and check flags to determine whether to load or extract this features again
- Parameters:
features (list) – list of features to extract
cache_files (list | None) – Path to files with saved feature data
overwrite_cache (bool) – Whether to overwrite cached files
load_cached (bool) – Whether to load data from cache files
- Returns:
list – List of features to extract. Might not include features which have cache files.
- get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)
Get names of cache files from cache_pattern and feature names
- Parameters:
cache_pattern (str) – Pattern to use for cache file names
grid_shape (tuple) – Shape of grid to use for cache file naming
time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming
target (tuple) – Target to use for cache file naming
features (list) – List of features to use for cache file naming
- Returns:
list – List of cache file names
- property noncached_features
Get list of features needing extraction or derivation
- parallel_load(data, cache_files, features, max_workers=None)
Load feature data in parallel
- Parameters:
data (ndarray) – Array to fill with cached data
cache_files (list) – List of cache files for each feature
features (list) – List of requested features
max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.
- property try_load
Check if we should try to load cache