sup3r.preprocessing.data_handling.mixin.InputMixIn

class InputMixIn(target, shape, raster_file=None, raster_index=None, temporal_slice=slice(None, None, 1), res_kwargs=None)[source]

Bases: CacheHandlingMixIn

MixIn class with properties and methods for handling the spatiotemporal data domain to extract from source data.

Provide properties of the spatiotemporal data domain

Parameters:
  • target (tuple) – (lat, lon) lower left corner of raster. Either need target+shape or raster_file.

  • shape (tuple) – (rows, cols) grid size. Either need target+shape or raster_file.

  • raster_file (str | None) – File for raster_index array for the corresponding target and shape. If specified the raster_index will be loaded from the file if it exists or written to the file if it does not yet exist. If None and raster_index is not provided raster_index will be calculated directly. Either need target+shape, raster_file, or raster_index input.

  • raster_index (list) – List of tuples or slices. Used as an alternative to computing the raster index from target+shape or loading the raster index from file

  • temporal_slice (slice) – Slice specifying extent and step of temporal extraction. e.g. slice(start, stop, time_pruning). If equal to slice(None, None, 1) the full time dimension is selected.

  • res_kwargs (dict | None) – Dictionary of kwargs to pass to xarray.open_mfdataset.

Methods

cap_worker_args(max_workers)

Cap all workers args by max_workers

check_cached_features(features[, ...])

Check which features have been cached and check flags to determine whether to load or extract this features again

get_cache_file_names(cache_pattern[, ...])

Get names of cache files from cache_pattern and feature names

get_capped_workers(max_workers_cap, max_workers)

Get max number of workers for a given job.

get_full_domain(file_paths)

Get full lat/lon grid for when target + shape are not specified

get_lat_lon(file_paths, raster_index[, ...])

Get lat/lon grid for requested target and shape

get_time_index(file_paths[, max_workers])

Get raw time index for source data

lats_are_descending(lat_lon)

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner).

parallel_load(data, cache_files, features[, ...])

Load feature data in parallel

time_index_conflict_check()

Check if the number of input files and the length of the time index is the same

Attributes

cache_files

Cache files for storing extracted data

cache_pattern

Get correct cache file pattern for formatting.

cached_features

List of features which have been requested but have been determined not to need extraction.

file_paths

Get file paths for input data

full_raw_lat_lon

Get the full lat/lon grid without doing any latitude inversion

grid_shape

Get shape of raster

input_file_info

Method to provide info about files in log output.

invert_lat

Whether to invert the latitude axis during data extraction.

lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.

latitude

Flattened list of latitudes

longitude

Flattened list of longitudes

meta

Meta dataframe with coordinates.

need_full_domain

Check whether we need to get the full lat/lon grid to determine target and shape values

noncached_features

Get list of features needing extraction or derivation

raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension.

raw_time_index

Time index for input data without time pruning.

raw_tsteps

Get number of time steps for all input files

single_ts_files

Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

source_type

Get data type for source files.

target

Get lower left corner of raster

temporal_slice

Get temporal range to extract from full dataset

ti_workers

Get max number of workers for computing time index

time_freq_hours

Get the time frequency in hours as a float

time_index

Time index for input data with time pruning.

time_index_file

Get time index file path

try_load

Check if we should try to load cache

property raw_tsteps

Get number of time steps for all input files

property single_ts_files

Check if there is a file for each time step, in which case we can send a subset of files to the data handler according to ti_pad_slice

static get_capped_workers(max_workers_cap, max_workers)[source]

Get max number of workers for a given job. Capped to global max workers if specified

Parameters:
  • max_workers_cap (int | None) – Cap for job specific max_workers

  • max_workers (int | None) – Job specific max_workers

Returns:

max_workers (int | None) – job specific max_workers capped by max_workers_cap if provided

cap_worker_args(max_workers)[source]

Cap all workers args by max_workers

abstract classmethod get_full_domain(file_paths)[source]

Get full lat/lon grid for when target + shape are not specified

abstract classmethod get_lat_lon(file_paths, raster_index, invert_lat=False)[source]

Get lat/lon grid for requested target and shape

abstract get_time_index(file_paths, max_workers=None, **kwargs)[source]

Get raw time index for source data

property input_file_info

Method to provide info about files in log output. Since NETCDF files have single time slices printing out all the file paths is just a text dump without much info.

Returns:

str – message to append to log output that does not include a huge info dump of file paths

property temporal_slice

Get temporal range to extract from full dataset

property file_paths

Get file paths for input data

property ti_workers

Get max number of workers for computing time index

property need_full_domain

Check whether we need to get the full lat/lon grid to determine target and shape values

property full_raw_lat_lon

Get the full lat/lon grid without doing any latitude inversion

property raw_lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This returns the gid without any lat inversion.

Returns:

ndarray

property latitude

Flattened list of latitudes

property longitude

Flattened list of longitudes

property meta

Meta dataframe with coordinates.

property lat_lon

Lat lon grid for data in format (spatial_1, spatial_2, 2) Lat/Lon array with same ordering in last dimension. This ensures that the lower left hand corner of the domain is given by lat_lon[-1, 0]

Returns:

ndarray

property invert_lat

Whether to invert the latitude axis during data extraction. This is to enforce a descending latitude ordering so that the lower left corner of the grid is at idx=(-1, 0) instead of idx=(0, 0)

property target

Get lower left corner of raster

Returns:

_target (tuple) – (lat, lon) lower left corner of raster.

classmethod lats_are_descending(lat_lon)[source]

Check if latitudes are in descending order (i.e. the target coordinate is already at the bottom left corner)

Parameters:

lat_lon (np.ndarray) – Lat/Lon array with shape (n_lats, n_lons, 2)

Returns:

bool

property grid_shape

Get shape of raster

Returns:

_grid_shape (tuple) – (rows, cols) grid size.

property source_type

Get data type for source files. Either nc or h5

property raw_time_index

Time index for input data without time pruning. This is the base time index for the raw input data.

time_index_conflict_check()[source]

Check if the number of input files and the length of the time index is the same

property time_index

Time index for input data with time pruning. This is the raw time index with a cropped range and time step applied.

property time_freq_hours

Get the time frequency in hours as a float

property time_index_file

Get time index file path

property cache_files

Cache files for storing extracted data

property cache_pattern

Get correct cache file pattern for formatting.

Returns:

_cache_pattern (str) – The cache file pattern with formatting keys included.

property cached_features

List of features which have been requested but have been determined not to need extraction. Thus they have been cached already.

static check_cached_features(features, cache_files=None, overwrite_cache=False, load_cached=False)

Check which features have been cached and check flags to determine whether to load or extract this features again

Parameters:
  • features (list) – list of features to extract

  • cache_files (list | None) – Path to files with saved feature data

  • overwrite_cache (bool) – Whether to overwrite cached files

  • load_cached (bool) – Whether to load data from cache files

Returns:

list – List of features to extract. Might not include features which have cache files.

get_cache_file_names(cache_pattern, grid_shape=None, time_index=None, target=None, features=None)

Get names of cache files from cache_pattern and feature names

Parameters:
  • cache_pattern (str) – Pattern to use for cache file names

  • grid_shape (tuple) – Shape of grid to use for cache file naming

  • time_index (list | pd.DatetimeIndex) – Time index to use for cache file naming

  • target (tuple) – Target to use for cache file naming

  • features (list) – List of features to use for cache file naming

Returns:

list – List of cache file names

property noncached_features

Get list of features needing extraction or derivation

parallel_load(data, cache_files, features, max_workers=None)

Load feature data in parallel

Parameters:
  • data (ndarray) – Array to fill with cached data

  • cache_files (list) – List of cache files for each feature

  • features (list) – List of requested features

  • max_workers (int | None) – Max number of workers to use for parallel data loading. If None the max number of available workers will be used.

property try_load

Check if we should try to load cache