sup3r.preprocessing.cachers.base.Cacher#

class Cacher(data: Sup3rX | Sup3rDataset, cache_kwargs: Dict | None = None)[source]#

Bases: Container

Base cacher object. Simply writes given data to H5 or NETCDF files. By default every feature will be written to a separate file. To write multiple features to the same file call write_netcdf() or write_h5() directly

Parameters:
  • data (Union[Sup3rX, Sup3rDataset]) – Data to write to file

  • cache_kwargs (dict) – Dictionary with kwargs for caching wrangled data. This should at minimum include a ‘cache_pattern’ key, value. This pattern must have a {feature} format key and either a h5 or nc file extension, based on desired output type.

    Can also include a max_workers key and chunks key. max_workers is an inteeger specifying number of threads to use for writing chunks to output files and chunks is a dictionary of dictionaries for each feature (or a single dictionary to use for all features). e.g. .. code-block:: JSON

    {‘cache_pattern’: …,
    ‘chunks’: {
    ‘u_10m’: {

    ‘time’: 20, ‘south_north’: 100, ‘west_east’: 100

    }

    }

    }

Note

This is only for saving cached data. If you want to reload the cached files load them with a Loader object. DataHandler objects can cache and reload from cache automatically.

Methods

add_coord_meta(out_file, data[, meta])

Add flattened coordinate meta to out_file.

cache_data(cache_pattern[, chunks, ...])

Cache data to file with file type based on user provided cache_pattern.

get_chunk_slices(chunks, shape)

Get slices used to write xarray data to netcdf file in chunks.

get_chunksizes(dset, data, chunks)

Get chunksizes after rechunking (could be undetermined beforehand if chunks == 'auto') and return rechunked data.

parse_chunks(feature, chunks, dims)

Parse chunks input to Cacher.

post_init_log([args_dict])

Log additional arguments after initialization.

wrap(data)

Return a Sup3rDataset object or tuple of such.

write_chunk(out_file, dset, chunk_slice, ...)

Add chunk to netcdf file.

write_h5(out_file, data[, features, chunks, ...])

Cache data to h5 file using user provided chunks value.

write_netcdf(out_file, data[, features, ...])

Cache data to a netcdf file.

write_netcdf_chunks(out_file, feature, data)

Write netcdf chunks with delayed dask tasks.

Attributes

data

Return underlying data.

shape

Get shape of underlying data.

cache_data(cache_pattern, chunks=None, max_workers=None, mode='w', attrs=None, verbose=False)[source]#

Cache data to file with file type based on user provided cache_pattern.

Parameters:
  • cache_pattern (str) – Cache file pattern. Must have a {feature} format key. The extension (.h5 or .nc) specifies which format to use for caching.

  • chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g. {'u_10m': {'time': 10, 'south_north': 100, 'west_east': 100}}

  • max_workers (int | None) – Number of workers to use for parallel writing of chunks

  • mode (str) – Write mode for out_file. Defaults to write.

  • attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}

  • verbose (bool) – Whether to log progress for each chunk written to output files.

static parse_chunks(feature, chunks, dims)[source]#

Parse chunks input to Cacher. Needs to be a dictionary of dimensions and chunk values but parsed to a tuple for H5 caching.

classmethod get_chunksizes(dset, data, chunks)[source]#

Get chunksizes after rechunking (could be undetermined beforehand if chunks == 'auto') and return rechunked data.

Parameters:
  • dset (str) – Name of feature to get chunksizes for.

  • data (Sup3rX | xr.Dataset) – Sup3rX or xr.Dataset containing data to be cached.

  • chunks (dict | None | ‘auto’) – Dictionary of chunksizes either to use for all features or, if the dictionary includes feature keys, feature specific chunksizes. Can also be None or ‘auto’.

classmethod add_coord_meta(out_file, data, meta=None)[source]#

Add flattened coordinate meta to out_file. This is used for h5 caching.

Parameters:
  • out_file (str) – Name of output file.

  • data (Sup3rX | xr.Dataset) – Data being written to the given out_file.

  • meta (pd.DataFrame | None) – Optional additional meta information to be written to the given out_file. If this is None then only coordinate info will be included in the meta written to the out_file

classmethod write_h5(out_file, data, features='all', chunks=None, max_workers=None, mode='w', attrs=None, verbose=False)[source]#

Cache data to h5 file using user provided chunks value.

Parameters:
  • out_file (str) – Name of file to write. Must have a .h5 extension.

  • data (Sup3rDataset | Sup3rX | xr.Dataset) – Data to write to file. Comes from self.data, so an xr.Dataset like object with .dims and .coords

  • features (str | list) – Name of feature(s) to write to file.

  • chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g. {'u_10m': {'time': 10, 'south_north': 100, 'west_east': 100}}

  • max_workers (int | None) – Number of workers to use for parallel writing of chunks

  • mode (str) – Write mode for out_file. Defaults to write.

  • attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}. Can also include a global meta dataframe that will then be added to the coordinate meta.

  • verbose (bool) – Dummy arg to match write_netcdf signature

static get_chunk_slices(chunks, shape)[source]#

Get slices used to write xarray data to netcdf file in chunks.

static write_chunk(out_file, dset, chunk_slice, chunk_data, msg=None)[source]#

Add chunk to netcdf file.

classmethod write_netcdf_chunks(out_file, feature, data, chunks=None, max_workers=None, verbose=False)[source]#

Write netcdf chunks with delayed dask tasks.

classmethod write_netcdf(out_file, data, features='all', chunks=None, max_workers=None, mode='w', attrs=None, verbose=False)[source]#

Cache data to a netcdf file.

Parameters:
  • out_file (str) – Name of file to write. Must have a .nc extension.

  • data (Sup3rDataset) – Data to write to file. Comes from self.data, so a Sup3rDataset with coords attributes

  • features (str | list) – Names of feature(s) to write to file.

  • chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g. {'south_north': 100, 'west_east': 100, 'time': 10} Can also include dataset specific values. e.g. {'windspeed': {'south_north': 100, 'west_east': 100, 'time': 10}}

  • max_workers (int | None) – Number of workers to use for parallel writing of chunks

  • mode (str) – Write mode for out_file. Defaults to write.

  • attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}

  • verbose (bool) – Whether to log output after each chunk is written.

property data#

Return underlying data.

Returns:

Sup3rDataset

See also

wrap()

post_init_log(args_dict=None)#

Log additional arguments after initialization.

property shape#

Get shape of underlying data.

wrap(data)#

Return a Sup3rDataset object or tuple of such. This is a tuple when the .data attribute belongs to a Collection object like BatchHandler. Otherwise this is Sup3rDataset object, which is either a wrapped 2-tuple or 1-tuple (e.g. len(data) == 2 or len(data) == 1). This is a 2-tuple when .data belongs to a dual container object like DualSampler and a 1-tuple otherwise.