sup3r.preprocessing.cachers.base.Cacher#
- class Cacher(data: Sup3rX | Sup3rDataset, cache_kwargs: Dict | None = None)[source]#
Bases:
Container
Base cacher object. Simply writes given data to H5 or NETCDF files. By default every feature will be written to a separate file. To write multiple features to the same file call
write_netcdf()
orwrite_h5()
directly- Parameters:
data (Union[Sup3rX, Sup3rDataset]) – Data to write to file
cache_kwargs (dict) – Dictionary with kwargs for caching wrangled data. This should at minimum include a ‘cache_pattern’ key, value. This pattern must have a {feature} format key and either a h5 or nc file extension, based on desired output type.
Can also include a
max_workers
key andchunks
key.max_workers
is an inteeger specifying number of threads to use for writing chunks to output files andchunks
is a dictionary of dictionaries for each feature (or a single dictionary to use for all features). e.g. .. code-block:: JSON- {‘cache_pattern’: …,
- ‘chunks’: {
- ‘u_10m’: {
‘time’: 20, ‘south_north’: 100, ‘west_east’: 100
}
}
}
Note
This is only for saving cached data. If you want to reload the cached files load them with a
Loader
object.DataHandler
objects can cache and reload from cache automatically.Methods
add_coord_meta
(out_file, data[, meta])Add flattened coordinate meta to out_file.
cache_data
(cache_pattern[, chunks, ...])Cache data to file with file type based on user provided cache_pattern.
get_chunk_slices
(chunks, shape)Get slices used to write xarray data to netcdf file in chunks.
get_chunksizes
(dset, data, chunks)Get chunksizes after rechunking (could be undetermined beforehand if
chunks == 'auto'
) and return rechunked data.parse_chunks
(feature, chunks, dims)Parse chunks input to Cacher.
post_init_log
([args_dict])Log additional arguments after initialization.
wrap
(data)Return a
Sup3rDataset
object or tuple of such.write_chunk
(out_file, dset, chunk_slice, ...)Add chunk to netcdf file.
write_h5
(out_file, data[, features, chunks, ...])Cache data to h5 file using user provided chunks value.
write_netcdf
(out_file, data[, features, ...])Cache data to a netcdf file.
write_netcdf_chunks
(out_file, feature, data)Write netcdf chunks with delayed dask tasks.
Attributes
- cache_data(cache_pattern, chunks=None, max_workers=None, mode='w', attrs=None, verbose=False)[source]#
Cache data to file with file type based on user provided cache_pattern.
- Parameters:
cache_pattern (str) – Cache file pattern. Must have a {feature} format key. The extension (.h5 or .nc) specifies which format to use for caching.
chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g.
{'u_10m': {'time': 10, 'south_north': 100, 'west_east': 100}}
max_workers (int | None) – Number of workers to use for parallel writing of chunks
mode (str) – Write mode for
out_file
. Defaults to write.attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}
verbose (bool) – Whether to log progress for each chunk written to output files.
- static parse_chunks(feature, chunks, dims)[source]#
Parse chunks input to Cacher. Needs to be a dictionary of dimensions and chunk values but parsed to a tuple for H5 caching.
- classmethod get_chunksizes(dset, data, chunks)[source]#
Get chunksizes after rechunking (could be undetermined beforehand if
chunks == 'auto'
) and return rechunked data.- Parameters:
dset (str) – Name of feature to get chunksizes for.
data (Sup3rX | xr.Dataset) –
Sup3rX
orxr.Dataset
containing data to be cached.chunks (dict | None | ‘auto’) – Dictionary of chunksizes either to use for all features or, if the dictionary includes feature keys, feature specific chunksizes. Can also be None or ‘auto’.
- classmethod add_coord_meta(out_file, data, meta=None)[source]#
Add flattened coordinate meta to out_file. This is used for h5 caching.
- Parameters:
out_file (str) – Name of output file.
data (Sup3rX | xr.Dataset) – Data being written to the given
out_file
.meta (pd.DataFrame | None) – Optional additional meta information to be written to the given
out_file
. If this is None then only coordinate info will be included in the meta written to theout_file
- classmethod write_h5(out_file, data, features='all', chunks=None, max_workers=None, mode='w', attrs=None, verbose=False)[source]#
Cache data to h5 file using user provided chunks value.
- Parameters:
out_file (str) – Name of file to write. Must have a .h5 extension.
data (Sup3rDataset | Sup3rX | xr.Dataset) – Data to write to file. Comes from
self.data
, so anxr.Dataset
like object with.dims
and.coords
features (str | list) – Name of feature(s) to write to file.
chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g.
{'u_10m': {'time': 10, 'south_north': 100, 'west_east': 100}}
max_workers (int | None) – Number of workers to use for parallel writing of chunks
mode (str) – Write mode for
out_file
. Defaults to write.attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}. Can also include a global meta dataframe that will then be added to the coordinate meta.
verbose (bool) – Dummy arg to match
write_netcdf
signature
- static get_chunk_slices(chunks, shape)[source]#
Get slices used to write xarray data to netcdf file in chunks.
- static write_chunk(out_file, dset, chunk_slice, chunk_data, msg=None)[source]#
Add chunk to netcdf file.
- classmethod write_netcdf_chunks(out_file, feature, data, chunks=None, max_workers=None, verbose=False)[source]#
Write netcdf chunks with delayed dask tasks.
- classmethod write_netcdf(out_file, data, features='all', chunks=None, max_workers=None, mode='w', attrs=None, verbose=False)[source]#
Cache data to a netcdf file.
- Parameters:
out_file (str) – Name of file to write. Must have a
.nc
extension.data (Sup3rDataset) – Data to write to file. Comes from
self.data
, so aSup3rDataset
with coords attributesfeatures (str | list) – Names of feature(s) to write to file.
chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g.
{'south_north': 100, 'west_east': 100, 'time': 10}
Can also include dataset specific values. e.g.{'windspeed': {'south_north': 100, 'west_east': 100, 'time': 10}}
max_workers (int | None) – Number of workers to use for parallel writing of chunks
mode (str) – Write mode for
out_file
. Defaults to write.attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}
verbose (bool) – Whether to log output after each chunk is written.
- property data#
Return underlying data.
- Returns:
See also
- post_init_log(args_dict=None)#
Log additional arguments after initialization.
- property shape#
Get shape of underlying data.
- wrap(data)#
Return a
Sup3rDataset
object or tuple of such. This is a tuple when the .data attribute belongs to aCollection
object likeBatchHandler
. Otherwise this isSup3rDataset
object, which is either a wrapped 2-tuple or 1-tuple (e.g.len(data) == 2
orlen(data) == 1)
. This is a 2-tuple when.data
belongs to a dual container object likeDualSampler
and a 1-tuple otherwise.