sup3r.preprocessing.cachers.base.Cacher#
- class Cacher(data: Sup3rX | Sup3rDataset, cache_kwargs: Dict | None = None)[source]#
Bases:
ContainerBase cacher object. Simply writes given data to H5 or NETCDF files. By default every feature will be written to a separate file. To write multiple features to the same file call
write_netcdf()orwrite_h5()directly- Parameters:
data (Union[Sup3rX, Sup3rDataset]) – Data to write to file
cache_kwargs (dict) – Dictionary with kwargs for caching wrangled data. This should at minimum include a ‘cache_pattern’ key, value. This pattern must have a {feature} format key and either a h5 or nc file extension, based on desired output type.
Can also include a
max_workerskey andchunkskey.max_workersis an inteeger specifying number of threads to use for writing chunks to output files andchunksis a dictionary of dictionaries for each feature (or a single dictionary to use for all features). e.g. .. code-block:: JSON- {‘cache_pattern’: …,
- ‘chunks’: {
- ‘u_10m’: {
‘time’: 20, ‘south_north’: 100, ‘west_east’: 100
}
}
}
Note
This is only for saving cached data. If you want to reload the cached files load them with a
Loaderobject.DataHandlerobjects can cache and reload from cache automatically.Methods
cache_data(cache_pattern[, chunks, ...])Cache data to file with file type based on user provided cache_pattern.
get_chunksizes(dset, data, chunks)Get chunksizes after rechunking (could be undetermined beforehand if
chunks == 'auto') and return rechunked data.parse_chunks(feature, chunks, dims)Parse chunks input to Cacher.
post_init_log([args_dict])Log additional arguments after initialization.
wrap(data)Return a
Sup3rDatasetobject or tuple of such.write_h5(out_file, data[, features, chunks, ...])Cache data to h5 file using user provided chunks value.
write_netcdf(out_file, data[, features, ...])Cache data to a netcdf file using xarray.
Attributes
- cache_data(cache_pattern, chunks=None, max_workers=None, mode='w', attrs=None, keep_dim_order=False, overwrite=False)[source]#
Cache data to file with file type based on user provided cache_pattern.
- Parameters:
cache_pattern (str) – Cache file pattern. Must have a {feature} format key. The extension (.h5 or .nc) specifies which format to use for caching.
chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g.
{'u_10m': {'time': 10, 'south_north': 100, 'west_east': 100}}max_workers (int | None) – Number of workers to use for parallel writing of chunks
mode (str) – Write mode for
out_file. Defaults to write.attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}
keep_dim_order (bool) – Whether to keep the original dimension order of the data. If
Falsethen the data will be transposed to have the time dimension first.overwrite (bool) – Whether to overwrite existing cache files.
- static parse_chunks(feature, chunks, dims)[source]#
Parse chunks input to Cacher. Needs to be a dictionary of dimensions and chunk values but parsed to a tuple for H5 caching.
- classmethod get_chunksizes(dset, data, chunks)[source]#
Get chunksizes after rechunking (could be undetermined beforehand if
chunks == 'auto') and return rechunked data.- Parameters:
dset (str) – Name of feature to get chunksizes for.
data (Sup3rX | xr.Dataset) –
Sup3rXorxr.Datasetcontaining data to be cached.chunks (dict | None | ‘auto’) – Dictionary of chunksizes either to use for all features or, if the dictionary includes feature keys, feature specific chunksizes. Can also be None or ‘auto’.
- classmethod write_h5(out_file, data, features='all', chunks=None, max_workers=None, mode='w', attrs=None, keep_dim_order=False)[source]#
Cache data to h5 file using user provided chunks value.
- Parameters:
out_file (str) – Name of file to write. Must have a .h5 extension.
data (Sup3rDataset | Sup3rX | xr.Dataset) – Data to write to file. Comes from
self.data, so anxr.Datasetlike object with.dimsand.coordsfeatures (str | list) – Name of feature(s) to write to file.
chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g.
{'u_10m': {'time': 10, 'south_north': 100, 'west_east': 100}}max_workers (int | None) – Number of workers to use for parallel writing of chunks
mode (str) – Write mode for
out_file. Defaults to write.attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}. Can also include a global meta dataframe that will then be added to the coordinate meta.
keep_dim_order (bool) – Whether to keep the original dimension order of the data. If
Falsethen the data will be transposed to have the time dimension first.
- classmethod write_netcdf(out_file, data, features='all', chunks=None, max_workers=1, mode='w', attrs=None, keep_dim_order=False)[source]#
Cache data to a netcdf file using xarray.
- Parameters:
out_file (str) – Name of file to write. Must have a
.ncextension.data (Sup3rDataset) – Data to write to file. Comes from
self.data, so aSup3rDatasetwith coords attributesfeatures (str | list) – Names of feature(s) to write to file.
chunks (dict | None) – Chunk sizes for coordinate dimensions. e.g.
{'south_north': 100, 'west_east': 100, 'time': 10}Can also include dataset specific values. e.g.{'windspeed': {'south_north': 100, 'west_east': 100, 'time': 10}}max_workers (int | None) – Number of workers to use for parallel writing of chunks
mode (str) – Write mode for
out_file. Defaults to write.attrs (dict | None) – Optional attributes to write to file. Can specify dataset specific attributes by adding a dictionary with the dataset name as a key. e.g. {**global_attrs, dset: {…}}
keep_dim_order (bool) – Dummy arg to match
write_h5signature
- property data#
Return underlying data.
- Returns:
See also
- post_init_log(args_dict=None)#
Log additional arguments after initialization.
- property shape#
Get shape of underlying data.
- wrap(data)#
Return a
Sup3rDatasetobject or tuple of such. This is a tuple when the.dataattribute belongs to aCollectionobject likeBatchHandler. Otherwise this isSup3rDatasetobject, which is either a wrapped 3-tuple, 2-tuple, or 1-tuple (e.g.len(data) == 3,len(data) == 2orlen(data) == 1). This is a 3-tuple when.databelongs to a container object likeDualSamplerWithObs, a 2-tuple when.databelongs to a dual container object likeDualSampler, and a 1-tuple otherwise.