sup3r.postprocessing.collectors.nc.CollectorNC#

class CollectorNC(file_paths)[source]#

Bases: BaseCollector

Sup3r NETCDF file collection framework

Parameters:

file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.<ext>. Files should have non-overlapping time_index and spatial domains.

Methods

collect(file_paths, out_file[, features, ...])

Collect data files from a dir to one output file.

get_chunk_indices(file)

Get spatial and temporal chunk indices from the given file name.

get_node_cmd(config)

Get a CLI call to collect data.

get_time_dim_name(filepath)

Get the name of the time dimension in the given file

group_spatial_chunks()

Group same spatial chunks together so each entry has same spatial footprint but different times

write_data(out_file, dsets, time_index, ...)

Write list of datasets to out_file.

classmethod collect(file_paths, out_file, features='all', log_level=None, log_file=None, overwrite=True, res_kwargs=None, cacher_kwargs=None, is_regular_grid=True)[source]#

Collect data files from a dir to one output file.

TODO: For a regular grid (lat values are constant across lon and vice versa) collecting lat / lon chunks is supported. For curvilinear grids only collection of chunks that are split by latitude are supported. This should be generalized to allow for any spatial chunking and any dimension. I think this would require a new file naming scheme with a spatial index for both latitude and longitude or checking each chunk to see how they are split.

Filename requirements:
  • Should end with “.nc”

Parameters:
  • file_paths (list | str) – Explicit list of str file paths that will be sorted and collected or a single string with unix-style /search/patt*ern.nc.

  • out_file (str) – File path of final output file.

  • features (list | str) – List of dsets to collect. If ‘all’ then all data_vars will be collected.

  • log_level (str | None) – Desired log level, None will not initialize logging.

  • log_file (str | None) – Target log file. None logs to stdout.

  • write_status (bool) – Flag to write status file once complete if running from pipeline.

  • job_name (str) – Job name for status file if running from pipeline.

  • overwrite (bool) – Whether to overwrite existing output file

  • res_kwargs (dict | None) – Dictionary of kwargs to pass to xarray.open_mfdataset.

  • cacher_kwargs (dict | None) – Dictionary of kwargs to pass to Cacher._write_single.

  • is_regular_grid (bool) – Whether the data is on a regular grid. If True then spatial chunks can be combined across both latitude and longitude.

group_spatial_chunks()[source]#

Group same spatial chunks together so each entry has same spatial footprint but different times

static get_chunk_indices(file)#

Get spatial and temporal chunk indices from the given file name.

Returns:

  • temporal_chunk_index (str) – Zero padded integer for the temporal chunk index

  • spatial_chunk_index (str) – Zero padded integer for the spatial chunk index

classmethod get_node_cmd(config)#

Get a CLI call to collect data.

Parameters:

config (dict) – sup3r collection config with all necessary args and kwargs to run data collection.

static get_time_dim_name(filepath)#

Get the name of the time dimension in the given file

Parameters:

filepath (str) – Path to the file

Returns:

time_key (str) – Name of the time dimension in the given file

classmethod write_data(out_file, dsets, time_index, data_list, meta, global_attrs=None)#

Write list of datasets to out_file.

Parameters:
  • out_file (str) – Pre-existing H5 file output path

  • dsets (list) – list of datasets to write to out_file

  • time_index (pd.DatetimeIndex()) – Pandas datetime index to use for file time_index.

  • data_list (list) – List of np.ndarray objects to write to out_file

  • meta (pd.DataFrame) – Full meta dataframe for the final output data.

  • global_attrs (dict) – Namespace of file-global attributes for the final output data.