nsrdb.file_handlers.collection.Collector

class Collector(collect_dir, dset)[source]

Bases: object

NSRDB file collection framework

Parameters:
  • collect_dir (str) – Directory that files are being collected from

  • dset (str) – Dataset/var name that is searched for in file names in collect_dir.

Methods

collect_daily(collect_dir, fn_out, dsets[, ...])

Collect daily data model files from a dir to one output file.

collect_dir(meta_final, collect_dir, ...[, ...])

Perform final collection of dsets for given collect_dir.

collect_flist(flist, collect_dir, f_out, dset)

Collect a dataset from a file list with data pre-init.

collect_flist_lowmem(flist, collect_dir, ...)

Collect a file list without data pre-init for low memory utilization

filter_flist(flist, collect_dir, dset)

Filter file list so that only remaining files have given dset.

get_data(fpath, dset, time_index, meta, ...)

Retreive a data array from a chunked file.

get_dset_attrs(h5dir[, ignore_dsets])

Get output file dataset attributes for a set of datasets.

get_flist(d, dset)

Get a date-sorted .h5 file list for a given var.

get_slices(final_time_index, final_meta, ...)

Get index slices where the new ti/meta belong in the final ti/meta.

verify_flist(flist, d, var)

Verify the correct number of files in d for var.

static verify_flist(flist, d, var)[source]

Verify the correct number of files in d for var. Raise if bad flist.

Filename requirements:
  • Expects file names with leading “YYYYMMDD_”.

  • Must have var in the file name.

  • Should end with “.h5”

Parameters:
  • flist (list) – List of .h5 files in directory d that contain the var string. Sorted by integer before the first underscore in the filename.

  • d (str) – Directory to get file list from.

  • var (str) – Variable name that is searched for in files in d.

static filter_flist(flist, collect_dir, dset)[source]

Filter file list so that only remaining files have given dset.

static get_flist(d, dset)[source]

Get a date-sorted .h5 file list for a given var.

Filename requirements:
  • Expects file names with leading “YYYYMMDD_”.

  • Must have var in the file name.

  • Should end with “.h5”

Parameters:
  • d (str) – Directory to get file list from.

  • dset (str) – Variable name that is searched for in files in d.

Returns:

flist (list) – List of .h5 files in directory d that contain the var string. Sorted by integer before the first underscore in the filename.

static get_slices(final_time_index, final_meta, new_time_index, new_meta)[source]

Get index slices where the new ti/meta belong in the final ti/meta.

Parameters:
  • final_time_index (pd.Datetimeindex) – Time index of the final file that new_time_index is being written to.

  • final_meta (pd.DataFrame) – Meta data of the final file that new_meta is being written to.

  • new_time_index (pd.Datetimeindex) – Chunk time index that is a subset of the final_time_index.

  • new_meta (pd.DataFrame) – Chunk meta data that is a subset of the final_meta.

Returns:

  • row_slice (slice) – final_time_index[row_slice] = new_time_index

  • col_slice (slice) – final_meta[col_slice] = new_meta

static get_data(fpath, dset, time_index, meta, scale_factor, dtype, sites=None)[source]

Retreive a data array from a chunked file.

Parameters:
  • fpath (str) – h5 file to get data from

  • dset (str) – dataset to retrieve data from fpath.

  • time_index (pd.Datetimeindex) – Time index of the final file.

  • final_meta (pd.DataFrame) – Meta data of the final file.

  • scale_factor (int | float) – Final destination scale factor after collection. If the data retrieval from the files to be collected has a different scale factor, the collected data will be rescaled and returned as float32.

  • dtype (np.dtype) – Final dtype to return data as

  • sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.

Returns:

  • f_data (np.ndarray) – Data array from the fpath cast as input dtype.

  • row_slice (slice) – final_time_index[row_slice] = new_time_index

  • col_slice (slice) – final_meta[col_slice] = new_meta

classmethod collect_flist(flist, collect_dir, f_out, dset, sites=None, sort=False, sort_key=None, var_meta=None, max_workers=None)[source]

Collect a dataset from a file list with data pre-init.

Note

Collects data that can be chunked in both space and time.

Parameters:
  • flist (list) – List of chunked filenames in collect_dir to collect.

  • collect_dir (str) – Directory of chunked files (flist).

  • f_out (str) – File path of final output file.

  • dset (str) – Dataset name to collect.

  • sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.

  • sort (bool) – flag to sort flist to determine meta data order.

  • sort_key (None | fun) – Optional sort key to sort flist by (determines how meta is built if f_out does not exist).

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None uses all available.

static collect_flist_lowmem(flist, collect_dir, f_out, dset, sort=False, sort_key=None, var_meta=None, log_file='collect_flist_lowmem.log', log_level='DEBUG')[source]

Collect a file list without data pre-init for low memory utilization

Collects data that can be chunked in both space and time as long as f_out is pre-initialized.

Parameters:
  • flist (list | str) – List of chunked filenames in collect_dir to collect. Can also be a json.dumps(flist).

  • collect_dir (str) – Directory of chunked files (flist).

  • f_out (str) – File path of final output file. Must already be initialized with full time index and meta.

  • dset (str) – Dataset name to collect.

  • sort (bool) – flag to sort flist to determine meta data order.

  • sort_key (None | fun) – Optional sort key to sort flist by (determines how meta is built if f_out does not exist).

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • log_level (str | None) – Desired log level, None will not initialize logging.

  • log_file (str | None) – Target log file. None logs to stdout.

classmethod collect_daily(collect_dir, fn_out, dsets, sites=None, n_writes=1, var_meta=None, max_workers=None, log_file='collect_daily.log', log_level='DEBUG')[source]

Collect daily data model files from a dir to one output file.

Assumes the file list is chunked in time (row chunked).

Filename requirements:
  • Expects file names with leading “YYYYMMDD_”.

  • Must have var in the file name.

  • Should end with “.h5”

Parameters:
  • collect_dir (str) – Directory of chunked files. Each file should be one variable for one day.

  • fn_out (str) – File path of final output file.

  • dsets (list | str) – List of datasets / variable names to collect. Can also be a single dataset or json.dumps(dsets).

  • sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.

  • n_writes (None | int) – Number of file list divisions to write per dataset. For example, if ghi and dni are being collected and n_writes is set to 2, half of the source ghi files will be collected at once and then written, then the second half of ghi files, then dni.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo. This is used if f_out has not yet been initialized.

  • max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.

  • log_file (str | None) – Target log file. None logs to stdout.

  • log_level (str | None) – Desired log level, None will not initialize logging.

static get_dset_attrs(h5dir, ignore_dsets=('coordinates', 'time_index', 'meta'))[source]

Get output file dataset attributes for a set of datasets.

Parameters:
  • h5dir (str) – Path to directory containing multiple h5 files with all available dsets. Can also be a single h5 filepath.

  • ignore_dsets (tuple | list) – List of datasets to ignore (will not be aggregated).

Returns:

  • dsets (list) – List of datasets.

  • attrs (dict) – Dictionary of dataset attributes keyed by dset name.

  • chunks (dict) – Dictionary of chunk tuples keyed by dset name.

  • dtypes (dict) – dictionary of numpy datatypes keyed by dset name.

  • ti (pd.Datetimeindex) – Time index of source files in h5dir.

classmethod collect_dir(meta_final, collect_dir, collect_tag, fout, dsets=None, max_workers=None, log_file='collect_dir.log', log_level='DEBUG')[source]

Perform final collection of dsets for given collect_dir.

Parameters:
  • meta_final (str | pd.DataFrame) – Final meta data with index = gid.

  • collect_dir (str) – Directory path containing chunked h5 files to collect.

  • collect_tag (str) – String to be found in files that are being collected

  • fout (str) – File path to the output collected file (will be initialized by this method).

  • dsets (list | tuple) – Select datasets to collect (None will default to all dsets)

  • max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.

  • log_file (str | None) – Target log file. None logs to stdout.

  • log_level (str | None) – Desired log level, None will not initialize logging.