nsrdb.file_handlers.collection.Collector

class Collector(collect_dir, dset)[source]

Bases: object

NSRDB file collection framework

Parameters:

collect_dir (str) – Directory that files are being collected from
dset (str) – Dataset/var name that is searched for in file names in collect_dir.

Methods

`collect_daily`(collect_dir, fn_out, dsets[, ...])	Collect daily data model files from a dir to one output file.
`collect_dir`(meta_final, collect_dir, ...[, ...])	Perform final collection of dsets for given collect_dir.
`collect_flist`(flist, collect_dir, f_out, dset)	Collect a dataset from a file list with data pre-init.
`collect_flist_lowmem`(flist, collect_dir, ...)	Collect a file list without data pre-init for low memory utilization
`filter_flist`(flist, collect_dir, dset)	Filter file list so that only remaining files have given dset.
`get_data`(fpath, dset, time_index, meta, ...)	Retreive a data array from a chunked file.
`get_dset_attrs`(h5dir[, ignore_dsets])	Get output file dataset attributes for a set of datasets.
`get_flist`(d, dset)	Get a date-sorted .h5 file list for a given var.
`get_slices`(final_time_index, final_meta, ...)	Get index slices where the new ti/meta belong in the final ti/meta.
`verify_flist`(flist, d, var)	Verify the correct number of files in d for var.

static verify_flist(flist, d, var)[source]

Verify the correct number of files in d for var. Raise if bad flist.

Filename requirements:

Expects file names with leading “YYYYMMDD_”.
Must have var in the file name.
Should end with “.h5”

Parameters:

flist (list) – List of .h5 files in directory d that contain the var string. Sorted by integer before the first underscore in the filename.
d (str) – Directory to get file list from.
var (str) – Variable name that is searched for in files in d.

static filter_flist(flist, collect_dir, dset)[source]: Filter file list so that only remaining files have given dset.

static get_flist(d, dset)[source]

Get a date-sorted .h5 file list for a given var.

Filename requirements:

Expects file names with leading “YYYYMMDD_”.
Must have var in the file name.
Should end with “.h5”

Parameters:

d (str) – Directory to get file list from.
dset (str) – Variable name that is searched for in files in d.

Returns:

flist (list) – List of .h5 files in directory d that contain the var string. Sorted by integer before the first underscore in the filename.

static get_slices(final_time_index, final_meta, new_time_index, new_meta)[source]

Get index slices where the new ti/meta belong in the final ti/meta.

Parameters:

final_time_index (pd.Datetimeindex) – Time index of the final file that new_time_index is being written to.
final_meta (pd.DataFrame) – Meta data of the final file that new_meta is being written to.
new_time_index (pd.Datetimeindex) – Chunk time index that is a subset of the final_time_index.
new_meta (pd.DataFrame) – Chunk meta data that is a subset of the final_meta.

Returns:

row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta

static get_data(fpath, dset, time_index, meta, scale_factor, dtype, sites=None)[source]

Retreive a data array from a chunked file.

Parameters:

fpath (str) – h5 file to get data from
dset (str) – dataset to retrieve data from fpath.
time_index (pd.Datetimeindex) – Time index of the final file.
final_meta (pd.DataFrame) – Meta data of the final file.
scale_factor (int | float) – Final destination scale factor after collection. If the data retrieval from the files to be collected has a different scale factor, the collected data will be rescaled and returned as float32.
dtype (np.dtype) – Final dtype to return data as
sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.

Returns:

f_data (np.ndarray) – Data array from the fpath cast as input dtype.
row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta

classmethod collect_flist(flist, collect_dir, f_out, dset, sites=None, sort=False, sort_key=None, var_meta=None, max_workers=None)[source]

Collect a dataset from a file list with data pre-init.

Note

Collects data that can be chunked in both space and time.

Parameters:

flist (list) – List of chunked filenames in collect_dir to collect.
collect_dir (str) – Directory of chunked files (flist).
f_out (str) – File path of final output file.
dset (str) – Dataset name to collect.
sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.
sort (bool) – flag to sort flist to determine meta data order.
sort_key (None | fun) – Optional sort key to sort flist by (determines how meta is built if f_out does not exist).
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None uses all available.

static collect_flist_lowmem(flist, collect_dir, f_out, dset, sort=False, sort_key=None, var_meta=None, log_file='collect_flist_lowmem.log', log_level='DEBUG')[source]

Collect a file list without data pre-init for low memory utilization

Collects data that can be chunked in both space and time as long as f_out is pre-initialized.

Parameters:

flist (list | str) – List of chunked filenames in collect_dir to collect. Can also be a json.dumps(flist).
collect_dir (str) – Directory of chunked files (flist).
f_out (str) – File path of final output file. Must already be initialized with full time index and meta.
dset (str) – Dataset name to collect.
sort (bool) – flag to sort flist to determine meta data order.
sort_key (None | fun) – Optional sort key to sort flist by (determines how meta is built if f_out does not exist).
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
log_level (str | None) – Desired log level, None will not initialize logging.
log_file (str | None) – Target log file. None logs to stdout.

classmethod collect_daily(collect_dir, fn_out, dsets, sites=None, n_writes=1, var_meta=None, max_workers=None, log_file='collect_daily.log', log_level='DEBUG')[source]

Collect daily data model files from a dir to one output file.

Assumes the file list is chunked in time (row chunked).

Filename requirements:

Expects file names with leading “YYYYMMDD_”.
Must have var in the file name.
Should end with “.h5”

Parameters:

collect_dir (str) – Directory of chunked files. Each file should be one variable for one day.
fn_out (str) – File path of final output file.
dsets (list | str) – List of datasets / variable names to collect. Can also be a single dataset or json.dumps(dsets).
sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.
n_writes (None | int) – Number of file list divisions to write per dataset. For example, if ghi and dni are being collected and n_writes is set to 2, half of the source ghi files will be collected at once and then written, then the second half of ghi files, then dni.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo. This is used if f_out has not yet been initialized.
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
log_file (str | None) – Target log file. None logs to stdout.
log_level (str | None) – Desired log level, None will not initialize logging.

static get_dset_attrs(h5dir, ignore_dsets=('coordinates', 'time_index', 'meta'))[source]

Get output file dataset attributes for a set of datasets.

Parameters:

h5dir (str) – Path to directory containing multiple h5 files with all available dsets. Can also be a single h5 filepath.
ignore_dsets (tuple | list) – List of datasets to ignore (will not be aggregated).

Returns:

dsets (list) – List of datasets.
attrs (dict) – Dictionary of dataset attributes keyed by dset name.
chunks (dict) – Dictionary of chunk tuples keyed by dset name.
dtypes (dict) – dictionary of numpy datatypes keyed by dset name.
ti (pd.Datetimeindex) – Time index of source files in h5dir.

classmethod collect_dir(meta_final, collect_dir, collect_tag, fout, dsets=None, max_workers=None, log_file='collect_dir.log', log_level='DEBUG')[source]

Perform final collection of dsets for given collect_dir.

Parameters:

meta_final (str | pd.DataFrame) – Final meta data with index = gid.
collect_dir (str) – Directory path containing chunked h5 files to collect.
collect_tag (str) – String to be found in files that are being collected
fout (str) – File path to the output collected file (will be initialized by this method).
dsets (list | tuple) – Select datasets to collect (None will default to all dsets)
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
log_file (str | None) – Target log file. None logs to stdout.
log_level (str | None) – Desired log level, None will not initialize logging.