nsrdb.file_handlers.collection.Collector
- class Collector(collect_dir, dset)[source]
Bases:
object
NSRDB file collection framework
- Parameters:
collect_dir (str) – Directory that files are being collected from
dset (str) – Dataset/var name that is searched for in file names in collect_dir.
Methods
collect_daily
(collect_dir, fn_out, dsets[, ...])Collect daily data model files from a dir to one output file.
collect_dir
(meta_final, collect_dir, ...[, ...])Perform final collection of dsets for given collect_dir.
collect_flist
(flist, collect_dir, f_out, dset)Collect a dataset from a file list with data pre-init.
collect_flist_lowmem
(flist, collect_dir, ...)Collect a file list without data pre-init for low memory utilization
filter_flist
(flist, collect_dir, dset)Filter file list so that only remaining files have given dset.
get_data
(fpath, dset, time_index, meta, ...)Retreive a data array from a chunked file.
get_dset_attrs
(h5dir[, ignore_dsets])Get output file dataset attributes for a set of datasets.
get_flist
(d, dset)Get a date-sorted .h5 file list for a given var.
get_slices
(final_time_index, final_meta, ...)Get index slices where the new ti/meta belong in the final ti/meta.
verify_flist
(flist, d, var)Verify the correct number of files in d for var.
- static verify_flist(flist, d, var)[source]
Verify the correct number of files in d for var. Raise if bad flist.
- Filename requirements:
Expects file names with leading “YYYYMMDD_”.
Must have var in the file name.
Should end with “.h5”
- Parameters:
flist (list) – List of .h5 files in directory d that contain the var string. Sorted by integer before the first underscore in the filename.
d (str) – Directory to get file list from.
var (str) – Variable name that is searched for in files in d.
- static filter_flist(flist, collect_dir, dset)[source]
Filter file list so that only remaining files have given dset.
- static get_flist(d, dset)[source]
Get a date-sorted .h5 file list for a given var.
- Filename requirements:
Expects file names with leading “YYYYMMDD_”.
Must have var in the file name.
Should end with “.h5”
- Parameters:
d (str) – Directory to get file list from.
dset (str) – Variable name that is searched for in files in d.
- Returns:
flist (list) – List of .h5 files in directory d that contain the var string. Sorted by integer before the first underscore in the filename.
- static get_slices(final_time_index, final_meta, new_time_index, new_meta)[source]
Get index slices where the new ti/meta belong in the final ti/meta.
- Parameters:
final_time_index (pd.Datetimeindex) – Time index of the final file that new_time_index is being written to.
final_meta (pd.DataFrame) – Meta data of the final file that new_meta is being written to.
new_time_index (pd.Datetimeindex) – Chunk time index that is a subset of the final_time_index.
new_meta (pd.DataFrame) – Chunk meta data that is a subset of the final_meta.
- Returns:
row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta
- static get_data(fpath, dset, time_index, meta, scale_factor, dtype, sites=None)[source]
Retreive a data array from a chunked file.
- Parameters:
fpath (str) – h5 file to get data from
dset (str) – dataset to retrieve data from fpath.
time_index (pd.Datetimeindex) – Time index of the final file.
final_meta (pd.DataFrame) – Meta data of the final file.
scale_factor (int | float) – Final destination scale factor after collection. If the data retrieval from the files to be collected has a different scale factor, the collected data will be rescaled and returned as float32.
dtype (np.dtype) – Final dtype to return data as
sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.
- Returns:
f_data (np.ndarray) – Data array from the fpath cast as input dtype.
row_slice (slice) – final_time_index[row_slice] = new_time_index
col_slice (slice) – final_meta[col_slice] = new_meta
- classmethod collect_flist(flist, collect_dir, f_out, dset, sites=None, sort=False, sort_key=None, var_meta=None, max_workers=None)[source]
Collect a dataset from a file list with data pre-init.
Note
Collects data that can be chunked in both space and time.
- Parameters:
flist (list) – List of chunked filenames in collect_dir to collect.
collect_dir (str) – Directory of chunked files (flist).
f_out (str) – File path of final output file.
dset (str) – Dataset name to collect.
sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.
sort (bool) – flag to sort flist to determine meta data order.
sort_key (None | fun) – Optional sort key to sort flist by (determines how meta is built if f_out does not exist).
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None uses all available.
- static collect_flist_lowmem(flist, collect_dir, f_out, dset, sort=False, sort_key=None, var_meta=None, log_file='collect_flist_lowmem.log', log_level='DEBUG')[source]
Collect a file list without data pre-init for low memory utilization
Collects data that can be chunked in both space and time as long as f_out is pre-initialized.
- Parameters:
flist (list | str) – List of chunked filenames in collect_dir to collect. Can also be a json.dumps(flist).
collect_dir (str) – Directory of chunked files (flist).
f_out (str) – File path of final output file. Must already be initialized with full time index and meta.
dset (str) – Dataset name to collect.
sort (bool) – flag to sort flist to determine meta data order.
sort_key (None | fun) – Optional sort key to sort flist by (determines how meta is built if f_out does not exist).
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
log_level (str | None) – Desired log level, None will not initialize logging.
log_file (str | None) – Target log file. None logs to stdout.
- classmethod collect_daily(collect_dir, fn_out, dsets, sites=None, n_writes=1, var_meta=None, max_workers=None, log_file='collect_daily.log', log_level='DEBUG')[source]
Collect daily data model files from a dir to one output file.
Assumes the file list is chunked in time (row chunked).
- Filename requirements:
Expects file names with leading “YYYYMMDD_”.
Must have var in the file name.
Should end with “.h5”
- Parameters:
collect_dir (str) – Directory of chunked files. Each file should be one variable for one day.
fn_out (str) – File path of final output file.
dsets (list | str) – List of datasets / variable names to collect. Can also be a single dataset or json.dumps(dsets).
sites (None | np.ndarray) – Subset of site indices to collect. None collects all sites.
n_writes (None | int) – Number of file list divisions to write per dataset. For example, if ghi and dni are being collected and n_writes is set to 2, half of the source ghi files will be collected at once and then written, then the second half of ghi files, then dni.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo. This is used if f_out has not yet been initialized.
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
log_file (str | None) – Target log file. None logs to stdout.
log_level (str | None) – Desired log level, None will not initialize logging.
- static get_dset_attrs(h5dir, ignore_dsets=('coordinates', 'time_index', 'meta'))[source]
Get output file dataset attributes for a set of datasets.
- Parameters:
h5dir (str) – Path to directory containing multiple h5 files with all available dsets. Can also be a single h5 filepath.
ignore_dsets (tuple | list) – List of datasets to ignore (will not be aggregated).
- Returns:
dsets (list) – List of datasets.
attrs (dict) – Dictionary of dataset attributes keyed by dset name.
chunks (dict) – Dictionary of chunk tuples keyed by dset name.
dtypes (dict) – dictionary of numpy datatypes keyed by dset name.
ti (pd.Datetimeindex) – Time index of source files in h5dir.
- classmethod collect_dir(meta_final, collect_dir, collect_tag, fout, dsets=None, max_workers=None, log_file='collect_dir.log', log_level='DEBUG')[source]
Perform final collection of dsets for given collect_dir.
- Parameters:
meta_final (str | pd.DataFrame) – Final meta data with index = gid.
collect_dir (str) – Directory path containing chunked h5 files to collect.
collect_tag (str) – String to be found in files that are being collected
fout (str) – File path to the output collected file (will be initialized by this method).
dsets (list | tuple) – Select datasets to collect (None will default to all dsets)
max_workers (int | None) – Number of workers to use in parallel. 1 runs serial, None will use all available workers.
log_file (str | None) – Target log file. None logs to stdout.
log_level (str | None) – Desired log level, None will not initialize logging.