nsrdb.nsrdb.NSRDB

class NSRDB(out_dir, year, grid, freq='5min', var_meta=None, make_out_dirs=True)[source]

Bases: object

Entry point for NSRDB data pipeline execution.

Parameters:
  • out_dir (str) – Project directory.

  • year (int | str) – Processing year.

  • grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | None) – File path to NSRDB variables meta data. None will use the default file from the github repo.

  • make_out_dirs (bool) – Flag to make output directories for logs, daily, collect, and final

Methods

collect_data_model(out_dir, year, grid, ...)

Collect daily data model files to a single site-chunked output file.

collect_final(out_dir, year, grid[, ...])

Collect chunked files to single final output files.

date_to_doy(date)

Convert a date to a day of year integer.

doy_to_datestr(year, doy)

Convert day of year to YYYYMMDD string format

gap_fill_clouds(out_dir, year, i_chunk[, ...])

Gap fill cloud properties in a collected data model output file.

init_output_h5(f_out, dsets, time_index, meta)

Initialize a target output h5 file if it does not already exist or if Force is True.

make_out_dirs()

Ensure that all output directories exist

ml_cloud_fill(out_dir, date[, fill_all, ...])

Gap fill cloud properties using a physics-guided neural network (phygnn).

run_all_sky(out_dir, year, grid[, freq, ...])

Run the all-sky physics model from collected .h5 files

run_daily_all_sky(out_dir, year, grid, date)

Run the all-sky physics model from daily data model output files.

run_data_model(out_dir, date, grid[, ...])

Run daily data model, and save output files.

run_full(date, grid, freq[, var_meta, ...])

Run the full nsrdb pipeline in-memory using serial compute.

to_datetime(date)

Convert a date string or integer to datetime object.

Attributes

OUTS

meta

Get the NSRDB meta dataframe from the grid file.

time_index_year

Get the NSRDB full-year time index.

make_out_dirs()[source]

Ensure that all output directories exist

property time_index_year

Get the NSRDB full-year time index.

Returns:

nsrdb_ti (pd.DatetimeIndex) – Pandas datetime index for the current year at the NSRDB resolution.

property meta

Get the NSRDB meta dataframe from the grid file.

Returns:

meta (pd.DataFrame) – DataFrame of meta data from grid file csv. The first column must be the NSRDB site gid’s.

static init_output_h5(f_out, dsets, time_index, meta, force=False, var_meta=None)[source]

Initialize a target output h5 file if it does not already exist or if Force is True.

Parameters:
  • f_out (str) – File path to final .h5 file.

  • dsets (list) – List of dataset / variable names that are to be contained in f_out.

  • time_index (pd.datetimeindex) – Time index to init to file.

  • meta (pd.DataFrame) – Meta data to init to file.

  • force (bool) – Flag to overwrite / force the creation of the f_out even if a previous file exists.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

static doy_to_datestr(year, doy)[source]

Convert day of year to YYYYMMDD string format

Parameters:
  • year (int) – Year of interest

  • doy (int) – Enumerated day of year.

Returns:

date (str) – Single day to extract ancillary data for. str in YYYYMMDD format.

classmethod date_to_doy(date)[source]

Convert a date to a day of year integer.

Parameters:

date (datetime.date | str | int) – Single day to extract ancillary data for. Can be str or int in YYYYMMDD format.

Returns:

doy (int) – Day of year.

static to_datetime(date)[source]

Convert a date string or integer to datetime object.

Parameters:

date (datetime.date | str | int) – Single day to extract ancillary data for. Can be str or int in YYYYMMDD format.

Returns:

date (datetime.date) – Date object.

classmethod run_data_model(out_dir, date, grid, dist_lim=1.0, var_list=None, freq='5min', var_meta=None, factory_kwargs=None, mlclouds=False, max_workers=None, max_workers_regrid=None, log_file='data_model.log', log_level='DEBUG')[source]

Run daily data model, and save output files.

Parameters:
  • out_dir (str) – Project directory.

  • date (datetime.date | str | int) – Single day to extract ancillary data for. Can be str or int in YYYYMMDD format.

  • grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.

  • dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.

  • var_list (list | tuple | None) – Variables to process with the data model. None will default to all NSRDB variables.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs

  • mlclouds (bool) – Flag to add extra variables to the variable processing list of mlclouds gap fill is expected to be run as the next pipeline step.

  • max_workers (int | None) – Number of workers to run in parallel. 1 runs serial, None uses all available workers.

  • max_workers_regrid (None | int) – Max parallel workers allowed for cloud regrid processing. None uses all available workers. 1 runs regrid in serial.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

classmethod collect_data_model(out_dir, year, grid, n_chunks, i_chunk, i_fname, n_writes=1, freq='5min', var_meta=None, log_file='collect_dm.log', log_level='DEBUG', max_workers=None, final=False, final_file_name=None)[source]

Collect daily data model files to a single site-chunked output file.

Parameters:
  • out_dir (str) – Project directory.

  • year (int | str) – Year of analysis

  • grid (str) – Final/full NSRDB grid file. The first column must be the NSRDB site gid’s.

  • n_chunks (int) – Number of chunks (site-wise) to collect to.

  • i_chunks (int) – Chunk index (site-wise) (indexing n_chunks) to run.

  • i_fname (int) – File name index from sorted NSRDB.OUTS keys to run collection for.

  • n_writes (None | int) – Number of file list divisions to write per dataset. For example, if ghi and dni are being collected and n_writes is set to 2, half of the source ghi files will be collected at once and then written, then the second half of ghi files, then dni.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

  • max_workers (int | None) – Number of workers to run in parallel. 1 runs serial, None uses all available workers.

  • final (bool) – Flag signifying that this is the last step in the NSRDB pipeline. this will collect the data to the out_dir/final/ directory instead of the out_dir/collect Directory.

  • final_file_name (str | None) – Final file name for filename outputs if this is the terminal job.

classmethod collect_final(out_dir, year, grid, collect_dir=None, freq='5min', var_meta=None, i_fname=None, tmp=False, log_file='final_collection.log', log_level='DEBUG')[source]

Collect chunked files to single final output files.

Parameters:
  • out_dir (str) – Project directory.

  • year (int | str) – Year of analysis

  • grid (str) – Final/full NSRDB grid file. The first column must be the NSRDB site gid’s.

  • collect_dir (str) – Directory with chunked files to be collected.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • i_fname (int | None) – Optional index to collect just a single output file. Indexes the sorted OUTS class attribute keys.

  • tmp (bool) – Flag to use temporary scratch storage, then move to out_dir when finished. Doesn’t seem to be faster than collecting to normal scratch on hpc.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

classmethod gap_fill_clouds(out_dir, year, i_chunk, rows=slice(None, None, None), cols=slice(None, None, None), col_chunk=None, var_meta=None, log_file='cloud_fill.log', log_level='DEBUG')[source]

Gap fill cloud properties in a collected data model output file.

Parameters:
  • out_dir (str) – Project directory.

  • year (int | str) – Year of analysis

  • i_chunks (int) – Chunk index (indexing n_chunks) to run.

  • rows (slice) – Subset of rows to gap fill.

  • cols (slice) – Subset of columns to gap fill.

  • col_chunk (None | int) – Optional chunking method to gap fill one column chunk at a time to reduce memory requirements. If provided, this should be an integer specifying how many columns to work on at one time.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

classmethod ml_cloud_fill(out_dir, date, fill_all=False, model_path=None, var_meta=None, log_file='cloud_fill.log', log_level='DEBUG', col_chunk=None, max_workers=None)[source]

Gap fill cloud properties using a physics-guided neural network (phygnn).

Parameters:
  • out_dir (str) – Project directory.

  • date (datetime.date | str | int) – Single day data model output to run cloud fill on. Can be str or int in YYYYMMDD format.

  • fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.

  • model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

  • col_chunk (None | int) – Optional chunking method to gap fill one column chunk at a time to reduce memory requirements. If provided, this should be an integer specifying how many columns to work on at one time.

  • max_workers (None | int) – Maximum workers to clean data in parallel. 1 is serial and None uses all available workers.

classmethod run_all_sky(out_dir, year, grid, freq='5min', var_meta=None, col_chunk=10, rows=slice(None, None, None), cols=slice(None, None, None), max_workers=None, log_file='all_sky.log', log_level='DEBUG', i_chunk=None, disc_on=False)[source]

Run the all-sky physics model from collected .h5 files

Parameters:
  • out_dir (str) – Project directory.

  • year (int | str) – Year of analysis

  • grid (str) – Final/full NSRDB grid file. The first column must be the NSRDB site gid’s.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • col_chunk (int) – Chunking method to run all sky one column chunk at a time to reduce memory requirements. This is an integer specifying how many columns to work on at one time.

  • rows (slice) – Subset of rows to run.

  • cols (slice) – Subset of columns to run.

  • max_workers (int | None) – Number of workers to run in parallel. 1 will run serial, None will use all available.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

  • i_chunk (None | int) – Enumerated file index if running on site chunk.

  • disc_on (bool) – Compute cloudy sky dni with the disc model (True) or the farms-dni model (False)

classmethod run_daily_all_sky(out_dir, year, grid, date, freq='5min', var_meta=None, col_chunk=500, rows=slice(None, None, None), cols=slice(None, None, None), max_workers=None, log_file='all_sky.log', log_level='DEBUG', disc_on=False)[source]

Run the all-sky physics model from daily data model output files.

Parameters:
  • out_dir (str) – Project directory.

  • year (int | str) – Year of analysis

  • grid (str) – Final/full NSRDB grid file. The first column must be the NSRDB site gid’s.

  • date (datetime.date | str | int) – Single day data model output to run cloud fill on. Can be str or int in YYYYMMDD format.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • col_chunk (int) – Chunking method to run all sky one column chunk at a time to reduce memory requirements. This is an integer specifying how many columns to work on at one time.

  • rows (slice) – Subset of rows to run.

  • cols (slice) – Subset of columns to run.

  • max_workers (int | None) – Number of workers to run in parallel. 1 will run serial, None will use all available.

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

  • disc_on (bool) – Compute cloudy sky dni with the disc model (True) or the farms-dni model (False)

classmethod run_full(date, grid, freq, var_meta=None, factory_kwargs=None, fill_all=False, model_path=None, dist_lim=1.0, max_workers=None, low_mem=False, log_file=None, log_level='INFO', disc_on=False)[source]

Run the full nsrdb pipeline in-memory using serial compute.

Parameters:
  • date (datetime.date | str | int) – Single day to extract ancillary data for. Can be str or int in YYYYMMDD format.

  • grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.

  • freq (str) – Final desired NSRDB temporal frequency.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs

  • fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.

  • model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

  • dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.

  • max_workers (int, optional) – Number of workers to use for NSRDB computation. If 1 run in serial, else in parallel. If None use all available cores. by default None

  • low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

  • log_file (str) – File to log to. Will be put in output directory.

  • log_level (str | None) – Logging level (DEBUG, INFO). If None, no logging will be initialized.

  • disc_on (bool) – Compute cloudy sky dni with the disc model (True) or the farms-dni model (False)

Returns:

data_model (nsrdb.data_model.DataModel) – DataModel instance with all processed nsrdb variables available in the DataModel.processed_data attribute.