nsrdb.data_model.data_model.DataModel

class DataModel(date, nsrdb_grid, nsrdb_freq='5min', var_meta=None, factory_kwargs=None, scale=True, max_workers=None)[source]

Bases: object

Datamodel for single-day ancillary data processing to NSRDB.

Parameters:

date (datetime.date) – Single day to extract MERRA2 data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
max_workers (int | None) – Maximum number of workers to use in parallel. 1 runs serial, None uses all available workers.

Methods

`check_merra_cloud_source`(var_list, ...)	Check if the cloud data source is a merra file and adjust variable lists and factory kwargs accordingly.
`convert_units`(var, data)	Convert MERRA data to NSRDB units.
`dump`(var, fpath_out, data[, purge, mode])	Run ancillary data processing for one variable for a single day.
`get_cloud_nn`(fp_cloud, cloud_kwargs, nsrdb_grid)	Nearest neighbors computation for cloud data regrid.
`get_dependencies`(var)	Get dependencies for a derived variable
`get_geo_nn`(df1, df2[, interp_method, ...])	Get the geographic nearest neighbor distances (km) and indices.
`get_single_cloud_data`(fp_cloud, ...[, dist_lim])	Get all that good stuff from a cloud data file.
`get_weights`(var_obj)	Get the irradiance model weights for AOD/Alpha.
`init_cloud_data`(cloud_obj_all)	Initialize a dictionary for all cloud datasets
`is_cloud_var`(var)	Determine whether or not the variable is a cloud variable from the CLAVR-x / GOES data
`is_derived_var`(var)	Determine whether or not the variable is derived from primary source datasets
`is_merra_cloud`(handler)	Check to see if cloud variables have merra2 source files for the current day
`run_clouds`(cloud_vars, date, nsrdb_grid[, ...])	Run cloud processing for multiple cloud variables.
`run_multiple`(var_list, date, nsrdb_grid[, ...])	Run ancillary data processing for multiple variables for single day.
`run_pre_flight`(var_list)	Run pre-flight checks, raise if specified paths/files are not found.
`run_single`(var, date, nsrdb_grid[, ...])	Run ancillary data processing for one variable for a single day.
`scale_data`(var, data)	Perform safe scaling and datatype conversion of data.
`unscale_data`(var, data)	Perform safe un-scaling and datatype conversion of data.

Attributes

`ALL_SKY_VARS`
`ALL_VARS`
`ALL_VARS_ML`
`CACHE_DIR`
`CLOUD_VARS`
`MERRA_VARS`
`MLCLOUDS_VARS`
`WEIGHTS`
`date`	Get the single-day datetime.date for this instance.
`nsrdb_data_shape`	Get the final NSRDB data shape for a single var.
`nsrdb_grid`	Return the grid.
`nsrdb_ti`	Get the NSRDB target time index.
`processed_data`	Get the processed data dictionary.
`var_meta`	Get the nsrdb variables meta data table.

property date: Get the single-day datetime.date for this instance.

property nsrdb_grid

Return the grid.

Returns:: _nsrdb_grid (pd.DataFrame) – Reference grid data.

property nsrdb_ti

Get the NSRDB target time index.

Returns:: nsrdb_ti (pd.DatetimeIndex) – Pandas datetime index for the current day at the NSRDB resolution.

property nsrdb_data_shape

Get the final NSRDB data shape for a single var.

Returns:: _nsrdb_data_shape (tuple) – Two-entry shape tuple.

property var_meta

Get the nsrdb variables meta data table.

Returns:: pd.DataFrame

property processed_data

Get the processed data dictionary.

Returns:: _processed (dict) – Namespace of processed data set with __setitem__. Keys should be NSRDB variable names.

get_geo_nn(df1, df2, interp_method='NN', nn_method='haversine', labels=('latitude', 'longitude'), cache=False)[source]

Get the geographic nearest neighbor distances (km) and indices.

Parameters:

df1/df2 (pd.DataFrame:) – Dataframes containing coodinate columns with the corresponding labels.
interp_method (str) – Spatial interpolation method - either NN or IDW
nn_method (str | None) – NSRDB nearest_neighbor tree search method, either “haversine” or “kdtree”. None defaults to geo_nn.
labels (tuple | list) – Column labels corresponding to the lat/lon columns in df1/df2.
cache (bool | str) – Flag to cache nearest neighbor results or retrieve cached results instead of performing NN query. Strings are evaluated as the csv file name to cache.

Returns:

dist (ndarray) – Distance array in km returned if return_dist input arg set to True.
indicies (ndarray) – 1D array of row indicies in df1 that match df2. df1[df1.index[indicies[i]]] is closest to df2[df2.index[i]]

static get_cloud_nn(fp_cloud, cloud_kwargs, nsrdb_grid, dist_lim=1.0)[source]

Nearest neighbors computation for cloud data regrid.

Parameters:

fp_cloud (str) – Single cloud source file either .nc or .h5
cloud_kwargs (dict) – Kwargs for the initialization of CloudVarSingleH5 or CloudVarSingleNC along with fp_cloud
nsrdb_grid (pd.DataFrame) – Reference grid data for NSRDB.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.

Returns:

index (np.ndarray | None) – KDTree query results mapping cloud data to the NSRDB grid. e.g. nsrdb_data = cloud_data[index]. None if bad grid data.
cloud_obj_single (CloudVarSingleH5 | CloudVarSingleNC) – Initialized cloud variable handler for a single cloud source file. The .tree property should be initialized with this return obj

get_weights(var_obj)[source]

Get the irradiance model weights for AOD/Alpha.

Parameters:: var_obj (MerraVar) – Merra processing variable object.
Returns:: weights (np.ndarray | NoneType) – 1D array of weighting values for the given var in the current month. Returns None if var does not require weighting.

scale_data(var, data)[source]

Perform safe scaling and datatype conversion of data.

Parameters:

var (str) – NSRDB variable name.
data (np.ndarray) – Data array to scale.

Returns:

data (np.ndarray) – Scaled data array with final dtype.

unscale_data(var, data)[source]

Perform safe un-scaling and datatype conversion of data.

Parameters:

var (str) – NSRDB variable name.
data (np.ndarray) – Scaled data array to unscale.

Returns:

data (np.ndarray) – Unscaled float32 data array.

run_pre_flight(var_list)[source]

Run pre-flight checks, raise if specified paths/files are not found.

Parameters:: var_list (list) – List of variable names

static convert_units(var, data)[source]

Convert MERRA data to NSRDB units.

Parameters:

var (str) – NSRDB Variable name.
data (np.ndarray) – Data for var.

Returns:

data (np.ndarray) – Data with NSRDB units if conversion is required for “var”.

classmethod check_merra_cloud_source(var_list, cloud_vars, date, var_meta, factory_kwargs)[source]

Check if the cloud data source is a merra file and adjust variable lists and factory kwargs accordingly.

Parameters:

var_list (list) – List of variables being processed without the GOES cloud data handler
cloud_vars (list) – List of cloud data variables from GOES being processed with the cloud data handler
date (datetime.date) – Date of target processing
var_meta (pd.DataFrame | None | str) – CSV file or dataframe containing meta data for all NSRDB variables.
factory_kwargs (dict) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs

Returns:

var_list (list) – List of variables being processed without the GOES cloud data handler - cloud variables have been added to this list if merra is source
cloud_vars (list) – List of variables being processed with the GOES cloud data handler. This is empty if the data source is merra.
factory_kwargs (dict) – Optional namespace of kwargs to initialize variable data. If cloud variables are being sourced from merra, appropriate kwargs are added to this dict.

static is_merra_cloud(handler)[source]

Check to see if cloud variables have merra2 source files for the current day

Parameters:

handler (AncillaryVarHandler) – Base data model variable handler

Returns:

check (bool) – True if the source is merra, False if not
out (dict) – New factory kwargs for the variable if source is merra

is_cloud_var(var)[source]

Determine whether or not the variable is a cloud variable from the CLAVR-x / GOES data

Parameters:: var (str) – NSRDB variable name
Returns:: is_cv (bool) – True if var is a cloud variable

is_derived_var(var)[source]

Determine whether or not the variable is derived from primary source datasets

Parameters:: var (str) – NSRDB variable name
Returns:: is_derived (bool) – True if var is handled using a derived variable handler

get_dependencies(var)[source]

Get dependencies for a derived variable

Parameters:: var (str) – NSRDB variable name
Returns:: deps (tuple) – Tuple of string names of dependencies of the derived variable input. Empty tuple if var is not derived.

init_cloud_data(cloud_obj_all)[source]

Initialize a dictionary for all cloud datasets

Parameters:: cloud_obj_all (CloudVar) – Cloud variable handler that can be used to iterate over all single cloud file handler
Returns:: cloud_data (dict) – Data dictionary of cloud datasets mapped to the NSRDB grid. Keys are the cloud variables names, values are 2D numpy arrays. Array shape is (n_time, n_sites).

classmethod get_single_cloud_data(fp_cloud, cloud_kwargs, nsrdb_grid, dist_lim=1.0)[source]

Get all that good stuff from a cloud data file.

Parameters:

fp_cloud (str) – Single cloud source file either .nc or .h5
cloud_kwargs (dict) – Kwargs for the initialization of CloudVarSingleH5 or CloudVarSingleNC along with fp_cloud
nsrdb_grid (pd.DataFrame) – Reference grid data for NSRDB.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.

Returns:

single_data (dict | None) – Dictionary of source data for the single cloud file (single timestep) mapped onto the nsrdb_grid. Keys are nsrdb cloud dataset names and values are 1D (space,) arrays of data matching the nsrdb_grid. Returns None if something went wrong.

dump(var, fpath_out, data, purge=False, mode='w')[source]

Run ancillary data processing for one variable for a single day.

Parameters:

var (str) – NSRDB var name.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
data (np.ndarray) – NSRDB-resolution data for the given var and the current day.
purge (bool) – Flag to purge data from memory after dumping to disk
mode (str, optional) – Mode to open fpath_out with, by default ‘w’

Returns:

data (str | np.ndarray) – Input data array if no purge, else file path to dump results.

classmethod run_single(var, date, nsrdb_grid, nsrdb_freq='5min', var_meta=None, fpath_out=None, factory_kwargs=None, scale=True, **kwargs)[source]

Run ancillary data processing for one variable for a single day.

Parameters:

var (str) – NSRDB var name.
date (datetime.date) – Single day to extract ancillary data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
kwargs (dict) – Optional kwargs. Based on the NSRDB var name requested to be processed, this method runs one of several DataModel processing methods (_interpolate, _derive, _process_clouds). These kwargs will get passed to the processing method.

Returns:

data (np.ndarray) – NSRDB-resolution data for the given var and the current day.

classmethod run_clouds(cloud_vars, date, nsrdb_grid, nsrdb_freq='5min', dist_lim=1.0, var_meta=None, max_workers_regrid=None, scale=True, fpath_out=None, factory_kwargs=None)[source]

Run cloud processing for multiple cloud variables.

(most efficient to process all cloud variables together to minimize number of kdtrees during regrid)

Parameters:

cloud_vars (list | tuple) – NSRDB cloud variables names.
date (datetime.date) – Single day to extract ancillary data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
max_workers_regrid (None | int) – Max parallel workers allowed for cloud regrid processing. None uses all available workers. 1 runs regrid in serial.
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs

Returns:

data (dict) – Namespace of nsrdb data numpy arrays keyed by nsrdb variable name.

classmethod run_multiple(var_list, date, nsrdb_grid, nsrdb_freq='5min', dist_lim=1.0, var_meta=None, max_workers=None, max_workers_regrid=None, return_obj=False, scale=True, fpath_out=None, factory_kwargs=None)[source]

Run ancillary data processing for multiple variables for single day.

Parameters:

var_list (list | None) – List of variables to process
date (datetime.date) – Single day to extract ancillary data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
max_workers (int | None) – Number of workers to run in parallel. 1 will run serial, None will use all available.
max_workers_regrid (None | int) – Max parallel workers allowed for cloud regrid processing. None uses all available workers. 1 runs regrid in serial.
return_obj (bool) – Flag to return full DataModel object instead of just the processed data dictionary.
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs

Returns:

out (dict | DataModel) – Either the dictionary of data or the full DataModel object with the data in the .processed property. Controlled by the return_obj flag.