nsrdb.data_model.data_model.DataModel
- class DataModel(date, nsrdb_grid, nsrdb_freq='5min', var_meta=None, factory_kwargs=None, scale=True, max_workers=None)[source]
Bases:
object
Datamodel for single-day ancillary data processing to NSRDB.
- Parameters:
date (datetime.date) – Single day to extract MERRA2 data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
max_workers (int | None) – Maximum number of workers to use in parallel. 1 runs serial, None uses all available workers.
Methods
check_merra_cloud_source
(var_list, ...)Check if the cloud data source is a merra file and adjust variable lists and factory kwargs accordingly.
convert_units
(var, data)Convert MERRA data to NSRDB units.
dump
(var, fpath_out, data[, purge, mode])Run ancillary data processing for one variable for a single day.
get_cloud_nn
(fp_cloud, cloud_kwargs, nsrdb_grid)Nearest neighbors computation for cloud data regrid.
get_dependencies
(var)Get dependencies for a derived variable
get_geo_nn
(df1, df2[, interp_method, ...])Get the geographic nearest neighbor distances (km) and indices.
get_single_cloud_data
(fp_cloud, ...[, dist_lim])Get all that good stuff from a cloud data file.
get_weights
(var_obj)Get the irradiance model weights for AOD/Alpha.
init_cloud_data
(cloud_obj_all)Initialize a dictionary for all cloud datasets
is_cloud_var
(var)Determine whether or not the variable is a cloud variable from the CLAVR-x / GOES data
is_derived_var
(var)Determine whether or not the variable is derived from primary source datasets
is_merra_cloud
(handler)Check to see if cloud variables have merra2 source files for the current day
run_clouds
(cloud_vars, date, nsrdb_grid[, ...])Run cloud processing for multiple cloud variables.
run_multiple
(var_list, date, nsrdb_grid[, ...])Run ancillary data processing for multiple variables for single day.
run_pre_flight
(var_list)Run pre-flight checks, raise if specified paths/files are not found.
run_single
(var, date, nsrdb_grid[, ...])Run ancillary data processing for one variable for a single day.
scale_data
(var, data)Perform safe scaling and datatype conversion of data.
unscale_data
(var, data)Perform safe un-scaling and datatype conversion of data.
Attributes
ALL_SKY_VARS
ALL_VARS
ALL_VARS_ML
CACHE_DIR
CLOUD_VARS
MERRA_VARS
MLCLOUDS_VARS
WEIGHTS
Get the single-day datetime.date for this instance.
Get the final NSRDB data shape for a single var.
Return the grid.
Get the NSRDB target time index.
Get the processed data dictionary.
Get the nsrdb variables meta data table.
- property date
Get the single-day datetime.date for this instance.
- property nsrdb_grid
Return the grid.
- Returns:
_nsrdb_grid (pd.DataFrame) – Reference grid data.
- property nsrdb_ti
Get the NSRDB target time index.
- Returns:
nsrdb_ti (pd.DatetimeIndex) – Pandas datetime index for the current day at the NSRDB resolution.
- property nsrdb_data_shape
Get the final NSRDB data shape for a single var.
- Returns:
_nsrdb_data_shape (tuple) – Two-entry shape tuple.
- property var_meta
Get the nsrdb variables meta data table.
- Returns:
pd.DataFrame
- property processed_data
Get the processed data dictionary.
- Returns:
_processed (dict) – Namespace of processed data set with __setitem__. Keys should be NSRDB variable names.
- get_geo_nn(df1, df2, interp_method='NN', nn_method='haversine', labels=('latitude', 'longitude'), cache=False)[source]
Get the geographic nearest neighbor distances (km) and indices.
- Parameters:
df1/df2 (pd.DataFrame:) – Dataframes containing coodinate columns with the corresponding labels.
interp_method (str) – Spatial interpolation method - either NN or IDW
nn_method (str | None) – NSRDB nearest_neighbor tree search method, either “haversine” or “kdtree”. None defaults to geo_nn.
labels (tuple | list) – Column labels corresponding to the lat/lon columns in df1/df2.
cache (bool | str) – Flag to cache nearest neighbor results or retrieve cached results instead of performing NN query. Strings are evaluated as the csv file name to cache.
- Returns:
dist (ndarray) – Distance array in km returned if return_dist input arg set to True.
indicies (ndarray) – 1D array of row indicies in df1 that match df2. df1[df1.index[indicies[i]]] is closest to df2[df2.index[i]]
- static get_cloud_nn(fp_cloud, cloud_kwargs, nsrdb_grid, dist_lim=1.0)[source]
Nearest neighbors computation for cloud data regrid.
- Parameters:
fp_cloud (str) – Single cloud source file either .nc or .h5
cloud_kwargs (dict) – Kwargs for the initialization of CloudVarSingleH5 or CloudVarSingleNC along with fp_cloud
nsrdb_grid (pd.DataFrame) – Reference grid data for NSRDB.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.
- Returns:
index (np.ndarray | None) – KDTree query results mapping cloud data to the NSRDB grid. e.g. nsrdb_data = cloud_data[index]. None if bad grid data.
cloud_obj_single (CloudVarSingleH5 | CloudVarSingleNC) – Initialized cloud variable handler for a single cloud source file. The .tree property should be initialized with this return obj
- get_weights(var_obj)[source]
Get the irradiance model weights for AOD/Alpha.
- Parameters:
var_obj (MerraVar) – Merra processing variable object.
- Returns:
weights (np.ndarray | NoneType) – 1D array of weighting values for the given var in the current month. Returns None if var does not require weighting.
- scale_data(var, data)[source]
Perform safe scaling and datatype conversion of data.
- Parameters:
var (str) – NSRDB variable name.
data (np.ndarray) – Data array to scale.
- Returns:
data (np.ndarray) – Scaled data array with final dtype.
- unscale_data(var, data)[source]
Perform safe un-scaling and datatype conversion of data.
- Parameters:
var (str) – NSRDB variable name.
data (np.ndarray) – Scaled data array to unscale.
- Returns:
data (np.ndarray) – Unscaled float32 data array.
- run_pre_flight(var_list)[source]
Run pre-flight checks, raise if specified paths/files are not found.
- Parameters:
var_list (list) – List of variable names
- static convert_units(var, data)[source]
Convert MERRA data to NSRDB units.
- Parameters:
var (str) – NSRDB Variable name.
data (np.ndarray) – Data for var.
- Returns:
data (np.ndarray) – Data with NSRDB units if conversion is required for “var”.
- classmethod check_merra_cloud_source(var_list, cloud_vars, date, var_meta, factory_kwargs)[source]
Check if the cloud data source is a merra file and adjust variable lists and factory kwargs accordingly.
- Parameters:
var_list (list) – List of variables being processed without the GOES cloud data handler
cloud_vars (list) – List of cloud data variables from GOES being processed with the cloud data handler
date (datetime.date) – Date of target processing
var_meta (pd.DataFrame | None | str) – CSV file or dataframe containing meta data for all NSRDB variables.
factory_kwargs (dict) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
- Returns:
var_list (list) – List of variables being processed without the GOES cloud data handler - cloud variables have been added to this list if merra is source
cloud_vars (list) – List of variables being processed with the GOES cloud data handler. This is empty if the data source is merra.
factory_kwargs (dict) – Optional namespace of kwargs to initialize variable data. If cloud variables are being sourced from merra, appropriate kwargs are added to this dict.
- static is_merra_cloud(handler)[source]
Check to see if cloud variables have merra2 source files for the current day
- Parameters:
handler (AncillaryVarHandler) – Base data model variable handler
- Returns:
check (bool) – True if the source is merra, False if not
out (dict) – New factory kwargs for the variable if source is merra
- is_cloud_var(var)[source]
Determine whether or not the variable is a cloud variable from the CLAVR-x / GOES data
- Parameters:
var (str) – NSRDB variable name
- Returns:
is_cv (bool) – True if var is a cloud variable
- is_derived_var(var)[source]
Determine whether or not the variable is derived from primary source datasets
- Parameters:
var (str) – NSRDB variable name
- Returns:
is_derived (bool) – True if var is handled using a derived variable handler
- get_dependencies(var)[source]
Get dependencies for a derived variable
- Parameters:
var (str) – NSRDB variable name
- Returns:
deps (tuple) – Tuple of string names of dependencies of the derived variable input. Empty tuple if var is not derived.
- init_cloud_data(cloud_obj_all)[source]
Initialize a dictionary for all cloud datasets
- Parameters:
cloud_obj_all (CloudVar) – Cloud variable handler that can be used to iterate over all single cloud file handler
- Returns:
cloud_data (dict) – Data dictionary of cloud datasets mapped to the NSRDB grid. Keys are the cloud variables names, values are 2D numpy arrays. Array shape is (n_time, n_sites).
- classmethod get_single_cloud_data(fp_cloud, cloud_kwargs, nsrdb_grid, dist_lim=1.0)[source]
Get all that good stuff from a cloud data file.
- Parameters:
fp_cloud (str) – Single cloud source file either .nc or .h5
cloud_kwargs (dict) – Kwargs for the initialization of CloudVarSingleH5 or CloudVarSingleNC along with fp_cloud
nsrdb_grid (pd.DataFrame) – Reference grid data for NSRDB.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.
- Returns:
single_data (dict | None) – Dictionary of source data for the single cloud file (single timestep) mapped onto the nsrdb_grid. Keys are nsrdb cloud dataset names and values are 1D (space,) arrays of data matching the nsrdb_grid. Returns None if something went wrong.
- dump(var, fpath_out, data, purge=False, mode='w')[source]
Run ancillary data processing for one variable for a single day.
- Parameters:
var (str) – NSRDB var name.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
data (np.ndarray) – NSRDB-resolution data for the given var and the current day.
purge (bool) – Flag to purge data from memory after dumping to disk
mode (str, optional) – Mode to open fpath_out with, by default ‘w’
- Returns:
data (str | np.ndarray) – Input data array if no purge, else file path to dump results.
- classmethod run_single(var, date, nsrdb_grid, nsrdb_freq='5min', var_meta=None, fpath_out=None, factory_kwargs=None, scale=True, **kwargs)[source]
Run ancillary data processing for one variable for a single day.
- Parameters:
var (str) – NSRDB var name.
date (datetime.date) – Single day to extract ancillary data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
kwargs (dict) – Optional kwargs. Based on the NSRDB var name requested to be processed, this method runs one of several DataModel processing methods (_interpolate, _derive, _process_clouds). These kwargs will get passed to the processing method.
- Returns:
data (np.ndarray) – NSRDB-resolution data for the given var and the current day.
- classmethod run_clouds(cloud_vars, date, nsrdb_grid, nsrdb_freq='5min', dist_lim=1.0, var_meta=None, max_workers_regrid=None, scale=True, fpath_out=None, factory_kwargs=None)[source]
Run cloud processing for multiple cloud variables.
(most efficient to process all cloud variables together to minimize number of kdtrees during regrid)
- Parameters:
cloud_vars (list | tuple) – NSRDB cloud variables names.
date (datetime.date) – Single day to extract ancillary data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
max_workers_regrid (None | int) – Max parallel workers allowed for cloud regrid processing. None uses all available workers. 1 runs regrid in serial.
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
- Returns:
data (dict) – Namespace of nsrdb data numpy arrays keyed by nsrdb variable name.
- classmethod run_multiple(var_list, date, nsrdb_grid, nsrdb_freq='5min', dist_lim=1.0, var_meta=None, max_workers=None, max_workers_regrid=None, return_obj=False, scale=True, fpath_out=None, factory_kwargs=None)[source]
Run ancillary data processing for multiple variables for single day.
- Parameters:
var_list (list | None) – List of variables to process
date (datetime.date) – Single day to extract ancillary data for.
nsrdb_grid (str | pd.DataFrame) – CSV file containing the NSRDB reference grid to interpolate to, or a pre-extracted (and reduced) dataframe. The first csv column must be the NSRDB site gid’s.
nsrdb_freq (str) – Final desired NSRDB temporal frequency.
dist_lim (float) – Return only neighbors within this distance during cloud regrid. The distance is in decimal degrees (more efficient than real distance). NSRDB sites further than this value from GOES data pixels will be warned and given missing cloud types and properties resulting in a full clearsky timeseries.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
max_workers (int | None) – Number of workers to run in parallel. 1 will run serial, None will use all available.
max_workers_regrid (None | int) – Max parallel workers allowed for cloud regrid processing. None uses all available workers. 1 runs regrid in serial.
return_obj (bool) – Flag to return full DataModel object instead of just the processed data dictionary.
scale (bool) – Flag to scale source data to reduced (integer) precision after data model processing.
fpath_out (str | None) – File path to dump results. If no file path is given, results will be returned as an object.
factory_kwargs (dict | None) – Optional namespace of kwargs to use to initialize variable data handlers from the data model’s variable factory. Keyed by variable name. Values can be “source_dir”, “handler”, etc… source_dir for cloud variables can be a normal directory path or /directory/prefix*suffix where /directory/ can have more sub dirs
- Returns:
out (dict | DataModel) – Either the dictionary of data or the full DataModel object with the data in the .processed property. Controlled by the return_obj flag.