nsrdb.gap_fill.mlclouds_fill.MLCloudsFill

class MLCloudsFill(h5_source, fill_all=False, model_path=None, var_meta=None)[source]

Bases: object

Use the MLClouds algorithm with phygnn model to fill missing cloud data

Parameters:

h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (dict | None) – kwargs for MultiCloudsModel.load method. Specifies cloud_prop_model_path for cloud property model and optionally cloud_type_model_path for a cloud type model. Each value is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

Methods

`archive_cld_properties`()	Archive original cloud property (cld_*) .h5 files.
`clean_array`(dset, array)	Clean a dataset array using temporal nearest neighbor interpolation.
`clean_data_model`(data_model[, fill_all, ...])	Run the MLCloudsFill process on data in-memory in an nsrdb data model object.
`clean_feature_data`(feature_raw, fill_flag[, ...])	Clean feature data
`fill_bad_cld_properties`(predicted_data, ...)	Fill bad cloud properties in the feature data from predicted data
`fill_ctype_press`(h5_source[, col_slice])	Fill cloud type and pressure using simple temporal nearest neighbor.
`init_clean_arrays`()	Initialize a dict of numpy arrays for clean data.
`mark_complete_archived_files`()	Remove the .tmp marker from the archived files once MLCloudsFill is complete
`merra_clouds`(h5_source[, var_meta, ...])	Quick check to see if cloud data is from a merra source in which case it should be gap-free and cloud_fill_flag will be written with all 8's
`parse_feature_data`([feature_data, col_slice])	Parse raw feature data from .h5 files (will have gaps!)
`predict_cld_properties`(feature_data[, ...])	Predict cloud properties with phygnn
`preflight`()	Run preflight checks and raise error if datasets are missing
`prep_chunk`(h5_source[, model_path, ...])	Prepare a column chunk (slice) of data for phygnn prediction.
`process_chunk`(i_features, i_clean, i_flag, ...)	Use cleaned and prepared data to run phygnn predictions and create final filled data for a single column chunk.
`run`(h5_source[, fill_all, model_path, ...])	Fill cloud properties using phygnn predictions.
`write_fill_flag`(fill_flag[, col_slice])	Write the fill flag dataset to its daily file next to the cloud property files.
`write_filled_data`(filled_data[, col_slice])	Write gap filled cloud data to disk

Attributes

`DEFAULT_MODEL`
`dset_map`	Mapping of datasets to .h5 files
`h5_source`	Path to directory containing multi-file resource file sets. Available formats: /h5_dir/ /h5_dir/prefix*suffix.
`phygnn_model`	Pre-trained PhygnnModel instance

preflight()[source]: Run preflight checks and raise error if datasets are missing

property phygnn_model

Pre-trained PhygnnModel instance

Returns:: PhygnnModel

property h5_source

Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix

Returns:: str

property dset_map

Mapping of datasets to .h5 files

Returns:: dict

parse_feature_data(feature_data=None, col_slice=slice(None, None, None))[source]

Parse raw feature data from .h5 files (will have gaps!)

Parameters:

feature_data (dict | None) – Pre-loaded feature data to add to (optional). Keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). Any dsets already in this input won’t be re-read.
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

Returns:

feature_data (dict) – Raw feature data with gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).

init_clean_arrays()[source]: Initialize a dict of numpy arrays for clean data.

static clean_array(dset, array)[source]

Clean a dataset array using temporal nearest neighbor interpolation.

Parameters:

dset (str) – NSRDB dataset name
array (np.ndarray) – 2D (time x sites) float numpy array of data for dset. Missing values should be set to NaN.

Returns:

array (np.ndarray) – 2D (time x sites) float numpy array of data for dset. Missing values should be filled

clean_feature_data(feature_raw, fill_flag, sza_lim=90, max_workers=1)[source]

Clean feature data

Parameters:

feature_raw (dict) – Raw feature data with gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
max_workers (None | int) – Maximum workers to clean data in parallel. 1 is serial and None uses all available workers.

Returns:

feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.

archive_cld_properties()[source]: Archive original cloud property (cld_*) .h5 files. This method creates .tmp files in a ./raw/ sub directory. mark_complete_archived_files() should be run at the end to remove the .tmp designation. This will signify that the cloud fill was completed successfully.

mark_complete_archived_files()[source]: Remove the .tmp marker from the archived files once MLCloudsFill is complete

predict_cld_properties(feature_data, col_slice=None, low_mem=False)[source]

Predict cloud properties with phygnn

Parameters:

feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Just used for logging in this method.
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

Returns:

predicted_data (dict) – Dictionary of predicted cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites).

fill_bad_cld_properties(predicted_data, feature_data, col_slice=None)[source]

Fill bad cloud properties in the feature data from predicted data

Parameters:

predicted_data (dict) – Dictionary of predicted cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites).
feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Just used for logging in this method.

Returns:

filled_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites). The filled data is a combination of the input predicted_data and feature_data. The datasets in the predicted_data input are used to fill the feature_data input where: (feature_data[‘flag’] == “bad_cloud”)

static fill_ctype_press(h5_source, col_slice=slice(None, None, None))[source]

Fill cloud type and pressure using simple temporal nearest neighbor.

Parameters:

h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

Returns:

cloud_type (np.ndarray) – 2D array (time x sites) of gap-filled cloud type data.
cloud_pres (np.ndarray) – 2D array (time x sites) of gap-filled cld_press_acha data.
sza (np.ndarray) – 2D array (time x sites) of solar zenith angle data.
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.

write_filled_data(filled_data, col_slice=slice(None, None, None))[source]

Write gap filled cloud data to disk

Parameters:

filled_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites). The filled data is a combination of the input predicted_data and feature_data. The datasets in the predicted_data input are used to fill the feature_data input where: (feature_data[‘flag’] == “bad_cloud”)
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

write_fill_flag(fill_flag, col_slice=slice(None, None, None))[source]

Write the fill flag dataset to its daily file next to the cloud property files.

Parameters:

fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

classmethod prep_chunk(h5_source, model_path=None, var_meta=None, sza_lim=90, col_slice=slice(None, None, None))[source]

Prepare a column chunk (slice) of data for phygnn prediction.

Parameters:

h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

Returns:

feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). This is just for the col_slice being worked on.
clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites) and have nearest-neighbor cleaned values for cloud pressure and type This is just for the col_slice being worked on.
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is just for the col_slice being worked on.

process_chunk(i_features, i_clean, i_flag, col_slice, clean_data, fill_flag, low_mem=False)[source]

Use cleaned and prepared data to run phygnn predictions and create final filled data for a single column chunk.

Parameters:

i_features (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). This is just for a single column chunk (col_slice).
i_clean (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites) of phygnn-predicted values (cloud opd and reff) or nearest-neighbor cleaned values (cloud pressure and type). This is just for a single column chunk (col_slice).
i_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is just for a single column chunk (col_slice).
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load.
clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites). This is for ALL chunks (full resource shape).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is for ALL chunks (full resource shape).
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

Returns:

clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites). This has been updated with phygnn-predicted values (cloud opd and reff) or nearest-neighbor cleaned values (cloud pressure and type) This is for ALL chunks (full resource shape).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is for ALL chunks (full resource shape).

classmethod clean_data_model(data_model, fill_all=False, model_path=None, var_meta=None, sza_lim=90, low_mem=False)[source]

Run the MLCloudsFill process on data in-memory in an nsrdb data model object.

Parameters:

data_model (nsrdb.data_model.DataModel) – DataModel object with processed source data (cloud data + ancillary processed onto the nsrdb grid).
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

Returns:

data_model (nsrdb.data_model.DataModel) – DataModel object with processed source data (cloud data + ancillary processed onto the nsrdb grid). The cloud property datasets (cloud_type, cld_opd_dcomp, cld_reff_dcomp, cloud_fill_flag) are now cleaned.

classmethod merra_clouds(h5_source, var_meta=None, merra_fill_flag=8, fill_all=False, model_path=None)[source]

Quick check to see if cloud data is from a merra source in which case it should be gap-free and cloud_fill_flag will be written with all 8’s

Parameters:

h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
merra_fill_flag (int) – Integer fill flag representing where merra data was used as source cloud data.
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

Returns:

is_merra (bool) – Flag that is True if cloud data is from merra

classmethod run(h5_source, fill_all=False, model_path=None, var_meta=None, sza_lim=90, col_chunk=None, max_workers=None, low_mem=False)[source]

Fill cloud properties using phygnn predictions. Original files will be archived to a new “raw/” sub-directory

Parameters:

h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
col_chunk (None | int) – Optional chunking method to gap fill one column chunk at a time to reduce memory requirements. If provided, this should be an integer specifying how many columns to work on at one time.
max_workers (None | int) – Maximum workers for running mlclouds in parallel. 1 is serial and None uses all available workers.
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)