nsrdb.gap_fill.mlclouds_fill.MLCloudsFill

class MLCloudsFill(h5_source, fill_all=False, model_path=None, var_meta=None)[source]

Bases: object

Use the MLClouds algorithm with phygnn model to fill missing cloud data

Parameters:
  • h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

    /h5_dir/ /h5_dir/prefix*suffix

  • fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.

  • model_path (dict | None) – kwargs for MultiCloudsModel.load method. Specifies cloud_prop_model_path for cloud property model and optionally cloud_type_model_path for a cloud type model. Each value is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

Methods

archive_cld_properties()

Archive original cloud property (cld_*) .h5 files.

clean_array(dset, array)

Clean a dataset array using temporal nearest neighbor interpolation.

clean_data_model(data_model[, fill_all, ...])

Run the MLCloudsFill process on data in-memory in an nsrdb data model object.

clean_feature_data(feature_raw, fill_flag[, ...])

Clean feature data

fill_bad_cld_properties(predicted_data, ...)

Fill bad cloud properties in the feature data from predicted data

fill_ctype_press(h5_source[, col_slice])

Fill cloud type and pressure using simple temporal nearest neighbor.

init_clean_arrays()

Initialize a dict of numpy arrays for clean data.

mark_complete_archived_files()

Remove the .tmp marker from the archived files once MLCloudsFill is complete

merra_clouds(h5_source[, var_meta, ...])

Quick check to see if cloud data is from a merra source in which case it should be gap-free and cloud_fill_flag will be written with all 8's

parse_feature_data([feature_data, col_slice])

Parse raw feature data from .h5 files (will have gaps!)

predict_cld_properties(feature_data[, ...])

Predict cloud properties with phygnn

preflight()

Run preflight checks and raise error if datasets are missing

prep_chunk(h5_source[, model_path, ...])

Prepare a column chunk (slice) of data for phygnn prediction.

process_chunk(i_features, i_clean, i_flag, ...)

Use cleaned and prepared data to run phygnn predictions and create final filled data for a single column chunk.

run(h5_source[, fill_all, model_path, ...])

Fill cloud properties using phygnn predictions.

write_fill_flag(fill_flag[, col_slice])

Write the fill flag dataset to its daily file next to the cloud property files.

write_filled_data(filled_data[, col_slice])

Write gap filled cloud data to disk

Attributes

DEFAULT_MODEL

dset_map

Mapping of datasets to .h5 files

h5_source

Path to directory containing multi-file resource file sets. Available formats: /h5_dir/ /h5_dir/prefix*suffix.

phygnn_model

Pre-trained PhygnnModel instance

preflight()[source]

Run preflight checks and raise error if datasets are missing

property phygnn_model

Pre-trained PhygnnModel instance

Returns:

PhygnnModel

property h5_source

Path to directory containing multi-file resource file sets. Available formats:

/h5_dir/ /h5_dir/prefix*suffix

Returns:

str

property dset_map

Mapping of datasets to .h5 files

Returns:

dict

parse_feature_data(feature_data=None, col_slice=slice(None, None, None))[source]

Parse raw feature data from .h5 files (will have gaps!)

Parameters:
  • feature_data (dict | None) – Pre-loaded feature data to add to (optional). Keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). Any dsets already in this input won’t be re-read.

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

Returns:

feature_data (dict) – Raw feature data with gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).

init_clean_arrays()[source]

Initialize a dict of numpy arrays for clean data.

static clean_array(dset, array)[source]

Clean a dataset array using temporal nearest neighbor interpolation.

Parameters:
  • dset (str) – NSRDB dataset name

  • array (np.ndarray) – 2D (time x sites) float numpy array of data for dset. Missing values should be set to NaN.

Returns:

array (np.ndarray) – 2D (time x sites) float numpy array of data for dset. Missing values should be filled

clean_feature_data(feature_raw, fill_flag, sza_lim=90, max_workers=1)[source]

Clean feature data

Parameters:
  • feature_raw (dict) – Raw feature data with gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).

  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.

  • sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data

  • max_workers (None | int) – Maximum workers to clean data in parallel. 1 is serial and None uses all available workers.

Returns:

  • feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).

  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.

archive_cld_properties()[source]

Archive original cloud property (cld_*) .h5 files. This method creates .tmp files in a ./raw/ sub directory. mark_complete_archived_files() should be run at the end to remove the .tmp designation. This will signify that the cloud fill was completed successfully.

mark_complete_archived_files()[source]

Remove the .tmp marker from the archived files once MLCloudsFill is complete

predict_cld_properties(feature_data, col_slice=None, low_mem=False)[source]

Predict cloud properties with phygnn

Parameters:
  • feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Just used for logging in this method.

  • low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

Returns:

predicted_data (dict) – Dictionary of predicted cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites).

fill_bad_cld_properties(predicted_data, feature_data, col_slice=None)[source]

Fill bad cloud properties in the feature data from predicted data

Parameters:
  • predicted_data (dict) – Dictionary of predicted cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites).

  • feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Just used for logging in this method.

Returns:

filled_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites). The filled data is a combination of the input predicted_data and feature_data. The datasets in the predicted_data input are used to fill the feature_data input where: (feature_data[‘flag’] == “bad_cloud”)

static fill_ctype_press(h5_source, col_slice=slice(None, None, None))[source]

Fill cloud type and pressure using simple temporal nearest neighbor.

Parameters:
  • h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

    /h5_dir/ /h5_dir/prefix*suffix

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

Returns:

  • cloud_type (np.ndarray) – 2D array (time x sites) of gap-filled cloud type data.

  • cloud_pres (np.ndarray) – 2D array (time x sites) of gap-filled cld_press_acha data.

  • sza (np.ndarray) – 2D array (time x sites) of solar zenith angle data.

  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.

write_filled_data(filled_data, col_slice=slice(None, None, None))[source]

Write gap filled cloud data to disk

Parameters:
  • filled_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites). The filled data is a combination of the input predicted_data and feature_data. The datasets in the predicted_data input are used to fill the feature_data input where: (feature_data[‘flag’] == “bad_cloud”)

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

write_fill_flag(fill_flag, col_slice=slice(None, None, None))[source]

Write the fill flag dataset to its daily file next to the cloud property files.

Parameters:
  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

classmethod prep_chunk(h5_source, model_path=None, var_meta=None, sza_lim=90, col_slice=slice(None, None, None))[source]

Prepare a column chunk (slice) of data for phygnn prediction.

Parameters:
  • h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

    /h5_dir/ /h5_dir/prefix*suffix

  • model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.

Returns:

  • feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). This is just for the col_slice being worked on.

  • clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites) and have nearest-neighbor cleaned values for cloud pressure and type This is just for the col_slice being worked on.

  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is just for the col_slice being worked on.

process_chunk(i_features, i_clean, i_flag, col_slice, clean_data, fill_flag, low_mem=False)[source]

Use cleaned and prepared data to run phygnn predictions and create final filled data for a single column chunk.

Parameters:
  • i_features (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). This is just for a single column chunk (col_slice).

  • i_clean (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites) of phygnn-predicted values (cloud opd and reff) or nearest-neighbor cleaned values (cloud pressure and type). This is just for a single column chunk (col_slice).

  • i_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is just for a single column chunk (col_slice).

  • col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load.

  • clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites). This is for ALL chunks (full resource shape).

  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is for ALL chunks (full resource shape).

  • low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

Returns:

  • clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites). This has been updated with phygnn-predicted values (cloud opd and reff) or nearest-neighbor cleaned values (cloud pressure and type) This is for ALL chunks (full resource shape).

  • fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is for ALL chunks (full resource shape).

classmethod clean_data_model(data_model, fill_all=False, model_path=None, var_meta=None, sza_lim=90, low_mem=False)[source]

Run the MLCloudsFill process on data in-memory in an nsrdb data model object.

Parameters:
  • data_model (nsrdb.data_model.DataModel) – DataModel object with processed source data (cloud data + ancillary processed onto the nsrdb grid).

  • fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.

  • model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data

  • low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)

Returns:

data_model (nsrdb.data_model.DataModel) – DataModel object with processed source data (cloud data + ancillary processed onto the nsrdb grid). The cloud property datasets (cloud_type, cld_opd_dcomp, cld_reff_dcomp, cloud_fill_flag) are now cleaned.

classmethod merra_clouds(h5_source, var_meta=None, merra_fill_flag=8, fill_all=False, model_path=None)[source]

Quick check to see if cloud data is from a merra source in which case it should be gap-free and cloud_fill_flag will be written with all 8’s

Parameters:
  • h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

    /h5_dir/ /h5_dir/prefix*suffix

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • merra_fill_flag (int) – Integer fill flag representing where merra data was used as source cloud data.

  • fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.

  • model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

Returns:

is_merra (bool) – Flag that is True if cloud data is from merra

classmethod run(h5_source, fill_all=False, model_path=None, var_meta=None, sza_lim=90, col_chunk=None, max_workers=None, low_mem=False)[source]

Fill cloud properties using phygnn predictions. Original files will be archived to a new “raw/” sub-directory

Parameters:
  • h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:

    /h5_dir/ /h5_dir/prefix*suffix

  • fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.

  • model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.

  • var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.

  • sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data

  • col_chunk (None | int) – Optional chunking method to gap fill one column chunk at a time to reduce memory requirements. If provided, this should be an integer specifying how many columns to work on at one time.

  • max_workers (None | int) – Maximum workers for running mlclouds in parallel. 1 is serial and None uses all available workers.

  • low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)