nsrdb.gap_fill.mlclouds_fill.MLCloudsFill
- class MLCloudsFill(h5_source, fill_all=False, model_path=None, var_meta=None)[source]
Bases:
object
Use the MLClouds algorithm with phygnn model to fill missing cloud data
- Parameters:
h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:
/h5_dir/ /h5_dir/prefix*suffix
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (dict | None) – kwargs for
MultiCloudsModel.load
method. Specifiescloud_prop_model_path
for cloud property model and optionallycloud_type_model_path
for a cloud type model. Each value is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
Methods
Archive original cloud property (cld_*) .h5 files.
clean_array
(dset, array)Clean a dataset array using temporal nearest neighbor interpolation.
clean_data_model
(data_model[, fill_all, ...])Run the MLCloudsFill process on data in-memory in an nsrdb data model object.
clean_feature_data
(feature_raw, fill_flag[, ...])Clean feature data
fill_bad_cld_properties
(predicted_data, ...)Fill bad cloud properties in the feature data from predicted data
fill_ctype_press
(h5_source[, col_slice])Fill cloud type and pressure using simple temporal nearest neighbor.
Initialize a dict of numpy arrays for clean data.
Remove the .tmp marker from the archived files once MLCloudsFill is complete
merra_clouds
(h5_source[, var_meta, ...])Quick check to see if cloud data is from a merra source in which case it should be gap-free and cloud_fill_flag will be written with all 8's
parse_feature_data
([feature_data, col_slice])Parse raw feature data from .h5 files (will have gaps!)
predict_cld_properties
(feature_data[, ...])Predict cloud properties with phygnn
Run preflight checks and raise error if datasets are missing
prep_chunk
(h5_source[, model_path, ...])Prepare a column chunk (slice) of data for phygnn prediction.
process_chunk
(i_features, i_clean, i_flag, ...)Use cleaned and prepared data to run phygnn predictions and create final filled data for a single column chunk.
run
(h5_source[, fill_all, model_path, ...])Fill cloud properties using phygnn predictions.
write_fill_flag
(fill_flag[, col_slice])Write the fill flag dataset to its daily file next to the cloud property files.
write_filled_data
(filled_data[, col_slice])Write gap filled cloud data to disk
Attributes
DEFAULT_MODEL
Mapping of datasets to .h5 files
Path to directory containing multi-file resource file sets. Available formats: /h5_dir/ /h5_dir/prefix*suffix.
Pre-trained PhygnnModel instance
- property phygnn_model
Pre-trained PhygnnModel instance
- Returns:
PhygnnModel
- property h5_source
Path to directory containing multi-file resource file sets. Available formats:
/h5_dir/ /h5_dir/prefix*suffix
- Returns:
str
- property dset_map
Mapping of datasets to .h5 files
- Returns:
dict
- parse_feature_data(feature_data=None, col_slice=slice(None, None, None))[source]
Parse raw feature data from .h5 files (will have gaps!)
- Parameters:
feature_data (dict | None) – Pre-loaded feature data to add to (optional). Keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). Any dsets already in this input won’t be re-read.
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.
- Returns:
feature_data (dict) – Raw feature data with gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
- static clean_array(dset, array)[source]
Clean a dataset array using temporal nearest neighbor interpolation.
- Parameters:
dset (str) – NSRDB dataset name
array (np.ndarray) – 2D (time x sites) float numpy array of data for dset. Missing values should be set to NaN.
- Returns:
array (np.ndarray) – 2D (time x sites) float numpy array of data for dset. Missing values should be filled
- clean_feature_data(feature_raw, fill_flag, sza_lim=90, max_workers=1)[source]
Clean feature data
- Parameters:
feature_raw (dict) – Raw feature data with gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
max_workers (None | int) – Maximum workers to clean data in parallel. 1 is serial and None uses all available workers.
- Returns:
feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.
- archive_cld_properties()[source]
Archive original cloud property (cld_*) .h5 files. This method creates .tmp files in a ./raw/ sub directory. mark_complete_archived_files() should be run at the end to remove the .tmp designation. This will signify that the cloud fill was completed successfully.
- mark_complete_archived_files()[source]
Remove the .tmp marker from the archived files once MLCloudsFill is complete
- predict_cld_properties(feature_data, col_slice=None, low_mem=False)[source]
Predict cloud properties with phygnn
- Parameters:
feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Just used for logging in this method.
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)
- Returns:
predicted_data (dict) – Dictionary of predicted cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites).
- fill_bad_cld_properties(predicted_data, feature_data, col_slice=None)[source]
Fill bad cloud properties in the feature data from predicted data
- Parameters:
predicted_data (dict) – Dictionary of predicted cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites).
feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites).
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Just used for logging in this method.
- Returns:
filled_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites). The filled data is a combination of the input predicted_data and feature_data. The datasets in the predicted_data input are used to fill the feature_data input where: (feature_data[‘flag’] == “bad_cloud”)
- static fill_ctype_press(h5_source, col_slice=slice(None, None, None))[source]
Fill cloud type and pressure using simple temporal nearest neighbor.
- Parameters:
h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:
/h5_dir/ /h5_dir/prefix*suffix
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.
- Returns:
cloud_type (np.ndarray) – 2D array (time x sites) of gap-filled cloud type data.
cloud_pres (np.ndarray) – 2D array (time x sites) of gap-filled cld_press_acha data.
sza (np.ndarray) – 2D array (time x sites) of solar zenith angle data.
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.
- write_filled_data(filled_data, col_slice=slice(None, None, None))[source]
Write gap filled cloud data to disk
- Parameters:
filled_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays of phygnn-predicted values (time x sites). The filled data is a combination of the input predicted_data and feature_data. The datasets in the predicted_data input are used to fill the feature_data input where: (feature_data[‘flag’] == “bad_cloud”)
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.
- write_fill_flag(fill_flag, col_slice=slice(None, None, None))[source]
Write the fill flag dataset to its daily file next to the cloud property files.
- Parameters:
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why.
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.
- classmethod prep_chunk(h5_source, model_path=None, var_meta=None, sza_lim=90, col_slice=slice(None, None, None))[source]
Prepare a column chunk (slice) of data for phygnn prediction.
- Parameters:
h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:
/h5_dir/ /h5_dir/prefix*suffix
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load. Use slice(None) for no chunking.
- Returns:
feature_data (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). This is just for the col_slice being worked on.
clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites) and have nearest-neighbor cleaned values for cloud pressure and type This is just for the col_slice being worked on.
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is just for the col_slice being worked on.
- process_chunk(i_features, i_clean, i_flag, col_slice, clean_data, fill_flag, low_mem=False)[source]
Use cleaned and prepared data to run phygnn predictions and create final filled data for a single column chunk.
- Parameters:
i_features (dict) – Clean feature data without gaps. keys are the feature names (nsrdb dataset names), values are 2D numpy arrays (time x sites). This is just for a single column chunk (col_slice).
i_clean (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites) of phygnn-predicted values (cloud opd and reff) or nearest-neighbor cleaned values (cloud pressure and type). This is just for a single column chunk (col_slice).
i_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is just for a single column chunk (col_slice).
col_slice (slice) – Column slice of the resource data to work on. This is a result of chunking the columns to reduce memory load.
clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites). This is for ALL chunks (full resource shape).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is for ALL chunks (full resource shape).
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)
- Returns:
clean_data (dict) – Dictionary of filled cloud properties. Keys are nsrdb dataset names, values are 2D arrays (time x sites). This has been updated with phygnn-predicted values (cloud opd and reff) or nearest-neighbor cleaned values (cloud pressure and type) This is for ALL chunks (full resource shape).
fill_flag (np.ndarray) – Integer array of flags showing what data was filled and why. This is for ALL chunks (full resource shape).
- classmethod clean_data_model(data_model, fill_all=False, model_path=None, var_meta=None, sza_lim=90, low_mem=False)[source]
Run the MLCloudsFill process on data in-memory in an nsrdb data model object.
- Parameters:
data_model (nsrdb.data_model.DataModel) – DataModel object with processed source data (cloud data + ancillary processed onto the nsrdb grid).
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)
- Returns:
data_model (nsrdb.data_model.DataModel) – DataModel object with processed source data (cloud data + ancillary processed onto the nsrdb grid). The cloud property datasets (cloud_type, cld_opd_dcomp, cld_reff_dcomp, cloud_fill_flag) are now cleaned.
- classmethod merra_clouds(h5_source, var_meta=None, merra_fill_flag=8, fill_all=False, model_path=None)[source]
Quick check to see if cloud data is from a merra source in which case it should be gap-free and cloud_fill_flag will be written with all 8’s
- Parameters:
h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:
/h5_dir/ /h5_dir/prefix*suffix
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
merra_fill_flag (int) – Integer fill flag representing where merra data was used as source cloud data.
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
- Returns:
is_merra (bool) – Flag that is True if cloud data is from merra
- classmethod run(h5_source, fill_all=False, model_path=None, var_meta=None, sza_lim=90, col_chunk=None, max_workers=None, low_mem=False)[source]
Fill cloud properties using phygnn predictions. Original files will be archived to a new “raw/” sub-directory
- Parameters:
h5_source (str) – Path to directory containing multi-file resource file sets. Available formats:
/h5_dir/ /h5_dir/prefix*suffix
fill_all (bool) – Flag to fill all cloud properties for all timesteps where cloud_type is cloudy.
model_path (str | None) – Directory to load phygnn model from. This is typically a fpath to a .pkl file with an accompanying .json file in the same directory. None will try to use the default model path from the mlclouds project directory.
var_meta (str | pd.DataFrame | None) – CSV file or dataframe containing meta data for all NSRDB variables. Defaults to the NSRDB var meta csv in git repo.
sza_lim (int, optional) – Solar zenith angle limit below which missing cloud property data will be gap filled. By default 90 to fill all missing daylight data
col_chunk (None | int) – Optional chunking method to gap fill one column chunk at a time to reduce memory requirements. If provided, this should be an integer specifying how many columns to work on at one time.
max_workers (None | int) – Maximum workers for running mlclouds in parallel. 1 is serial and None uses all available workers.
low_mem (bool) – Option to run predictions in low memory mode. Typically the memory bloat during prediction is: (n_time x n_sites x n_nodes_per_layer). low_mem=True will reduce this to (1000 x n_nodes_per_layer)