phygnn.utilities.pre_processing.PreProcess
- class PreProcess(features, feature_names=None)[source]
Bases:
object
Class to handle the pre-processing of feature data.
- Parameters:
features (np.ndarray | pd.DataFrame) – Feature data in a 2D array or DataFrame.
feature_names (str, optional) – Feature names, used if features is an ndarray, by default None
Methods
check_one_hot_categories
(one_hot_categories)Check one hot features and categories for duplicate names and against feature names if provided
normalize
(arr[, mean, stdev])Normalize features with mean at 0 and stdev of 1.
one_hot
(features[, feature_names, ...])Process str and int columns in the feature data to one-hot vectors.
process_one_hot
([convert_int, categories, ...])Process str and int columns in the feature data to one-hot vectors.
unnormalize
(norm_arr, mean, stdev)Unnormalize data with mean at 0 and stdev of 1.
update_names
(names, categories)Update feature names with the OHE categories.
- static normalize(arr, mean=None, stdev=None)[source]
Normalize features with mean at 0 and stdev of 1.
- Parameters:
arr (ndarray | pd.DataFrame) – native data, dataframes are converted to arrays.
mean (float | None) – mean to use for normalization
stdev (float | None) – stdev to use for normalization
- Returns:
norm_arr (ndarray) – normalized data
mean (np.ndarray) – 1D array of mean values used for normalization with length equal to number of features
stdev (np.ndarray) – 1D array of stdev values used for normalization with length equal to number of features
- static unnormalize(norm_arr, mean, stdev)[source]
Unnormalize data with mean at 0 and stdev of 1.
- Parameters:
norm_arr (ndarray) – normalized data
mean (float) – mean used for normalization
stdev (float) – stdev used for normalization
- Returns:
native_arr (ndarray) – native un-normalized data
- static check_one_hot_categories(one_hot_categories, feature_names=None)[source]
Check one hot features and categories for duplicate names and against feature names if provided
- Parameters:
one_hot_categories (dict, optional) – Features to one-hot encode using given categories
feature_names ([type], optional) – Feature names, by default None
- static update_names(names, categories)[source]
Update feature names with the OHE categories.
- Parameters:
names (list | None) – Feature or label names
categories (dict) – Categories to use for one hot encoding where a key is the original column name in the feature dataframe and value is a list of the possible unique values of the feature column. The value list must have as many or more entries as unique values in the feature column. This will name the feature column headers for the new one-hot-encoding if features is a dataframe. Empty dict or None results in category names being determined automatically. Format:
- {‘col_name1’[‘cat1’, ‘cat2’, ‘cat3’],
‘col_name2’ : [‘other_cat1’, ‘other_cat2’]}
- Returns:
names (list | None) – Names updated with categories
- process_one_hot(convert_int=False, categories=None, return_ind=False)[source]
Process str and int columns in the feature data to one-hot vectors.
- Parameters:
convert_int (bool, optional) – Flag to convert integer data to one-hot vectors, by default False
categories (dict | None, optional) – Categories to use for one hot encoding where a key is the original column name in the feature dataframe and value is a list of the possible unique values of the feature column. The value list must have as many or more entries as unique values in the feature column. This will name the feature column headers for the new one-hot-encoding if features is a dataframe. Empty dict or None results in category names being determined automatically. Format:
- {‘col_name1’[‘cat1’, ‘cat2’, ‘cat3’],
‘col_name2’ : [‘other_cat1’, ‘other_cat2’]}
by default None
return_ind (bool, optional) – Return one hot column indices, by default False
- Returns:
processed (np.ndarray | pd.DataFrame) – Feature data with str and int columns removed and one-hot boolean vectors appended as new columns. If features is a dataframe and categories is input, the new one-hot columns will be named according to categories.
one_hot_ind (list, optional) – List of numeric column indices in the native data that are to-be-transformed into one-hot vectors.
- classmethod one_hot(features, feature_names=None, convert_int=False, categories=None, return_ind=False)[source]
Process str and int columns in the feature data to one-hot vectors.
- Parameters:
features (np.ndarray | pd.DataFrame) – Feature data in a 2D array or DataFrame.
feature_names (str, optional) – Feature names, used if features is an ndarray, by default None
convert_int (bool, optional) – Flag to convert integer data to one-hot vectors, by default False
categories (dict | None, optional) – Categories to use for one hot encoding where a key is the original column name in the feature dataframe and value is a list of the possible unique values of the feature column. The value list must have as many or more entries as unique values in the feature column. This will name the feature column headers for the new one-hot-encoding if features is a dataframe. Empty dict or None results in category names being determined automatically. Format:
- {‘col_name1’[‘cat1’, ‘cat2’, ‘cat3’],
‘col_name2’ : [‘other_cat1’, ‘other_cat2’]}
by default None
return_ind (bool, optional) – Return one hot column indices, by default False
- Returns:
processed (np.ndarray | pd.DataFrame) – Feature data with str and int columns removed and one-hot boolean vectors appended as new columns. If features is a dataframe and categories is input, the new one-hot columns will be named according to categories.
one_hot_ind (list, optional) – List of numeric column indices in the native data that are to-be-transformed into one-hot vectors.