phygnn.utilities.pre_processing.PreProcess

class PreProcess(features, feature_names=None)[source]

Bases: object

Class to handle the pre-processing of feature data.

Parameters:
  • features (np.ndarray | pd.DataFrame) – Feature data in a 2D array or DataFrame.

  • feature_names (str, optional) – Feature names, used if features is an ndarray, by default None

Methods

check_one_hot_categories(one_hot_categories)

Check one hot features and categories for duplicate names and against feature names if provided

normalize(arr[, mean, stdev])

Normalize features with mean at 0 and stdev of 1.

one_hot(features[, feature_names, ...])

Process str and int columns in the feature data to one-hot vectors.

process_one_hot([convert_int, categories, ...])

Process str and int columns in the feature data to one-hot vectors.

unnormalize(norm_arr, mean, stdev)

Unnormalize data with mean at 0 and stdev of 1.

update_names(names, categories)

Update feature names with the OHE categories.

static normalize(arr, mean=None, stdev=None)[source]

Normalize features with mean at 0 and stdev of 1.

Parameters:
  • arr (ndarray | pd.DataFrame) – native data, dataframes are converted to arrays.

  • mean (float | None) – mean to use for normalization

  • stdev (float | None) – stdev to use for normalization

Returns:

  • norm_arr (ndarray) – normalized data

  • mean (np.ndarray) – 1D array of mean values used for normalization with length equal to number of features

  • stdev (np.ndarray) – 1D array of stdev values used for normalization with length equal to number of features

static unnormalize(norm_arr, mean, stdev)[source]

Unnormalize data with mean at 0 and stdev of 1.

Parameters:
  • norm_arr (ndarray) – normalized data

  • mean (float) – mean used for normalization

  • stdev (float) – stdev used for normalization

Returns:

native_arr (ndarray) – native un-normalized data

static check_one_hot_categories(one_hot_categories, feature_names=None)[source]

Check one hot features and categories for duplicate names and against feature names if provided

Parameters:
  • one_hot_categories (dict, optional) – Features to one-hot encode using given categories

  • feature_names ([type], optional) – Feature names, by default None

static update_names(names, categories)[source]

Update feature names with the OHE categories.

Parameters:
  • names (list | None) – Feature or label names

  • categories (dict) – Categories to use for one hot encoding where a key is the original column name in the feature dataframe and value is a list of the possible unique values of the feature column. The value list must have as many or more entries as unique values in the feature column. This will name the feature column headers for the new one-hot-encoding if features is a dataframe. Empty dict or None results in category names being determined automatically. Format:

    {‘col_name1’[‘cat1’, ‘cat2’, ‘cat3’],

    ‘col_name2’ : [‘other_cat1’, ‘other_cat2’]}

Returns:

names (list | None) – Names updated with categories

process_one_hot(convert_int=False, categories=None, return_ind=False)[source]

Process str and int columns in the feature data to one-hot vectors.

Parameters:
  • convert_int (bool, optional) – Flag to convert integer data to one-hot vectors, by default False

  • categories (dict | None, optional) – Categories to use for one hot encoding where a key is the original column name in the feature dataframe and value is a list of the possible unique values of the feature column. The value list must have as many or more entries as unique values in the feature column. This will name the feature column headers for the new one-hot-encoding if features is a dataframe. Empty dict or None results in category names being determined automatically. Format:

    {‘col_name1’[‘cat1’, ‘cat2’, ‘cat3’],

    ‘col_name2’ : [‘other_cat1’, ‘other_cat2’]}

    by default None

  • return_ind (bool, optional) – Return one hot column indices, by default False

Returns:

  • processed (np.ndarray | pd.DataFrame) – Feature data with str and int columns removed and one-hot boolean vectors appended as new columns. If features is a dataframe and categories is input, the new one-hot columns will be named according to categories.

  • one_hot_ind (list, optional) – List of numeric column indices in the native data that are to-be-transformed into one-hot vectors.

classmethod one_hot(features, feature_names=None, convert_int=False, categories=None, return_ind=False)[source]

Process str and int columns in the feature data to one-hot vectors.

Parameters:
  • features (np.ndarray | pd.DataFrame) – Feature data in a 2D array or DataFrame.

  • feature_names (str, optional) – Feature names, used if features is an ndarray, by default None

  • convert_int (bool, optional) – Flag to convert integer data to one-hot vectors, by default False

  • categories (dict | None, optional) – Categories to use for one hot encoding where a key is the original column name in the feature dataframe and value is a list of the possible unique values of the feature column. The value list must have as many or more entries as unique values in the feature column. This will name the feature column headers for the new one-hot-encoding if features is a dataframe. Empty dict or None results in category names being determined automatically. Format:

    {‘col_name1’[‘cat1’, ‘cat2’, ‘cat3’],

    ‘col_name2’ : [‘other_cat1’, ‘other_cat2’]}

    by default None

  • return_ind (bool, optional) – Return one hot column indices, by default False

Returns:

  • processed (np.ndarray | pd.DataFrame) – Feature data with str and int columns removed and one-hot boolean vectors appended as new columns. If features is a dataframe and categories is input, the new one-hot columns will be named according to categories.

  • one_hot_ind (list, optional) – List of numeric column indices in the native data that are to-be-transformed into one-hot vectors.