flasc.data_processing.filtering#

Implement filtering class and functions for FLASC data.

Functions

df_get_no_faulty_measurements

Get the number of faulty measurements for a specific turbine.

df_mark_turbdata_as_faulty

Mark turbine data as faulty based on a condition.

filter_df_by_faulty_impacting_turbines

Assign faulty measurements based on upstream turbines faults.

Classes

FlascFilter

Implement filtering class for SCADA data.

flasc.data_processing.filtering.df_get_no_faulty_measurements(df, turbine)[source]#

Get the number of faulty measurements for a specific turbine.

Parameters:
  • df (pd.DataFrame | FlascDataFrame) --

    Dataframe containing the turbine data, formatted in the generic SCADA data format. Namely, the dataframe should at the very least contain the columns:

    • Time of each measurement: time

    • Wind speed of each turbine: ws_000, ws_001, ...

    • Power production of each turbine: pow_000, pow_001, ...

  • turbine (int) -- The turbine identifier for which the number of faulty measurements should be counted.

Returns:

Number of faulty measurements for the turbine.

Return type:

N_isnan (int)

flasc.data_processing.filtering.df_mark_turbdata_as_faulty(df, cond, turbine_list, exclude_columns=[])[source]#

Mark turbine data as faulty based on a condition.

Parameters:
  • df (pd.DataFrame | FlascDataFrame) -- Dataframe containing the turbine data, formatted in the generic SCADA data format.

  • cond (iteratible) -- List or array-like variable with bool entries depicting whether the condition is met or not. These should be situations in which you classify the data as faulty. For example, high wind speeds but low power productions, or NaNs, self-flagged status variables.

  • turbine_list (int, list) -- Turbine identifier(s) for which the data should be flagged as faulty when the condition is met.

  • exclude_columns (list, optional) -- List of columns that should not be considered for the filtering. Defaults to [].

Returns:

Dataframe with the faulty measurements marked as None.

Return type:

pd.DataFrame | FlascDataFrame

class flasc.data_processing.filtering.FlascFilter(df, turbine_names=None)[source]#

Implement filtering class for SCADA data.

This class allows a user to filter turbine data based on the wind-speed power curve. This class includes several useful filtering .. method:: 1. Filtering based on prespecified boxes/windows. Any data outside

of the specified box is considered faulty.

2. Filtering based on x-wise distance from the mean power curve. Any

data too far off the mean curve is considered faulty.

3. Filtering based on the standard deviation from the mean power

curve. This is slightly different from (2) in the point that it allows the user to consider variations in standard deviation per power bin.

_get_all_unique_flags()[source]#

Returns all unique flags in the dataframe.

Private function that grabs all the unique filter flags that are available in self.df_filters and returns them as a list of strings. This is helpful when plotting the various filter sources in a scatter plot, for example.

Returns:

List with all unique flags available

in self.df_filters, each entry being a string.

Return type:

all_flags (list)

_reset_mean_power_curves(ws_bins=array([0., 0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5, 5., 5.5, 6., 6.5, 7., 7.5, 8., 8.5, 9., 9.5, 10., 10.5, 11., 11.5, 12., 12.5, 13., 13.5, 14., 14.5, 15., 15.5, 16., 16.5, 17., 17.5, 18., 18.5, 19., 19.5, 20., 20.5, 21., 21.5, 22., 22.5, 23., 23.5, 24., 24.5, 25.]))[source]#
_get_mean_power_curves(df=None, turbine_subset=None)[source]#

Calculates the mean power production in bins of the wind speed.

Calculates the mean power production in bins of the wind speed, for all turbines in the wind farm.

Parameters:
  • ws_bins ([iteratible], optional) -- Wind speed bins. Defaults to np.arange(0.0, 25.5, 0.5).

  • df (pd.DataFrame) --

    Dataframe containing the turbine data, formatted in the generic SCADA data format. Namely, the dataframe should at the very least contain the columns:

    • Time of each measurement: time

    • Wind speed of each turbine: ws_000, ws_001, ...

    • Power production of each turbine: pow_000, pow_001, ...

  • turbine_subset (list, optional) -- List of turbine indices to calculate the mean power curve for. If None is specified, defaults to calculating it for all turbines.

Returns:

Dataframe containing the wind

speed bins and the mean power production value for every turbine.

Return type:

pd.DataFrame

_set_legend_alpha_to_one(lgd: Legend) None[source]#

Set the alpha value of the provided Legend object to be 1.

Parameters:

lgd (matplotlib.legend.Legend) -- Legend object

Return type:

None

reset_filters()[source]#

Reset all filter variables and assume all data is clean.

filter_by_condition(condition, label, ti: int, verbose: bool = True, apply_filters_to_df: bool = True)[source]#

Filter the dataframe for a specific condition, for a specific turbine.

This is a generic method to filter the dataframe for any particular condition, for a specific turbine or specific set of turbines. This provides a platform for user-specific queries to filter and then inspect the data with. You can call this function multiple times and the filters will aggregate chronologically. This filter directly cuts down the dataframe self.df to a filtered subset.

A correct usage is, for example:
FlascFilter.filter_by_condition(

condition=(FlascFilter.df["pow_{:03d}".format(ti)] < -1.0e-6), label="Power below zero", ti=ti, verbose=True,

)

and:
FlascFilter.filter_by_condition(

condition=(FlascFilter.df["is_operation_normal_{:03d}".format(ti)] == False), label="Self-flagged (is_operation_normal==False)", ti=ti, verbose=True,

)

Parameters:
  • condition (iteratible) -- List or array-like variable with bool entries depicting whether the condition is met or not. These should be situations in which you classify the data as faulty. For example, high wind speeds but low power productions, or NaNs, self-flagged status variables.

  • label (str) -- Name or description of the fault/condition that is flagged.

  • ti (int) -- Turbine identifier, typically an integer, but may also be a list. This flags the measurements of all these turbines as faulty for which condition==True.

  • verbose (bool, optional) -- Print information to console. Defaults to True.

  • apply_filters_to_df (bool, optional) -- Assign the flagged measurements in self.df directly as NaN. Defaults to True.

Returns:

The filtered dataframe.

All measurements that are flagged as faulty are overwritten by "None"/"NaN". If apply_filters_to_df==True, then this dataframe is equal to the internally filtered dataframe 'self.df'.

Return type:

pd.Dataframe | FlascDataFrame

filter_by_sensor_stuck_faults(columns: list, ti: int, n_consecutive_measurements: int = 3, stddev_threshold: float = 0.001, plot: bool = False, verbose: bool = True)[source]#

Filter the turbine measurements for sensor-stuck type of faults.

This is the situation where a turbine measurement reads the exact same value for multiple consecutive timestamps. This typically indicates a "frozen" sensor rather than a true physical effect. This is particularly the case for signals that are known to change at a high rate and are measured with high precision, e.g., wind speed and wind direction measurements.

Parameters:
  • columns (list) -- List of columns which should be checked for sensor-stuck type of faults. A typical choice is ["ws_000", "wd_000"] with ti=0, which are the wind speed and wind direction for turbine 0. We can safely assume that those measurements should change between every 10-minute measurement. Note that you may not want to include "pow_000", since that measurement may be constant for longer periods of time even during normal operation, e.g., when the turbine is shutdown at very low wind speeds or when the turbine is operating above rated wind speed. Note that if any of the signals in 'columns' is flagged as frozen ("stuck"), all measurements of that turbine will be marked faulty.

  • ti (int) -- The turbine identifier for which its measurements should be flagged as faulty when the signals in the columns are found to be frozen ("stuck"). This is typically the turbine number that corresponds to the columns, e.g., if you use columns=["ws_000", "wd_000"] then ti=0, and if you use ["ws_003", "wd_003"] you use ti=3.

  • n_consecutive_measurements (int, optional) -- Number of consecutive measurements that should read the same value for the measurement to be considered "frozen". Defaults to 3.

  • stddev_threshold (float, optional) -- Threshold value, typically a low number. If the set of consecutive measurements do not differ by more than this value, then the measurements is considered stuck. Defaults to 0.001.

  • plot (bool, optional) -- Produce plots highlighting a handful of situations in which the measurements are stuck in time. This is typically only helpful if you have more than 1% of measurements being faulty, and you would like to figure out whether this is a numerical issue or this is actually happening. Defaults to False.

  • verbose (bool, optional) -- Print information to console. Defaults to True.

Returns:

Pandas DataFrame with the filtered data,

in which faulty turbine measurements are flagged as None/NaN. This is an aggregated filtering variable, so it includes faulty-flagged measurements from filter operations in previous steps.

Return type:

pd.Dataframe | FlascDataFrame

filter_by_power_curve(ti, m_ws_lb=0.95, m_pow_lb=1.01, m_ws_rb=1.05, m_pow_rb=0.99, ws_deadband=0.5, pow_deadband=20.0, no_iterations=10, cutoff_ws=20.0)[source]#

Filter the data by offset from the mean power curve in x-directions.

This is an iterative process because the estimated mean curve actually changes as data is filtered. This process typically converges within a couple iterations.

Parameters:
  • ti (int) -- The turbine identifier for which the data should be filtered.

  • m_ws_lb (float, optional) -- Multiplier on the wind speed defining the left bound for the power curve. Any data to the left of this curve is considered faulty. Defaults to 0.95.

  • m_pow_lb (float, optional) -- Multiplier on the power defining the left bound for the power curve. Any data to the left of this curve is considered faulty. Defaults to 1.01.

  • m_ws_rb (float, optional) -- Multiplier on the wind speed defining the right bound for the power curve. Any data to the right of this curve is considered faulty. Defaults to 1.05.

  • m_pow_rb (float, optional) -- Multiplier on the power defining the right bound for the power curve. Any data to the right of this curve is considered faulty. Defaults to 0.99.

  • ws_deadband (float, optional) -- Deadband in [m/s] around the median power curve around which data is by default classified as valid. Defaults to 0.50.

  • pow_deadband (float, optional) -- Deadband in [kW] around the median power curve around which data is by default classified as valid. Defaults to 20.0.

  • no_iterations (int, optional) -- Number of iterations. The solution typically converges in 2-3 steps, but as the process is very fast, it's better to run a higher number of iterations. Defaults to 10.

  • cutoff_ws (float, optional) -- Upper limit for the filtering to occur. Typically, this is a value just below the cut-out wind speed. Namely, issues arise if you put this wind speed above the cut-out wind speed, because we effectively end up with two curves for the same power production (one at region 2, one going down from cut-out wind speed). This confuses the algorithm. Hence, suggested to put this somewhere around 15-25 m/s. Defaults to 20 m/s.

filter_by_floris_power_curve(fm, ti, m_ws_lb=0.95, m_pow_lb=1.01, m_ws_rb=1.05, m_pow_rb=0.99, ws_deadband=0.5, pow_deadband=20.0, cutoff_ws=20.0)[source]#

Filter the data by offset from the floris power curve.

Parameters:
  • fm (FlorisModel) -- The FlorisModel object for the farm

  • ti (int) -- The turbine identifier for which the data should be filtered.

  • m_ws_lb (float, optional) -- Multiplier on the wind speed defining the left bound for the power curve. Any data to the left of this curve is considered faulty. Defaults to 0.95.

  • m_pow_lb (float, optional) -- Multiplier on the power defining the left bound for the power curve. Any data to the left of this curve is considered faulty. Defaults to 1.01.

  • m_ws_rb (float, optional) -- Multiplier on the wind speed defining the right bound for the power curve. Any data to the right of this curve is considered faulty. Defaults to 1.05.

  • m_pow_rb (float, optional) -- Multiplier on the power defining the right bound for the power curve. Any data to the right of this curve is considered faulty. Defaults to 0.99.

  • ws_deadband (float, optional) -- Deadband in [m/s] around the median power curve around which data is by default classified as valid. Defaults to 0.50.

  • pow_deadband (float, optional) -- Deadband in [kW] around the median power curve around which data is by default classified as valid. Defaults to 20.0.

  • cutoff_ws (float, optional) -- Wind speed up to which the median power curve is calculated and the data is filtered for. You should make sure this variable is set to a value above the rated wind speed and below the cut-out wind speed. If you are experiencing problems with data filtering and your data points have a downward trend near the high wind speeds, try decreasing this variable's value to 15.0.

Returns:

Pandas DataFrame with the filtered data,

in which faulty turbine measurements are flagged as None/NaN. This is an aggregated filtering variable, so it includes faulty-flagged measurements from filter operations in previous steps.

Return type:

pd.Dataframe | FlascDataFrame

get_df()[source]#

Return the filtered dataframe to the user.

Returns:

Pandas DataFrame with the filtered data,

in which faulty turbine measurements are flagged as None/NaN. This is an aggregated filtering variable, so it includes faulty-flagged measurements from filter operations in previous steps.

Return type:

pd.DataFrame | FlascDataFrame

get_power_curve(calculate_missing=True)[source]#

Return the turbine estimated mean power curves to the user.

Parameters:

calculate_missing (bool, optional) -- Calculate the median power curves for the turbines for the turbines of which their power curves were previously not yet calculated.

Returns:

Dataframe containing the estimated mean power curves.

Return type:

pd.DataFrame

plot_farm_mean_power_curve(fm=None)[source]#

Plot mean of all turbines' power curves and show individual curves.

Also estimate and plot a mean turbine power curve.

Parameters:

fm (FlorisModel) -- The FlorisModel object for the farm. If specified by the user, then the farm-average turbine power curve from FLORIS will be plotted on top of the SCADA-based power curves.

Returns:

The figure and axis objects of the plot.

Return type:

tuple (fig, ax)

plot_filters_custom_scatter(ti, x_col, y_col, xlabel='Wind speed (m/s)', ylabel='Power (kW)', ax=None)[source]#

Plot the filtered data in a scatter plot.

Plot the filtered data in a scatter plot, categorized by the source of their filter/fault. This is a generic function that allows the user to plot various numeric variables on the x and y axis.

Parameters:
  • ti (int) -- Turbine identifier. This is used to determine which turbine's filter history should be looked at.

  • x_col (str) -- Column name to plot on the x-axis. A common choice is "ws_000" for ti=0, for example.

  • y_col (str) -- Column name to plot on the y-axis. A common choice is "pow_000" for ti=0, for example.

  • xlabel (str, optional) -- Figure x-axis label. Defaults to 'Wind speed (m/s)'.

  • ylabel (str, optional) -- Figure y-axis label. Defaults to 'Power (kW)'.

  • ax (plt.Axis, optional) --

    Pyplot Figure axis in which the figure should be produced. If None specified, then

    creates a new figure. Defaults to None.

Returns:

The figure axis in which the scatter plot is drawn.

Return type:

ax

plot_filters_custom_scatter_bokeh(ti, x_col, y_col, title='Wind-speed vs. power curve', xlabel='Wind speed (m/s)', ylabel='Power (kW)', p=None)[source]#

Plot the filtered data in a scatter plot.

Plot the filtered data in a scatter plot, categorized by the source of their filter/fault. This is a generic function that allows the user to plot various numeric variables on the x and y axis.

Parameters:
  • ti (int) -- Turbine identifier. This is used to determine which turbine's filter history should be looked at.

  • x_col (str) -- Column name to plot on the x-axis. A common choice is "ws_000" for ti=0, for example.

  • y_col (str) -- Column name to plot on the y-axis. A common choice is "pow_000" for ti=0, for example.

  • title (str, optional) -- Figure title. Defaults to 'Wind- speed vs. power curve'.

  • xlabel (str, optional) -- Figure x-axis label. Defaults to 'Wind speed (m/s)'.

  • ylabel (str, optional) -- Figure y-axis label. Defaults to 'Power (kW)'.

  • p (Bokeh Figure, optional) -- Figure to plot in. If None is specified, creates a new figure. Defaults to None.

Returns:

The figure axis in which the scatter plot is drawn.

Return type:

ax

plot_filters_in_ws_power_curve(ti, fm=None, ax=None)[source]#

Plot faulty data in the wind speed power curve.

Plot the wind speed power curve and connect each faulty datapoint to the label it was classified as faulty with.

Parameters:
  • ti (int) -- Turbine number which should be plotted.

  • fm (FlorisModel, optional) -- floris object. If not None, will

  • floris. (use this to plot the turbine power curves as implemented in)

  • None. (Defaults to)

  • ax (plt.Axis) -- Pyplot Axis object.

Returns:

The figure axis in which the scatter plot is drawn.

Return type:

ax

plot_postprocessed_in_ws_power_curve(ti, fm=None, ax=None)[source]#

Plot the postprocessed data in the wind speed power curve.

Plot the wind speed power curve and mark faulty data according to their filters.

Parameters:
  • ti (int) -- Turbine number which should be plotted.

  • fm (FlorisModel, optional) -- floris object. If not None, will

  • floris. (use this to plot the turbine power curves as implemented in)

  • None. (Defaults to)

  • ax (Matplotlib.pyplot Axis, optional) -- Axis to plot in. If None is specified, creates a new figure and axis. Defaults to None.

Returns:

The figure axis in which the scatter plot is drawn.

Return type:

ax

plot_filters_in_time(ti, ax=None)[source]#

Plot the filtered data in time.

Generate bar plot where each week of data is gathered and its filtering results will be shown relative to the data size of each week. This plot can particularly be useful to investigate whether certain weeks/time periods show a particular high number of faulty measurements. This can often be correlated with maintenance time windows and the user may opt to completely remove any measurements in the found time period from the dataset.

Parameters:
  • ti (int) -- Index of the turbine of interest.

  • ax (Matplotlib.pyplot Axis, optional) -- Axis to plot in. If None is specified, creates a new figure and axis. Defaults to None.

plot_filters_in_time_bokeh(ti, p=None)[source]#

Plot the filtered data in time.

Generate bar plot where each week of data is gathered and its filtering results will be shown relative to the data size of each week. This plot can particularly be useful to investigate whether certain weeks/time periods show a particular high number of faulty measurements. This can often be correlated with maintenance time windows and the user may opt to completely remove any measurements in the found time period from the dataset.

Parameters:
  • ti (int) -- Index of the turbine of interest.

  • p (Bokeh Figure, optional) -- Figure to plot in. If None is specified, creates a new figure. Defaults to None.

Returns:

The figure axis in which the scatter plot is

Return type:

axis

flasc.data_processing.filtering.filter_df_by_faulty_impacting_turbines(df, ti, df_impacting_turbines, verbose=True)[source]#

Assign faulty measurements based on upstream turbines faults.

Assigns a turbine's measurement to NaN for each timestamp for which any of the turbines

that are shedding a wake on this turbine is reporting NaN measurements.

Parameters:
  • df (pd.DataFrame | FlascDataFrame) -- Dataframe with SCADA data with measurements formatted according to wd_000, wd_001, wd_002, pow_000, pow_001, pow_002, and so on.

  • ti (int) -- Turbine number for which we are filtering the data. Basically, each turbine that impacts that power production of turbine 'ti' by more than 0.1% is required to be reporting a non-faulty measurement. If not, we classify the measurement of turbine 'ti' as faulty because we cannot sufficiently know the inflow conditions of this turbine.

  • df_impacting_turbines (pd.DataFrame) -- A Pandas DataFrame in the

  • of (format) -- 0 1 2 3 4 5 6 wd 0.0 [6, 5] [5] [3, 5] [] [] [] [] 3.0 [6] [5] [3, 5] [] [] [] [] ... ... ... ... .. .. .. .. 354.0 [6, 5, 3] [5, 0] [3, 5] [] [] [] [] 357.0 [6, 5] [5] [3, 5, 4] [] [] [] []

  • interest (The columns indicate the turbine of)

  • i.e.

  • that (the turbine)

  • waked (is)

  • turbine (and each row shows which turbines are waking that)

  • direction (for that particular wind) -- import flasc.utilities.floris_tools as ftools df_impacting_turbines = ftools.get_all_impacting_turbines(fi)

  • verbose (bool, optional) -- Print information to the console. Defaults

  • True. (to)

Returns:

The postprocessed dataframe for 'df', filtered for inter-turbine issues like curtailment and turbine downtime.

Return type:

pd.DataFrame