buildings_bench.evaluation

The buildings_bench.evaluation module contains the main functionality for evaluting a model on the benchmark tasks.

The buildings_bench.evaluation.managers.DatasetMetricsManager class is the main entry point.

Simple usage

from buildings_bench import BuildingTypes
from buildings_bench.evaluation.managers import DatasetMetricsManager

# By default, the DatasetMetricsManager keeps track of NRMSE, NMAE, and NMBE
metrics_manager = DatasetMetricsManager()

# Iterate over the dataset using our building dataset generator
for building_name, building_dataset in buildings_datasets_generator:

    # Register a new building with the manager
    metrics_manager.add_building_to_dataset_if_missing(
        dataset_name, building_name,
    )

    # Your model makes predictions
    # ...

    # Register the predictions with the manager
    metrics_manager(
        dataset_name,                 # the name of the dataset, e.g., electricity
        building_name,                # the name of the building, e.g., MT_001
        continuous_targets,           # the ground truth 24 hour targets
        predictions,                  # the model's 24 hour predictions
        BuildingTypes.RESIDENTIAL_INT,    # an int indicating the building type
    )

Advanced usage (with scoring rule)

from buildings_bench.evaluation.managers import DatasetMetricsManager
from buildings_bench.evaluation import scoring_rule_factory

metrics_manager = DatasetMetricsManager(scoring_rule = scoring_rule_factory('crps'))

# Iterate over the dataset
for building_name, building_dataset in buildings_datasets_generator:

    # Register a new building with the manager
    metrics_manager.add_building_to_dataset_if_missing(
        dataset_name, building_name,
    )

    # Your model makes predictions
    # ...

    # Register the predictions with the manager
    metrics_manager(
        dataset_name,           # the name of the dataset, e.g., electricity
        building_name,          # the name of the building, e.g., MT_001
        continuous_targets,     # the ground truth 24 hour targets
        predictions,            # the model's 24 hour predictions
        building_types_mask,    # a boolean tensor indicating building type
        y_categories=targets,   # for scoring rules, the ground truth (discrete categories if using tokenization)
        y_distribution_params=distribution_params, # for scoring rules, the distribution parameters
        centroids=centroids   # for scoring rules with categorical variables, the centroid values
    )

metrics_factory

`buildings_bench.evaluation.metrics_factory(name, types=[MetricType.SCALAR])`

Create a metric from a name. By default, will return a scalar metric.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`types`	`List[MetricTypes]`	The types of the metric.	`[SCALAR]`

Returns: metrics_list (List[Metric]): A list of metrics.

scoring_rule_factory

`buildings_bench.evaluation.scoring_rule_factory(name)`

Create a scoring rule from a name.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the scoring rule.	required

Returns: sr (ScoringRule): A scoring rule.

all_metrics_list

`buildings_bench.evaluation.all_metrics_list()`

Returns all registered metrics.

Returns:

Name	Type	Description
`metrics_list`	`List[Metric]`	A list of metrics.

BuildingTypes

`buildings_bench.evaluation.managers.BuildingTypes`

Enum for supported types of buildings.

Attributes:

Name	Type	Description
`RESIDENTIAL`	`str`	Residential building type.
`COMMERCIAL`	`str`	Commercial building type.
`RESIDENTIAL_INT`	`int`	Integer representation of residential building type (0).
`COMMERCIAL_INT`	`int`	Integer representation of commercial building type (1).

DatasetMetricsManager

`buildings_bench.evaluation.managers.DatasetMetricsManager`

A class that manages a MetricsManager for each building in one or more benchmark datasets. One DatasetMetricsManager can be used to keep track of all metrics when evaluating a model on all of the benchmark's datasets.

This class wil create a Pandas Dataframe summary containing the metrics for each building.

Default metrics are NRMSE (CVRMSE), NMAE, NMBE.

`call(dataset_name, building_id, y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)`

Compute metrics for a batch of predictions for a single building in a dataset.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	The name of the dataset.	required
`building_id`	`str`	The unique building identifier.	required
`y_true`	`Tensor`	The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]	required
`y_pred`	`Tensor`	The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]	required
`building_types_mask`	`Tensor`	A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size]. Default is None.	`None`
`building_type`	`int`	The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type.	`COMMERCIAL_INT`

Other Parameters:

Name	Type	Description
`y_categories`	`Tensor`	The true load values. (quantized)
`y_distribution_params`	`Tensor`	logits, Gaussian params, etc.
`centroids`	`Tensor`	The bin values for the quantized load.
`loss`	`Tensor`	The loss for the batch.

`init(metrics=default_metrics, scoring_rule=None)`

Parameters:

Name	Type	Description	Default
`metrics`	`List[Metric]`	A list of metrics to compute for each building type.	`default_metrics`
`scoring_rule`	`ScoringRule`	A scoring rule to compute for each building type.	`None`

`add_building_to_dataset_if_missing(dataset_name, building_id)`

If the building does not exist, add a new MetricsManager for the building.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	The name of the dataset.	required
`building_id`	`str`	The unique building identifier.	required

`get_building_from_dataset(dataset_name, building_id)`

If the dataset and building exist, return the MetricsManager for the building.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	The name of the dataset.	required
`building_id`	`str`	The unique building identifier.	required

Returns:

Type	Description
`Optional[MetricsManager]`	A MetricsManager if the dataset and building exist, otherwise None.

`summary(dataset_name=None)`

Return a summary of the metrics for the dataset.

Parameters:

Name	Type	Description	Default
`dataset_name`	`str`	The name of the dataset to summarize. If None, summarize all datasets.	`None`

Returns: A Pandas dataframe with the following columns:

    - dataset: The name of the dataset.
    - building_id: The unique ID of the building.
    - building_type: The type of the building.
    - metric: The name of the metric.
    - metric_type: The type of the metric. (scalar or hour_of_day)
    - value: The value of the metric.

MetricsManager

`buildings_bench.evaluation.managers.MetricsManager`

A class that keeps track of all metrics (and a scoring rule)for one or more buildings.

Metrics are computed for each building type (residential and commercial).

Example:

from buildings_bench.evaluation.managers import MetricsManager
from buildings_bench.evaluation import metrics_factory
from buildings_bench import BuildingTypes
import torch


metrics_manager = MetricsManager(metrics=metrics_factory('cvrmse'))

metrics_manager(
    y_true=torch.FloatTensor([1, 2, 3]).view(1,3,1),
    y_pred=torch.FloatTensor([1, 2, 3]).view(1,3,1),
    building_type = BuildingTypes.RESIDENTIAL_INT
)

for metric in metrics_manager.metrics[BuildingTypes.RESIDENTIAL]:
    metric.mean()
    print(metric.value) # prints tensor(0.)

`call(y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)`

Compute metrics for a batch of predictions.

Parameters:

Name	Type	Description	Default
`y_true`	`Tensor`	The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]	required
`y_pred`	`Tensor`	The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]	required
`building_types_mask`	`Tensor`	A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size].	`None`
`building_type`	`int`	The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type.	`COMMERCIAL_INT`

Other Parameters:

Name	Type	Description
`y_categories`	`Tensor`	The true load values. (quantized)
`y_distribution_params`	`Tensor`	logits, Gaussian params, etc.
`centroids`	`Tensor`	The bin values for the quantized load.
`loss`	`Tensor`	The loss for the batch.

`init(metrics=None, scoring_rule=None)`

Initializes the MetricsManager.

Parameters:

Name	Type	Description	Default
`metrics`	`List[Metric]`	A list of metrics to compute for each building type.	`None`
`scoring_rule`	`ScoringRule`	A scoring rule to compute for each building type.	`None`

`get_ppl()`

Returns the perplexity of the accumulated loss.

`reset(loss=True)`

Reset the metrics.

`summary(with_loss=False, with_ppl=False)`

Return a summary of the metrics for the dataset.

A summary maps keys to objects of type Metric or ScoringRule.

MetricType

`buildings_bench.evaluation.metrics.MetricType`

Enum class for metric types.

Attributes:

Name	Type	Description
`SCALAR`	`str`	A scalar metric.
`HOUR_OF_DAY`	`str`	A metric that is calculated for each hour of the day.

BuildingsBenchMetric

`buildings_bench.evaluation.metrics.BuildingsBenchMetric`

An abstract class for all metrics.

The basic idea is to acculumate the errors etc. in a list and then calculate the mean of the errors etc. at the end of the evaluation.

Calling the metric will add the error to the list of errors. Calling .mean() will calculate the mean of the errors, populating the .value attribute.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`type`	`MetricType`	The type of the metric.
`value`	`float`	The value of the metric.

Metric

`buildings_bench.evaluation.metrics.Metric`

Bases: BuildingsBenchMetric

A class that represents an error metric.

Example:

rmse = Metric('rmse', MetricType.SCALAR, squared_error, sqrt=True)
mae = Metric('mae', MetricType.SCALAR, absolute_error)
nmae = Metric('nmae', MetricType.SCALAR, absolute_error, normalize=True)
cvrmse = Metric('cvrmse', MetricType.SCALAR, squared_error, normalize=True, sqrt=True)
nmbe = Metric('nmbe', MetricType.SCALAR, bias_error, normalize=True)

`call(y_true, y_pred)`

Parameters:

Name	Type	Description	Default
`y_true`	`Tensor`	shape [batch_size, pred_len]	required
`y_pred`	`Tensor`	shape [batch_size, pred_len]	required

`init(name, type, function, **kwargs)`

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the metric.	required
`type`	`MetricType`	The type of the metric.	required
`function`	`Callable`	A function that takes two tensors and returns a tensor.	required

Other Parameters:

Name	Type	Description
`normalize`	`bool`	Whether to normalize the error.
`sqrt`	`bool`	Whether to take the square root of the error.

`mean()`

Calculate the mean of the error metric.

`reset()`

Reset the metric.

absolute_error

`buildings_bench.evaluation.metrics.absolute_error(y_true, y_pred)`

A PyTorch method that calculates the absolute error (AE) metric.

Parameters:

Name	Type	Description	Default
`y_true`	`Tensor`	[batch, pred_len]	required
`y_pred`	`Tensor`	[batch, pred_len]	required

Returns:

Name	Type	Description
`error`	`Tensor`	[batch, pred_len]

squared_error

`buildings_bench.evaluation.metrics.squared_error(y_true, y_pred)`

A PyTorch method that calculates the squared error (SE) metric.

Parameters:

Name	Type	Description	Default
`y_true`	`Tensor`	[batch, pred_len]	required
`y_pred`	`Tensor`	[batch, pred_len]	required

Returns:

Name	Type	Description
`error`	`Tensor`	[batch, pred_len]

bias_error

`buildings_bench.evaluation.metrics.bias_error(y_true, y_pred)`

A PyTorch method that calculates the bias error (BE) metric.

Parameters:

Name	Type	Description	Default
`y_true`	`Tensor`	[batch, pred_len]	required
`y_pred`	`Tensor`	[batch, pred_len]	required

Returns:

Name	Type	Description
`error`	`Tensor`	[batch, pred_len]

ScoringRule

`buildings_bench.evaluation.scoring_rules.ScoringRule`

Bases: BuildingsBenchMetric

An abstract class for all scoring rules.

RankedProbabilityScore

`buildings_bench.evaluation.scoring_rules.RankedProbabilityScore`

Bases: ScoringRule

A class that calculates the ranked probability score (RPS) metric for categorical distributions.

`rps(y_true, y_pred_logits, centroids)`

A PyTorch method that calculates the ranked probability score metric for categorical distributions.

Since the bin values are centroids of clusters along the real line, we have to compute the width of the bins by summing the distance to the left and right centroids of the bin (divided by 2), except for the first and last bins, where we only need to sum the distance to the right centroid of the first bin and the left centroid of the last bin, respectively.

Parameters:

Name	Type	Description	Default
`y_true`	`Tensor`	of shape [batch_size, seq_len, 1] categorical labels	required
`y_pred_logits`	`Tensor`	of shape [batch_size, seq_len, vocab_size] logits	required
`centroids`	`Tensor`	of shape [vocab_size]	required

ContinuousRankedProbabilityScore

`buildings_bench.evaluation.scoring_rules.ContinuousRankedProbabilityScore`

Bases: ScoringRule

A class that calculates the Gaussian continuous ranked probability score (CRPS) metric.

`crps(true_continuous, y_pred_distribution_params)`

Computes the Gaussian CRPS.

Parameters:

Name	Type	Description	Default
`true_continuous`	`Tensor`	of shape [batch_size, seq_len, 1]	required
`y_pred_distribution_params`	`Tensor`	of shape [batch_size, seq_len, 2]	required

aggregate

`buildings_bench.evaluation.aggregate.return_aggregate_median(model_list, results_dir, experiment='zero_shot', metrics=['cvrmse'], exclude_simulated=True, only_simulated=False, oov_list=[], reps=50000)`

Compute the aggregate median for a list of models and metrics over all buildings. Also returns the stratified 95% boostrap CIs for the aggregate median.

Parameters:

Name	Type	Description	Default
`model_list`	`list`	List of models to compute aggregate median for.	required
`results_dir`	`str`	Path to directory containing results.	required
`experiment`	`str`	Experiment type. Defaults to 'zero_shot'. Options: 'zero_shot', 'transfer_learning'.	`'zero_shot'`
`metrics`	`list`	List of metrics to compute aggregate median for. Defaults to ['cvrmse'].	`['cvrmse']`
`exclude_simulated`	`bool`	Whether to exclude simulated data. Defaults to True.	`True`
`only_simulated`	`bool`	Whether to only include simulated data. Defaults to False.	`False`
`oov_list`	`list`	List of OOV buildings to exclude. Defaults to [].	`[]`
`reps`	`int`	Number of bootstrap replicates to use. Defaults to 50000.	`50000`

Returns:

Name	Type	Description
`result_dict`	`Dict`	Dictionary containing aggregate median and CIs for each metric and building type.

buildings_bench.evaluation

Simple usage

Advanced usage (with scoring rule)

metrics_factory

buildings_bench.evaluation.metrics_factory(name, types=[MetricType.SCALAR])

scoring_rule_factory

buildings_bench.evaluation.scoring_rule_factory(name)

all_metrics_list

buildings_bench.evaluation.all_metrics_list()

BuildingTypes

buildings_bench.evaluation.managers.BuildingTypes

DatasetMetricsManager

buildings_bench.evaluation.managers.DatasetMetricsManager

__call__(dataset_name, building_id, y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)

__init__(metrics=default_metrics, scoring_rule=None)

add_building_to_dataset_if_missing(dataset_name, building_id)

get_building_from_dataset(dataset_name, building_id)

summary(dataset_name=None)

MetricsManager

buildings_bench.evaluation.managers.MetricsManager

__call__(y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)

__init__(metrics=None, scoring_rule=None)

get_ppl()

reset(loss=True)

summary(with_loss=False, with_ppl=False)

MetricType

buildings_bench.evaluation.metrics.MetricType

BuildingsBenchMetric

buildings_bench.evaluation.metrics.BuildingsBenchMetric

Metric

buildings_bench.evaluation.metrics.Metric

__call__(y_true, y_pred)

__init__(name, type, function, **kwargs)

mean()

reset()

absolute_error

buildings_bench.evaluation.metrics.absolute_error(y_true, y_pred)

squared_error

buildings_bench.evaluation.metrics.squared_error(y_true, y_pred)

bias_error

buildings_bench.evaluation.metrics.bias_error(y_true, y_pred)

ScoringRule

buildings_bench.evaluation.scoring_rules.ScoringRule

RankedProbabilityScore

buildings_bench.evaluation.scoring_rules.RankedProbabilityScore

rps(y_true, y_pred_logits, centroids)

ContinuousRankedProbabilityScore

buildings_bench.evaluation.scoring_rules.ContinuousRankedProbabilityScore

crps(true_continuous, y_pred_distribution_params)

aggregate

buildings_bench.evaluation.aggregate.return_aggregate_median(model_list, results_dir, experiment='zero_shot', metrics=['cvrmse'], exclude_simulated=True, only_simulated=False, oov_list=[], reps=50000)

`buildings_bench.evaluation.metrics_factory(name, types=[MetricType.SCALAR])`

`buildings_bench.evaluation.scoring_rule_factory(name)`

`buildings_bench.evaluation.all_metrics_list()`

`buildings_bench.evaluation.managers.BuildingTypes`

`buildings_bench.evaluation.managers.DatasetMetricsManager`

`call(dataset_name, building_id, y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)`

`init(metrics=default_metrics, scoring_rule=None)`

`add_building_to_dataset_if_missing(dataset_name, building_id)`

`get_building_from_dataset(dataset_name, building_id)`

`summary(dataset_name=None)`

`buildings_bench.evaluation.managers.MetricsManager`

`call(y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)`

`init(metrics=None, scoring_rule=None)`

`get_ppl()`

`reset(loss=True)`

`summary(with_loss=False, with_ppl=False)`

`buildings_bench.evaluation.metrics.MetricType`

`buildings_bench.evaluation.metrics.BuildingsBenchMetric`

`buildings_bench.evaluation.metrics.Metric`

`call(y_true, y_pred)`

`init(name, type, function, **kwargs)`

`mean()`

`reset()`

`buildings_bench.evaluation.metrics.absolute_error(y_true, y_pred)`

`buildings_bench.evaluation.metrics.squared_error(y_true, y_pred)`

`buildings_bench.evaluation.metrics.bias_error(y_true, y_pred)`

`buildings_bench.evaluation.scoring_rules.ScoringRule`

`buildings_bench.evaluation.scoring_rules.RankedProbabilityScore`

`rps(y_true, y_pred_logits, centroids)`

`buildings_bench.evaluation.scoring_rules.ContinuousRankedProbabilityScore`

`crps(true_continuous, y_pred_distribution_params)`

`buildings_bench.evaluation.aggregate.return_aggregate_median(model_list, results_dir, experiment='zero_shot', metrics=['cvrmse'], exclude_simulated=True, only_simulated=False, oov_list=[], reps=50000)`