Skip to content

buildings_bench.evaluation

The buildings_bench.evaluation module contains the main functionality for evaluting a model on the benchmark tasks.

The buildings_bench.evaluation.managers.DatasetMetricsManager class is the main entry point.

Simple usage

from buildings_bench import BuildingTypes
from buildings_bench.evaluation.managers import DatasetMetricsManager

# By default, the DatasetMetricsManager keeps track of NRMSE, NMAE, and NMBE
metrics_manager = DatasetMetricsManager()

# Iterate over the dataset using our building dataset generator
for building_name, building_dataset in buildings_datasets_generator:

    # Register a new building with the manager
    metrics_manager.add_building_to_dataset_if_missing(
        dataset_name, building_name,
    )

    # Your model makes predictions
    # ...

    # Register the predictions with the manager
    metrics_manager(
        dataset_name,                 # the name of the dataset, e.g., electricity
        building_name,                # the name of the building, e.g., MT_001
        continuous_targets,           # the ground truth 24 hour targets
        predictions,                  # the model's 24 hour predictions
        BuildingTypes.RESIDENTIAL_INT,    # an int indicating the building type
    )

Advanced usage (with scoring rule)

from buildings_bench.evaluation.managers import DatasetMetricsManager
from buildings_bench.evaluation import scoring_rule_factory

metrics_manager = DatasetMetricsManager(scoring_rule = scoring_rule_factory('crps'))

# Iterate over the dataset
for building_name, building_dataset in buildings_datasets_generator:

    # Register a new building with the manager
    metrics_manager.add_building_to_dataset_if_missing(
        dataset_name, building_name,
    )

    # Your model makes predictions
    # ...

    # Register the predictions with the manager
    metrics_manager(
        dataset_name,           # the name of the dataset, e.g., electricity
        building_name,          # the name of the building, e.g., MT_001
        continuous_targets,     # the ground truth 24 hour targets
        predictions,            # the model's 24 hour predictions
        building_types_mask,    # a boolean tensor indicating building type
        y_categories=targets,   # for scoring rules, the ground truth (discrete categories if using tokenization)
        y_distribution_params=distribution_params, # for scoring rules, the distribution parameters
        centroids=centroids   # for scoring rules with categorical variables, the centroid values
    )

metrics_factory

buildings_bench.evaluation.metrics_factory(name: str, types: List[MetricType] = [MetricType.SCALAR]) -> List[Metric]

Create a metric from a name. By default, will return a scalar metric.

Parameters:

Name Type Description Default
name str

The name of the metric.

required
types List[MetricTypes]

The types of the metric.

[MetricType.SCALAR]

Returns:

Name Type Description
metrics_list List[Metric]

A list of metrics.

scoring_rule_factory

buildings_bench.evaluation.scoring_rule_factory(name: str) -> ScoringRule

Create a scoring rule from a name.

Parameters:

Name Type Description Default
name str

The name of the scoring rule.

required

Returns:

Name Type Description
sr ScoringRule

A scoring rule.

all_metrics_list

buildings_bench.evaluation.all_metrics_list() -> List[Metric]

Returns all registered metrics.

Returns:

Name Type Description
metrics_list List[Metric]

A list of metrics.


BuildingTypes

buildings_bench.evaluation.managers.BuildingTypes

Enum for supported types of buildings.

Attributes:

Name Type Description
RESIDENTIAL str

Residential building type.

COMMERCIAL str

Commercial building type.

RESIDENTIAL_INT int

Integer representation of residential building type (0).

COMMERCIAL_INT int

Integer representation of commercial building type (1).

DatasetMetricsManager

buildings_bench.evaluation.managers.DatasetMetricsManager

A class that manages a MetricsManager for each building in one or more benchmark datasets. One DatasetMetricsManager can be used to keep track of all metrics when evaluating a model on all of the benchmark's datasets.

This class wil create a Pandas Dataframe summary containing the metrics for each building.

Default metrics are NRMSE (CVRMSE), NMAE, NMBE.

__call__(dataset_name: str, building_id: str, y_true: torch.Tensor, y_pred: torch.Tensor, building_types_mask: torch.Tensor = None, building_type: int = BuildingTypes.COMMERCIAL_INT, **kwargs: int) -> None

Compute metrics for a batch of predictions for a single building in a dataset.

Parameters:

Name Type Description Default
dataset_name str

The name of the dataset.

required
building_id str

The unique building identifier.

required
y_true torch.Tensor

The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]

required
y_pred torch.Tensor

The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]

required
building_types_mask torch.Tensor

A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size]. Default is None.

None
building_type int

The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type.

BuildingTypes.COMMERCIAL_INT

Other Parameters:

Name Type Description
y_categories torch.Tensor

The true load values. (quantized)

y_distribution_params torch.Tensor

logits, Gaussian params, etc.

centroids torch.Tensor

The bin values for the quantized load.

loss torch.Tensor

The loss for the batch.

summary(dataset_name: str = None) -> pd.DataFrame

Return a summary of the metrics for the dataset.

Parameters:

Name Type Description Default
dataset_name str

The name of the dataset to summarize. If None, summarize all datasets.

None

Returns:

Type Description
pd.DataFrame

A Pandas dataframe with the following columns:

  • dataset: The name of the dataset.
  • building_id: The unique ID of the building.
  • building_type: The type of the building.
  • metric: The name of the metric.
  • metric_type: The type of the metric. (scalar or hour_of_day)
  • value: The value of the metric.

MetricsManager

buildings_bench.evaluation.managers.MetricsManager

A class that keeps track of all metrics (and a scoring rule)for one or more buildings.

Metrics are computed for each building type (residential and commercial).

Example:

from buildings_bench.evaluation.managers import MetricsManager
from buildings_bench.evaluation import metrics_factory
from buildings_bench import BuildingTypes
import torch


metrics_manager = MetricsManager(metrics=metrics_factory('cvrmse'))

metrics_manager(
    y_true=torch.FloatTensor([1, 2, 3]).view(1,3,1),
    y_pred=torch.FloatTensor([1, 2, 3]).view(1,3,1),
    building_type = BuildingTypes.RESIDENTIAL_INT
)

for metric in metrics_manager.metrics[BuildingTypes.RESIDENTIAL]:
    metric.mean()
    print(metric.value) # prints tensor(0.)
__init__(metrics: List[Metric] = None, scoring_rule: ScoringRule = None)

Initializes the MetricsManager.

Parameters:

Name Type Description Default
metrics List[Metric]

A list of metrics to compute for each building type.

None
scoring_rule ScoringRule

A scoring rule to compute for each building type.

None
get_ppl()

Returns the perplexity of the accumulated loss.

summary(with_loss = False, with_ppl = False)

Return a summary of the metrics for the dataset.

A summary maps keys to objects of type Metric or ScoringRule.

reset(loss: bool = True) -> None

Reset the metrics.

__call__(y_true: torch.Tensor, y_pred: torch.Tensor, building_types_mask: torch.Tensor = None, building_type: int = BuildingTypes.COMMERCIAL_INT, **kwargs: int)

Compute metrics for a batch of predictions.

Parameters:

Name Type Description Default
y_true torch.Tensor

The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]

required
y_pred torch.Tensor

The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1]

required
building_types_mask torch.Tensor

A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size].

None
building_type int

The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type.

BuildingTypes.COMMERCIAL_INT

Other Parameters:

Name Type Description
y_categories torch.Tensor

The true load values. (quantized)

y_distribution_params torch.Tensor

logits, Gaussian params, etc.

centroids torch.Tensor

The bin values for the quantized load.

loss torch.Tensor

The loss for the batch.


MetricType

buildings_bench.evaluation.metrics.MetricType

Enum class for metric types.

Attributes:

Name Type Description
SCALAR str

A scalar metric.

HOUR_OF_DAY str

A metric that is calculated for each hour of the day.

BuildingsBenchMetric

buildings_bench.evaluation.metrics.BuildingsBenchMetric

An abstract class for all metrics.

The basic idea is to acculumate the errors etc. in a list and then calculate the mean of the errors etc. at the end of the evaluation.

Calling the metric will add the error to the list of errors. Calling .mean() will calculate the mean of the errors, populating the .value attribute.

Attributes:

Name Type Description
name str

The name of the metric.

type MetricType

The type of the metric.

value float

The value of the metric.

Metric

buildings_bench.evaluation.metrics.Metric

Bases: BuildingsBenchMetric

A class that represents an error metric.

Example:

rmse = Metric('rmse', MetricType.SCALAR, squared_error, sqrt=True)
mae = Metric('mae', MetricType.SCALAR, absolute_error)
nmae = Metric('nmae', MetricType.SCALAR, absolute_error, normalize=True)
cvrmse = Metric('cvrmse', MetricType.SCALAR, squared_error, normalize=True, sqrt=True)
nmbe = Metric('nmbe', MetricType.SCALAR, bias_error, normalize=True)
__init__(name: str, type: MetricType, function: Callable, **kwargs: Callable)

Parameters:

Name Type Description Default
name str

The name of the metric.

required
type MetricType

The type of the metric.

required
function Callable

A function that takes two tensors and returns a tensor.

required

Other Parameters:

Name Type Description
normalize bool

Whether to normalize the error.

sqrt bool

Whether to take the square root of the error.

__call__(y_true, y_pred) -> None

Parameters:

Name Type Description Default
y_true torch.Tensor

shape [batch_size, pred_len]

required
y_pred torch.Tensor

shape [batch_size, pred_len]

required
reset() -> None

Reset the metric.

mean() -> None

Calculate the mean of the error metric.

absolute_error

buildings_bench.evaluation.metrics.absolute_error(y_true: torch.Tensor, y_pred: torch.Tensor) -> torch.Tensor

A PyTorch method that calculates the absolute error (AE) metric.

Parameters:

Name Type Description Default
y_true torch.Tensor

[batch, pred_len]

required
y_pred torch.Tensor

[batch, pred_len]

required

Returns:

Name Type Description
error torch.Tensor

[batch, pred_len]

squared_error

buildings_bench.evaluation.metrics.squared_error(y_true: torch.Tensor, y_pred: torch.Tensor) -> torch.Tensor

A PyTorch method that calculates the squared error (SE) metric.

Parameters:

Name Type Description Default
y_true torch.Tensor

[batch, pred_len]

required
y_pred torch.Tensor

[batch, pred_len]

required

Returns:

Name Type Description
error torch.Tensor

[batch, pred_len]

bias_error

buildings_bench.evaluation.metrics.bias_error(y_true: torch.Tensor, y_pred: torch.Tensor) -> torch.Tensor

A PyTorch method that calculates the bias error (BE) metric.

Parameters:

Name Type Description Default
y_true torch.Tensor

[batch, pred_len]

required
y_pred torch.Tensor

[batch, pred_len]

required

Returns:

Name Type Description
error torch.Tensor

[batch, pred_len]


ScoringRule

buildings_bench.evaluation.scoring_rules.ScoringRule

Bases: BuildingsBenchMetric

An abstract class for all scoring rules.

RankedProbabilityScore

buildings_bench.evaluation.scoring_rules.RankedProbabilityScore

Bases: ScoringRule

A class that calculates the ranked probability score (RPS) metric for categorical distributions.

rps(y_true, y_pred_logits, centroids) -> None

A PyTorch method that calculates the ranked probability score metric for categorical distributions.

Since the bin values are centroids of clusters along the real line, we have to compute the width of the bins by summing the distance to the left and right centroids of the bin (divided by 2), except for the first and last bins, where we only need to sum the distance to the right centroid of the first bin and the left centroid of the last bin, respectively.

Parameters:

Name Type Description Default
y_true torch.Tensor

of shape [batch_size, seq_len, 1] categorical labels

required
y_pred_logits torch.Tensor

of shape [batch_size, seq_len, vocab_size] logits

required
centroids torch.Tensor

of shape [vocab_size]

required

ContinuousRankedProbabilityScore

buildings_bench.evaluation.scoring_rules.ContinuousRankedProbabilityScore

Bases: ScoringRule

A class that calculates the Gaussian continuous ranked probability score (CRPS) metric.

crps(true_continuous, y_pred_distribution_params) -> None

Computes the Gaussian CRPS.

Parameters:

Name Type Description Default
true_continuous torch.Tensor

of shape [batch_size, seq_len, 1]

required
y_pred_distribution_params torch.Tensor

of shape [batch_size, seq_len, 2]

required

aggregate

buildings_bench.evaluation.aggregate.return_aggregate_median(model_list, results_dir, experiment = 'zero_shot', metrics = ['cvrmse'], exclude_simulated = True, only_simulated = False, oov_list = [], reps = 50000)

Compute the aggregate median for a list of models and metrics over all buildings. Also returns the stratified 95% boostrap CIs for the aggregate median.

Parameters:

Name Type Description Default
model_list list

List of models to compute aggregate median for.

required
results_dir str

Path to directory containing results.

required
experiment str

Experiment type. Defaults to 'zero_shot'. Options: 'zero_shot', 'transfer_learning'.

'zero_shot'
metrics list

List of metrics to compute aggregate median for. Defaults to ['cvrmse'].

['cvrmse']
exclude_simulated bool

Whether to exclude simulated data. Defaults to True.

True
only_simulated bool

Whether to only include simulated data. Defaults to False.

False
oov_list list

List of OOV buildings to exclude. Defaults to [].

[]
reps int

Number of bootstrap replicates to use. Defaults to 50000.

50000

Returns:

Name Type Description
result_dict Dict

Dictionary containing aggregate median and CIs for each metric and building type.