buildings_bench.evaluation
The buildings_bench.evaluation
module contains the main functionality for evaluting a model
on the benchmark tasks.
The buildings_bench.evaluation.managers.DatasetMetricsManager
class is the main entry point.
Simple usage
from buildings_bench import BuildingTypes
from buildings_bench.evaluation.managers import DatasetMetricsManager
# By default, the DatasetMetricsManager keeps track of NRMSE, NMAE, and NMBE
metrics_manager = DatasetMetricsManager()
# Iterate over the dataset using our building dataset generator
for building_name, building_dataset in buildings_datasets_generator:
# Register a new building with the manager
metrics_manager.add_building_to_dataset_if_missing(
dataset_name, building_name,
)
# Your model makes predictions
# ...
# Register the predictions with the manager
metrics_manager(
dataset_name, # the name of the dataset, e.g., electricity
building_name, # the name of the building, e.g., MT_001
continuous_targets, # the ground truth 24 hour targets
predictions, # the model's 24 hour predictions
BuildingTypes.RESIDENTIAL_INT, # an int indicating the building type
)
Advanced usage (with scoring rule)
from buildings_bench.evaluation.managers import DatasetMetricsManager
from buildings_bench.evaluation import scoring_rule_factory
metrics_manager = DatasetMetricsManager(scoring_rule = scoring_rule_factory('crps'))
# Iterate over the dataset
for building_name, building_dataset in buildings_datasets_generator:
# Register a new building with the manager
metrics_manager.add_building_to_dataset_if_missing(
dataset_name, building_name,
)
# Your model makes predictions
# ...
# Register the predictions with the manager
metrics_manager(
dataset_name, # the name of the dataset, e.g., electricity
building_name, # the name of the building, e.g., MT_001
continuous_targets, # the ground truth 24 hour targets
predictions, # the model's 24 hour predictions
building_types_mask, # a boolean tensor indicating building type
y_categories=targets, # for scoring rules, the ground truth (discrete categories if using tokenization)
y_distribution_params=distribution_params, # for scoring rules, the distribution parameters
centroids=centroids # for scoring rules with categorical variables, the centroid values
)
metrics_factory
buildings_bench.evaluation.metrics_factory(name: str, types: List[MetricType] = [MetricType.SCALAR]) > List[Metric]
Create a metric from a name. By default, will return a scalar metric.
Parameters:
Name  Type  Description  Default 

name 
str

The name of the metric. 
required 
types 
List[MetricTypes]

The types of the metric. 
[MetricType.SCALAR]

Returns:
Name  Type  Description 

metrics_list 
List[Metric]

A list of metrics. 
scoring_rule_factory
buildings_bench.evaluation.scoring_rule_factory(name: str) > ScoringRule
Create a scoring rule from a name.
Parameters:
Name  Type  Description  Default 

name 
str

The name of the scoring rule. 
required 
Returns:
Name  Type  Description 

sr 
ScoringRule

A scoring rule. 
all_metrics_list
buildings_bench.evaluation.all_metrics_list() > List[Metric]
Returns all registered metrics.
Returns:
Name  Type  Description 

metrics_list 
List[Metric]

A list of metrics. 
BuildingTypes
buildings_bench.evaluation.managers.BuildingTypes
Enum for supported types of buildings.
Attributes:
Name  Type  Description 

RESIDENTIAL 
str

Residential building type. 
COMMERCIAL 
str

Commercial building type. 
RESIDENTIAL_INT 
int

Integer representation of residential building type (0). 
COMMERCIAL_INT 
int

Integer representation of commercial building type (1). 
DatasetMetricsManager
buildings_bench.evaluation.managers.DatasetMetricsManager
A class that manages a MetricsManager for each building in one or more benchmark datasets. One DatasetMetricsManager can be used to keep track of all metrics when evaluating a model on all of the benchmark's datasets.
This class wil create a Pandas Dataframe summary containing the metrics for each building.
Default metrics are NRMSE (CVRMSE), NMAE, NMBE.
__call__(dataset_name: str, building_id: str, y_true: torch.Tensor, y_pred: torch.Tensor, building_types_mask: torch.Tensor = None, building_type: int = BuildingTypes.COMMERCIAL_INT, **kwargs: int) > None
Compute metrics for a batch of predictions for a single building in a dataset.
Parameters:
Name  Type  Description  Default 

dataset_name 
str

The name of the dataset. 
required 
building_id 
str

The unique building identifier. 
required 
y_true 
torch.Tensor

The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] 
required 
y_pred 
torch.Tensor

The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] 
required 
building_types_mask 
torch.Tensor

A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size]. Default is None. 
None

building_type 
int

The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type. 
BuildingTypes.COMMERCIAL_INT

Other Parameters:
Name  Type  Description 

y_categories 
torch.Tensor

The true load values. (quantized) 
y_distribution_params 
torch.Tensor

logits, Gaussian params, etc. 
centroids 
torch.Tensor

The bin values for the quantized load. 
loss 
torch.Tensor

The loss for the batch. 
summary(dataset_name: str = None) > pd.DataFrame
Return a summary of the metrics for the dataset.
Parameters:
Name  Type  Description  Default 

dataset_name 
str

The name of the dataset to summarize. If None, summarize all datasets. 
None

Returns:
Type  Description 

pd.DataFrame

A Pandas dataframe with the following columns:

MetricsManager
buildings_bench.evaluation.managers.MetricsManager
A class that keeps track of all metrics (and a scoring rule)for one or more buildings.
Metrics are computed for each building type (residential and commercial).
Example:
from buildings_bench.evaluation.managers import MetricsManager
from buildings_bench.evaluation import metrics_factory
from buildings_bench import BuildingTypes
import torch
metrics_manager = MetricsManager(metrics=metrics_factory('cvrmse'))
metrics_manager(
y_true=torch.FloatTensor([1, 2, 3]).view(1,3,1),
y_pred=torch.FloatTensor([1, 2, 3]).view(1,3,1),
building_type = BuildingTypes.RESIDENTIAL_INT
)
for metric in metrics_manager.metrics[BuildingTypes.RESIDENTIAL]:
metric.mean()
print(metric.value) # prints tensor(0.)
__init__(metrics: List[Metric] = None, scoring_rule: ScoringRule = None)
Initializes the MetricsManager.
Parameters:
Name  Type  Description  Default 

metrics 
List[Metric]

A list of metrics to compute for each building type. 
None

scoring_rule 
ScoringRule

A scoring rule to compute for each building type. 
None

get_ppl()
Returns the perplexity of the accumulated loss.
summary(with_loss = False, with_ppl = False)
Return a summary of the metrics for the dataset.
A summary maps keys to objects of type Metric or ScoringRule.
reset(loss: bool = True) > None
Reset the metrics.
__call__(y_true: torch.Tensor, y_pred: torch.Tensor, building_types_mask: torch.Tensor = None, building_type: int = BuildingTypes.COMMERCIAL_INT, **kwargs: int)
Compute metrics for a batch of predictions.
Parameters:
Name  Type  Description  Default 

y_true 
torch.Tensor

The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] 
required 
y_pred 
torch.Tensor

The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] 
required 
building_types_mask 
torch.Tensor

A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size]. 
None

building_type 
int

The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type. 
BuildingTypes.COMMERCIAL_INT

Other Parameters:
Name  Type  Description 

y_categories 
torch.Tensor

The true load values. (quantized) 
y_distribution_params 
torch.Tensor

logits, Gaussian params, etc. 
centroids 
torch.Tensor

The bin values for the quantized load. 
loss 
torch.Tensor

The loss for the batch. 
MetricType
buildings_bench.evaluation.metrics.MetricType
Enum class for metric types.
Attributes:
Name  Type  Description 

SCALAR 
str

A scalar metric. 
HOUR_OF_DAY 
str

A metric that is calculated for each hour of the day. 
BuildingsBenchMetric
buildings_bench.evaluation.metrics.BuildingsBenchMetric
An abstract class for all metrics.
The basic idea is to acculumate the errors etc. in a list and then calculate the mean of the errors etc. at the end of the evaluation.
Calling the metric will add the error to the list of errors. Calling .mean()
will calculate the mean of the errors, populating the .value
attribute.
Attributes:
Name  Type  Description 

name 
str

The name of the metric. 
type 
MetricType

The type of the metric. 
value 
float

The value of the metric. 
Metric
buildings_bench.evaluation.metrics.Metric
Bases: BuildingsBenchMetric
A class that represents an error metric.
Example:
rmse = Metric('rmse', MetricType.SCALAR, squared_error, sqrt=True)
mae = Metric('mae', MetricType.SCALAR, absolute_error)
nmae = Metric('nmae', MetricType.SCALAR, absolute_error, normalize=True)
cvrmse = Metric('cvrmse', MetricType.SCALAR, squared_error, normalize=True, sqrt=True)
nmbe = Metric('nmbe', MetricType.SCALAR, bias_error, normalize=True)
__init__(name: str, type: MetricType, function: Callable, **kwargs: Callable)
Parameters:
Name  Type  Description  Default 

name 
str

The name of the metric. 
required 
type 
MetricType

The type of the metric. 
required 
function 
Callable

A function that takes two tensors and returns a tensor. 
required 
Other Parameters:
Name  Type  Description 

normalize 
bool

Whether to normalize the error. 
sqrt 
bool

Whether to take the square root of the error. 
__call__(y_true, y_pred) > None
Parameters:
Name  Type  Description  Default 

y_true 
torch.Tensor

shape [batch_size, pred_len] 
required 
y_pred 
torch.Tensor

shape [batch_size, pred_len] 
required 
reset() > None
Reset the metric.
mean() > None
Calculate the mean of the error metric.
absolute_error
buildings_bench.evaluation.metrics.absolute_error(y_true: torch.Tensor, y_pred: torch.Tensor) > torch.Tensor
A PyTorch method that calculates the absolute error (AE) metric.
Parameters:
Name  Type  Description  Default 

y_true 
torch.Tensor

[batch, pred_len] 
required 
y_pred 
torch.Tensor

[batch, pred_len] 
required 
Returns:
Name  Type  Description 

error 
torch.Tensor

[batch, pred_len] 
squared_error
buildings_bench.evaluation.metrics.squared_error(y_true: torch.Tensor, y_pred: torch.Tensor) > torch.Tensor
A PyTorch method that calculates the squared error (SE) metric.
Parameters:
Name  Type  Description  Default 

y_true 
torch.Tensor

[batch, pred_len] 
required 
y_pred 
torch.Tensor

[batch, pred_len] 
required 
Returns:
Name  Type  Description 

error 
torch.Tensor

[batch, pred_len] 
bias_error
buildings_bench.evaluation.metrics.bias_error(y_true: torch.Tensor, y_pred: torch.Tensor) > torch.Tensor
A PyTorch method that calculates the bias error (BE) metric.
Parameters:
Name  Type  Description  Default 

y_true 
torch.Tensor

[batch, pred_len] 
required 
y_pred 
torch.Tensor

[batch, pred_len] 
required 
Returns:
Name  Type  Description 

error 
torch.Tensor

[batch, pred_len] 
ScoringRule
buildings_bench.evaluation.scoring_rules.ScoringRule
RankedProbabilityScore
buildings_bench.evaluation.scoring_rules.RankedProbabilityScore
Bases: ScoringRule
A class that calculates the ranked probability score (RPS) metric for categorical distributions.
rps(y_true, y_pred_logits, centroids) > None
A PyTorch method that calculates the ranked probability score metric for categorical distributions.
Since the bin values are centroids of clusters along the real line, we have to compute the width of the bins by summing the distance to the left and right centroids of the bin (divided by 2), except for the first and last bins, where we only need to sum the distance to the right centroid of the first bin and the left centroid of the last bin, respectively.
Parameters:
Name  Type  Description  Default 

y_true 
torch.Tensor

of shape [batch_size, seq_len, 1] categorical labels 
required 
y_pred_logits 
torch.Tensor

of shape [batch_size, seq_len, vocab_size] logits 
required 
centroids 
torch.Tensor

of shape [vocab_size] 
required 
ContinuousRankedProbabilityScore
buildings_bench.evaluation.scoring_rules.ContinuousRankedProbabilityScore
Bases: ScoringRule
A class that calculates the Gaussian continuous ranked probability score (CRPS) metric.
crps(true_continuous, y_pred_distribution_params) > None
Computes the Gaussian CRPS.
Parameters:
Name  Type  Description  Default 

true_continuous 
torch.Tensor

of shape [batch_size, seq_len, 1] 
required 
y_pred_distribution_params 
torch.Tensor

of shape [batch_size, seq_len, 2] 
required 
aggregate
buildings_bench.evaluation.aggregate.return_aggregate_median(model_list, results_dir, experiment = 'zero_shot', metrics = ['cvrmse'], exclude_simulated = True, only_simulated = False, oov_list = [], reps = 50000)
Compute the aggregate median for a list of models and metrics over all buildings. Also returns the stratified 95% boostrap CIs for the aggregate median.
Parameters:
Name  Type  Description  Default 

model_list 
list

List of models to compute aggregate median for. 
required 
results_dir 
str

Path to directory containing results. 
required 
experiment 
str

Experiment type. Defaults to 'zero_shot'. Options: 'zero_shot', 'transfer_learning'. 
'zero_shot'

metrics 
list

List of metrics to compute aggregate median for. Defaults to ['cvrmse']. 
['cvrmse']

exclude_simulated 
bool

Whether to exclude simulated data. Defaults to True. 
True

only_simulated 
bool

Whether to only include simulated data. Defaults to False. 
False

oov_list 
list

List of OOV buildings to exclude. Defaults to []. 
[]

reps 
int

Number of bootstrap replicates to use. Defaults to 50000. 
50000

Returns:
Name  Type  Description 

result_dict 
Dict

Dictionary containing aggregate median and CIs for each metric and building type. 