buildings_bench.evaluation
The buildings_bench.evaluation
module contains the main functionality for evaluting a model
on the benchmark tasks.
The buildings_bench.evaluation.managers.DatasetMetricsManager
class is the main entry point.
Simple usage
from buildings_bench import BuildingTypes
from buildings_bench.evaluation.managers import DatasetMetricsManager
# By default, the DatasetMetricsManager keeps track of NRMSE, NMAE, and NMBE
metrics_manager = DatasetMetricsManager()
# Iterate over the dataset using our building dataset generator
for building_name, building_dataset in buildings_datasets_generator:
# Register a new building with the manager
metrics_manager.add_building_to_dataset_if_missing(
dataset_name, building_name,
)
# Your model makes predictions
# ...
# Register the predictions with the manager
metrics_manager(
dataset_name, # the name of the dataset, e.g., electricity
building_name, # the name of the building, e.g., MT_001
continuous_targets, # the ground truth 24 hour targets
predictions, # the model's 24 hour predictions
BuildingTypes.RESIDENTIAL_INT, # an int indicating the building type
)
Advanced usage (with scoring rule)
from buildings_bench.evaluation.managers import DatasetMetricsManager
from buildings_bench.evaluation import scoring_rule_factory
metrics_manager = DatasetMetricsManager(scoring_rule = scoring_rule_factory('crps'))
# Iterate over the dataset
for building_name, building_dataset in buildings_datasets_generator:
# Register a new building with the manager
metrics_manager.add_building_to_dataset_if_missing(
dataset_name, building_name,
)
# Your model makes predictions
# ...
# Register the predictions with the manager
metrics_manager(
dataset_name, # the name of the dataset, e.g., electricity
building_name, # the name of the building, e.g., MT_001
continuous_targets, # the ground truth 24 hour targets
predictions, # the model's 24 hour predictions
building_types_mask, # a boolean tensor indicating building type
y_categories=targets, # for scoring rules, the ground truth (discrete categories if using tokenization)
y_distribution_params=distribution_params, # for scoring rules, the distribution parameters
centroids=centroids # for scoring rules with categorical variables, the centroid values
)
metrics_factory
buildings_bench.evaluation.metrics_factory(name, types=[MetricType.SCALAR])
Create a metric from a name. By default, will return a scalar metric.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the metric. |
required |
types
|
List[MetricTypes]
|
The types of the metric. |
[SCALAR]
|
Returns: metrics_list (List[Metric]): A list of metrics.
scoring_rule_factory
buildings_bench.evaluation.scoring_rule_factory(name)
Create a scoring rule from a name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the scoring rule. |
required |
Returns: sr (ScoringRule): A scoring rule.
all_metrics_list
buildings_bench.evaluation.all_metrics_list()
Returns all registered metrics.
Returns:
Name | Type | Description |
---|---|---|
metrics_list |
List[Metric]
|
A list of metrics. |
BuildingTypes
buildings_bench.evaluation.managers.BuildingTypes
Enum for supported types of buildings.
Attributes:
Name | Type | Description |
---|---|---|
RESIDENTIAL |
str
|
Residential building type. |
COMMERCIAL |
str
|
Commercial building type. |
RESIDENTIAL_INT |
int
|
Integer representation of residential building type (0). |
COMMERCIAL_INT |
int
|
Integer representation of commercial building type (1). |
DatasetMetricsManager
buildings_bench.evaluation.managers.DatasetMetricsManager
A class that manages a MetricsManager for each building in one or more benchmark datasets. One DatasetMetricsManager can be used to keep track of all metrics when evaluating a model on all of the benchmark's datasets.
This class wil create a Pandas Dataframe summary containing the metrics for each building.
Default metrics are NRMSE (CVRMSE), NMAE, NMBE.
__call__(dataset_name, building_id, y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)
Compute metrics for a batch of predictions for a single building in a dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset. |
required |
building_id
|
str
|
The unique building identifier. |
required |
y_true
|
Tensor
|
The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] |
required |
y_pred
|
Tensor
|
The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] |
required |
building_types_mask
|
Tensor
|
A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size]. Default is None. |
None
|
building_type
|
int
|
The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type. |
COMMERCIAL_INT
|
Other Parameters:
Name | Type | Description |
---|---|---|
y_categories |
Tensor
|
The true load values. (quantized) |
y_distribution_params |
Tensor
|
logits, Gaussian params, etc. |
centroids |
Tensor
|
The bin values for the quantized load. |
loss |
Tensor
|
The loss for the batch. |
__init__(metrics=default_metrics, scoring_rule=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics
|
List[Metric]
|
A list of metrics to compute for each building type. |
default_metrics
|
scoring_rule
|
ScoringRule
|
A scoring rule to compute for each building type. |
None
|
add_building_to_dataset_if_missing(dataset_name, building_id)
If the building does not exist, add a new MetricsManager for the building.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset. |
required |
building_id
|
str
|
The unique building identifier. |
required |
get_building_from_dataset(dataset_name, building_id)
If the dataset and building exist, return the MetricsManager for the building.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset. |
required |
building_id
|
str
|
The unique building identifier. |
required |
Returns:
Type | Description |
---|---|
Optional[MetricsManager]
|
A MetricsManager if the dataset and building exist, otherwise None. |
summary(dataset_name=None)
Return a summary of the metrics for the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name
|
str
|
The name of the dataset to summarize. If None, summarize all datasets. |
None
|
Returns: A Pandas dataframe with the following columns:
- dataset: The name of the dataset.
- building_id: The unique ID of the building.
- building_type: The type of the building.
- metric: The name of the metric.
- metric_type: The type of the metric. (scalar or hour_of_day)
- value: The value of the metric.
MetricsManager
buildings_bench.evaluation.managers.MetricsManager
A class that keeps track of all metrics (and a scoring rule)for one or more buildings.
Metrics are computed for each building type (residential and commercial).
Example:
from buildings_bench.evaluation.managers import MetricsManager
from buildings_bench.evaluation import metrics_factory
from buildings_bench import BuildingTypes
import torch
metrics_manager = MetricsManager(metrics=metrics_factory('cvrmse'))
metrics_manager(
y_true=torch.FloatTensor([1, 2, 3]).view(1,3,1),
y_pred=torch.FloatTensor([1, 2, 3]).view(1,3,1),
building_type = BuildingTypes.RESIDENTIAL_INT
)
for metric in metrics_manager.metrics[BuildingTypes.RESIDENTIAL]:
metric.mean()
print(metric.value) # prints tensor(0.)
__call__(y_true, y_pred, building_types_mask=None, building_type=BuildingTypes.COMMERCIAL_INT, **kwargs)
Compute metrics for a batch of predictions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
Tensor
|
The true (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] |
required |
y_pred
|
Tensor
|
The predicted (unscaled) load values. (continuous) shape is [batch_size, pred_len, 1] |
required |
building_types_mask
|
Tensor
|
A boolean mask indicating the building type of each building. True (1) if commercial, False (0). Shape is [batch_size]. |
None
|
building_type
|
int
|
The building type of the batch. Can be provided instead of building_types_mask if all buildings are of the same type. |
COMMERCIAL_INT
|
Other Parameters:
Name | Type | Description |
---|---|---|
y_categories |
Tensor
|
The true load values. (quantized) |
y_distribution_params |
Tensor
|
logits, Gaussian params, etc. |
centroids |
Tensor
|
The bin values for the quantized load. |
loss |
Tensor
|
The loss for the batch. |
__init__(metrics=None, scoring_rule=None)
Initializes the MetricsManager.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics
|
List[Metric]
|
A list of metrics to compute for each building type. |
None
|
scoring_rule
|
ScoringRule
|
A scoring rule to compute for each building type. |
None
|
get_ppl()
Returns the perplexity of the accumulated loss.
reset(loss=True)
Reset the metrics.
summary(with_loss=False, with_ppl=False)
Return a summary of the metrics for the dataset.
A summary maps keys to objects of type Metric or ScoringRule.
MetricType
buildings_bench.evaluation.metrics.MetricType
Enum class for metric types.
Attributes:
Name | Type | Description |
---|---|---|
SCALAR |
str
|
A scalar metric. |
HOUR_OF_DAY |
str
|
A metric that is calculated for each hour of the day. |
BuildingsBenchMetric
buildings_bench.evaluation.metrics.BuildingsBenchMetric
An abstract class for all metrics.
The basic idea is to acculumate the errors etc. in a list and then calculate the mean of the errors etc. at the end of the evaluation.
Calling the metric will add the error to the list of errors. Calling .mean()
will calculate the mean of the errors, populating the .value
attribute.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
The name of the metric. |
type |
MetricType
|
The type of the metric. |
value |
float
|
The value of the metric. |
Metric
buildings_bench.evaluation.metrics.Metric
Bases: BuildingsBenchMetric
A class that represents an error metric.
Example:
rmse = Metric('rmse', MetricType.SCALAR, squared_error, sqrt=True)
mae = Metric('mae', MetricType.SCALAR, absolute_error)
nmae = Metric('nmae', MetricType.SCALAR, absolute_error, normalize=True)
cvrmse = Metric('cvrmse', MetricType.SCALAR, squared_error, normalize=True, sqrt=True)
nmbe = Metric('nmbe', MetricType.SCALAR, bias_error, normalize=True)
__call__(y_true, y_pred)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
Tensor
|
shape [batch_size, pred_len] |
required |
y_pred
|
Tensor
|
shape [batch_size, pred_len] |
required |
__init__(name, type, function, **kwargs)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the metric. |
required |
type
|
MetricType
|
The type of the metric. |
required |
function
|
Callable
|
A function that takes two tensors and returns a tensor. |
required |
Other Parameters:
Name | Type | Description |
---|---|---|
normalize |
bool
|
Whether to normalize the error. |
sqrt |
bool
|
Whether to take the square root of the error. |
mean()
Calculate the mean of the error metric.
reset()
Reset the metric.
absolute_error
buildings_bench.evaluation.metrics.absolute_error(y_true, y_pred)
A PyTorch method that calculates the absolute error (AE) metric.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
Tensor
|
[batch, pred_len] |
required |
y_pred
|
Tensor
|
[batch, pred_len] |
required |
Returns:
Name | Type | Description |
---|---|---|
error |
Tensor
|
[batch, pred_len] |
squared_error
buildings_bench.evaluation.metrics.squared_error(y_true, y_pred)
A PyTorch method that calculates the squared error (SE) metric.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
Tensor
|
[batch, pred_len] |
required |
y_pred
|
Tensor
|
[batch, pred_len] |
required |
Returns:
Name | Type | Description |
---|---|---|
error |
Tensor
|
[batch, pred_len] |
bias_error
buildings_bench.evaluation.metrics.bias_error(y_true, y_pred)
A PyTorch method that calculates the bias error (BE) metric.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
Tensor
|
[batch, pred_len] |
required |
y_pred
|
Tensor
|
[batch, pred_len] |
required |
Returns:
Name | Type | Description |
---|---|---|
error |
Tensor
|
[batch, pred_len] |
ScoringRule
buildings_bench.evaluation.scoring_rules.ScoringRule
RankedProbabilityScore
buildings_bench.evaluation.scoring_rules.RankedProbabilityScore
Bases: ScoringRule
A class that calculates the ranked probability score (RPS) metric for categorical distributions.
rps(y_true, y_pred_logits, centroids)
A PyTorch method that calculates the ranked probability score metric for categorical distributions.
Since the bin values are centroids of clusters along the real line, we have to compute the width of the bins by summing the distance to the left and right centroids of the bin (divided by 2), except for the first and last bins, where we only need to sum the distance to the right centroid of the first bin and the left centroid of the last bin, respectively.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y_true
|
Tensor
|
of shape [batch_size, seq_len, 1] categorical labels |
required |
y_pred_logits
|
Tensor
|
of shape [batch_size, seq_len, vocab_size] logits |
required |
centroids
|
Tensor
|
of shape [vocab_size] |
required |
ContinuousRankedProbabilityScore
buildings_bench.evaluation.scoring_rules.ContinuousRankedProbabilityScore
Bases: ScoringRule
A class that calculates the Gaussian continuous ranked probability score (CRPS) metric.
crps(true_continuous, y_pred_distribution_params)
Computes the Gaussian CRPS.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
true_continuous
|
Tensor
|
of shape [batch_size, seq_len, 1] |
required |
y_pred_distribution_params
|
Tensor
|
of shape [batch_size, seq_len, 2] |
required |
aggregate
buildings_bench.evaluation.aggregate.return_aggregate_median(model_list, results_dir, experiment='zero_shot', metrics=['cvrmse'], exclude_simulated=True, only_simulated=False, oov_list=[], reps=50000)
Compute the aggregate median for a list of models and metrics over all buildings. Also returns the stratified 95% boostrap CIs for the aggregate median.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_list
|
list
|
List of models to compute aggregate median for. |
required |
results_dir
|
str
|
Path to directory containing results. |
required |
experiment
|
str
|
Experiment type. Defaults to 'zero_shot'. Options: 'zero_shot', 'transfer_learning'. |
'zero_shot'
|
metrics
|
list
|
List of metrics to compute aggregate median for. Defaults to ['cvrmse']. |
['cvrmse']
|
exclude_simulated
|
bool
|
Whether to exclude simulated data. Defaults to True. |
True
|
only_simulated
|
bool
|
Whether to only include simulated data. Defaults to False. |
False
|
oov_list
|
list
|
List of OOV buildings to exclude. Defaults to []. |
[]
|
reps
|
int
|
Number of bootstrap replicates to use. Defaults to 50000. |
50000
|
Returns:
Name | Type | Description |
---|---|---|
result_dict |
Dict
|
Dictionary containing aggregate median and CIs for each metric and building type. |