buildings_bench.data

Functions and class definitions for loading Torch and Pandas datasets.

Main entry points for loading PyTorch and Pandas datasets:

load_pretraining() (used for pretraining)
load_torch_dataset() (used for benchmark tasks)
load_pandas_dataset() (used for benchmark tasks)

Available PyTorch Datasets:

Buildings900K (used for pretraining)
TorchBuildingsDataset (used for benchmark tasks)
PandasTransformerDataset (used for benchmark tasks)

load_pretraining

`buildings_bench.data.load_pretraining(name, num_buildings_ablation=-1, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None, custom_idx_filename='', context_len=168, pred_len=24)`

Pre-training datasets: buildings-900k-train, buildings-900k-val

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset to load.	required
`num_buildings_ablation`	`int`	Number of buildings to use for pre-training. If -1, use all buildings.	`-1`
`apply_scaler_transform`	`str`	If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.	`''`
`scaler_transform_path`	`Path`	Path to data for transform, e.g., pickled data for BoxCox transform.	`None`
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Default: None.	`None`
`custom_idx_filename`	`str`	customized index filename. Default: ''	`''`
`context_len`	`int`	Length of the context. Defaults to 168.	`168`
`pred_len`	`int`	Length of the prediction horizon. Defaults to 24.	`24`

Returns:

Type	Description
`Dataset`	torch.utils.data.Dataset: Dataset for pretraining.

load_torch_dataset

`buildings_bench.data.load_torch_dataset(name, dataset_path=None, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None, include_outliers=False, context_len=168, pred_len=24)`

Load datasets by name.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset to load.	required
`dataset_path`	`Path`	Path to the benchmark data. Optional.	`None`
`apply_scaler_transform`	`str`	If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.	`''`
`scaler_transform_path`	`Path`	Path to data for transform, e.g., pickled data for BoxCox transform.	`None`
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Default: None.	`None`
`include_outliers`	`bool`	Use version of BuildingsBench with outliers.	`False`
`context_len`	`int`	Length of the context. Defaults to 168.	`168`
`pred_len`	`int`	Length of the prediction horizon. Defaults to 24.	`24`

Returns:

Name	Type	Description
`dataset`	`Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]`	Dataset for benchmarking.

load_pandas_dataset

`buildings_bench.data.load_pandas_dataset(name, dataset_path=None, feature_set='engineered', weather_inputs=None, apply_scaler_transform='', scaler_transform_path=None, include_outliers=False)`

Load datasets by name.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset to load.	required
`dataset_path`	`Path`	Path to the benchmark data. Optional.	`None`
`feature_set`	`str`	Feature set to use. Default: 'engineered'.	`'engineered'`
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Default: None.	`None`
`apply_scaler_transform`	`str`	If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.	`''`
`scaler_transform_path`	`Path`	Path to data for transform, e.g., pickled data for BoxCox transform.	`None`
`include_outliers`	`bool`	Use version of BuildingsBench with outliers.	`False`

Returns:

Name	Type	Description
`dataset`	`PandasBuildingDatasetsFromCSV`	Generator of Pandas datasets for benchmarking.

The Buildings-900K PyTorch Dataset

`buildings_bench.data.buildings900K.Buildings900K`

Bases: Dataset

This is an indexed dataset for the Buildings-900K dataset. It uses an index file to quickly load a sub-sequence from a time series in a multi-building Parquet file. The index file is a tab separated file with the following columns:

Building-type-and-year (e.g., comstock_tmy3_release_1)
Census region (e.g., by_puma_midwest)
PUMA ID
Building ID
Hour of year pointer (e.g., 0070)

The sequence pointer is used to extract the slice [pointer - context length : pointer + pred length] for a given building ID.

The time series are not stored chronologically and must be sorted by timestamp after loading.

Each dataloader worker has its own file pointer to the index file. This is to avoid weird multiprocessing errors from sharing a file pointer. We 'seek' to the correct line in the index file for random access.

`init(dataset_path, index_file, context_len=168, pred_len=24, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None)`

Parameters:

Name	Type	Description	Default
`dataset_path`	`Path`	Path to the pretraining dataset.	required
`index_file`	`str`	Name of the index file	required
`context_len`	`int`	Length of the context. Defaults to 168. The index file has to be generated with the same context length.	`168`
`pred_len`	`int`	Length of the prediction horizon. Defaults to 24. The index file has to be generated with the same pred length.	`24`
`apply_scaler_transform`	`str`	Apply a scaler transform to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the scaler transform. Defaults to None.	`None`
`weather_inputs`	`List[str]`	list of weather features to use. Default: None.	`None`

`__read_index_file(index_file)`

Extract metadata from index file.

`collate_fn()`

Returns a function taking only one argument (the list of items to be batched).

`init_fp()`

Each worker needs to open its own file pointer to avoid weird multiprocessing errors from sharing a file pointer.

This is not called in the main process. This is called in the DataLoader worker_init_fn. The file is opened in binary mode which lets us disable buffering.

TorchBuildingDataset

`buildings_bench.data.datasets.TorchBuildingDataset`

Bases: Dataset

PyTorch Dataset for a single building's energy timeseries (a Pandas Dataframe) with a timestamp index and a power column.

Used to iterate over mini-batches of 192-hour timeseries (168 hours of context, 24 hours prediction horizon).

`init(dataframe, building_latlon, building_type, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, is_leap_year=False, weather_dataframe=None, weather_transform_path=None)`

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	Pandas DataFrame with a timestamp index and a 'power' column.	required
`building_latlon`	`List[float]`	Latitude and longitude of the building.	required
`building_type`	`BuildingTypes`	Building type for the dataset.	required
`context_len`	`int`	Length of context. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`is_leap_year`	`bool`	Is the year a leap year? Defaults to False.	`False`
`weather_dataframe`	`DataFrame`	Weather timeseries data. Defaults to None.	`None`
`weather_transform_path`	`Path`	Path to the pickled data for weather transform. Defaults to None.	`None`

PandasTransformerDataset

`buildings_bench.data.datasets.PandasTransformerDataset`

Bases: Dataset

Create a Torch Dataset from a Pandas DataFrame.

Used to iterate over mini-batches of e.g, 192-hour (168 hours context + 24 hour pred horizon) timeseries.

`init(df, context_len=168, pred_len=24, sliding_window=24, weather_inputs=None)`

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Pandas DataFrame with columns: load, latitude, longitude, hour of day, day of week, day of year, building type	required
`context_len`	`int`	Length of context.. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction sequence for the forecasting model. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Defaults to None. The df is assumed to already have the weather inputs in the list as columns.	`None`

TorchBuildingDatasetsFromParquet

`buildings_bench.data.datasets.TorchBuildingDatasetFromParquet`

Generate PyTorch Datasets out of EULP parquet files.

Each file has multiple buildings (with same Lat/Lon and building type) and each building is a column. All time series are for the same year.

Attributes:

Name	Type	Description
`building_datasets`	`dict`	Maps unique building ids to a TorchBuildingDataset.

`init(data_path, parquet_datasets, building_latlons, building_types, weather_inputs=None, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, leap_years=None)`

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	Path to the dataset	required
`parquet_datasets`	`List[str]`	List of paths to a parquet file, each has a timestamp index and multiple columns, one per building.	required
`building_latlons`	`List[List[float]]`	List of latlons for each parquet file.	required
`building_types`	`List[BuildingTypes]`	List of building types for each parquet file.	required
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Default: None.	`None`
`context_len`	`int`	Length of context. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`leap_years`	`List[int]`	List of leap years. Defaults to None.	`None`

`iter()`

Generator to iterate over the building datasets.

Yields:

Type	Description
`Tuple[str, TorchBuildingDataset]`	A pair of building id, TorchBuildingDataset objects.

TorchBuildingDatasetsFromCSV

`buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV`

TorchBuildingDatasetsFromCSV

Generate PyTorch Datasets from a list of CSV files.

Attributes:

Name	Type	Description
`building_datasets`	`dict`	Maps unique building ids to a list of tuples (year, TorchBuildingDataset).

`init(data_path, building_year_files, building_latlon, building_type, weather_inputs=None, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, leap_years=None)`

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	Path to the dataset	required
`building_year_files`	`List[str]`	List of paths to a csv file, each has a timestamp index and multiple columns, one per building.	required
`building_type`	`BuildingTypes`	Building type for the dataset.	required
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Defaults to None.	`None`
`context_len`	`int`	Length of context. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction sequence for the forecasting model. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`leap_years`	`List[int]`	List of leap years. Defaults to None.	`None`

`iter()`

A Generator for TorchBuildingDataset objects.

Yields:

Type	Description
`Tuple[str, ConcatDataset]`	A tuple of the building id and a ConcatDataset of the TorchBuildingDataset objects for all years.

PandasBuildingDatasetsFromCSV

`buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV`

Generate Pandas Dataframes from a list of CSV files.

Can be used with sklearn models or tree-based models that require Pandas Dataframes. In this case, use 'features' = 'engineered' to generate a dataframe with engineered features.

Create a dictionary of building datasets from a list of csv files. Used as a generator to iterate over Pandas Dataframes for each building. The Pandas Dataframe contain all of the years of data for the building.

Attributes:

Name	Type	Description
`building_datasets`	`dict`	Maps unique building ids to a list of tuples (year, Dataframe).

`init(data_path, building_year_files, building_latlon, building_type, weather_inputs=None, features='transformer', apply_scaler_transform='', scaler_transform_path=None, leap_years=[])`

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	Path to the dataset	required
`building_year_files`	`List[str]`	List of paths to a csv file, each has a timestamp index and multiple columns, one per building.	required
`building_type`	`BuildingTypes`	Building type for the dataset.	required
`weather_inputs`	`List[str]`	list of weather feature names to use as additional inputs. Defaults to None.	`None`
`features`	`str`	Type of features to use. Defaults to 'transformer'. {'transformer','engineered'} 'transformer' features: load, latitude, longitude, hour of day, day of week, day of year, building type 'engineered' features are an expansive list of mainly calendar-based features, useful for traditional ML models.	`'transformer'`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`leap_years`	`List[int]`	List of leap years. Defaults to None.	`[]`

`iter()`

Generator for iterating over the dataset.

Yields:

Type	Description
`Tuple[str, DataFrame]`	A pair of building id and Pandas dataframe. The dataframe has all years concatenated.

buildings_bench.data

load_pretraining

buildings_bench.data.load_pretraining(name, num_buildings_ablation=-1, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None, custom_idx_filename='', context_len=168, pred_len=24)

load_torch_dataset

buildings_bench.data.load_torch_dataset(name, dataset_path=None, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None, include_outliers=False, context_len=168, pred_len=24)

load_pandas_dataset

buildings_bench.data.load_pandas_dataset(name, dataset_path=None, feature_set='engineered', weather_inputs=None, apply_scaler_transform='', scaler_transform_path=None, include_outliers=False)

The Buildings-900K PyTorch Dataset

buildings_bench.data.buildings900K.Buildings900K

__init__(dataset_path, index_file, context_len=168, pred_len=24, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None)

__read_index_file(index_file)

collate_fn()

init_fp()

TorchBuildingDataset

buildings_bench.data.datasets.TorchBuildingDataset

__init__(dataframe, building_latlon, building_type, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, is_leap_year=False, weather_dataframe=None, weather_transform_path=None)

PandasTransformerDataset

buildings_bench.data.datasets.PandasTransformerDataset

__init__(df, context_len=168, pred_len=24, sliding_window=24, weather_inputs=None)

TorchBuildingDatasetsFromParquet

buildings_bench.data.datasets.TorchBuildingDatasetFromParquet

__init__(data_path, parquet_datasets, building_latlons, building_types, weather_inputs=None, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, leap_years=None)

__iter__()

TorchBuildingDatasetsFromCSV

buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV

__init__(data_path, building_year_files, building_latlon, building_type, weather_inputs=None, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, leap_years=None)

__iter__()

PandasBuildingDatasetsFromCSV

buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV

__init__(data_path, building_year_files, building_latlon, building_type, weather_inputs=None, features='transformer', apply_scaler_transform='', scaler_transform_path=None, leap_years=[])

__iter__()

`buildings_bench.data.load_pretraining(name, num_buildings_ablation=-1, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None, custom_idx_filename='', context_len=168, pred_len=24)`

`buildings_bench.data.load_torch_dataset(name, dataset_path=None, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None, include_outliers=False, context_len=168, pred_len=24)`

`buildings_bench.data.load_pandas_dataset(name, dataset_path=None, feature_set='engineered', weather_inputs=None, apply_scaler_transform='', scaler_transform_path=None, include_outliers=False)`

`buildings_bench.data.buildings900K.Buildings900K`

`init(dataset_path, index_file, context_len=168, pred_len=24, apply_scaler_transform='', scaler_transform_path=None, weather_inputs=None)`

`__read_index_file(index_file)`

`collate_fn()`

`init_fp()`

`buildings_bench.data.datasets.TorchBuildingDataset`

`init(dataframe, building_latlon, building_type, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, is_leap_year=False, weather_dataframe=None, weather_transform_path=None)`

`buildings_bench.data.datasets.PandasTransformerDataset`

`init(df, context_len=168, pred_len=24, sliding_window=24, weather_inputs=None)`

`buildings_bench.data.datasets.TorchBuildingDatasetFromParquet`

`init(data_path, parquet_datasets, building_latlons, building_types, weather_inputs=None, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, leap_years=None)`

`iter()`

`buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV`

`init(data_path, building_year_files, building_latlon, building_type, weather_inputs=None, context_len=168, pred_len=24, sliding_window=24, apply_scaler_transform='', scaler_transform_path=None, leap_years=None)`

`iter()`

`buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV`

`init(data_path, building_year_files, building_latlon, building_type, weather_inputs=None, features='transformer', apply_scaler_transform='', scaler_transform_path=None, leap_years=[])`

`iter()`