buildings_bench.data

Functions and class definitions for loading Torch and Pandas datasets.

Main entry points for loading PyTorch and Pandas datasets:

load_pretraining() (used for pretraining)
load_torch_dataset() (used for benchmark tasks)
load_pandas_dataset() (used for benchmark tasks)

Available PyTorch Datasets:

Buildings900K (used for pretraining)
TorchBuildingsDataset (used for benchmark tasks)
PandasTransformerDataset (used for benchmark tasks)

load_pretraining

`buildings_bench.data.load_pretraining(name: str, num_buildings_ablation: int = -1, apply_scaler_transform: str = '', scaler_transform_path: Path = None, context_len: Path = 168, pred_len: Path = 24) -> torch.utils.data.Dataset`

Pre-training datasets: buildings-900k-train, buildings-900k-val

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset to load.	required
`num_buildings_ablation`	`int`	Number of buildings to use for pre-training. If -1, use all buildings.	`-1`
`apply_scaler_transform`	`str`	If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.	`''`
`scaler_transform_path`	`Path`	Path to data for transform, e.g., pickled data for BoxCox transform.	`None`
`context_len`	`int`	Length of the context. Defaults to 168.	`168`
`pred_len`	`int`	Length of the prediction horizon. Defaults to 24.	`24`

Returns:

Type	Description
`torch.utils.data.Dataset`	torch.utils.data.Dataset: Dataset for pretraining.

load_torch_dataset

`buildings_bench.data.load_torch_dataset(name: str, dataset_path: Path = None, apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False, context_len: bool = 168, pred_len: bool = 24) -> Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]`

Load datasets by name.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset to load.	required
`dataset_path`	`Path`	Path to the benchmark data. Optional.	`None`
`apply_scaler_transform`	`str`	If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.	`''`
`scaler_transform_path`	`Path`	Path to data for transform, e.g., pickled data for BoxCox transform.	`None`
`include_outliers`	`bool`	Use version of BuildingsBench with outliers.	`False`
`context_len`	`int`	Length of the context. Defaults to 168.	`168`
`pred_len`	`int`	Length of the prediction horizon. Defaults to 24.	`24`

Returns:

Name	Type	Description
`dataset`	`Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]`	Dataset for benchmarking.

load_pandas_dataset

`buildings_bench.data.load_pandas_dataset(name: str, dataset_path: Path = None, feature_set: str = 'engineered', apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False) -> PandasBuildingDatasetsFromCSV`

Load datasets by name.

Parameters:

Name	Type	Description	Default
`name`	`str`	Name of the dataset to load.	required
`dataset_path`	`Path`	Path to the benchmark data. Optional.	`None`
`feature_set`	`str`	Feature set to use. Default: 'engineered'.	`'engineered'`
`apply_scaler_transform`	`str`	If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.	`''`
`scaler_transform_path`	`Path`	Path to data for transform, e.g., pickled data for BoxCox transform.	`None`
`include_outliers`	`bool`	Use version of BuildingsBench with outliers.	`False`

Returns:

Name	Type	Description
`dataset`	`PandasBuildingDatasetsFromCSV`	Generator of Pandas datasets for benchmarking.

The Buildings-900K PyTorch Dataset

`buildings_bench.data.buildings900K.Buildings900K`

Bases: torch.utils.data.Dataset

This is an indexed dataset for the Buildings-900K dataset. It uses an index file to quickly load a sub-sequence from a time series in a multi-building Parquet file. The index file is a tab separated file with the following columns:

Building-type-and-year (e.g., comstock_tmy3_release_1)
Census region (e.g., by_puma_midwest)
PUMA ID
Building ID
Hour of year pointer (e.g., 0070)

The sequence pointer is used to extract the slice [pointer - context length : pointer + pred length] for a given building ID.

The time series are not stored chronologically and must be sorted by timestamp after loading.

Each dataloader worker has its own file pointer to the index file. This is to avoid weird multiprocessing errors from sharing a file pointer. We 'seek' to the correct line in the index file for random access.

`init(dataset_path: Path, index_file: str, context_len: int = 168, pred_len: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None)`

Parameters:

Name	Type	Description	Default
`dataset_path`	`Path`	Path to the pretraining dataset.	required
`index_file`	`str`	Name of the index file	required
`context_len`	`int`	Length of the context. Defaults to 168. The index file has to be generated with the same context length.	`168`
`pred_len`	`int`	Length of the prediction horizon. Defaults to 24. The index file has to be generated with the same pred length.	`24`
`apply_scaler_transform`	`str`	Apply a scaler transform to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the scaler transform. Defaults to None.	`None`

`init_fp()`

Each worker needs to open its own file pointer to avoid weird multiprocessing errors from sharing a file pointer.

This is not called in the main process. This is called in the DataLoader worker_init_fn. The file is opened in binary mode which lets us disable buffering.

`__read_index_file(index_file: Path) -> None`

Extract metadata from index file.

TorchBuildingDataset

`buildings_bench.data.datasets.TorchBuildingDataset`

Bases: torch.utils.data.Dataset

PyTorch Dataset for a single building's Pandas Dataframe with a timestamp index and a 'power' column.

Used to iterate over mini-batches of 192-hour subsequences.

`init(dataframe: pd.DataFrame, building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, is_leap_year: Path = False)`

Parameters:

Name	Type	Description	Default
`dataframe`	`pd.DataFrame`	Pandas DataFrame with a timestamp index and a 'power' column.	required
`building_latlon`	`List[float]`	Latitude and longitude of the building.	required
`building_type`	`BuildingTypes`	Building type for the dataset.	required
`context_len`	`int`	Length of context. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`is_leap_year`	`bool`	Is the year a leap year? Defaults to False.	`False`

PandasTransformerDataset

`buildings_bench.data.datasets.PandasTransformerDataset`

Bases: torch.utils.data.Dataset

Create a Torch Dataset out of a Pandas DataFrame.

Used to iterate over mini-batches of 192-hour sub-sequences.

`init(df: pd.DataFrame, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24)`

Parameters:

Name	Type	Description	Default
`df`	`pd.DataFrame`	Pandas DataFrame with columns: load, latitude, longitude, hour of day, day of week, day of year, building type	required
`context_len`	`int`	Length of context.. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction sequence for the forecasting model. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`

TorchBuildingDatasetsFromParquet

`buildings_bench.data.datasets.TorchBuildingDatasetFromParquet`

Generate PyTorch Datasets out of Parquet files.

Each file has multiple buildings (with same Lat/Lon and building type) and each building is a column. All time series are for the same year.

Attributes:

Name	Type	Description
`building_datasets`	`dict`	Maps unique building ids to a TorchBuildingDataset.

`init(parquet_datasets: List[str], building_latlons: List[List[float]], building_types: List[BuildingTypes], context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)`

Parameters:

Name	Type	Description	Default
`parquet_datasets`	`List[str]`	List of paths to a parquet file, each has a timestamp index and multiple columns, one per building.	required
`building_latlons`	`List[List[float]]`	List of latlons for each parquet file.	required
`building_types`	`List[BuildingTypes]`	List of building types for each parquet file.	required
`context_len`	`int`	Length of context. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`leap_years`	`List[int]`	List of leap years. Defaults to None.	`None`

`iter() -> Iterator[Tuple[str, TorchBuildingDataset]]`

Generator to iterate over the building datasets.

Yields:

Type	Description
`Iterator[Tuple[str, TorchBuildingDataset]]`	A pair of building id, TorchBuildingDataset objects.

TorchBuildingDatasetsFromCSV

`buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV`

TorchBuildingDatasetsFromCSV

Generate PyTorch Datasets from a list of CSV files.

Attributes:

Name	Type	Description
`building_datasets`	`dict`	Maps unique building ids to a list of tuples (year, TorchBuildingDataset).

`init(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)`

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	Path to the dataset	required
`building_year_files`	`List[str]`	List of paths to a csv file, each has a timestamp index and multiple columns, one per building.	required
`building_type`	`BuildingTypes`	Building type for the dataset.	required
`context_len`	`int`	Length of context. Defaults to 168.	`168`
`pred_len`	`int`	Length of prediction sequence for the forecasting model. Defaults to 24.	`24`
`sliding_window`	`int`	Stride for sliding window to split timeseries into test samples. Defaults to 24.	`24`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`leap_years`	`List[int]`	List of leap years. Defaults to None.	`None`

`iter() -> Iterator[Tuple[str, torch.utils.data.ConcatDataset]]`

A Generator for TorchBuildingDataset objects.

Yields:

Type	Description
`Iterator[Tuple[str, torch.utils.data.ConcatDataset]]`	A tuple of the building id and a ConcatDataset of the TorchBuildingDataset objects for all years.

PandasBuildingDatasetsFromCSV

`buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV`

Generate Pandas Dataframes from a list of CSV files.

Create a dictionary of building datasets from a list of csv files. Used as a generator to iterate over Pandas Dataframes for each building. The Pandas Dataframe contain all of the years of data for the building.

Attributes:

Name	Type	Description
`building_datasets`	`dict`	Maps unique building ids to a list of tuples (year, Dataframe).

`init(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, features: str = 'transformer', apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = [])`

Parameters:

Name	Type	Description	Default
`data_path`	`Path`	Path to the dataset	required
`building_year_files`	`List[str]`	List of paths to a csv file, each has a timestamp index and multiple columns, one per building.	required
`building_type`	`BuildingTypes`	Building type for the dataset.	required
`features`	`str`	Type of features to use. Defaults to 'transformer'. {'transformer','engineered'} 'transformer' features: load, latitude, longitude, hour of day, day of week, day of year, building type 'engineered' features are an expansive list of mainly calendar-based features, useful for traditional ML models.	`'transformer'`
`apply_scaler_transform`	`str`	Apply scaler transform {boxcox,standard} to the load. Defaults to ''.	`''`
`scaler_transform_path`	`Path`	Path to the pickled data for BoxCox transform. Defaults to None.	`None`
`leap_years`	`List[int]`	List of leap years. Defaults to None.	`[]`

`iter() -> Iterator[Tuple[str, pd.DataFrame]]`

Generator for iterating over the dataset.

Yields:

Type	Description
`Iterator[Tuple[str, pd.DataFrame]]`	A pair of building id and Pandas dataframe. The dataframe has all years concatenated.

buildings_bench.data

load_pretraining

buildings_bench.data.load_pretraining(name: str, num_buildings_ablation: int = -1, apply_scaler_transform: str = '', scaler_transform_path: Path = None, context_len: Path = 168, pred_len: Path = 24) -> torch.utils.data.Dataset

load_torch_dataset

buildings_bench.data.load_torch_dataset(name: str, dataset_path: Path = None, apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False, context_len: bool = 168, pred_len: bool = 24) -> Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]

load_pandas_dataset

buildings_bench.data.load_pandas_dataset(name: str, dataset_path: Path = None, feature_set: str = 'engineered', apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False) -> PandasBuildingDatasetsFromCSV

The Buildings-900K PyTorch Dataset

buildings_bench.data.buildings900K.Buildings900K

__init__(dataset_path: Path, index_file: str, context_len: int = 168, pred_len: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None)

init_fp()

__read_index_file(index_file: Path) -> None

TorchBuildingDataset

buildings_bench.data.datasets.TorchBuildingDataset

__init__(dataframe: pd.DataFrame, building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, is_leap_year: Path = False)

PandasTransformerDataset

buildings_bench.data.datasets.PandasTransformerDataset

__init__(df: pd.DataFrame, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24)

TorchBuildingDatasetsFromParquet

buildings_bench.data.datasets.TorchBuildingDatasetFromParquet

__init__(parquet_datasets: List[str], building_latlons: List[List[float]], building_types: List[BuildingTypes], context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)

__iter__() -> Iterator[Tuple[str, TorchBuildingDataset]]

TorchBuildingDatasetsFromCSV

buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV

__init__(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)

__iter__() -> Iterator[Tuple[str, torch.utils.data.ConcatDataset]]

PandasBuildingDatasetsFromCSV

buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV

__init__(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, features: str = 'transformer', apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = [])

__iter__() -> Iterator[Tuple[str, pd.DataFrame]]

`buildings_bench.data.load_pretraining(name: str, num_buildings_ablation: int = -1, apply_scaler_transform: str = '', scaler_transform_path: Path = None, context_len: Path = 168, pred_len: Path = 24) -> torch.utils.data.Dataset`

`buildings_bench.data.load_torch_dataset(name: str, dataset_path: Path = None, apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False, context_len: bool = 168, pred_len: bool = 24) -> Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]`

`buildings_bench.data.load_pandas_dataset(name: str, dataset_path: Path = None, feature_set: str = 'engineered', apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False) -> PandasBuildingDatasetsFromCSV`

`buildings_bench.data.buildings900K.Buildings900K`

`init(dataset_path: Path, index_file: str, context_len: int = 168, pred_len: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None)`

`init_fp()`

`__read_index_file(index_file: Path) -> None`

`buildings_bench.data.datasets.TorchBuildingDataset`

`init(dataframe: pd.DataFrame, building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, is_leap_year: Path = False)`

`buildings_bench.data.datasets.PandasTransformerDataset`

`init(df: pd.DataFrame, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24)`

`buildings_bench.data.datasets.TorchBuildingDatasetFromParquet`

`init(parquet_datasets: List[str], building_latlons: List[List[float]], building_types: List[BuildingTypes], context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)`

`iter() -> Iterator[Tuple[str, TorchBuildingDataset]]`

`buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV`

`init(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)`

`iter() -> Iterator[Tuple[str, torch.utils.data.ConcatDataset]]`

`buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV`

`init(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, features: str = 'transformer', apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = [])`

`iter() -> Iterator[Tuple[str, pd.DataFrame]]`