Skip to content

buildings_bench.data

Functions and class definitions for loading Torch and Pandas datasets.

Main entry points for loading PyTorch and Pandas datasets:

  • load_pretraining() (used for pretraining)
  • load_torch_dataset() (used for benchmark tasks)
  • load_pandas_dataset() (used for benchmark tasks)

Available PyTorch Datasets:

  • Buildings900K (used for pretraining)
  • TorchBuildingsDataset (used for benchmark tasks)
  • PandasTransformerDataset (used for benchmark tasks)

load_pretraining

buildings_bench.data.load_pretraining(name: str, num_buildings_ablation: int = -1, apply_scaler_transform: str = '', scaler_transform_path: Path = None, context_len: Path = 168, pred_len: Path = 24) -> torch.utils.data.Dataset

Pre-training datasets: buildings-900k-train, buildings-900k-val

Parameters:

Name Type Description Default
name str

Name of the dataset to load.

required
num_buildings_ablation int

Number of buildings to use for pre-training. If -1, use all buildings.

-1
apply_scaler_transform str

If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.

''
scaler_transform_path Path

Path to data for transform, e.g., pickled data for BoxCox transform.

None
context_len int

Length of the context. Defaults to 168.

168
pred_len int

Length of the prediction horizon. Defaults to 24.

24

Returns:

Type Description
torch.utils.data.Dataset

torch.utils.data.Dataset: Dataset for pretraining.

load_torch_dataset

buildings_bench.data.load_torch_dataset(name: str, dataset_path: Path = None, apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False, context_len: bool = 168, pred_len: bool = 24) -> Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]

Load datasets by name.

Parameters:

Name Type Description Default
name str

Name of the dataset to load.

required
dataset_path Path

Path to the benchmark data. Optional.

None
apply_scaler_transform str

If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.

''
scaler_transform_path Path

Path to data for transform, e.g., pickled data for BoxCox transform.

None
include_outliers bool

Use version of BuildingsBench with outliers.

False
context_len int

Length of the context. Defaults to 168.

168
pred_len int

Length of the prediction horizon. Defaults to 24.

24

Returns:

Name Type Description
dataset Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]

Dataset for benchmarking.

load_pandas_dataset

buildings_bench.data.load_pandas_dataset(name: str, dataset_path: Path = None, feature_set: str = 'engineered', apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False) -> PandasBuildingDatasetsFromCSV

Load datasets by name.

Parameters:

Name Type Description Default
name str

Name of the dataset to load.

required
dataset_path Path

Path to the benchmark data. Optional.

None
feature_set str

Feature set to use. Default: 'engineered'.

'engineered'
apply_scaler_transform str

If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''.

''
scaler_transform_path Path

Path to data for transform, e.g., pickled data for BoxCox transform.

None
include_outliers bool

Use version of BuildingsBench with outliers.

False

Returns:

Name Type Description
dataset PandasBuildingDatasetsFromCSV

Generator of Pandas datasets for benchmarking.


The Buildings-900K PyTorch Dataset

buildings_bench.data.buildings900K.Buildings900K

Bases: torch.utils.data.Dataset

This is an indexed dataset for the Buildings-900K dataset. It uses an index file to quickly load a sub-sequence from a time series in a multi-building Parquet file. The index file is a tab separated file with the following columns:

  1. Building-type-and-year (e.g., comstock_tmy3_release_1)
  2. Census region (e.g., by_puma_midwest)
  3. PUMA ID
  4. Building ID
  5. Hour of year pointer (e.g., 0070)

The sequence pointer is used to extract the slice [pointer - context length : pointer + pred length] for a given building ID.

The time series are not stored chronologically and must be sorted by timestamp after loading.

Each dataloader worker has its own file pointer to the index file. This is to avoid weird multiprocessing errors from sharing a file pointer. We 'seek' to the correct line in the index file for random access.

__init__(dataset_path: Path, index_file: str, context_len: int = 168, pred_len: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None)

Parameters:

Name Type Description Default
dataset_path Path

Path to the pretraining dataset.

required
index_file str

Name of the index file

required
context_len int

Length of the context. Defaults to 168. The index file has to be generated with the same context length.

168
pred_len int

Length of the prediction horizon. Defaults to 24. The index file has to be generated with the same pred length.

24
apply_scaler_transform str

Apply a scaler transform to the load. Defaults to ''.

''
scaler_transform_path Path

Path to the scaler transform. Defaults to None.

None
init_fp()

Each worker needs to open its own file pointer to avoid weird multiprocessing errors from sharing a file pointer.

This is not called in the main process. This is called in the DataLoader worker_init_fn. The file is opened in binary mode which lets us disable buffering.

__read_index_file(index_file: Path) -> None

Extract metadata from index file.


TorchBuildingDataset

buildings_bench.data.datasets.TorchBuildingDataset

Bases: torch.utils.data.Dataset

PyTorch Dataset for a single building's Pandas Dataframe with a timestamp index and a 'power' column.

Used to iterate over mini-batches of 192-hour subsequences.

__init__(dataframe: pd.DataFrame, building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, is_leap_year: Path = False)

Parameters:

Name Type Description Default
dataframe pd.DataFrame

Pandas DataFrame with a timestamp index and a 'power' column.

required
building_latlon List[float]

Latitude and longitude of the building.

required
building_type BuildingTypes

Building type for the dataset.

required
context_len int

Length of context. Defaults to 168.

168
pred_len int

Length of prediction. Defaults to 24.

24
sliding_window int

Stride for sliding window to split timeseries into test samples. Defaults to 24.

24
apply_scaler_transform str

Apply scaler transform {boxcox,standard} to the load. Defaults to ''.

''
scaler_transform_path Path

Path to the pickled data for BoxCox transform. Defaults to None.

None
is_leap_year bool

Is the year a leap year? Defaults to False.

False

PandasTransformerDataset

buildings_bench.data.datasets.PandasTransformerDataset

Bases: torch.utils.data.Dataset

Create a Torch Dataset out of a Pandas DataFrame.

Used to iterate over mini-batches of 192-hour sub-sequences.

__init__(df: pd.DataFrame, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24)

Parameters:

Name Type Description Default
df pd.DataFrame

Pandas DataFrame with columns: load, latitude, longitude, hour of day, day of week, day of year, building type

required
context_len int

Length of context.. Defaults to 168.

168
pred_len int

Length of prediction sequence for the forecasting model. Defaults to 24.

24
sliding_window int

Stride for sliding window to split timeseries into test samples. Defaults to 24.

24

TorchBuildingDatasetsFromParquet

buildings_bench.data.datasets.TorchBuildingDatasetFromParquet

Generate PyTorch Datasets out of Parquet files.

Each file has multiple buildings (with same Lat/Lon and building type) and each building is a column. All time series are for the same year.

Attributes:

Name Type Description
building_datasets dict

Maps unique building ids to a TorchBuildingDataset.

__init__(parquet_datasets: List[str], building_latlons: List[List[float]], building_types: List[BuildingTypes], context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)

Parameters:

Name Type Description Default
parquet_datasets List[str]

List of paths to a parquet file, each has a timestamp index and multiple columns, one per building.

required
building_latlons List[List[float]]

List of latlons for each parquet file.

required
building_types List[BuildingTypes]

List of building types for each parquet file.

required
context_len int

Length of context. Defaults to 168.

168
pred_len int

Length of prediction. Defaults to 24.

24
sliding_window int

Stride for sliding window to split timeseries into test samples. Defaults to 24.

24
apply_scaler_transform str

Apply scaler transform {boxcox,standard} to the load. Defaults to ''.

''
scaler_transform_path Path

Path to the pickled data for BoxCox transform. Defaults to None.

None
leap_years List[int]

List of leap years. Defaults to None.

None
__iter__() -> Iterator[Tuple[str, TorchBuildingDataset]]

Generator to iterate over the building datasets.

Yields:

Type Description
Iterator[Tuple[str, TorchBuildingDataset]]

A pair of building id, TorchBuildingDataset objects.

TorchBuildingDatasetsFromCSV

buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV

TorchBuildingDatasetsFromCSV

Generate PyTorch Datasets from a list of CSV files.

Attributes:

Name Type Description
building_datasets dict

Maps unique building ids to a list of tuples (year, TorchBuildingDataset).

__init__(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)

Parameters:

Name Type Description Default
data_path Path

Path to the dataset

required
building_year_files List[str]

List of paths to a csv file, each has a timestamp index and multiple columns, one per building.

required
building_type BuildingTypes

Building type for the dataset.

required
context_len int

Length of context. Defaults to 168.

168
pred_len int

Length of prediction sequence for the forecasting model. Defaults to 24.

24
sliding_window int

Stride for sliding window to split timeseries into test samples. Defaults to 24.

24
apply_scaler_transform str

Apply scaler transform {boxcox,standard} to the load. Defaults to ''.

''
scaler_transform_path Path

Path to the pickled data for BoxCox transform. Defaults to None.

None
leap_years List[int]

List of leap years. Defaults to None.

None
__iter__() -> Iterator[Tuple[str, torch.utils.data.ConcatDataset]]

A Generator for TorchBuildingDataset objects.

Yields:

Type Description
Iterator[Tuple[str, torch.utils.data.ConcatDataset]]

A tuple of the building id and a ConcatDataset of the TorchBuildingDataset objects for all years.

PandasBuildingDatasetsFromCSV

buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV

Generate Pandas Dataframes from a list of CSV files.

Create a dictionary of building datasets from a list of csv files. Used as a generator to iterate over Pandas Dataframes for each building. The Pandas Dataframe contain all of the years of data for the building.

Attributes:

Name Type Description
building_datasets dict

Maps unique building ids to a list of tuples (year, Dataframe).

__init__(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, features: str = 'transformer', apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = [])

Parameters:

Name Type Description Default
data_path Path

Path to the dataset

required
building_year_files List[str]

List of paths to a csv file, each has a timestamp index and multiple columns, one per building.

required
building_type BuildingTypes

Building type for the dataset.

required
features str

Type of features to use. Defaults to 'transformer'. {'transformer','engineered'} 'transformer' features: load, latitude, longitude, hour of day, day of week, day of year, building type 'engineered' features are an expansive list of mainly calendar-based features, useful for traditional ML models.

'transformer'
apply_scaler_transform str

Apply scaler transform {boxcox,standard} to the load. Defaults to ''.

''
scaler_transform_path Path

Path to the pickled data for BoxCox transform. Defaults to None.

None
leap_years List[int]

List of leap years. Defaults to None.

[]
__iter__() -> Iterator[Tuple[str, pd.DataFrame]]

Generator for iterating over the dataset.

Yields:

Type Description
Iterator[Tuple[str, pd.DataFrame]]

A pair of building id and Pandas dataframe. The dataframe has all years concatenated.