Datasets

The pretraining dataset and evaluation data is available for download here as tar files, or can be accessed via AWS S3 here. The benchmark datasets are < 1GB in size in total and the pretraining data is ~110GB in size.

The pretraining data is divided into 4 compressed files

comstock_amy2018.tar.gz: ~21GB
comstock_tmy3.tar.gz: ~21GB
resstock_amy2018.tar.gz: ~33GB
resstock_tmy3.tar.gz: ~33GB

and one compressed file for the metadata

metadata.tar.gz

The evaluation datasets are compressed into a single file

BuildingsBench.tar.gz

Download all files to a folder on a storage device with at least 250GB of free space. Then, decompress all of the downloaded files. There will be a new subdirectory called BuildingsBench. This is the data directory, which is different than the Github code repository, although both folders are named "BuildingsBench".

Data directory organization

BuildingsBench/
- Buildings-900K/end-use-load-profiles-for-us-building-stock/2021/: The Buildings-900K pretraining and validation data.
  - comstock_amy2018_release_1/
    - timeseries_individual_buildings/
      - by_puma_midwest
        
        upgrade=0
        
        puma={puma_id}/*.parquet
        
        ...
      - by_puma_northeast
      - by_puma_south
      - by_puma_west
    - weather/
      - amy2018/
        
        {puma_id}_2018.csv
        
        ...
    - metadata/
      - metadata.parquet
  - ...: Other datasets
- BDG-2/: Building Data Genome Project 2 BuildingsBench evaluation data with outliers removed.
  - {building_id}={year}.csv: The .csv files for the BDG-2 dataset.
  - ...: Other buildings
- ...: Other evaluation datasets (Borealis, Electricity, etc.)
- buildingsbench_with_outliers: The BuildingsBench evaluation data with outliers.
  - BDG-2/: Buildings Data Genome Project 2 BuildingsBench evaluation data with outliers.
    - {building_id}={year}.csv: The .csv files for the BDG-2 dataset.
    - ...: Other buildings
  - ...: Other evaluation datasets (Borealis, Electricity, etc.)
- LICENSES/: Licenses for each evaluation dataset redistributed in BuildingsBench.
- metadata/: Metadata for the evaluation suite.
  - benchmark.toml: Metadata for the benchmark. For each dataset, we specify:
    - building_type: residential or commercial.
    - latlon: a List of two floats representing the location of the building(s).
    - conus_location: The name of the county or city in the U.S. where the building is located, or a county/city in the U.S. of similar climate to the building's true location.
    - actual_location: County/city where the building actually is located. This will be different from conus_location when the building is outside of the CONUS. These values are for book-keeping and can be set to dummy values.
    - url: The URL where the dataset was obtained from.
  - building_years.txt: List of .csv files included in the benchmark. Each line is of the form {dataset}/{building_id}={year}.csv.
  - withheld_pumas.tsv: List of PUMAs withheld from the training/validation set of Buildings-900K, which we use as synthetic test data.
  - map_of_pumas_in_census_region*.csv: Maps PUMA IDs to their geographical centroid (lat/lon).
  - spatial_tract_lookup_table.csv: Mapping between census tract identifiers and other geographies.
  - list_oov.py: Python script to generate a list of buildings that are OOV for the Buildings-900K tokenizer.
  - oov.txt: List of buildings that are OOV for the Buildings-900K tokenizer.
  - transfer_learning_commercial_buildings.txt: List of 100 commercial buildings from the benchmark we use for evaluating transfer learning.
  - transfer_learning_residential_buildings.txt: List of 100 residential buildings from the benchmark we use for evaluating transfer learning.
  - transfer_learning_hyperparameter_tuning.txt: List of 2 held out buildings (1 commercial, 1 residential) that can be used for hyperparameter tuning.
  - train*.idx: Index files for fast dataloading of Buildings-900K. This file uncompressed is 16GB.
  - val*.idx: Index files for fast dataloading of Buildings-900K.
  - transforms: Directory for storing data transform info.

Exploring the data

See our dataset quick start Jupyter notebook

Parquet file format

The pretraining dataset Buildings-900K is stored as a collection of PUMA-level parquet files. Each parquet file in Buildings-900K is stored in a directory named after a unique PUMA ID puma={puma_id}/*.parquet. The first column is the timestamp and each subsequent column is the energy consumption in kWh for a different building in that. These columns are named by building id. The timestamp is in the format YYYY-MM-DD HH:MM:SS. The energy consumption is in kWh. The parquet files are compressed with snappy. Sort by the timestamp after loading.

import pyarrow.parquet as pq

bldg_id = '00001'
df = pq.read_table('puma={puma_id}', columns=['timestamp', bldg_id]).to_pandas().sort_values(by='timestamp')

CSV file format

Most CSV files in the benchmark are named building_id=year.csv and correspond to a single building's energy consumption time series. The first column is the timestamp (the Pandas index), and the second column is the energy consumption in kWh. The timestamp is in the format YYYY-MM-DD HH:MM:SS. The energy consumption is in kWh.

Certain datasets have multiple buildings in a single file. In this case, the first column is the timestamp (the Pandas index), and each subsequent column is the energy consumption in kWh for a different building. These columns are named by building id. The timestamp is in the format YYYY-MM-DD HH:MM:SS. The energy consumption is in kWh.

Adding a new dataset

For a new CSV dataset named {dataset}

Create a directory called {dataset} of CSV files with filenames {building_id}={year}.csv.
Add the line {dataset}/{building_id}={year}.csv for each file to the building_years.txt file.
Add the appropriate metadata for the dataset to benchmark.toml under the buildings_bench.{dataset} tag.
Add {dataset} to the benchmark registry in buildings_bench/data/__init__.py.

You can now use the provided torch and pandas dataloaders to load this dataset by name {dataset}.

Out-of-vocab test consumption values

Hourly consumption values > 5100 kWh are larger than the maximum values seen during pretraining on Buildings-900K. We consider these "out-of-vocab" and remove such buildings from evaluation. This prevents errors due to extrapolation, which is not the focus of this benchmark. See list_oov.py for the code we use to generate a list of OOV buildings.