buildings_bench.data
Functions and class definitions for loading Torch and Pandas datasets.
Main entry points for loading PyTorch and Pandas datasets:
load_pretraining()
(used for pretraining)load_torch_dataset()
(used for benchmark tasks)load_pandas_dataset()
(used for benchmark tasks)
Available PyTorch Datasets:
Buildings900K
(used for pretraining)TorchBuildingsDataset
(used for benchmark tasks)PandasTransformerDataset
(used for benchmark tasks)
load_pretraining
buildings_bench.data.load_pretraining(name: str, num_buildings_ablation: int = -1, apply_scaler_transform: str = '', scaler_transform_path: Path = None, weather_inputs: List[str] = None, custom_idx_filename: str = '', context_len: str = 168, pred_len: str = 24) -> torch.utils.data.Dataset
Pre-training datasets: buildings-900k-train, buildings-900k-val
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the dataset to load. |
required |
num_buildings_ablation |
int
|
Number of buildings to use for pre-training. If -1, use all buildings. |
-1
|
apply_scaler_transform |
str
|
If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''. |
''
|
scaler_transform_path |
Path
|
Path to data for transform, e.g., pickled data for BoxCox transform. |
None
|
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Default: None. |
None
|
custom_idx_filename |
str
|
customized index filename. Default: '' |
''
|
context_len |
int
|
Length of the context. Defaults to 168. |
168
|
pred_len |
int
|
Length of the prediction horizon. Defaults to 24. |
24
|
Returns:
Type | Description |
---|---|
torch.utils.data.Dataset
|
torch.utils.data.Dataset: Dataset for pretraining. |
load_torch_dataset
buildings_bench.data.load_torch_dataset(name: str, dataset_path: Path = None, apply_scaler_transform: str = '', scaler_transform_path: Path = None, weather_inputs: List[str] = None, include_outliers: bool = False, context_len: bool = 168, pred_len: bool = 24) -> Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]
Load datasets by name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the dataset to load. |
required |
dataset_path |
Path
|
Path to the benchmark data. Optional. |
None
|
apply_scaler_transform |
str
|
If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''. |
''
|
scaler_transform_path |
Path
|
Path to data for transform, e.g., pickled data for BoxCox transform. |
None
|
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Default: None. |
None
|
include_outliers |
bool
|
Use version of BuildingsBench with outliers. |
False
|
context_len |
int
|
Length of the context. Defaults to 168. |
168
|
pred_len |
int
|
Length of the prediction horizon. Defaults to 24. |
24
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
Union[TorchBuildingDatasetsFromCSV, TorchBuildingDatasetFromParquet]
|
Dataset for benchmarking. |
load_pandas_dataset
buildings_bench.data.load_pandas_dataset(name: str, dataset_path: Path = None, feature_set: str = 'engineered', weather_inputs: List[str] = None, apply_scaler_transform: str = '', scaler_transform_path: Path = None, include_outliers: bool = False) -> PandasBuildingDatasetsFromCSV
Load datasets by name.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
Name of the dataset to load. |
required |
dataset_path |
Path
|
Path to the benchmark data. Optional. |
None
|
feature_set |
str
|
Feature set to use. Default: 'engineered'. |
'engineered'
|
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Default: None. |
None
|
apply_scaler_transform |
str
|
If not using quantized load or unscaled loads, applies a {boxcox,standard} scaling transform to the load. Default: ''. |
''
|
scaler_transform_path |
Path
|
Path to data for transform, e.g., pickled data for BoxCox transform. |
None
|
include_outliers |
bool
|
Use version of BuildingsBench with outliers. |
False
|
Returns:
Name | Type | Description |
---|---|---|
dataset |
PandasBuildingDatasetsFromCSV
|
Generator of Pandas datasets for benchmarking. |
The Buildings-900K PyTorch Dataset
buildings_bench.data.buildings900K.Buildings900K
Bases: torch.utils.data.Dataset
This is an indexed dataset for the Buildings-900K dataset. It uses an index file to quickly load a sub-sequence from a time series in a multi-building Parquet file. The index file is a tab separated file with the following columns:
- Building-type-and-year (e.g., comstock_tmy3_release_1)
- Census region (e.g., by_puma_midwest)
- PUMA ID
- Building ID
- Hour of year pointer (e.g., 0070)
The sequence pointer is used to extract the slice [pointer - context length : pointer + pred length] for a given building ID.
The time series are not stored chronologically and must be sorted by timestamp after loading.
Each dataloader worker has its own file pointer to the index file. This is to avoid weird multiprocessing errors from sharing a file pointer. We 'seek' to the correct line in the index file for random access.
__init__(dataset_path: Path, index_file: str, context_len: int = 168, pred_len: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, weather_inputs: List[str] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_path |
Path
|
Path to the pretraining dataset. |
required |
index_file |
str
|
Name of the index file |
required |
context_len |
int
|
Length of the context. Defaults to 168. The index file has to be generated with the same context length. |
168
|
pred_len |
int
|
Length of the prediction horizon. Defaults to 24. The index file has to be generated with the same pred length. |
24
|
apply_scaler_transform |
str
|
Apply a scaler transform to the load. Defaults to ''. |
''
|
scaler_transform_path |
Path
|
Path to the scaler transform. Defaults to None. |
None
|
weather_inputs |
List[str]
|
list of weather features to use. Default: None. |
None
|
init_fp()
Each worker needs to open its own file pointer to avoid weird multiprocessing errors from sharing a file pointer.
This is not called in the main process. This is called in the DataLoader worker_init_fn. The file is opened in binary mode which lets us disable buffering.
__read_index_file(index_file: Path) -> None
Extract metadata from index file.
collate_fn()
Returns a function taking only one argument (the list of items to be batched).
TorchBuildingDataset
buildings_bench.data.datasets.TorchBuildingDataset
Bases: torch.utils.data.Dataset
PyTorch Dataset for a single building's energy timeseries (a Pandas Dataframe)
with a timestamp index and a power
column.
Used to iterate over mini-batches of 192-hour timeseries (168 hours of context, 24 hours prediction horizon).
__init__(dataframe: pd.DataFrame, building_latlon: List[float], building_type: BuildingTypes, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, is_leap_year: Path = False, weather_dataframe: pd.DataFrame = None, weather_transform_path: Path = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe |
pd.DataFrame
|
Pandas DataFrame with a timestamp index and a 'power' column. |
required |
building_latlon |
List[float]
|
Latitude and longitude of the building. |
required |
building_type |
BuildingTypes
|
Building type for the dataset. |
required |
context_len |
int
|
Length of context. Defaults to 168. |
168
|
pred_len |
int
|
Length of prediction. Defaults to 24. |
24
|
sliding_window |
int
|
Stride for sliding window to split timeseries into test samples. Defaults to 24. |
24
|
apply_scaler_transform |
str
|
Apply scaler transform {boxcox,standard} to the load. Defaults to ''. |
''
|
scaler_transform_path |
Path
|
Path to the pickled data for BoxCox transform. Defaults to None. |
None
|
is_leap_year |
bool
|
Is the year a leap year? Defaults to False. |
False
|
weather_dataframe |
pd.DataFrame
|
Weather timeseries data. Defaults to None. |
None
|
weather_transform_path |
Path
|
Path to the pickled data for weather transform. Defaults to None. |
None
|
PandasTransformerDataset
buildings_bench.data.datasets.PandasTransformerDataset
Bases: torch.utils.data.Dataset
Create a Torch Dataset from a Pandas DataFrame.
Used to iterate over mini-batches of e.g, 192-hour (168 hours context + 24 hour pred horizon) timeseries.
__init__(df: pd.DataFrame, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, weather_inputs: List[str] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
pd.DataFrame
|
Pandas DataFrame with columns: load, latitude, longitude, hour of day, day of week, day of year, building type |
required |
context_len |
int
|
Length of context.. Defaults to 168. |
168
|
pred_len |
int
|
Length of prediction sequence for the forecasting model. Defaults to 24. |
24
|
sliding_window |
int
|
Stride for sliding window to split timeseries into test samples. Defaults to 24. |
24
|
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Defaults to None. The df is assumed to already have the weather inputs in the list as columns. |
None
|
TorchBuildingDatasetsFromParquet
buildings_bench.data.datasets.TorchBuildingDatasetFromParquet
Generate PyTorch Datasets out of EULP parquet files.
Each file has multiple buildings (with same Lat/Lon and building type) and each building is a column. All time series are for the same year.
Attributes:
Name | Type | Description |
---|---|---|
building_datasets |
dict
|
Maps unique building ids to a TorchBuildingDataset. |
__init__(data_path: Path, parquet_datasets: List[str], building_latlons: List[List[float]], building_types: List[BuildingTypes], weather_inputs: List[str] = None, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path |
Path
|
Path to the dataset |
required |
parquet_datasets |
List[str]
|
List of paths to a parquet file, each has a timestamp index and multiple columns, one per building. |
required |
building_latlons |
List[List[float]]
|
List of latlons for each parquet file. |
required |
building_types |
List[BuildingTypes]
|
List of building types for each parquet file. |
required |
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Default: None. |
None
|
context_len |
int
|
Length of context. Defaults to 168. |
168
|
pred_len |
int
|
Length of prediction. Defaults to 24. |
24
|
sliding_window |
int
|
Stride for sliding window to split timeseries into test samples. Defaults to 24. |
24
|
apply_scaler_transform |
str
|
Apply scaler transform {boxcox,standard} to the load. Defaults to ''. |
''
|
scaler_transform_path |
Path
|
Path to the pickled data for BoxCox transform. Defaults to None. |
None
|
leap_years |
List[int]
|
List of leap years. Defaults to None. |
None
|
__iter__() -> Iterator[Tuple[str, TorchBuildingDataset]]
Generator to iterate over the building datasets.
Yields:
Type | Description |
---|---|
Iterator[Tuple[str, TorchBuildingDataset]]
|
A pair of building id, TorchBuildingDataset objects. |
TorchBuildingDatasetsFromCSV
buildings_bench.data.datasets.TorchBuildingDatasetsFromCSV
TorchBuildingDatasetsFromCSV
Generate PyTorch Datasets from a list of CSV files.
Attributes:
Name | Type | Description |
---|---|---|
building_datasets |
dict
|
Maps unique building ids to a list of tuples (year, TorchBuildingDataset). |
__init__(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, weather_inputs: List[str] = None, context_len: int = 168, pred_len: int = 24, sliding_window: int = 24, apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path |
Path
|
Path to the dataset |
required |
building_year_files |
List[str]
|
List of paths to a csv file, each has a timestamp index and multiple columns, one per building. |
required |
building_type |
BuildingTypes
|
Building type for the dataset. |
required |
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Defaults to None. |
None
|
context_len |
int
|
Length of context. Defaults to 168. |
168
|
pred_len |
int
|
Length of prediction sequence for the forecasting model. Defaults to 24. |
24
|
sliding_window |
int
|
Stride for sliding window to split timeseries into test samples. Defaults to 24. |
24
|
apply_scaler_transform |
str
|
Apply scaler transform {boxcox,standard} to the load. Defaults to ''. |
''
|
scaler_transform_path |
Path
|
Path to the pickled data for BoxCox transform. Defaults to None. |
None
|
leap_years |
List[int]
|
List of leap years. Defaults to None. |
None
|
__iter__() -> Iterator[Tuple[str, torch.utils.data.ConcatDataset]]
A Generator for TorchBuildingDataset objects.
Yields:
Type | Description |
---|---|
Iterator[Tuple[str, torch.utils.data.ConcatDataset]]
|
A tuple of the building id and a ConcatDataset of the TorchBuildingDataset objects for all years. |
PandasBuildingDatasetsFromCSV
buildings_bench.data.datasets.PandasBuildingDatasetsFromCSV
Generate Pandas Dataframes from a list of CSV files.
Can be used with sklearn models or tree-based models that require Pandas Dataframes. In this case, use 'features' = 'engineered' to generate a dataframe with engineered features.
Create a dictionary of building datasets from a list of csv files. Used as a generator to iterate over Pandas Dataframes for each building. The Pandas Dataframe contain all of the years of data for the building.
Attributes:
Name | Type | Description |
---|---|---|
building_datasets |
dict
|
Maps unique building ids to a list of tuples (year, Dataframe). |
__init__(data_path: Path, building_year_files: List[str], building_latlon: List[float], building_type: BuildingTypes, weather_inputs: List[str] = None, features: str = 'transformer', apply_scaler_transform: str = '', scaler_transform_path: Path = None, leap_years: List[int] = [])
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_path |
Path
|
Path to the dataset |
required |
building_year_files |
List[str]
|
List of paths to a csv file, each has a timestamp index and multiple columns, one per building. |
required |
building_type |
BuildingTypes
|
Building type for the dataset. |
required |
weather_inputs |
List[str]
|
list of weather feature names to use as additional inputs. Defaults to None. |
None
|
features |
str
|
Type of features to use. Defaults to 'transformer'. {'transformer','engineered'} 'transformer' features: load, latitude, longitude, hour of day, day of week, day of year, building type 'engineered' features are an expansive list of mainly calendar-based features, useful for traditional ML models. |
'transformer'
|
apply_scaler_transform |
str
|
Apply scaler transform {boxcox,standard} to the load. Defaults to ''. |
''
|
scaler_transform_path |
Path
|
Path to the pickled data for BoxCox transform. Defaults to None. |
None
|
leap_years |
List[int]
|
List of leap years. Defaults to None. |
[]
|
__iter__() -> Iterator[Tuple[str, pd.DataFrame]]
Generator for iterating over the dataset.
Yields:
Type | Description |
---|---|
Iterator[Tuple[str, pd.DataFrame]]
|
A pair of building id and Pandas dataframe. The dataframe has all years concatenated. |