Data Preprocessing

Classes documented here deal with processing raw information into CELAVI-ready input datasets. Depending on the data source used, methods within the Compute Locations class may need to be edited or new methods added, to work with the different formats of the raw information.

Data Manager

The Data class and its child classes define specific data formats (column names) and data types for the CELAVI input datasets. Backfilling can also be performed, or not, for each dataset type.

Created on Fri Jan 22 08:05:13 2021

Uses code from Feedstock Production Emissions to Air Model (FPEAM) Copyright (c) 2018 Alliance for Sustainable Energy, LLC; Noah Fisher. Builds on functionality in the FPEAM’s Data.py. Unmodified FPEAM code is available at https://github.com/NREL/fpeam.

@author: aeberle

class celavi.data_manager.Data(df=None, fpath=None, columns=None, backfill=True)

Data representation. Specific datasets are created as child classes with defined column names, data types, and backfilling values. Creating child classes removes the need to define column names etc when the classes are called to read data from files.

__init__(df=None, fpath=None, columns=None, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

load(fpath, columns, memory_map=True, header=0, **kwargs)

Load data from a text file at <fpath>. Check and set column names.

See pandas.read_table() help for additional arguments.

Parameters
  • fpath ([string]) – file path to CSV file or SQLite database file

  • columns ([dict]) – {name: type, …}

  • memory_map ([bool]) – load directly to memory for improved performance

  • header ([int]) – 0-based row index containing column names

Returns

Return type

DataFrame

backfill(column, value=0)

Replace NaNs in <column> with <value>.

Parameters
  • column ([string]) – Name of column with NaNs to be backfilled

  • value ([any]) – Value for backfill

Returns

Return type

DataFrame with [column] backfilled with [value]

validate()

Check that data are not empty.

Return False if empty and True otherwise.

Returns

Return type

Boolean flag

class celavi.data_manager.TransportationGraph(df=None, fpath=None, columns={'countyfp': <class 'str'>, 'edge_id': <class 'int'>, 'fclass': <class 'int'>, 'statefp': <class 'str'>, 'u_of_edge': <class 'int'>, 'v_of_edge': <class 'int'>, 'weight': <class 'float'>}, backfill=True)

Read in and process the underlying transportation network for the Router module.

__init__(df=None, fpath=None, columns={'countyfp': <class 'str'>, 'edge_id': <class 'int'>, 'fclass': <class 'int'>, 'statefp': <class 'str'>, 'u_of_edge': <class 'int'>, 'v_of_edge': <class 'int'>, 'weight': <class 'float'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.TransportationNodeLocations(df=None, fpath=None, columns={'lat': <class 'float'>, 'long': <class 'float'>, 'node_id': <class 'int'>}, backfill=True)

Read in and process the node locations in the transportation graph.

__init__(df=None, fpath=None, columns={'lat': <class 'float'>, 'long': <class 'float'>, 'node_id': <class 'int'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.Locations(df=None, fpath=None, columns={'facility_id': <class 'int'>, 'facility_type': <class 'str'>, 'lat': <class 'float'>, 'long': <class 'float'>, 'region_id_1': <class 'str'>, 'region_id_2': <class 'str'>, 'region_id_3': <class 'str'>, 'region_id_4': <class 'str'>}, backfill=True)

Read in and process raw facility locations (other than power plants) datasets.

__init__(df=None, fpath=None, columns={'facility_id': <class 'int'>, 'facility_type': <class 'str'>, 'lat': <class 'float'>, 'long': <class 'float'>, 'region_id_1': <class 'str'>, 'region_id_2': <class 'str'>, 'region_id_3': <class 'str'>, 'region_id_4': <class 'str'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.TechUnitLocations(df=None, fpath=None, columns={'eia_id': <class 'float'>, 'p_cap': <class 'float'>, 'p_name': <class 'str'>, 'p_tnum': <class 'float'>, 'p_year': <class 'float'>, 't_cap': <class 'float'>, 't_county': <class 'str'>, 't_fips': <class 'int'>, 't_model': <class 'str'>, 't_state': <class 'str'>, 'xlong': <class 'float'>, 'ylat': <class 'float'>}, backfill=True)

Read in and process raw power plant locations dataset.

Dataset is downloadable at https://eerscmap.usgs.gov/uswtdb/

No manual changes are needed to the raw dataset before it is processed.

__init__(df=None, fpath=None, columns={'eia_id': <class 'float'>, 'p_cap': <class 'float'>, 'p_name': <class 'str'>, 'p_tnum': <class 'float'>, 'p_year': <class 'float'>, 't_cap': <class 'float'>, 't_county': <class 'str'>, 't_fips': <class 'int'>, 't_model': <class 'str'>, 't_state': <class 'str'>, 'xlong': <class 'float'>, 'ylat': <class 'float'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.OtherFacilityLocations(df=None, fpath=None, columns={'facility_id': <class 'int'>, 'facility_type': <class 'str'>, 'lat': <class 'float'>, 'long': <class 'float'>, 'region_id_1': <class 'str'>, 'region_id_2': <class 'str'>, 'region_id_3': <class 'str'>, 'region_id_4': <class 'str'>}, backfill=True)

Read in and process additional, miscellaneous facility location datasets.

__init__(df=None, fpath=None, columns={'facility_id': <class 'int'>, 'facility_type': <class 'str'>, 'lat': <class 'float'>, 'long': <class 'float'>, 'region_id_1': <class 'str'>, 'region_id_2': <class 'str'>, 'region_id_3': <class 'str'>, 'region_id_4': <class 'str'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.LandfillLocations(df=None, fpath=None, columns={'City': <class 'str'>, 'County': <class 'str'>, 'Current Landfill Status': <class 'str'>, 'Landfill Closure Year': <class 'str'>, 'Landfill ID': <class 'int'>, 'Latitude': <class 'float'>, 'Longitude': <class 'float'>, 'State': <class 'str'>}, backfill=True)

Read in and process raw landfill facility locations dataset from the U.S. EPA’s LMOP database at https://www.epa.gov/lmop.

__init__(df=None, fpath=None, columns={'City': <class 'str'>, 'County': <class 'str'>, 'Current Landfill Status': <class 'str'>, 'Landfill Closure Year': <class 'str'>, 'Landfill ID': <class 'int'>, 'Latitude': <class 'float'>, 'Longitude': <class 'float'>, 'State': <class 'str'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.StandardScenarios(df=None, fpath=None, columns={'state': <class 'str'>, 't': <class 'int'>, 'wind-ons_MW': <class 'float'>}, backfill=True)

Read in and process Standard Scenarios electricity grid mix datasets, viewable and downloadable at https://cambium.nrel.gov/?project=c3fec8d8-6243-4a8a-9bff-66af71889958 . More information on the Standard Scenarios project is available from https://www.nrel.gov/analysis/standard-scenarios.html .

This class is set up to use the annual, state-level datasets, with file names that end in: “_annual_state.csv”. Using any other type of dataset will produce an error.

USER NOTE: Delete the first line of the raw Standard Scenarios file before reading in to CELAVI.

__init__(df=None, fpath=None, columns={'state': <class 'str'>, 't': <class 'int'>, 'wind-ons_MW': <class 'float'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

class celavi.data_manager.RoutePairs(df=None, fpath=None, columns={'destination_facility_type': <class 'str'>, 'in_state_only': <class 'bool'>, 'source_facility_type': <class 'str'>, 'vkmt_max': <class 'float'>}, backfill=True)

Read in and process the dataset defining allowable facility pairs for the Router.

__init__(df=None, fpath=None, columns={'destination_facility_type': <class 'str'>, 'in_state_only': <class 'bool'>, 'source_facility_type': <class 'str'>, 'vkmt_max': <class 'float'>}, backfill=True)
Parameters
  • df – Initial data frame

  • fpath – Filepath location of data to be read in

  • columns – List of columns to backfill

  • backfill – Boolean flag: perform backfilling with datatype-specific value

Compute Locations

Methods in this class perform merging and filtering operations to generate one dataset of all supply chain facility locations and types, and a second dataset of technology unit installations over time.

class celavi.compute_locations.ComputeLocations(start_year: int, power_plant_locations, landfill_locations, other_facility_locations, transportation_graph, node_locations, lookup_facility_type, technology_data_filename, standard_scenarios_filename)

The ComputeLocations class performs preprocessing that ingests raw facility location data and returns a formatted, aggregated dataset of all facilities involved in a circular supply chain.

The specific preprocessing steps will need to be adjusted for different raw data sources to allow for variations in column names, data types, and the number of files ingested. The output format of the processed dataset must remain as it is defined in this code for the remainder of the CELAVI code base to function.

__init__(start_year: int, power_plant_locations, landfill_locations, other_facility_locations, transportation_graph, node_locations, lookup_facility_type, technology_data_filename, standard_scenarios_filename)
Parameters
  • start_year (int) – Calendar year when the simulation begins.

  • power_plant_locations – Data set of renewable energy power plant locations including lat, long, region identifier columns, and other power-plant-specific information as needed.

  • landfill_locations – Data set of landfill locations.

  • other_facility_locations – Data set of all other facility types involved in the case study.

  • transportation_graph – Transportation network data.

  • node_locations – Node locations within transportation graph.

  • lookup_facility_type – File defining the allowable set of facility types in the case study.

  • technology_data_filename – File with technology-specific installation data, including year installed, location (lat, long, and region identifiers), and any technology-specific data required to model material flows into power plants.

  • standard_scenarios_filename – File of output from NREL’s ReEDS model, used for capacity expansion projections past the current year.

wind_power_plant()

Ingests raw data from the U.S. Wind Turbine database, filters down to the contiguous U.S. and creates a data frame that can be combined with other sets of facility location data. The number_of_technology_units file is also created from this dataset.

See TurbineLocations child class of Data class for column names and data types, and where to download the USWTDB.

Returns

wind_plant_locations – Dataset of power plant locations including unique facility ID, the facility type identifier, a lat/long pair, and four generic region identifiers (country, state, county, etc.)

Columns:
  • facility_id : int

  • facility_type : str

  • lat : float

  • long : float

  • region_id_1 : str

  • region_id_2 : str

  • region_id_3 : str

  • region_id_4 : str

Return type

pd.DataFrame

landfill()

Processes raw data from U.S. EPA Landfill Methane Outreach Program (LMOP) to create a dataset of landfill locations in the contiguous U.S. See the LandfillLocations child class of the Data class for column names and download location.

Returns

landfill_locations_no_nulls – Dataset of landfill locations including unique facility ID, the facility type identifier, a lat/long pair, and four generic region identifiers (country, state, county, etc.)

Columns:
  • facility_id : int

  • facility_type : str

  • lat : float

  • long : float

  • region_id_1 : str

  • region_id_2 : str

  • region_id_3 : str

  • region_id_4 : str

Return type

pd.DataFrame

other_facility()

Process additional facility data that does not already have a dataset- specific method. See the OtherFacilityLocations child class of the Data class for column names.

If using this method, the location dataset being read in must already be in the same format as the Return value.

Returns

facility_locations – Dataset of generic facility locations including unique facility ID, the facility type identifier, a lat/long pair, and four generic region identifiers (country, state, county, etc.)

Columns:
  • facility_id : int

  • facility_type : str

  • lat : float

  • long : float

  • region_id_1 : str

  • region_id_2 : str

  • region_id_3 : str

  • region_id_4 : str

Return type

pd.DataFrame

capacity_projections()

Use NREL’s Standard Scenarios for electricity grid mix projections to calculate future technology unit installations. See the StandardScenarios child class of the Data parent class for additional information on obtaining and formatting the input dataset.

Future technology unit installations do not have locations defined in the input datasets. Locations for these installations are calculated by U.S. state (region_id_2) by averaging the locations of previously installed technology units.

join_facilities(locations_output_file)

Call other ComputeLocations methods to process raw locations datasets into a single facility location dataset. Creates the number_of_technology_units dataset from historical installation data and capacity expansion projection data.

Parameters

locations_output_file (str) – Path where the processed and aggregated locations dataset is saved

Data Filtering

The Data Filtering methods produce subsets of the facility locations, technology unit installation, and routes datasets (all outputs of preprocessing methods documented here) by U.S. state (region_id_2) and/or by the distance between facility pairs.

celavi.data_filtering.filter_locations(loc_filename, tech_units_filename, states)

This function is used to filter facility and technology unit locations based on the list of states to include provided in the scenario-specific config file.

Data is not returned; the filtered datasets overwrite the original datasets (CSV files).

Parameters
  • loc_filename (str) – Path to the unfiltered computed locations dataset.

  • tech_units_filename (str) – Path to the unfiltered number of technology units dataset.

  • states (list) – List of states to include in the filtered datasets.

celavi.data_filtering.filter_routes(filtered_locations_filename, routes_filename)

This function is used to filter the routes file such that only routes with both the source and destination contained within the states specified in the scenario config file are included. Routes that originate, terminate, or both outside the specified states are removed from the dataset.

The filtered routes dataset overwrites the original routes dataset (CSV file).

Parameters
  • filtered_locations_filename (str) – Path to the locations dataset that was previously filtered by state using the filter_locations method.

  • routes_filename (str) – Path to the unfiltered routes dataset. Either the computed routes dataset provided by the Router or a custom routes dataset may be used.

Router

Router methods implement the Djikstra minimum-distance algorithm to find on-road transportation routes between supply chain facility pairs.

Created on Fri Jan 22 08:03:13 2021

routing.py uses Dijkstra’s algorithm to compute distances between vertices on a network

Developed using code from Feedstock Production Emissions to Air Model (FPEAM) Copyright (c) 2018 Alliance for Sustainable Energy, LLC; Noah Fisher. Builds on functionality in the FPEAM’s Router.py and Data.py. Unmodified FPEAM code is available at https://github.com/NREL/fpeam.

@author: aeberle

class celavi.routing.Router(edges, node_map, memory=None, algorithm=<function bidirectional_dijkstra>)

Calculate minimum-distance routes between supply chain facilities.

__init__(edges, node_map, memory=None, algorithm=<function bidirectional_dijkstra>)
Parameters
  • edges ([DataFrame]) – DataFrame of edges within the routing (transportation) network.

  • node_map ([DataFrame]) – DataFrame of nodes within the routing (transportation) network.

  • [joblib.Memory] (memory) – Allows for caching.

  • algorithm ([function]) – Method for finding minimum-distance route. Defaults to bidirectional_dijkstra.

get_route(start, end)

Find route from <start> to <end>, if exists.

Parameters
  • start ([list] [long, lat]) – Starting point of route: a node in node_map.

  • end ([list] [long, lat]) – Ending point of route: a node in node_map.

Returns

Length and characteristics of route from <start> to <end>.

Columns:
  • region_transportation : str

  • fclass : int

  • vkmt : float

Return type

[DataFrame]

static get_all_routes(locations_file, route_pair_file, distance_filtering, transportation_graph, node_locations, routes_output_file, routing_output_folder)

Calculate distances traveled between all connected supply chain facilities.

Includes distance traveled through each transportation region (e.g., county FIPS) and road class.

This method has no return value. Routes are saved to CSV file.

Parameters
  • locations_file (str) – File containing processed supply chain facility locations.

  • route_pair_file (str) – File defining the allowable facility pairs to connect with routes.

  • distance_filtering (str) – File defining distance-based route filtering by facility pair.

  • transportation_graph (str) – File of transportation network data.

  • node_locations (str) – File of node locations within transportation graph.

  • routes_output_file (str) – Path to file where complete routes dataset is saved.

  • routing_output_folder (str) – Path to directory for intermediate routing outputs.