1 - Solar Bounty Data Prize Data Download#

Overview#

This notebook is a demonstration of how to download data from the OEDI PVDAQ data archives. It is focused on the process for downloading data from the PV Solar Bounty Data Prize sites, that utilize a slightly different archive architecture thatn the rest of the PVDAQ archives. For more information on how to download the other datasets within the PVDAQ archives see the code within the main.py module located inthis same repository.

The PVDAQ Data Archives resides in an Amazon Web Services Simple Storage Solution (S3) bucket. As such, there are modules available to provide you access to this data easily throughthose APIs. This part of the module will focus on the Boto3 module for performing the download.Since this is a series of public datasets, the method does not require the passing of authorization keys.

Steps:

# if running on google colab, uncomment the next line and execute this cell to install the dependencies and prevent "ModuleNotFoundError" in later cells:
!pip install pvdaq_access
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pvdaq_access in c:\users\sayala\documents\github\pvdaq_access (0+untagged.18.g44d26b7.dirty)
Requirement already satisfied: boto3 in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (1.24.28)
Requirement already satisfied: botocore in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (1.27.59)
Requirement already satisfied: jmespath in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (0.10.0)
Requirement already satisfied: numpy in c:\users\sayala\appdata\roaming\python\python311\site-packages (from pvdaq_access) (1.24.4)
Requirement already satisfied: pandas in c:\users\sayala\appdata\roaming\python\python311\site-packages (from pvdaq_access) (2.1.0)
Requirement already satisfied: python-dateutil in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (2.8.2)
Requirement already satisfied: pytz in c:\users\sayala\appdata\roaming\python\python311\site-packages (from pvdaq_access) (2023.3)
Requirement already satisfied: s3transfer in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (0.6.0)
Requirement already satisfied: six in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (1.16.0)
Requirement already satisfied: urllib3 in c:\programdata\anaconda3\lib\site-packages (from pvdaq_access) (1.26.16)
Requirement already satisfied: configparser in c:\users\sayala\appdata\roaming\python\python311\site-packages (from pvdaq_access) (6.0.0)
Requirement already satisfied: requests in c:\users\sayala\appdata\roaming\python\python311\site-packages (from pvdaq_access) (2.31.0)
Requirement already satisfied: tzdata>=2022.1 in c:\users\sayala\appdata\roaming\python\python311\site-packages (from pandas->pvdaq_access) (2023.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\sayala\appdata\roaming\python\python311\site-packages (from requests->pvdaq_access) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\programdata\anaconda3\lib\site-packages (from requests->pvdaq_access) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in c:\programdata\anaconda3\lib\site-packages (from requests->pvdaq_access) (2023.7.22)
import os
import boto3
import botocore
from botocore.handlers import disable_signing

1. Setup#

The first step is to indicate which system you are wishing to access and where you want the data to be stored. As part of the pvdaq_access module this can be done by passing parameters, but here we will use direct user query. For a list of the prize sites please follow the link PV Solar Data Prize Sites

System ID for the Webinar: We will be using system 2107.
site = input("Which Solar Bounty Data Prize Site do you wish to access? Enter thier unique ID number: ")
path = ''
if site:
    path = input("Where do you want to download the data to? Enter full path: ")
    if path:
        if os.path.isdir(path):
            print ("Site " + site + " time-series data to be downloaded to " + path)
        else:
            raise OSError('Path ' + path + " does not exists. Please add or cange path and restart.")
---------------------------------------------------------------------------
StdinNotImplementedError                  Traceback (most recent call last)
Cell In[3], line 1
----> 1 site = input("Which Solar Bounty Data Prize Site do you wish to access? Enter thier unique ID number: ")
      2 path = ''
      3 if site:

File C:\ProgramData\anaconda3\Lib\site-packages\ipykernel\kernelbase.py:1172, in Kernel.raw_input(self, prompt)
   1165 """Forward raw_input to frontends
   1166 
   1167 Raises
   1168 ------
   1169 StdinNotImplementedError if active frontend doesn't support stdin.
   1170 """
   1171 if not self._allow_stdin:
-> 1172     raise StdinNotImplementedError(
   1173         "raw_input was called, but this frontend does not support input requests."
   1174     )
   1175 return self._input_request(
   1176     str(prompt),
   1177     self._parent_ident["shell"],
   1178     self.get_parent("shell"),
   1179     password=False,
   1180 )

StdinNotImplementedError: raw_input was called, but this frontend does not support input requests.
if site == '':
    site = '2107'
    
if path == '':
    path = os.getcwd()

2. Downloading the Data#

This next section will connect to the S3 resource for the OEDI and PV Solar Data Bounty sites and pull all the data down from the site to your indicated location.

s3 = boto3.resource("s3")
s3.meta.client.meta.events.register("choose-signer.s3.*", disable_signing)
bucket = s3.Bucket("oedi-data-lake")
print ("Beginning download process ")

#Find each target file in buckets
target_dir = site + '_OEDI'
prefix =  "pvdaq/2023-solar-data-prize/" +  target_dir + "/data/"
objects = bucket.objects.filter(Prefix=prefix)

for obj in objects:
    if obj.key == prefix:
        continue            
    try:
        bucket.download_file(obj.key, os.path.join(path, os.path.basename(obj.key)).replace("\\", "/"))
    except botocore.exceptions.ClientError as e:
        print ('ERROR: Boto3 exception ' + str(e))
    else:
        print ('File ' + os.path.join(path, os.path.basename(obj.key)) + " downloaded successfully.")
Beginning download process 
File C:\Users\sayala\Documents\GitHub\pvdaq_access\tutorials\tutorials\2107_electrical_data.csv downloaded successfully.
File C:\Users\sayala\Documents\GitHub\pvdaq_access\tutorials\tutorials\2107_environment_data.csv downloaded successfully.
File C:\Users\sayala\Documents\GitHub\pvdaq_access\tutorials\tutorials\2107_irradiance_data.csv downloaded successfully.
File C:\Users\sayala\Documents\GitHub\pvdaq_access\tutorials\tutorials\2107_meter_15m_data.csv downloaded successfully.

3. Load and plot the file#

Select Irradiance Data with file 2107_irradiance_data.csv
print ("File download results")
files = os.listdir(path)
for file in files:
    print(file)

which_file = input("Which file from your download would you like to check? ")
File download results
.ipynb_checkpoints
1 - Downloading Data, and doing a Quality Assessment.py
1 - Solar Bounty Data Prize Data Download.html
1 - Solar Bounty Data Prize Data Download.ipynb
1 - Solar Bounty Data Prize Data Download.py
2 - Download using pvdaq_access.html
2 - Download using pvdaq_access.ipynb
2 - Download using pvdaq_access.py
2107_electrical_data.csv
2107_environment_data.csv
2107_irradiance_data.csv
2107_meter_15m_data.csv
SolarBountyDataPrize_DATA
Which file from your download would you like to check? 
if which_file == '':
    which_file = '2107_irradiance_data.csv'

Parse file into dataframe and examine info#

import pandas as pd
#Read in file
try: 
    df = pd.read_csv(os.path.join(path, which_file))
except FileNotFoundError:
    print("File not found.")
except pd.errors.EmptyDataError:
    print("No data")
except pd.errors.ParserError:
    print("Parse error")
else:
    df.set_index('measured_on', inplace=True)
    #extract file info
    df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 531019 entries, 2017-11-01 07:10:00 to 2023-11-01 23:55:00
Data columns (total 1 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   poa_irradiance_o_149574  531019 non-null  float64
dtypes: float64(1)
memory usage: 8.1+ MB

Perform a quick plot to examine data#

import matplotlib.pyplot as plt

#Set the plot values
plot_cols = ['poa_irradiance_o_149574']

# Add labels and title
axes = df[plot_cols].plot(marker='.', alpha=0.5, figsize=(11, 9))
#Rotate labels               
plt.xticks(rotation=45)    
# Show the plot
plt.xlabel('measured_on')
plt.ylabel('poa_irradiance_o_149574')
plt.title('Time Series data check')
plt.show()
../_images/3d3547a32038ad6020d5362d412a88a7b4cd8a58fd17b57c6f1f2b37ffaee9f0.png