Scaling up solar energy production quickly, over the next 20-50 years, is key to ending our current reliance on climate-damaging fossil fuels. However, solar energy is variable due to clouds and weather.
Electric utility companies need accurate forecasts of solar energy availability in order to plan the correct mix of fossil fuels and renewable energy to be used on any given day.
Errors in forecasting solar energy availability could lead to large expenses in extra fossil fuel consumption or emergency purchases of electricity from neighboring utilities.
Machine learning has the potential to find complex, statistically significant patterns that link numerical weather forecasts to solar energy generation. These trained models can then be used to make accurate predictions for solar power generation at a given solar energy generation site.
For this project, I used 18 years of every-3-hours numerical weather forecasts (500,000+ examples, 7920 raw features) from 1994-2012 from NOAA/ESRL Global Ensemble Forecast System, obtained from the Kaggle AMS 2013-2014 Solar Energy Prediction Contest.
These forecasts are from 11 different global weather models for 12, 15, 18, 21 and 24 hours ahead at 144 different latitude/longitude locations across Oklahoma.
I used these forecasts to predict the total integrated solar energy from Sun rise to Sun set as measured by 98 Oklahoma Mesonet sites spaced across Oklahoma. The actual total solar energy was measured directly by pyranometers at each site with a 5 minute cadence from 1994-2012.
I trained several hundred gradient-boosted decision tree models XGBoost using spatially averaged weather forecasts over 100x100 km to predict the actual integrated total solar energy.
I ran XGBoost to find numerical patterns that link weather forecasts to the actual integrated solar energy measured at each Oklahoma Mesonet site. I was able to predict the actual solar energy available to <5.8% accuracy 24 hours ahead for the vast majority of days, using a portion of the data not used in model training.
Numerical weather forecasts with grid spacings of 10+ kilometers lack the resolution necessary to predict the locations of clouds directly. Clouds are the main source of uncertainty in solar energy generation. Thus, more precise satellite data would be necessary to increase predictive power.
An XGBoost decision-tree model trained to predict solar energy using numerical weather predictions can be easily implemented in real time for a given solar energy general site. The model could be periodically updated with a longer time baseline, and site-specific weather information.
Solar energy prediction models can also be useful for rooftop solar in conjunction with batteries. A virtual powerplant controlling a network of charged batteries linked to multiple rooftop solar sites would likely find solar energy prediction useful to maximize revenue in sale of electricity to the grid.
-
Download the weather forecast files
gefs_test.tar.gz
(orgefs_test.zip
) andgefs_train.tar.gz
(orgefs_train.zip
) from the Kaggle competition data webpage. -
Move the files to the
Data/
directory.
- If you downloaded the
.tar.gz
files:
mv gefs_train.tar.gz Data/
mv gefs_test.tar.gz Data/
- or if you downloaded the
.zip
files:
mv gefs_train.zip Data/
mv gefs_test.zip Data/
-
Move into the Data directory:
cd Data/
-
Open the
.tar.gz
or.zip
files to make theData/train/
andData/test/
directories.
- If you downloaded the
.tar.gz
files, on OSX you can run:
tar -xzvf gefs_train.tar.gz
tar -xzvf gefs_test.tar.gz
- Or if you downloaded the
.zip
files:
unzip gefs_train.zip
unzip gefs_test.zip
There should now be Data/train/
and Data/test/
directories with multiple *.nc
files such as test/apcp_sfc_latlon_subset_20080101_20121130.nc
, test/dlwrf_sfc_latlon_subset_20080101_20121130.nc
etc. Each file gives weather features described on the Kaggle data webpage.
-
Install the netCDF4 module.
-
Switch back to the
PredictingSolarEnergy/
directory:cd ..
First, run train_solar_predict.py
to assemble, feature engineer, normalize and train XGBoost models on the raw features for one of the 11 different global forecast weather models. Second, run ensemble.py
to run XGBoost a second time combining multiple XGBoost models generated by train_solar_predict.py
.
train_solar_predict.py
is designed for a lot of experimentation to find the optimal amount of spatial averaging, feature engineering of the weather forecast grid points as well as hyperparameter tuning of XGBoost. FromPredictingSolarEnergy/
directory, run the code as:
python Code/train_solar_predict.py --outdir OUTDIR --modelnum MODELNUM --numclosegrid NUM --debug DEBUG --method METH --numrandstate NUMRAND --tag TAG
OUTDIR
is the name of the directory for the output filesMODELNUM
is the global weather forecast model to use (an integer0
-10
)NUM
is the number of grid points over which to spatially average a global weather forecast model. Set to7
for best results.METH
is a string specifying the type of spatial averaging to perform:avg
for a straightforward spatial average of the forecast models.use4
for no averaging. XGBoost will determine how best to use the different weather forecasts from different latitudes and longitudes.wavg
for using a spatial average weighted by the distance from each weather model grid point to each Mesonet weather station in Oklahoma.
NUMRAND
is the integer number of times to run XGBoost at different random states.TAG
is a string to tag the output files with.DEBUG
is for debugging, always set to0
(1
for debug).
ensemble.py
is designed to aggregate the models fromtrain_solar_predict.py
and fit an XGBoost model to make the final predictions to be submitted to Kaggle. From thePredictingSolarEnergy/
directory, run the code as:
python Code/ensemble.py --indirtag INDIRTAG --outdir OUTDIR --tag TAG
INDIRTAG
is the name of the output directories fromtrain_solar_predict.py
with XGBoost models that fit the raw features.OUTDIR
is the name of the directory to place the output files.TAG
is the string to tag the output files with.