Aggregate fleet data by subplant, not plant #395

grgmiller · 2024-11-20T19:08:58Z

Purpose

This PR is in response to #392, which explains:

If a plant has multiple generators that burn multiple fuels, all of the generation and emissions for this plant are assigned the plant primary fuel type when aggregating to the fleet level or other higher level aggregations.

For example, in 2022, the Meramec plant overall burned more natural gas than coal, although only units 1 and 2 burned natural gas, while units 3 and 4 burned coal. However, since this plant was categorized as natural gas, all of the coal emissions also got lumped in with natural gas, which can throw off the fleet totals.

Going forward, we should probably try and publish subplant-level data (at least annually) as well as perform all fleet-level aggregations based on subplant-level data, rather than aggregating the already aggregated plant-level data.

This PR does two main things:

Exports new subplant-level, annual and monthly results as part of the results/plant_data/ outputs
Uses subplants (rather than plants) as the basis for identifying fleet composition
Updates code to ensure complete fuel mapping to all subplants (see below)

Closes CAR-4709, CAR-4708

What the code is doing

Specific changes in this PR include:

Export monthly and annual subplant-level data as part of the results. We actually use this data elsewhere so it would be useful to have this exported.
Re-arrange the order of the data pipeline so that all monthly/annual results (plant data and power sector data) are exported before turning to the hourly imputation and data exports. This makes the pipeline a bit cleaner for the historical data years as well.
Update generated averages function. We previously exported an output table for generation averages, which represented a US-average fleet-level generated emission rate. We use this in the pipeline as backstop emission rates for non-US balancing areas. To make this more clear, I updated this function to export a "US" file to results/power_sector_data
Remove option to export shaped fleet data as part of plant-level results. OGE originally did not export hourly data for all EIA plants, but we changed this in Add hourly data for all individual plants #246, but retained the option to export data in the old way. This PR cleans this up and only gives the option to export data in the new way.
Update the documentation to reflect the updated pipeline order, and update step numbers in pipeline
When calculating residual fleet profiles, aggregate CEMS data based on capacity-based, subplant primary fuels rather than fuel-based. This is likely better aligned with how the data would be reported to EIA-930.
Shaping EIA data: Shape subplant-level data instead of plant-level data.
When shaping EIA data, assign profiles based on the subplant, capacity-based primary fuel
Update and clarify add_missing_cems_profiles(). In our existing pipeline, when shaping data, we use cems profiles as one of the backstop hourly profiles. However, I noticed that we were using the CEMS profiles that we used to calculate the residual profiles, rather than the CEMS profiles specifically calculated for backstop purposes. This PR fixes this.

Ensuring complete subplant_primary_fuel coverage

Now that this PR switches to basing all aggregations on the subplant level, when testing the PR I ran into an issue where a UserWarning was being raised because there were not primary fuel category mappings available for all subplants (this was not previously a problem when we were using plant-level data).

Digging into this issue, I found that there were three types of issues preventing a complete mapping:

The issue: Sometimes, a generator changes its plant_id_eia or generator_id over time, while the EPA plant and unit identifiers remain the same. For example, when a generator was being repowered, it would be assigned a new generator ID by EIA. For example, at plant 10298, generator "GEN1" got renamed "GT1" after a repower in 2014. In addition, sometimes certain generators at an existing plant switch ownership, and are assigned a new plant ID. For example at plant 1571, three generators switched to a new plant id (65285) in 2021. While two of the generators then retired that year, one of them remained at the new plant code through 2022, and then switched BACK to plant 1571 in 2023. The fix: This involved adding these mappings to the epa_eia_crosswalk_manual.csv file, but since these mappings are time-dependent, I also had to add new columns to indicate the start and end year for the mapping to be valid (if no year is specified, the mapping is assumed to be valid for all time). This also involved some tweaks to the code where we convert the CEMS IDs to EIA IDs. Because we are only ever loading a single year of CEMS data at a time, we can filter the mappings to only include those which are valid for that year.
The issue: CEMS reports data for generators that are still "proposed" in EIA. Previously, when constructing the primary fuel table (which is used to assign fleet identities to subplants), we were only determining primary fuels for generators that were currently operational according to EIA. The fix: We now add proposed generators that are in the late stage of their development (currently under construction) to the primary fuel table. Because these generators do not yet have any reported fuel data in EIA, the fuel type is determined by the capacity-based fuel reported in EIA-860.

This also revealed a bug with the gross-to-net generation conversion, where retired generators that reported 0 generation to CEMS were getting net generation applied incorrectly. I fixed this by dropping these subplants from the GTN conversion calculation so that they should backstop to using the default backstop ratio, which should still result in zero net generation.

Impacts

These changes would likely have wide ranging impacts on the results:

Aggregating fleets by subplants will change the fleet average emission rates
Aggregating fleets by subplant may also change the profiles used to shape hourly data
Changing some of the backstop profiles used for shaping may also change hourly profiles for EIA data

Testing

Have not yet tested:

Run the pipeline for 2022 without errors
Compare new outputs to old outputs

Where to look

Suggested order to review files:

data_pipeline
data_cleaning
helpers
impute_hourly_profiles
output_data
validation, column checks, consumed

Usage Example/Visuals

N/A

Review estimate

30-45 minutes (happy to walk through any of this)

Future work

As I was working on this, I noticed several opportunities to re-organize which modules certain functions live in, or opportunities to create new modules to shorten some of our modules. I didn't implement this because it would make it harder to review line-item changes to existing functions in the future, we could:

Create an "aggregation" module that is separate from outputs and hourly imputation
Move the "calculate and export hourly plant data" to a different module?

Checklist

Update the documentation to reflect changes made in this PR
Format all updated python files using black
Clear outputs from all notebooks modified
Add docstrings and type hints to any new functions created

grgmiller · 2024-11-23T22:07:40Z

src/oge/output_data.py

@@ -287,85 +287,59 @@ def output_data_quality_metrics(
        )


-def output_plant_data(
-    df: pd.DataFrame,
+def write_plant_data_to_results(


Updated to specifically output monthly and annual plant level data, or output monthly or annual subplant-level data

grgmiller · 2024-11-23T22:08:07Z

src/oge/output_data.py

@@ -398,10 +372,10 @@ def convert_results(df: pd.DataFrame) -> pd.DataFrame:
    return converted


-def write_generated_averages(
+def write_national_fleet_averages(


Updated to make clearer and write "US" file to results rather than outputs

grgmiller · 2024-11-23T22:08:38Z

src/oge/column_checks.py

@@ -372,6 +330,36 @@
    },
 }

+DATA_COLUMNS = [


Moved from data_cleaning

grgmiller · 2024-11-23T22:14:55Z

src/oge/load_data.py

@@ -143,7 +143,11 @@ def load_cems_ids() -> pd.DataFrame:
            filters=[["year", "==", year]],
            columns=["plant_id_epa", "plant_id_eia", "emissions_unit_id_epa"],
        ).drop_duplicates()
+        cems_id_year = apply_dtypes(cems_id_year)
+        # update the plant_id_eia column using manual matches
+        cems_id_year = update_epa_to_eia_map(cems_id_year, year)


This was moved up to apply dytpes and EIA mappings on a year-by-year basis, rather than at the end.

grgmiller · 2024-11-23T23:02:51Z

notebooks/manual_data/zip_data.ipynb

I added a utility function to remove certain files locally that I never use (like all of the metric unit files) to save space on my computer.

src/oge/data_cleaning.py

src/oge/helpers.py

src/oge/impute_hourly_profiles.py

src/oge/output_data.py

src/oge/reference_tables/epa_eia_crosswalk_manual.csv

src/oge/load_data.py

rouille

Looks good

grgmiller added 10 commits November 19, 2024 16:50

output subplant data

13fdfe3

export fleet data

f23b908

aggregate cems based on capacity fuels

516fd11

shape EIA based on subplant primary fuel

69e117e

clean up cems backstop profiles

e7d3ee0

update step numbers and docs

f9df46c

add US fleet average results

66cac73

add notebook to remove unnecessary files

2e5fe51

move function to fix circular import

a5c646a

add new pudl columns to dtypes

2f0fb07

grgmiller requested a review from rouille November 22, 2024 23:12

fix formatting

1040577

grgmiller marked this pull request as ready for review November 22, 2024 23:13

grgmiller linked an issue Nov 22, 2024 that may be closed by this pull request

Aggregate data to fuel types using subplant-level rather than plant-level data #392

Open

grgmiller added 5 commits November 22, 2024 16:26

add more helpful warning for missing fleet keys

d1349ff

fix missing primary fuels for CEMS

768be7d

fix issue with UC generators not being added

0c2a89c

add years to epa-eia crosswalk

6c892f7

fix bugs

55d749f

grgmiller commented Nov 23, 2024

View reviewed changes

src/oge/column_checks.py

@@ -372,6 +330,36 @@

},

}

DATA_COLUMNS = [

Copy link

Collaborator Author

grgmiller Nov 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from data_cleaning

grgmiller commented Nov 23, 2024

View reviewed changes

grgmiller added 2 commits November 23, 2024 14:57

update broken import

c27fd00

update notebooks

9da20da

grgmiller commented Nov 23, 2024

View reviewed changes

fix bug with gtn

8c3ab72