Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove geopandas and fiona + bugfixes #833

Closed
wants to merge 4 commits into from

Conversation

JoshCu
Copy link
Contributor

@JoshCu JoshCu commented Aug 22, 2024

Geopandas

geopandas (more specifically it's fiona dependency) causes issues when wheels aren't available and it attempts to build from source on arm64/aarch64. building fiona is currently preventing NGIAB from building and has caused issues with arm NGIAB in the past.
As far as I can tell it's just used to read non-geometric data from geopackages and non-geometric data from geojson files.

#this
gdf = gpd.open_file("conus.gpkg", layer="divides")
# can be done like this
with sqlite3.connect("conus.gpkg") as conn:
    df = pd.read_sql_query("SELECT * FROM divides", conn)

The geometry is an unpleasant blob that would need to be decoded something like this, but it doesn't appear to be used anywhere so we could probably not bother select it from the geopackage in the first place to save some time and memory

Similarly, geojson is read into a normal pandas df, the geometry is less mangled than the geopackage but still not a proper geometry object. The geometry doesn't seem to be used anywhere, so I just left some comments explaining the format if anyone in the future does need to use it.

I haven't done performance tests on this, but it should also be faster than geopandas.

Some similar work

#604
#568

Removals

  • All imports and use of geopandas and fiona

Changes

  • Geopackages now open with pandas + sqlite3 and are normal dataframes
  • Geojson now opens by reading the raw json and reformatting the contents into a normal pandas dataframe

Testing & Screenshots

It worked on a small simulation generated with ngiab_data_preprocess, running inside NGIAB
image

Additional Bugfixes

In order to test this using geopackages generated by ngiab_data_preprocess (v20.1 of the hydrofabric) I had to modify the regex and pandas table merging.

Regex

The original regex would match both flowpaths and flowpath_attributes as flowpaths.
image
matched_layers: {'flowpaths': 'flowpath_attributes', 'flowpath_attributes': 'flowpath_attributes', 'lakes': None, 'nexus': 'nexus', 'network': None}
adding $ to check for the end of line fixed this.

table merging

I noticed this while debugging the regex issue; trying to merge two dataframes when either one of them didn't have the "id" column was causing errors. Now it checks to see what combination of the flowpaths and flowpath_attributes exist and merge accordingly.

@shorvath-noaa
Copy link
Contributor

shorvath-noaa commented Aug 23, 2024

The geometries are actually used in a couple places (that's why the GitHub Actions tests are failing). t-route uses the geometries to retrieve lat/lon information for nexus points (which are intended to be passed to the model engine via BMI functions at some point to then pass along to coastal models) and waterbodies (to mimic the LAKEOUT files that WRF-Hydro creates, which include waterbody lat/lon). There might be additional places where these packages are used, but I'd need to take a closer look.

So, if we want to remove geopandas and fiona, two possible options are:

  1. Find another geospatial manipulation package to do these operations.
  2. Ask if lat/lon info can be included in the hydrofabric somewhere other than the geometry columns.

@@ -29,14 +28,19 @@ def find_layer_name(layers, pattern):
if re.search(pattern, layer, re.IGNORECASE):
return layer
return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor whitespace diff here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I must not have had black installed

crs = geometry_columns[0][1]

if has_spatial_metadata:
# select everything from the layer, + the midpoint of it's bounding box
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the midpoint of its bounding box.

@AminTorabi-NOAA
Copy link
Contributor

I test the performance, there is a good improvement using sql

@JoshCu
Copy link
Contributor Author

JoshCu commented Aug 27, 2024

I didn't want to litter the code with too many comments, so I'll leave some notes here:

Geopackages property fields in the geopackages aren't always indexed: This doesn't seem to be relevant here, but if you wanted to select a subset of the entries by id:

SELECT * FROM divides WHERE id=wb-9999

Takes ~1s on the conus gpkg. If there's an index on id, it takes <30ms

CREATE INDEX "diid" ON "divides" ( "id" ASC )

I overlooked this on my gpkg and lost an afternoon to performance testing different access types

@JoshCu
Copy link
Contributor Author

JoshCu commented Sep 3, 2024

This change was quite large and moved away from a well maintained package to hard coded operations. My main goal was to fix t-route for NGIAB and that can be achieved with two less intrusive changes #838 and #839.

@JoshCu JoshCu closed this Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants