Skip to content

Commit

Permalink
Merge pull request #23 from PermafrostDiscoveryGateway/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
julietcohen authored May 12, 2023
2 parents 4f31e95 + 5456628 commit 05b50ba
Show file tree
Hide file tree
Showing 7 changed files with 529 additions and 267 deletions.
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,13 +31,14 @@ stager.stage('path/to/input/file.shp')

## Vector file staging for the PDG tiling pipeline

This repository contains code that prepares vector data (e.g. shapefiles) for subsequent steps in the [PDG](https://permafrost.arcticdata.io/) tiling pipeline (such as [viz-3dtiles](https://github.com/PermafrostDiscoveryGateway/viz-3dtiles) and [viz-raster](https://github.com/PermafrostDiscoveryGateway/viz-raster)). The staging step creates output vector files that conform to a specified [OGC Two Dimensional Tile Matrix Set](http://docs.opengeospatial.org/is/17-083r2/17-083r2.html) ("TMS"). Specifically, for each input file, the staging process:
This repository contains code that prepares vector data (e.g. shapefiles, geopackages) for subsequent steps in the [PDG](https://permafrost.arcticdata.io/) tiling pipeline (such as [viz-3dtiles](https://github.com/PermafrostDiscoveryGateway/viz-3dtiles) and [viz-raster](https://github.com/PermafrostDiscoveryGateway/viz-raster)). The staging step creates output vector files that conform to a specified [OGC Two Dimensional Tile Matrix Set](http://docs.opengeospatial.org/is/17-083r2/17-083r2.html) ("TMS"). Specifically, for each input file, the staging process:

1. Simplifies polygons and re-projects them to the Coordinate Reference System ("CRS") used by the desired TMS.
2. Assigns area, centroid, and other properties to each polygon.
3. Saves polygons to one file for each tile in the specified level of the TMS.
3. Identifies duplicate polygons in the tiles.
4. Saves polygons to one file for each tile in the specified level of the TMS.

Polygons are assigned to a tile file if the polygon is within the tile or if it intersects with the bounding box of the tile (i.e. if it is at least *partially* within that tile). This means that polygons that fall within two or more tiles will be duplicated in the output. (This allows subsequent rasterization steps to measure the area of polygons that are only partially within the tile - otherwise some area is lost.)
Polygons are assigned to a tile file if the polygon is within the tile or if it intersects with the bounding box of the tile (i.e. if it is at least *partially* within that tile). This means that polygons that fall within two or more tiles will be duplicated in the output. (This allows subsequent rasterization steps to measure the area of polygons that are only partially within the tile - otherwise some area is lost). The duplicated polygons are labeled as such so they can be removed during staging or a later step in the PDG visualization pipeline. The step at which these polygons are removed is determined by the configuration file.

However, polygon-tile relationships are also identified using the centroid of each polygon: The `centroid_tile` property assigned to polygons identifies the tile within which the polygon's centroid falls. (In the rare event that a polygon's centroid falls exactly on a tile boundary, the polygon will be added to the southern/eastern tile.)

Expand All @@ -55,6 +56,7 @@ After being run through this staging process, each polygon will be assigned the
- **staging_filename** (string) - The path to the file from which this polygon originated
- **staging_identifier** (string) - A unique identifier for the polygon
- **staging_centroid_within_tile** (boolean) - True when the `centroid_tile` property matches the `tile`, i.e. when the centroid of the polygon is within the same tile as the file it is saved in
- **staging_duplicated** (boolean) - True when the polygon has been identified as a duplicate based on the deduplication method specified in the configuration

## Summary fields

Expand All @@ -71,3 +73,8 @@ The staging process will also output a summary CSV file with one row for each ti

- It is assumed that incoming vector data comprises only valid polygons. **Any non-polygon data is removed**, including multi-polygons, points, lines, or other geometries.
- It's also assumed that each incoming vector file is staged only once. If a file passes through the staging step twice, then all polygons from that file will be duplicated in the output (but with a different identifier). This is due to the fact that when a tile file already exists, additional polygons that belong to this tile will be appended to the file.
- The input data does not contain `NaN` values or infinite values, or if the data does contain one of these, then the value is known. Failing to specify this value in the configuration cause issues later in the visualization pipeline.
- For release 0.1.0, the deduplication method `neighbors` has not been thoroughly tested. The deduplication method should be `None` or `footprints`.
- If the deduplication method specified in the configuration is `footprints`, the footprint file(s) are provided with a structure that follows the [docs](https://github.com/PermafrostDiscoveryGateway/viz-staging/blob/main/docs/footprints.md).
- In order for logging to work properly, the node running the script that uses this package has a `/tmp` directory so the `log.log` file can populate there.

70 changes: 41 additions & 29 deletions pdgstaging/ConfigManager.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
import json
import logging
from . import logging_config
import os
from .Deduplicator import deduplicate_neighbors, deduplicate_by_footprint
from .TilePathManager import TilePathManager
import warnings
from coloraide import Color
import colormaps as cmaps

logger = logging.getLogger(__name__)
logger = logging_config.logger


class ConfigManager():
Expand Down Expand Up @@ -125,7 +126,7 @@ class ConfigManager():
- Tiling & rasterization options.
- tms_id : str
The ID of the TMS to use. Must much a TMS supported by the
morecantile library. Defaults to 'WorldCRS84Quad'.
morecantile library. Defaults to 'WGS1984Quad'.
- tile_path_structure : list of int
A list of strings that represent the directory structure of
last segment of the path that uses the tms (TileMatrixSet),
Expand Down Expand Up @@ -202,7 +203,9 @@ class ConfigManager():
format accepted by the coloraide library (see:
https://facelessuser.github.io/coloraide/color/), or
the name of a colormap from the colormaps library (see:
https://pratiman-91.github.io/colormaps).
https://pratiman-91.github.io/colormaps). Colors with
transparency hex codes are accepted (see:
https://gist.github.com/lopspower/03fb1cc0ac9f32ef38f4)
- nodata_val: int or float or None or np.nan
The value of pixels to interpret as no data or missing
data. Defaults to None.
Expand Down Expand Up @@ -233,20 +236,24 @@ class ConfigManager():
Options are 'staging', 'raster', '3dtiles', or None to skip
deduplication. If set to 'staging', then duplicates will
also be removed in raster and 3dtiles.
- deduplicate_method : 'neighbor', 'footprints', or None
- deduplicate_method : 'footprints', 'neighbor', or None
The method to use for deduplication. Options are
'neighbor', 'footprints', or None. If None, then no
deduplication will be performed. If 'neighbor', then the
input data will be deduplicated by removing nearby or
overlapping polygons, as determined by the
'deduplicate_centroid_tolerance' and
'deduplicate_overlap_tolerance' options. If 'footprints',
deduplication will be performed. If 'footprints',
then the input data will be deduplicated by removing
polygons that are contained within sections of overlapping
file footprints. This method requires footprint vector
files that have the same name as the input vector files,
stored in a directory specified by the 'dir_footprints'
option.
option. If 'neighbor', then the input data will be
deduplicated by removing nearby or overlapping polygons,
as determined by the 'deduplicate_centroid_tolerance' and
'deduplicate_overlap_tolerance' options. Note that with
release 0.1.0, the 'neighbor' method has been not been
thoroughly tested. Only the 'footprints' method has been
thoroughly tested and should be applied to input data,
as this release is tailored to a dataset that requires
this deduplication method.
- deduplicate_keep_rules : list of tuple: []
Required for both deduplication methods. Rules that define
which of the polygons to keep when two or more are
Expand All @@ -273,7 +280,9 @@ class ConfigManager():
the intersecting polygons to be considered a duplicate. If
False, then the overlap_tolerance proportion must be True
for only one of the intersecting polygons to be considered
a duplicate. Default is True.
a duplicate. Default is True. Note that with release 0.1.0,
the 'neighbor' method has been not been thoroughly tested
and should not be applied to input data.
- deduplicate_centroid_tolerance : float, optional
For the 'neighbor' deduplication method only. The maximum
distance between the centroids of two polygons to be
Expand All @@ -291,11 +300,14 @@ class ConfigManager():
before calculating the distance between them.
centroid_tolerance will use the units of this CRS. Set to
None to skip the re-projection and use the CRS of the
GeoDataFrame.
GeoDataFrame. Note that with release 0.1.0,
the 'neighbor' method has been not been thoroughly tested
and should not be applied to input data.
- deduplicate_clip_to_footprint : bool, optional
For the 'footprints' deduplication method only. If True,
then polygons that fall outside the bounds of the
associated footprint will be removed. Default is False.
associated footprint will be removed. Default is True for
release version 0.1.0, but will be false for future releases.
- deduplicate_clip_method: str, optional
For the 'footprints' deduplication method only, when
deduplicate_clip_to_footprint is True. The method to use to
Expand All @@ -306,7 +318,7 @@ class ConfigManager():
'intersects', 'overlaps', 'touches', 'within' (any option
listed by
geopandas.GeoDataFrame.sindex.valid_query_predicates).
Defaults to 'within'.
Defaults to 'intersects'.
Example config:
---------------
Expand Down Expand Up @@ -381,7 +393,7 @@ class ConfigManager():
# Staging options
'simplify_tolerance': 0.0001,
# Tiling & rasterization options
'tms_id': 'WorldCRS84Quad',
'tms_id': 'WGS1984Quad',
'tile_path_structure': ('style', 'tms', 'z', 'x', 'y'),
'z_range': (0, 13),
'tile_size': (256, 256),
Expand Down Expand Up @@ -417,8 +429,8 @@ class ConfigManager():
'deduplicate_overlap_both': True,
'deduplicate_centroid_tolerance': None,
'deduplicate_distance_crs': 'EPSG:3857',
'deduplicate_clip_to_footprint': False,
'deduplicate_clip_method': 'within'
'deduplicate_clip_to_footprint': True,
'deduplicate_clip_method': 'intersects'
}

tiling_scheme_map = {
Expand Down Expand Up @@ -1149,7 +1161,7 @@ def get_raster_config(self):
Returns
-------
dict
A dictionairy containing the configuration for shape,
A dictionary containing the configuration for shape,
centroid_properties, and stats for Raster.from_vector method.
"""

Expand Down Expand Up @@ -1185,7 +1197,7 @@ def get_path_manager_config(self):
A dict containing the configuration for the Tile Path Manager
class. Example:
{
'tms_id: 'WorldCRS84Quad',
'tms_id: 'WGS1984Quad',
'path_structure': ['style', 'tms', 'z', 'x', 'y'],
'base_dirs': {
'geotiff': {
Expand Down Expand Up @@ -1255,8 +1267,7 @@ def get_deduplication_config(self, gdf=None):
'centroid_tolerance': self.get(
'deduplicate_centroid_tolerance'),
'distance_crs': self.get('deduplicate_distance_crs'),
'return_intersections': False,
'label': True,
'return_intersections': False, # not used at the moment, need to re-introduce this feature since removed dict step when labeling duplicates
'prop_duplicated': self.polygon_prop('duplicated')
}
if(method == 'footprints'):
Expand All @@ -1273,12 +1284,7 @@ def get_deduplication_config(self, gdf=None):
return {
'split_by': file_prop,
'footprints': footprints,
'keep_rules': self.get('deduplicate_keep_rules'),
'clip_to_footprint': self.get(
'deduplicate_clip_to_footprint'),
'clip_method': self.get(
'deduplicate_clip_method'),
'label': True,
'keep_rules': self.get('deduplicate_keep_rules'),
'prop_duplicated': self.polygon_prop('duplicated')
}

Expand Down Expand Up @@ -1307,20 +1313,26 @@ def deduplicate_at(self, step):

def get_deduplication_method(self):
"""
Return the deduplication method
Return the deduplication method set in the config.
Returns
-------
str
The deduplication method.
The name of the deduplication method function,
which can be assigned to a new variable and executed as the
deduplciation function.
"""
method = self.get('deduplicate_method')
if(method == 'neighbor'):
logger.warning(f"Deduplication method 'neighbors' has not been"
f"tested for release 0.1.0. Please use deduplication"
f"method 'footprints' or None for this release.")
return deduplicate_neighbors
if(method == 'footprints'):
return deduplicate_by_footprint
return None


def footprint_path_from_input(self, path, check_exists=False):
"""
Get the footprint path from an input path
Expand Down
Loading

0 comments on commit 05b50ba

Please sign in to comment.