Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add dedicated download functions #26

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
e10646b
refactor: change available release versions caching
RaczeQ Nov 5, 2024
ed43112
feat: added timing decorator for aggregating total time with nested c…
RaczeQ Nov 5, 2024
32b5591
feat: added columns_to_download functionality
RaczeQ Nov 7, 2024
a4c7caf
chore: added dummy advanced fuctions file structure
RaczeQ Nov 7, 2024
6ec6e9e
chore: add pragma no cover
RaczeQ Nov 8, 2024
40c74f7
fix: change test working directory
RaczeQ Nov 8, 2024
aaefd02
chore: prepare them_type classification for wide form
RaczeQ Nov 8, 2024
f8a711f
chore: add wide form functions base code
RaczeQ Nov 13, 2024
73d0ee7
fix(pre-commit.ci): auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 13, 2024
1b7241a
feat: add working wide form logic for default cases
RaczeQ Nov 24, 2024
550b725
chore: make multiprocessing manager a singleton
RaczeQ Nov 24, 2024
42e3615
chore: cleaned nested Path call
RaczeQ Nov 24, 2024
8784c65
feat: added working buildings with parts download logic
RaczeQ Nov 24, 2024
fab6343
feat: refactor internat data download to work with multiple theme typ…
RaczeQ Dec 4, 2024
d4e38ea
feat: add dedicated function for downloading buildings with parts
RaczeQ Dec 4, 2024
f837525
Merge branch 'main' into 10-add-some-dedicated-functions-for-highways…
RaczeQ Dec 4, 2024
9098f12
fix: remove prints from functions
RaczeQ Dec 4, 2024
9f06060
fix: change windows test path
RaczeQ Dec 4, 2024
3863a81
feat: finish building and poi wide format logic
RaczeQ Dec 8, 2024
baa67c9
feat: add theme and type metadata info
RaczeQ Dec 8, 2024
f63f589
chore: remove unnecessary download buildings function
RaczeQ Dec 10, 2024
75be4f7
feat: add option to download multiple theme type datsets at once
RaczeQ Dec 14, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Automatic total time wrapper decorator to aggregate nested function calls
- Parameter `columns_to_download` for selecting columns to download from the dataset [#23](https://github.com/kraina-ai/overturemaestro/issues/23)

### Changed

- Refactored available release versions caching [#24](https://github.com/kraina-ai/overturemaestro/issues/24)
- Removed hive partitioned parquet schema columns from GeoDataFrame loading

## [0.1.1] - 2024-11-24

### Changed
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,8 @@ Required:

- `geoarrow-rust-core (>=0.3.0)`: For transforming Arrow data to Shapely objects

- `duckdb (>=1.1.0)`: For transforming downloaded data to the wide format

- `pooch (>=1.6.0)`: For downloading precalculated dataset indexes

- `rich (>=12.0.0)`: For showing progress bars
Expand Down
27 changes: 27 additions & 0 deletions overturemaestro/_duckdb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""Helper functions for DuckDB."""

from pathlib import Path
from typing import Union

import duckdb


def _sql_escape(value: str) -> str:
"""Escape value for SQL query."""
return value.replace("'", "''")

Check warning on line 11 in overturemaestro/_duckdb.py

View check run for this annotation

Codecov / codecov/patch

overturemaestro/_duckdb.py#L11

Added line #L11 was not covered by tests


def _set_up_duckdb_connection(tmp_dir_path: Union[str, Path]) -> "duckdb.DuckDBPyConnection":
"""Create DuckDB connection in a given directory."""
local_db_file = "db.duckdb"
connection = duckdb.connect(

Check warning on line 17 in overturemaestro/_duckdb.py

View check run for this annotation

Codecov / codecov/patch

overturemaestro/_duckdb.py#L16-L17

Added lines #L16 - L17 were not covered by tests
database=str(Path(tmp_dir_path) / local_db_file),
config=dict(preserve_insertion_order=False),
)
connection.sql("SET enable_progress_bar = false;")
connection.sql("SET enable_progress_bar_print = false;")

Check warning on line 22 in overturemaestro/_duckdb.py

View check run for this annotation

Codecov / codecov/patch

overturemaestro/_duckdb.py#L21-L22

Added lines #L21 - L22 were not covered by tests

connection.install_extension("spatial")
connection.load_extension("spatial")

Check warning on line 25 in overturemaestro/_duckdb.py

View check run for this annotation

Codecov / codecov/patch

overturemaestro/_duckdb.py#L24-L25

Added lines #L24 - L25 were not covered by tests

return connection

Check warning on line 27 in overturemaestro/_duckdb.py

View check run for this annotation

Codecov / codecov/patch

overturemaestro/_duckdb.py#L27

Added line #L27 was not covered by tests
6 changes: 6 additions & 0 deletions overturemaestro/_exceptions.py
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
class QueryNotGeocodedError(ValueError): ...


class MissingColumnError(ValueError): ...


class HierarchyDepthOutOfBoundsError(ValueError): ...
31 changes: 23 additions & 8 deletions overturemaestro/_parquet_multiprocessing.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
import multiprocessing
from multiprocessing.managers import SyncManager
from pathlib import Path
from queue import Empty, Queue
from time import sleep, time
from typing import TYPE_CHECKING, Any, Callable, Optional, Union
from typing import TYPE_CHECKING, Any, Callable, Optional, Union, cast

from overturemaestro._rich_progress import VERBOSITY_MODE, TrackProgressSpinner

Expand Down Expand Up @@ -31,13 +32,14 @@ def _job(
columns: Optional[list[str]],
filesystem: "fs.FileSystem",
) -> None: # pragma: no cover
import hashlib

import pyarrow.dataset as ds
import pyarrow.parquet as pq

current_pid = multiprocessing.current_process().pid

filepath = save_path / f"{current_pid}.parquet"
writer = None
writers = {}
while not queue.empty():
try:
file_name, row_group_index = None, None
Expand All @@ -61,10 +63,16 @@ def _job(
tracker.value += 1
continue

if not writer:
writer = pq.ParquetWriter(filepath, result_table.schema)
h = hashlib.new("sha256")
h.update(result_table.schema.to_string().encode())
schema_hash = h.hexdigest()

if schema_hash not in writers:
filepath = save_path / str(current_pid) / f"{schema_hash}.parquet"
filepath.parent.mkdir(exist_ok=True, parents=True)
writers[schema_hash] = pq.ParquetWriter(filepath, result_table.schema)

writer.write_table(result_table)
writers[schema_hash].write_table(result_table)

with tracker_lock:
tracker.value += 1
Expand All @@ -80,7 +88,7 @@ def _job(
)
raise MultiprocessingRuntimeError(msg) from ex

if writer:
for writer in writers.values():
writer.close()


Expand All @@ -107,6 +115,13 @@ def exception(self) -> Optional[tuple[Exception, str]]:
return self._exception


class SingletonContextManager(SyncManager):
def __new__(cls, ctx: multiprocessing.context.SpawnContext) -> "SingletonContextManager":
if not hasattr(cls, "instance"):
cls.instance = ctx.Manager()
return cast(SingletonContextManager, cls.instance)


def _read_row_group_number(path: str, filesystem: "fs.FileSystem") -> int:
import pyarrow.parquet as pq

Expand Down Expand Up @@ -154,7 +169,7 @@ def map_parquet_dataset(

from overturemaestro._rich_progress import TrackProgressBar

manager = ctx.Manager()
manager = SingletonContextManager(ctx=ctx)

queue: Queue[tuple[str, int]] = manager.Queue()
tracker: ValueProxy[int] = manager.Value("i", 0)
Expand Down
49 changes: 49 additions & 0 deletions overturemaestro/advanced_functions/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# """
# Advanced functions.

# This module contains dedicated functions for specific use cases.
# """

# from overturemaestro.advanced_functions.buildings import (
# convert_bounding_box_to_buildings_geodataframe,
# convert_bounding_box_to_buildings_parquet,
# convert_geometry_to_buildings_geodataframe,
# convert_geometry_to_buildings_parquet,
# )
# from overturemaestro.advanced_functions.poi import (
# convert_bounding_box_to_pois_geodataframe,
# convert_bounding_box_to_pois_parquet,
# convert_geometry_to_pois_geodataframe,
# convert_geometry_to_pois_parquet,
# )
# from overturemaestro.advanced_functions.transportation import (
# convert_bounding_box_to_roads_geodataframe,
# convert_bounding_box_to_roads_parquet,
# convert_geometry_to_roads_geodataframe,
# convert_geometry_to_roads_parquet,
# )
# from overturemaestro.advanced_functions.wide_form import (
# convert_bounding_box_to_wide_form_geodataframe,
# convert_bounding_box_to_wide_form_parquet,
# convert_geometry_to_wide_form_geodataframe,
# convert_geometry_to_wide_form_parquet,
# )

# __all__ = [
# "convert_bounding_box_to_buildings_geodataframe",
# "convert_bounding_box_to_buildings_parquet",
# "convert_bounding_box_to_pois_geodataframe",
# "convert_bounding_box_to_pois_parquet",
# "convert_bounding_box_to_roads_geodataframe",
# "convert_bounding_box_to_roads_parquet",
# "convert_bounding_box_to_wide_form_geodataframe",
# "convert_bounding_box_to_wide_form_parquet",
# "convert_geometry_to_buildings_geodataframe",
# "convert_geometry_to_buildings_parquet",
# "convert_geometry_to_pois_geodataframe",
# "convert_geometry_to_pois_parquet",
# "convert_geometry_to_roads_geodataframe",
# "convert_geometry_to_roads_parquet",
# "convert_geometry_to_wide_form_geodataframe",
# "convert_geometry_to_wide_form_parquet",
# ]
Empty file.
Empty file.
Loading
Loading