Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update airflow.io.path.ObjectStoragePath for upcoming universal_pathlib release #37067

Closed
2 tasks done
ap-- opened this issue Jan 29, 2024 · 5 comments · Fixed by #37524
Closed
2 tasks done

Update airflow.io.path.ObjectStoragePath for upcoming universal_pathlib release #37067

ap-- opened this issue Jan 29, 2024 · 5 comments · Fixed by #37524

Comments

@ap--
Copy link
Contributor

ap-- commented Jan 29, 2024

Apache Airflow version

main (development)

If "Other Airflow 2 version" selected, which one?

No response

What happened?

I'm opening this issue to track a needed refactor of airflow.io.path.ObjectStoragePath to support the next release of universal_pathlib without having to have a special implementation for Python-3.12

What you think should happen instead?

Tagging @bolkedebruin, @uranusjr, @potiuk

It would be great if I could have some insight into the intended behaviour for the custom UPath subclass airflow.io.path.ObjectStoragePath

Looking at the current implementation it seems that these are the intended changes:

  • (1) extracting a connection ID from the URI

    airflow/airflow/io/path.py

    Lines 148 to 151 in 07fd364

    userinfo, have_info, hostinfo = parsed_url.netloc.rpartition("@")
    if have_info:
    conn_id = conn_id or userinfo or None
    parsed_url = parsed_url._replace(netloc=hostinfo)
  • (2) delegating fsspec filesystem creation to airflow.io.store.ObjectStore
    def __init__(
    self,
    parsed_url: SplitResult | None,
    conn_id: str | None = None,
    **kwargs: typing.Any,
    ) -> None:
    # warning: we are not calling super().__init__ here
    # as it will try to create a new fs from a different
    # set if registered filesystems
    if parsed_url and parsed_url.scheme:
    self._store = attach(parsed_url.scheme, conn_id)
    else:
    self._store = attach("file", conn_id)
    @property
    def _fs(self) -> AbstractFileSystem:
    return self._store.fs
  • (3) customization and extension of the UPath interface with several custom methods.

I am working on supporting (1) and (2) via custom methods that can be overwritten in UPath subclasses, and allow to have a single implementation for python versions prior to 3.12 and 3.12+, see: fsspec/universal_pathlib#172

Once that is integrated in UPath it should simplify the ObjectStoragePath implementation a lot. The custom accessor will be replaced by a factory method for creating fsspec filesystems, and the custom __new__ implementation by a method for parsing additional storage_options.

Let me know if that sounds good, and what's the best way to collaborate on this issue and move forward.

Cheers,
Andreas 😃

How to reproduce

N/A

Operating System

N/A

Versions of Apache Airflow Providers


Deployment

Other

Deployment details

N/A

Anything else?

N/A

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@ap-- ap-- added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 29, 2024
Copy link

boring-cyborg bot commented Jan 29, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@bolkedebruin
Copy link
Contributor

Hey @ap-- ,

That just sounds awesome! Obviously any simplification while retaining flexibility sounds great.

The generic need for a custom class on top of UPath is indeed to support Airflow-isms, like what you mentioned but also to be able to de/serialize a path into something that can be stored in a database. We are also a bit more "object store" oriented (semantically) than UPath I guess, so a little bit more strict in what we support how how things behave.

@dirrao dirrao removed kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 31, 2024
@ap--
Copy link
Contributor Author

ap-- commented Jan 31, 2024

The generic need for a custom class on top of UPath is indeed to support Airflow-isms, like what you mentioned but also to be able to de/serialize a path into something that can be stored in a database. We are also a bit more "object store" oriented (semantically) than UPath I guess, so a little bit more strict in what we support how how things behave.

Great 👍 Once I finish the implementation for fsspec/universal_pathlib#173 I'll open a PR here.

@potiuk
Copy link
Member

potiuk commented Feb 14, 2024

Last blocker for #36755

@ap--
Copy link
Contributor Author

ap-- commented Mar 4, 2024

@bolkedebruin I released a new version of universal_pathlib yesterday, that vendors the required code to delay import of the required fsspec filesystems until they are actually used for IO.

PR #37524 should in theory pass all tests now, if universal_pathlib>=0.2.2 is used as a dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants