Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ObjectStoragePath for universal_pathlib>=v0.2.1 #37524

Merged
merged 10 commits into from
Feb 20, 2024

Conversation

ap--
Copy link
Contributor

@ap-- ap-- commented Feb 18, 2024

closes: #37067
related: #36546

This PR adjusts the airflow.io.path.ObjectStoragePath implementation to support the new universal_pathlib>=0.2.1.

Functionality should be identical to the previous implementation. A few notes:

  • ObjectStoragePath._accessor._store was keeping an ObjectStore instance reference just for retrieving the conn_id. The current refactor does not keep the store instance referenced, since the conn_id is available in the storage_options of the UPath instance.
  • The __hash__ method was previously decorated with lru_cache, which is bad, because it keeps a reference to each ObjectStoragePath instance in the cache dict (stored on the class), preventing cleanup.
  • There were a few unused slots _bucket, _key, etc.. that seemed to be leftovers from a previous implementation.

Tagging: @potiuk @bolkedebruin


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link

boring-cyborg bot commented Feb 18, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@ap--
Copy link
Contributor Author

ap-- commented Feb 18, 2024

Also, if a core developer can point me to the easiest way to run the relevant parts of the test suite on an M2 mac that would be wonderful ❤️

@potiuk
Copy link
Member

potiuk commented Feb 18, 2024

Cool.

Also, if a core developer can point me to the easiest way to run the relevant parts of the test suite on an M2 mac that would be wonderful ❤️

This should be rather easy :

Contributing docs: https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst
Particularly quick start: https://github.com/apache/airflow/blob/main/contributing-docs/03_contributors_quick_start.rst

You have two options:

A) Using your favourite IDE and (recommended) hatch:

hatch shell -> will create and drop you in the right Airflow env with [devel] extras installed. hatch env find will print you where the venv is created - you can configure your IDE to use this venv and then run tests from your IDE via point/click

Details: https://github.com/apache/airflow/blob/main/contributing-docs/03_contributors_quick_start.rst#setting-up-virtual-env

You can also manage your env in whatever way you are used to and run pip -e ".[devel]" in the venv to install Airlfow and all dev tools needed to run tests.

We use pytest to run our tests.

B) using 1-1 reproducible CI dockerized environment:

  1. Install pipx and do pipx install -e ./dev/breeze after checking out Airflow
  2. breeze ci-image build --upgrade-to-newer-dependencies in your PR should build the container dependencies (use --help to see other options like changing python version)
  3. breeze should use the image with docker compose command and drop you into shelll inside the docker-compose
  4. then pytest tests/iowill run io tests (you can also see the set of tests that will fail in the CI output above

Details: https://github.com/apache/airflow/blob/main/contributing-docs/03_contributors_quick_start.rst#setting-up-breeze

@potiuk
Copy link
Member

potiuk commented Feb 18, 2024

(I approved/run) the tests BTW.

@potiuk
Copy link
Member

potiuk commented Feb 18, 2024

Also installing pre-commit and running pre-commit install is a good idea - this way all static checks will be locally run for you before you commit anything giving you a quick feedback on style/linting/mypy/similar .

- uses MemoryFileSystems instead of mocked 'fake filesystems'
- tests intention for `.stat()` (two added keys)
- tests local filesystems with absolute file urls, because fsspec
  interprets `file://bucket/key` as `file://{os.getcwd()}/bucket/key`
tests/io/test_path.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I just started looking at what was required and you beat me right to it.

The PR introduces some subtle behavioral changes (improvements definitely!). Particularly around what a path is and I agree with @uranusjr that those require strengthening of the tests and possibly improvements to the docs.

tests/io/test_path.py Outdated Show resolved Hide resolved
tests/io/test_path.py Show resolved Hide resolved
airflow/providers/common/io/xcom/backend.py Outdated Show resolved Hide resolved
Copy link
Contributor

@bolkedebruin bolkedebruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thanks for this @ap--

@bolkedebruin bolkedebruin merged commit 08bc0f4 into apache:main Feb 20, 2024
98 of 99 checks passed
Copy link

boring-cyborg bot commented Feb 20, 2024

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@bolkedebruin
Copy link
Contributor

Oops, I think I might have missed a failing test / regression. I'll take a look

@ephraimbuddy
Copy link
Contributor

There are lots of conflicts while cherrypicking this to 2.8.2. I will be moving it to 2.9.0 except anyone want to try the cherry-picking? cc @bolkedebruin @potiuk

@ephraimbuddy ephraimbuddy removed this from the Airflow 2.8.2 milestone Feb 20, 2024
@ephraimbuddy ephraimbuddy added this to the Airflow 2.9.0 milestone Feb 20, 2024
@bolkedebruin
Copy link
Contributor

bolkedebruin commented Feb 20, 2024

Why? That seems weird @ephraimbuddy ? (I.e. the conflicts). It's quite isolated on its own and no I think this should be in 2.8.2.

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

Awesome! Thanks for this @ap--

Fantastic :)

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

Why? That seems weird @ephraimbuddy ? (I.e. the conflicts). It's quite isolated on its own and no I think this should be in 2.8.2.

Because there are other committs that will need to be cherry-picked before that one - those that introduced the backend and we have not cherry-picked them,

@bolkedebruin
Copy link
Contributor

Ah yes. Well up to you. Seems nice to have this in 2.8.2 but its fine in 2.9

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

Those are io related changes that are not cherry-picked to v2-8-test - @ephraimbuddy you might try but IMHO it's a bit too risky (and also some code there explicitly mentions >= 2.9.0 so it would have to be changed).

08bc0f4490 Update ObjectStoragePath for universal_pathlib>=v0.2.1 (#37524)
 573d650708 AIP-58: Add object storage backend for xcom (#37058)
 33996a49f1 Add support for openlineage to AFS and common.io (#36410)
 5cf2560cce Some improvements to Airflow IO code (#36259)

@ephraimbuddy
Copy link
Contributor

Those are io related changes that are not cherry-picked to v2-8-test - @ephraimbuddy you might try but IMHO it's a bit too risky (and also some code there explicitly mentions >= 2.9.0 so it would have to be changed).

08bc0f4490 Update ObjectStoragePath for universal_pathlib>=v0.2.1 (#37524)
 573d650708 AIP-58: Add object storage backend for xcom (#37058)
 33996a49f1 Add support for openlineage to AFS and common.io (#36410)
 5cf2560cce Some improvements to Airflow IO code (#36259)

I think we should skip this one. I will do 2.9 next month instead of 2.8.3

@bolkedebruin
Copy link
Contributor

Great! Even better because I think there is a regression now @ap-- . The modules are eagerly loaded now ie:

o = ObjectStoragePath("s3://xxx")

Will now load s3fs right away, which it shouldn't do as this code will be often placed at the top of a DAG and thus loaded every single time it is parsed.

@ap--
Copy link
Contributor Author

ap-- commented Feb 20, 2024

Great! Even better because I think there is a regression now @ap-- . The modules are eagerly loaded now ie:

o = ObjectStoragePath("s3://xxx")

Will now load s3fs right away, which it shouldn't do as this code will be often placed at the top of a DAG and thus loaded every single time it is parsed.

This design was intentional, because fsspec filesystems can implement their own storage_option parsing and protocol strip methods, and correct parsing behaviour can only be ensured if the filesystem class is importable.

I will see if I can make changes to universal_pathlib to delay that as long as possible, or offer another way of handling these cases. What operations on an ObjectStoragePath would you expect to not import the underlying filesystem_spec AbstractFileSystem class? And is ObjectStoragePath supposed to handle all UPath filesystems, or only a subset?

I'll open an issue in universal_pathlib for further discussion.

@ephraimbuddy ephraimbuddy added type:improvement Changelog: Improvements changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) and removed type:improvement Changelog: Improvements labels Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-58 changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update airflow.io.path.ObjectStoragePath for upcoming universal_pathlib release Airflow Python 3.12 support
5 participants