Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#7397: Avoid materializing index/columns in shape checks #7398

Merged
merged 5 commits into from
Sep 21, 2024

Conversation

noloerino
Copy link
Collaborator

@noloerino noloerino commented Sep 13, 2024

What do these changes do?

Calling len(pd.DataFrame(...)) will currently materialize the frame's Index, and return the length of the pd.Index object. This PR adds a get_axis_len method to the query compiler to potentially avoid this materialization when determining the length of the columns or index.

This may not make a large difference for existing backends, as the underlying PandasDataFrame caches the index/column labels together with the length of that axis. However, other backends may choose to cache the shape separate from the actual labels, and this extra method lets us potentially avoid materializing those labels. As such, frontend methods that previously called len(df.index) should instead call the equivalent len(df) to avoid potentially triggering this materialization.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves PERF: Add explicit query compiler method for len/shape checks #7397
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@noloerino noloerino changed the title [DRAFT] PERF-#7379: Avoid materializing index/columns in shape checks PERF-#7379: Avoid materializing index/columns in shape checks Sep 18, 2024
@noloerino noloerino marked this pull request as ready for review September 19, 2024 17:08
@noloerino noloerino changed the title PERF-#7379: Avoid materializing index/columns in shape checks PERF-#7397: Avoid materializing index/columns in shape checks Sep 19, 2024
modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/dataframe.py Outdated Show resolved Hide resolved
modin/pandas/series.py Outdated Show resolved Hide resolved
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@anmyachev anmyachev merged commit cc717a0 into modin-project:main Sep 21, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Add explicit query compiler method for len/shape checks
2 participants