Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Merge divisions after filtering partitions #1152

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 16, 2024

A Dask-cuDF user reported an error that suggested something was going wrong in the Merge graph:

kwargs:    {'how': 'inner', 'indicator': False, 'left_index': False, 'right_index': False, 'suffixes': ('_x', '_y'), 'result_meta': Empty DataFrame
Columns: [id_x, id_y, jaccard, id, uid]
Index: [], 'left_on': ['id_y'], 'right_on': ['id']}
Exception: 'AttributeError("\'tuple\' object has no attribute \'merge\'")'

It turns out that the problem happens when you perform a broadcast merge after filtering the partitions of the larger collection. In main, we use _divisions() to check the divisions of a (possibly-filtered) child expression when I'm pretty sure we need to be using the child's divisions property. This PR makes the necessary fix and adds test coverage.

@rjzamora rjzamora added the bug Something isn't working label Oct 16, 2024
@rjzamora rjzamora self-assigned this Oct 16, 2024
@rjzamora rjzamora marked this pull request as ready for review October 16, 2024 17:13
assert not expr._filtered
assert expr.left._filtered
assert expr.divisions == expr._divisions()
assert len(expr.divisions) == 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check the result as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

While adding this check I actually noticed that divisions are often "wrong" after a merge (unrelated to the changes in this PR). More specifically, it seems like we are inheriting divisions after a broadcast join, even if we aren't merging on the index. If I understand correctly, we inherit the partition count in this case, but the divisions are likely to change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's correct, we probably should only inherit on index merges, pandas is not consistent with the index when you merge on columns :(

@phofl phofl merged commit 2898409 into dask:main Oct 18, 2024
6 checks passed
@phofl
Copy link
Collaborator

phofl commented Oct 18, 2024

thanks

@rjzamora rjzamora deleted the partitions-filtered-merge-fix branch October 18, 2024 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants