-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Merge
divisions after filtering partitions
#1152
Conversation
assert not expr._filtered | ||
assert expr.left._filtered | ||
assert expr.divisions == expr._divisions() | ||
assert len(expr.divisions) == 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check the result as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call.
While adding this check I actually noticed that divisions are often "wrong" after a merge (unrelated to the changes in this PR). More specifically, it seems like we are inheriting divisions after a broadcast join, even if we aren't merging on the index. If I understand correctly, we inherit the partition count in this case, but the divisions are likely to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's correct, we probably should only inherit on index merges, pandas is not consistent with the index when you merge on columns :(
thanks |
A Dask-cuDF user reported an error that suggested something was going wrong in the
Merge
graph:It turns out that the problem happens when you perform a broadcast merge after filtering the partitions of the larger collection. In
main
, we use_divisions()
to check the divisions of a (possibly-filtered) child expression when I'm pretty sure we need to be using the child'sdivisions
property. This PR makes the necessary fix and adds test coverage.