Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Improve sort parameter doc of DataFrameGroupBy.value_counts #59412

Conversation

KevsterAmp
Copy link
Contributor

@KevsterAmp KevsterAmp commented Aug 5, 2024

I just copied the sort docstring of Dataframe.value_counts since the latter is mirrored to behaved like it.

image

@KevsterAmp KevsterAmp requested a review from rhshadrach as a code owner August 5, 2024 04:33
@KevsterAmp
Copy link
Contributor Author

Seems like I have changed the wrong value_count() function. I have changed pandas.core.groupby.SeriesGroupBy.value_counts rather than pandas.core.groupby.DataFrameGroupBy.value_counts. Applying fix in a bit

@KevsterAmp
Copy link
Contributor Author

Fixed it
image

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@@ -2303,7 +2303,7 @@ def value_counts(
normalize : bool, default False
Return proportions rather than frequencies.
sort : bool, default True
Sort by frequencies.
Sort by frequencies when True. Sort by DataFrame column values when False.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also want to modify the docstring on SeriesGroupBy.value_counts. Can you make the same adjustment there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing

@rhshadrach rhshadrach added Docs Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 5, 2024
@KevsterAmp
Copy link
Contributor Author

@rhshadrach - are there any extra changes needed for this PR? Thanks

@sfc-gh-joshi
Copy link

sfc-gh-joshi commented Aug 8, 2024

@KevsterAmp I left an additional ? on the parent issue: #59307 (comment)

It seems like when the groupby's sort flag and value_counts sort flag are both false, no sorting is done. Is this a behavioral bug, or just something else to add to documentation?

edit: It seems also that when value_counts has sort=True, the result frame is still being sorted on the DataFrame column values:

>>> data0 = {
    "by": ["c", "b", "a", "a", "b", "b", "c", "a"],
    "value1": ["ee", "aa", "bb", "aa", "bb", "cc", "dd", "aa"],
    "value2": [1, 2, 3, 1, 1, 3, 2, 1],
}
>>> pd.DataFrame(data0).groupby(by=["by"]).value_counts(sort=True)
by  value1  value2
a   aa      1         2
    bb      3         1
b   aa      2         1
    bb      1         1
    cc      3         1
c   dd      2         1
    ee      1         1

In the input frame, the row c ee 1 (which appears once) appears before c dd 1 (also appearing once), but the result ends up being sorted on value1 as well as the counts. I think it's worth clarifying whether value_counts(sort=True) will also sort on the DataFrame's values, or if it's just that the algorithm used to sort the counts column is unstable.

@KevsterAmp
Copy link
Contributor Author

Fixed the doc, thanks.

image

@KevsterAmp KevsterAmp requested a review from rhshadrach August 12, 2024 01:53
@KevsterAmp
Copy link
Contributor Author

@KevsterAmp I left an additional ? on the parent issue: #59307 (comment)

It seems like when the groupby's sort flag and value_counts sort flag are both false, no sorting is done. Is this a behavioral bug, or just something else to add to documentation?

edit: It seems also that when value_counts has sort=True, the result frame is still being sorted on the DataFrame column values:

>>> data0 = {
    "by": ["c", "b", "a", "a", "b", "b", "c", "a"],
    "value1": ["ee", "aa", "bb", "aa", "bb", "cc", "dd", "aa"],
    "value2": [1, 2, 3, 1, 1, 3, 2, 1],
}
>>> pd.DataFrame(data0).groupby(by=["by"]).value_counts(sort=True)
by  value1  value2
a   aa      1         2
    bb      3         1
b   aa      2         1
    bb      1         1
    cc      3         1
c   dd      2         1
    ee      1         1

In the input frame, the row c ee 1 (which appears once) appears before c dd 1 (also appearing once), but the result ends up being sorted on value1 as well as the counts. I think it's worth clarifying whether value_counts(sort=True) will also sort on the DataFrame's values, or if it's just that the algorithm used to sort the counts column is unstable.

Not sure if I will address this in this PR, since the scope of this one is to just modify the doc. Thanks

@KevsterAmp
Copy link
Contributor Author

@rhshadrach - are there any extra changes needed for this PR? Thanks

@rhshadrach
Copy link
Member

@KevsterAmp - I am hesitant to make this change since certain cases that are correct now become incorrect. When you do df.groupby(..., sort=False).value_counts(sort=False), pandas is preserving the order of the input. See #59307 (comment)

@rhshadrach
Copy link
Member

@KevsterAmp - I've opened #59307 based on the discussion in the linked issue. Unfortunately that supersedes this PR. Going to close here - but thanks for working on this!

@rhshadrach rhshadrach closed this Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Docs Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.value_counts doesn't preserve original order for non-grouping rows
4 participants