Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF/BENCHMARK: comprehensive cat-groupby benchmarks #19026

Closed
jreback opened this issue Jan 1, 2018 · 4 comments
Closed

PERF/BENCHMARK: comprehensive cat-groupby benchmarks #19026

jreback opened this issue Jan 1, 2018 · 4 comments
Labels
Benchmark Performance (ASV) benchmarks Categorical Categorical Data Type Groupby

Comments

@jreback
Copy link
Contributor

jreback commented Jan 1, 2018

we have a number of groupby benchmarks with categoricals, but I think we need a comprehensive set to exercise combinations of:

xref SO

groupby on cat/object columns
cython function (e.g. first/max/....)
.agg variants of cython functions

In [4]: import pandas as pd
   ...: import numpy as np
   ...: animals = ['Dog', 'Cat']
   ...: days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday','Saturday']
   ...: N = 1000000
   ...: df = pd.DataFrame({'animals': np.array(animals).take(np.random.randint(0, len(animals), size=N)),
   ...:                    'days': np.array(days).take(np.random.randint(0, len(days), size=N))})
   ...: df2 = df.copy()
   ...: df2['animals'] = df2['animals'].astype('category')
   ...: 
   ...: df3 = df2.copy()
   ...: df3['animals'] = df3['animals'].cat.codes
   ...: 
   ...: # group on object, aggregate cat
   ...: print('groupby on object')
   ...: %timeit df.groupby('days').agg({'animals': 'first'})
   ...: %timeit df2.groupby('days').agg({'animals': 'first'})
   ...: 
   ...: 
   ...: # group on cat, aggregate cat
   ...: print('groupby on cat / codes / agg')
   ...: %timeit df.groupby('animals').agg({'animals': 'first'})
   ...: %timeit df2.groupby('animals').agg({'animals': 'first'})
   ...: %timeit df3.groupby('animals').agg({'animals': 'first'})
   ...: 
   ...: print('groupby on cat / codes / cython')
   ...: %timeit df2.groupby('animals').first()
   ...: %timeit df3.groupby('animals').first()
   ...: 
[1] groupby on object
270 ms +- 5.22 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
118 ms +- 1.96 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
[2] groupby on cat / codes / agg
147 ms +- 2.53 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
69.1 ms +- 1.56 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
22.2 ms +- 838 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
[3] groupby on cat / codes / cython
156 ms +- 4.32 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
169 ms +- 4.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

so [3] should be as fast as [2], culprit is here

we could have multiple PRs to solve this issue (benchmark & perf fix)

@jreback jreback added Benchmark Performance (ASV) benchmarks Categorical Categorical Data Type Groupby Performance Memory or execution speed performance labels Jan 1, 2018
@jreback jreback added this to the Next Major Release milestone Jan 1, 2018
@jreback
Copy link
Contributor Author

jreback commented Jan 1, 2018

cc @mroeschke @TomAugspurger
cc @jakevdp

@TomAugspurger
Copy link
Contributor

I’ve found that temporarily making array raises an exception is a way to easily figure out which methods are unnecessarily materializing the full array of objects.

@rtlee9
Copy link
Contributor

rtlee9 commented May 11, 2019

I can add a more comprehensive set of benchmarks of .agg cython functions for categorical and object groupbys to the asv benchmarks if no one else is working on that already

@mroeschke mroeschke removed the Performance Memory or execution speed performance label Jun 12, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

I think we have these by now so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Categorical Categorical Data Type Groupby
Projects
None yet
Development

No branches or pull requests

4 participants