New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

benchmark #4

Open

kaixiongg opened this issue Mar 28, 2024 · 5 comments

kaixiongg commented Mar 28, 2024

Is there any benchmark between single function(time-wise func and axis-wise func) from other python library and your polars implementation?

想问下有相关性能测试吗？我试了下，和bottleneck的move_xxx的性能差比好像有点大？不知道是不是我测试的方式不对。如果方便的话，可以加一下截面和时序算子的单一性能测试对比？感谢

Owner

wukan1986 commented Mar 28, 2024

polars如果能提供原生rolling_rank是最好的了，可惜我不会rust，老外重心也在更重要的事上，所以暂时只能用其它方法进行替代。

要提高速度最重要的是减少跨语言的交互次数。比如move_rank好处是支持二维。一个5000支股票的收盘价矩阵只要1次交互，这当然速度是最快的，但二维无法解决停牌的问题。ta_cn项目中有对二维停牌的解决方案，但很麻烦。

另种方法是退回成一维，但用groupby来解决二维问题。这样必然导致速度变慢，5000支股票要5000次交互。

在选用bottleneck还是pandas来实现rolling_rank的问题上:

bottleneck长期无人维护
pandas使用了SkipList技术，rolling().rank()并不慢

所以最终是直接调用了pandas的相应函数

Author

kaixiongg commented Mar 28, 2024

感谢您的回复，方便的话，咱们能加个v联系吗:kaiwnd111
我测试看，不考虑复杂的时序算子。像简单rolling_sum这种，好像单个算子也是慢很多，不知道是不是我用错了。

Owner

wukan1986 commented May 11, 2024

对多资产计算时序时，一般要group_by，但有可能速度很慢，但如果提前sort下就会好很多

df = df.sort('asset')
df = df.group_by(“asset”).map_groups()

Author

kaixiongg commented Jun 3, 2024 •

edited

Loading

我也并没有多资产，我只是单纯测试一维数组，对rolling算子的表现。(单纯看最底层算子的性能对比)

代码如下
import polars as pl
import numpy as np
import bottlenec as bn
length = 100000
n_time = 20

np.random.seed(0)
data1 = np.random.rand(length)

df = pl.DataFrame({
"data1": data1
})

result = df.with_columns(
rolling_max=pl.col("data1").rolling_max(window_size=n_time),
)

result = bn.move_max(data1, n_time)

这个结果polars比bn要慢五倍。。。

Owner

wukan1986 commented Jun 3, 2024

polars只解决了并行计算问题，但每个函数内部的算法并没有做优化，这可能是因为人手不足

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment