perf: improve user specified flush delay automatically by statistics #159

rueian · 2022-12-14T15:05:17Z

Choose flush delay automatically by statistics from previous flush histories.

It generally achieves better throughput than just using MaxFlushDelay provided by users directly while still keeping reasonable CPU usage.

Benchmark comparison to the old v0.0.89 by using the code https://github.com/FZambia/pipelines:

`PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100`

▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p1.txt
▶ benchstat old-100us-p1.txt new-100us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    12.0µs ± 0%     3.8µs ± 2%  -68.47%  (p=0.000 n=20+20)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%     ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

`PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100`

▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p128.txt
▶ benchstat old-100us-p128.txt new-100us-p128.txt
name        old time/op    new time/op    delta
Rueidis-10     601ns ± 2%     589ns ± 2%  -1.89%  (p=0.000 n=20+19)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%    ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%    ~     (all equal)

`PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20`

▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p1.txt
▶ benchstat old-20us-p1.txt new-20us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    4.43µs ± 1%    3.80µs ± 2%  -14.38%  (p=0.000 n=20+20)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%     ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

`PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20`

▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p128.txt
▶ benchstat old-20us-p128.txt new-20us-p128.txt
name        old time/op    new time/op    delta
Rueidis-10     586ns ± 3%     587ns ± 3%   ~     (p=0.622 n=20+19)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%   ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%   ~     (all equal)

codecov · 2022-12-14T15:35:55Z

Codecov Report

Base: 96.19% // Head: 96.17% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (ba3adb3) compared to base (28cad24).
Patch coverage: 73.07% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #159      +/-   ##
==========================================
- Coverage   96.19%   96.17%   -0.02%     
==========================================
  Files          36       36              
  Lines       25382    25401      +19     
==========================================
+ Hits        24415    24429      +14     
- Misses        846      851       +5     
  Partials      121      121

Impacted Files	Coverage Δ
pipe.go	`99.13% <73.07%> (-0.62%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

rueian · 2022-12-14T16:15:51Z

related to #156

FZambia · 2022-12-14T16:51:45Z

Hello @rueian - this looks interesting, some initial thoughts after looking at PR (have not even run it yet):

Is it possible to describe the logic behind this formula?
I was thinking on a topic a bit – to be fair I am a bit nervous at the moment to enable non-deterministic connection behaviour with auto adjustment – reasonable CPU is a bit subjective matter, so I think automatic adjustment should be optional

FZambia · 2022-12-14T17:11:22Z

Also, I suppose this should be tested with read-intensive commands – in https://github.com/FZambia/pipelines I only tried PUBLISH command (which has a very small response), but probably if client reads a lot from Redis then more smooth write flushes of read-intensive commands may be preferred for CPU - but just a shot in the dark.

FZambia · 2022-12-14T17:48:31Z

Another thought: let's say we have a bigger latency, and set some delay we are comfortable with in the app. As far as I understand this auto adjustment does not take latency into account - it may be that there is no much sense to reduce delay automatically given large RTT time (since it may result in larger CPU on Redis at the end).

BTW, I experimented locally with https://github.com/Shopify/toxiproxy yesterday to add some latency to Redis. But I am on MacOS, for Linux tc may be more obvious choice - https://daniel.haxx.se/blog/2010/12/14/add-latency-to-localhost/.

FZambia · 2022-12-15T15:01:27Z

I tried to experiment a bit. Still - the more I think about this the more I tend to think that automatic adjustment does a bit separate thing: it does not reduce CPU, but goes towards throughput. For example, for parallelism 1 and delay 100 microseconds I got similar results as you, but CPU usage Increased 3x on both app and Redis in case of using automatic adjustment. But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.

This may be a good thing in general, but it feels a completely different mode than we had before with MaxFlushDelay. The point of adding delay in my case was to reduce CPU, but still have acceptable throughput. As application nodes/instances usually scale faster than Redis.

And I still doubt whether Rueidis really needs complex behaviour like this - probably better to drop the idea at all otherwise we risk to get pretty unmanageable behavior. Really worry about the fact that adjustments in runtime may be too tricky to maintain and there are too many corner cases, types/size of data, latency involved.

Maybe I am too conservative in that – but just prefer keeping things as simple as possible.

rueian · 2022-12-15T15:23:15Z

reasonable CPU is a bit subjective matte

Indeed. I use the word reasonable because the CPU usage of this approach is similar to redigo's.

Is it possible to describe the logic behind this formula?

Let's start with the question:

How long should we wait for more commands?

Sure, we could just sleep for MaxFlushDelay specified by the user and have the lowest CPU usage. But it could easily wait too long and result in very poor throughput.

For example, sleeping for 100 microseconds reduced the throughput by ~290%.

▶ benchstat ori-0us-p1.txt old-100us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    3.08µs ± 0%   11.99µs ± 0%  +289.49%  (p=0.000 n=19+20)

To avoid sleeping too long, I would like to make this sleep including the successive flush similar to previous flushes. In other words, I want time(sleep) + time(flush) == avg time(previous flushes).

Assuming the time(flush) is proportional to how many bytes it flushes, then I can have an average flush time of a byte donated as byteWork.

I can also have an approximated average receiving time of a byte, by tracking durations between each explicit flush, donated as byteWait.

Therefore, the original question can be approximated to How many bytes I should wait for to make them take a similar time to average flushes donated as nByte = avg time(previous flushes) / (byteWait + byteWork)

Then the sleeping duration can be nByte * byteWait.

rueian · 2022-12-15T16:18:16Z

But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.

Sure, but there are many cases that their traffic pattern can't be fit to a fixed MaxFlushDelay. Adaptive approach may still be good to have.

The point of adding delay in my case was to reduce CPU, but still have acceptable throughput.

We probably can also have a mode allowing user to set acceptable throughput then trying to reduce CPU usage.

Maybe I am too conservative in that – but just prefer keeping things as simple as possible.

Don't worry. This won't be merged until we find a better solution.

FZambia · 2022-12-16T19:01:48Z

Current MaxFlushDelay, while a bit manual, provides a working understandable way to tradeoff throughput/CPU. And it works deterministically. Maybe selected value is not a very good fit for some setup – but possible to tune. I'd say – it's already like acceptable latency (well, not exactly of course, more like acceptable delay but it seems good enough to solve a problem).

I think you implemented some bright ideas here, but I am not sure at this point that I'd use this mode somehow. Probably it's better to wait for more feedback here, maybe from other users.

FZambia · 2022-12-17T08:28:29Z

Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible

rueian · 2022-12-17T15:36:32Z

Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible

It is a good idea. Probably something like this func(ctx CurrentFlushContext) time.Duration allows users to choose how long they want to wait based on some information.

rueian changed the title ~~perf: improve user specified flush delay automatically by statistic~~ perf: improve user specified flush delay automatically by statistics Dec 14, 2022

rueian force-pushed the auto-flush-delay branch 3 times, most recently from 08c3c3b to dfa11e7 Compare December 14, 2022 15:29

rueian mentioned this pull request Dec 14, 2022

Migrate to github.com/rueian/rueidis centrifugal/centrifuge#262

Merged

perf: improve user specified flush delay automatically by statistics

ba3adb3

rueian force-pushed the auto-flush-delay branch from dfa11e7 to ba3adb3 Compare December 19, 2022 15:07

rueian force-pushed the main branch 3 times, most recently from d936e9a to 8410f48 Compare July 6, 2023 12:59

rueian force-pushed the main branch 2 times, most recently from fdb6041 to b9b0f4b Compare September 20, 2023 13:29

rueian force-pushed the main branch from deaea1a to 8a61462 Compare October 13, 2024 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve user specified flush delay automatically by statistics #159

perf: improve user specified flush delay automatically by statistics #159

rueian commented Dec 14, 2022 •

edited

Loading

codecov bot commented Dec 14, 2022 •

edited

Loading

rueian commented Dec 14, 2022

FZambia commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 15, 2022 •

edited

Loading

rueian commented Dec 15, 2022 •

edited

Loading

rueian commented Dec 15, 2022

FZambia commented Dec 16, 2022 •

edited

Loading

FZambia commented Dec 17, 2022

rueian commented Dec 17, 2022

perf: improve user specified flush delay automatically by statistics #159

Are you sure you want to change the base?

perf: improve user specified flush delay automatically by statistics #159

Conversation

rueian commented Dec 14, 2022 • edited Loading

PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100

PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100

PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20

PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20

codecov bot commented Dec 14, 2022 • edited Loading

Codecov Report

rueian commented Dec 14, 2022

FZambia commented Dec 14, 2022 • edited Loading

FZambia commented Dec 14, 2022 • edited Loading

FZambia commented Dec 14, 2022 • edited Loading

FZambia commented Dec 15, 2022 • edited Loading

rueian commented Dec 15, 2022 • edited Loading

rueian commented Dec 15, 2022

FZambia commented Dec 16, 2022 • edited Loading

FZambia commented Dec 17, 2022

rueian commented Dec 17, 2022

rueian commented Dec 14, 2022 •

edited

Loading

`PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100`

`PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100`

`PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20`

`PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20`

codecov bot commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 14, 2022 •

edited

Loading

FZambia commented Dec 15, 2022 •

edited

Loading

rueian commented Dec 15, 2022 •

edited

Loading

FZambia commented Dec 16, 2022 •

edited

Loading