Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve user specified flush delay automatically by statistics #159

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rueian
Copy link
Collaborator

@rueian rueian commented Dec 14, 2022

Choose flush delay automatically by statistics from previous flush histories.

It generally achieves better throughput than just using MaxFlushDelay provided by users directly while still keeping reasonable CPU usage.


Benchmark comparison to the old v0.0.89 by using the code https://github.com/FZambia/pipelines:

PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100

▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p1.txt
▶ benchstat old-100us-p1.txt new-100us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    12.0µs ± 0%     3.8µs ± 2%  -68.47%  (p=0.000 n=20+20)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%     ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100

▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-100us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=100 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-100us-p128.txt
▶ benchstat old-100us-p128.txt new-100us-p128.txt
name        old time/op    new time/op    delta
Rueidis-10     601ns ± 2%     589ns ± 2%  -1.89%  (p=0.000 n=20+19)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%    ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%    ~     (all equal)

PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20

▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p1.txt
▶ PIPE_PARALLELISM=1 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p1.txt
▶ benchstat old-20us-p1.txt new-20us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    4.43µs ± 1%    3.80µs ± 2%  -14.38%  (p=0.000 n=20+20)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%     ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20

▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > old-20us-p128.txt
▶ PIPE_PARALLELISM=128 PIPE_MAX_FLUSH_DELAY=20 go test -run xxx -bench BenchmarkRueidis -benchtime=2s -count 20 > new-20us-p128.txt
▶ benchstat old-20us-p128.txt new-20us-p128.txt
name        old time/op    new time/op    delta
Rueidis-10     586ns ± 3%     587ns ± 3%   ~     (p=0.622 n=20+19)

name        old alloc/op   new alloc/op   delta
Rueidis-10     80.0B ± 0%     80.0B ± 0%   ~     (all equal)

name        old allocs/op  new allocs/op  delta
Rueidis-10      1.00 ± 0%      1.00 ± 0%   ~     (all equal)

@rueian rueian changed the title perf: improve user specified flush delay automatically by statistic perf: improve user specified flush delay automatically by statistics Dec 14, 2022
@rueian rueian force-pushed the auto-flush-delay branch 3 times, most recently from 08c3c3b to dfa11e7 Compare December 14, 2022 15:29
@codecov
Copy link

codecov bot commented Dec 14, 2022

Codecov Report

Base: 96.19% // Head: 96.17% // Decreases project coverage by -0.01% ⚠️

Coverage data is based on head (ba3adb3) compared to base (28cad24).
Patch coverage: 73.07% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #159      +/-   ##
==========================================
- Coverage   96.19%   96.17%   -0.02%     
==========================================
  Files          36       36              
  Lines       25382    25401      +19     
==========================================
+ Hits        24415    24429      +14     
- Misses        846      851       +5     
  Partials      121      121              
Impacted Files Coverage Δ
pipe.go 99.13% <73.07%> (-0.62%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rueian
Copy link
Collaborator Author

rueian commented Dec 14, 2022

related to #156

@FZambia
Copy link
Contributor

FZambia commented Dec 14, 2022

Hello @rueian - this looks interesting, some initial thoughts after looking at PR (have not even run it yet):

  • Is it possible to describe the logic behind this formula?
  • I was thinking on a topic a bit – to be fair I am a bit nervous at the moment to enable non-deterministic connection behaviour with auto adjustment – reasonable CPU is a bit subjective matter, so I think automatic adjustment should be optional

@FZambia
Copy link
Contributor

FZambia commented Dec 14, 2022

Also, I suppose this should be tested with read-intensive commands – in https://github.com/FZambia/pipelines I only tried PUBLISH command (which has a very small response), but probably if client reads a lot from Redis then more smooth write flushes of read-intensive commands may be preferred for CPU - but just a shot in the dark.

@FZambia
Copy link
Contributor

FZambia commented Dec 14, 2022

Another thought: let's say we have a bigger latency, and set some delay we are comfortable with in the app. As far as I understand this auto adjustment does not take latency into account - it may be that there is no much sense to reduce delay automatically given large RTT time (since it may result in larger CPU on Redis at the end).

BTW, I experimented locally with https://github.com/Shopify/toxiproxy yesterday to add some latency to Redis. But I am on MacOS, for Linux tc may be more obvious choice - https://daniel.haxx.se/blog/2010/12/14/add-latency-to-localhost/.

@FZambia
Copy link
Contributor

FZambia commented Dec 15, 2022

I tried to experiment a bit. Still - the more I think about this the more I tend to think that automatic adjustment does a bit separate thing: it does not reduce CPU, but goes towards throughput. For example, for parallelism 1 and delay 100 microseconds I got similar results as you, but CPU usage Increased 3x on both app and Redis in case of using automatic adjustment. But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.

This may be a good thing in general, but it feels a completely different mode than we had before with MaxFlushDelay. The point of adding delay in my case was to reduce CPU, but still have acceptable throughput. As application nodes/instances usually scale faster than Redis.

And I still doubt whether Rueidis really needs complex behaviour like this - probably better to drop the idea at all otherwise we risk to get pretty unmanageable behavior. Really worry about the fact that adjustments in runtime may be too tricky to maintain and there are too many corner cases, types/size of data, latency involved.

Maybe I am too conservative in that – but just prefer keeping things as simple as possible.

@rueian
Copy link
Collaborator Author

rueian commented Dec 15, 2022

reasonable CPU is a bit subjective matte

Indeed. I use the word reasonable because the CPU usage of this approach is similar to redigo's.

Is it possible to describe the logic behind this formula?

Let's start with the question:

How long should we wait for more commands?

Sure, we could just sleep for MaxFlushDelay specified by the user and have the lowest CPU usage. But it could easily wait too long and result in very poor throughput.

For example, sleeping for 100 microseconds reduced the throughput by ~290%.

▶ benchstat ori-0us-p1.txt old-100us-p1.txt
name        old time/op    new time/op    delta
Rueidis-10    3.08µs ± 0%   11.99µs ± 0%  +289.49%  (p=0.000 n=19+20)

To avoid sleeping too long, I would like to make this sleep including the successive flush similar to previous flushes. In other words, I want time(sleep) + time(flush) == avg time(previous flushes).

Assuming the time(flush) is proportional to how many bytes it flushes, then I can have an average flush time of a byte donated as byteWork.

I can also have an approximated average receiving time of a byte, by tracking durations between each explicit flush, donated as byteWait.

Therefore, the original question can be approximated to How many bytes I should wait for to make them take a similar time to average flushes donated as nByte = avg time(previous flushes) / (byteWait + byteWork)

Then the sleeping duration can be nByte * byteWait.

@rueian
Copy link
Collaborator Author

rueian commented Dec 15, 2022

But I suppose if I wanted to improve throughput - I could just tune the MaxFlushDelay parameter myself as I know my app and its load as a developer better.

Sure, but there are many cases that their traffic pattern can't be fit to a fixed MaxFlushDelay. Adaptive approach may still be good to have.

The point of adding delay in my case was to reduce CPU, but still have acceptable throughput.

We probably can also have a mode allowing user to set acceptable throughput then trying to reduce CPU usage.

Maybe I am too conservative in that – but just prefer keeping things as simple as possible.

Don't worry. This won't be merged until we find a better solution.

@FZambia
Copy link
Contributor

FZambia commented Dec 16, 2022

Current MaxFlushDelay, while a bit manual, provides a working understandable way to tradeoff throughput/CPU. And it works deterministically. Maybe selected value is not a very good fit for some setup – but possible to tune. I'd say – it's already like acceptable latency (well, not exactly of course, more like acceptable delay but it seems good enough to solve a problem).

I think you implemented some bright ideas here, but I am not sure at this point that I'd use this mode somehow. Probably it's better to wait for more feedback here, maybe from other users.

@FZambia
Copy link
Contributor

FZambia commented Dec 17, 2022

Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible

@rueian
Copy link
Collaborator Author

rueian commented Dec 17, 2022

Thinking more.. Not sure whether it's good idea or not – can this batching strategy be a pluggable element somehow – just to keep Rueidis core simple yet extensible

It is a good idea. Probably something like this func(ctx CurrentFlushContext) time.Duration allows users to choose how long they want to wait based on some information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants