Rethink about vertical scaling based on `PreferredMaxReplicas` #329

sanposhiho · 2024-02-14T10:17:37Z

https://github.com/mercari/tortoise/blob/main/pkg/recommender/recommender.go#L185-L196

To prevent the deployment from creating too many but too small replicas, Tortoise has the feature that when the replica number goes higher than 30 (this threshold is configurable via PreferredMaxReplicas), it tries to make Pods vertically bigger.

With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1 until the replica number goes below 30 and it works to some extent. But, it keeps recreating Pods very many time, which is not great, because vertical scaling requires restarting Pods.

We should consider another way to achieve this, which is better, but as simple as the current stragety.

The text was updated successfully, but these errors were encountered:

sanposhiho · 2024-02-14T10:24:47Z

For now, I'll implement the feature flag feature in Tortoise controller so that we can temporarily disable this feature first.

sanposhiho · 2024-02-15T01:53:26Z

Implemented the alpha VerticalScalingBasedOnPreferredMaxReplicas feature gate, disabled by default.

lchavey · 2024-02-15T02:21:49Z

Could we scale the rate of scaling using something something similar to "tcp exponential backoff" or "tcp window scaling, used for slow start" to increase the vertical scaling?
(https://en.wikipedia.org/wiki/Exponential_backoff). For tortoise, the "delay from tcp exp backoff or the window increase", becomes the vertical scaling factor.

this could reduce the # of "scaling" iterations at the cost of some over scaling.

harpratap · 2024-02-15T03:57:57Z

@sanposhiho Could you please elaborate more on this part?

But, it keeps recreating Pods very many time, which is not great, because vertical scaling requires restarting Pods.

How frequent is too frequent? Based on this we can probably try exponential backoff like @lchavey suggested but we will need to add some edge cases to it because if the backoff window is too long then it will cause pods to crash and throttle

sanposhiho · 2024-02-15T04:08:55Z

Currently, each Tortoise is reconciled every 15s. Meaning Tortoise keeps restarting(scaling up) Pods every 15s until the replica number goes below 30, which is obviously too frequent.

Yup, "exponential backoff" would be a good idea to try out. Actually, the delay of this vertical scaling doesn't cause any problem on services because HPA still keeps increasing the replica in case of CPU utilization reaches the threshold of threshold. If vertical scaling up is too late, we can modify the factor from 1.1 to something bigger.

lchavey · 2024-02-17T13:07:38Z

Sorry, I may have missed understood the original post.

it tries to make Pods vertically bigger.
With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1
I was reading this as we were increasing the vertical by 1.1 each time.

So I was thinking of using "exponential scaling" for the vertical

it tries to make Pods vertically bigger.
With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1

This got me thinking that we could use both (time and scale).

sanposhiho added kind/feature New feature or request priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 14, 2024

sanposhiho self-assigned this Feb 14, 2024

sanposhiho mentioned this issue Feb 15, 2024

implement the alpha feature gate VerticalScalingBasedOnPreferredMaxReplicas #330

Merged

sanposhiho mentioned this issue Feb 23, 2024

scale up based on PreferredMaxReplicas only once 30mins #348

Merged

sanposhiho removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink about vertical scaling based on `PreferredMaxReplicas` #329

Rethink about vertical scaling based on `PreferredMaxReplicas` #329

sanposhiho commented Feb 14, 2024 •

edited

Loading

sanposhiho commented Feb 14, 2024

sanposhiho commented Feb 15, 2024

lchavey commented Feb 15, 2024

harpratap commented Feb 15, 2024

sanposhiho commented Feb 15, 2024

lchavey commented Feb 17, 2024

Rethink about vertical scaling based on PreferredMaxReplicas #329

Rethink about vertical scaling based on PreferredMaxReplicas #329

Comments

sanposhiho commented Feb 14, 2024 • edited Loading

sanposhiho commented Feb 14, 2024

sanposhiho commented Feb 15, 2024

lchavey commented Feb 15, 2024

harpratap commented Feb 15, 2024

sanposhiho commented Feb 15, 2024

lchavey commented Feb 17, 2024

Rethink about vertical scaling based on `PreferredMaxReplicas` #329

Rethink about vertical scaling based on `PreferredMaxReplicas` #329

sanposhiho commented Feb 14, 2024 •

edited

Loading