Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink about vertical scaling based on PreferredMaxReplicas #329

Open
sanposhiho opened this issue Feb 14, 2024 · 6 comments
Open

Rethink about vertical scaling based on PreferredMaxReplicas #329

sanposhiho opened this issue Feb 14, 2024 · 6 comments
Assignees
Labels
kind/feature New feature or request

Comments

@sanposhiho
Copy link
Collaborator

sanposhiho commented Feb 14, 2024

https://github.com/mercari/tortoise/blob/main/pkg/recommender/recommender.go#L185-L196

To prevent the deployment from creating too many but too small replicas, Tortoise has the feature that when the replica number goes higher than 30 (this threshold is configurable via PreferredMaxReplicas), it tries to make Pods vertically bigger.

With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1 until the replica number goes below 30 and it works to some extent. But, it keeps recreating Pods very many time, which is not great, because vertical scaling requires restarting Pods.

We should consider another way to achieve this, which is better, but as simple as the current stragety.

@sanposhiho sanposhiho added kind/feature New feature or request priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 14, 2024
@sanposhiho sanposhiho self-assigned this Feb 14, 2024
@sanposhiho
Copy link
Collaborator Author

For now, I'll implement the feature flag feature in Tortoise controller so that we can temporarily disable this feature first.

@sanposhiho
Copy link
Collaborator Author

Implemented the alpha VerticalScalingBasedOnPreferredMaxReplicas feature gate, disabled by default.

@lchavey
Copy link

lchavey commented Feb 15, 2024

Could we scale the rate of scaling using something something similar to "tcp exponential backoff" or "tcp window scaling, used for slow start" to increase the vertical scaling?
(https://en.wikipedia.org/wiki/Exponential_backoff). For tortoise, the "delay from tcp exp backoff or the window increase", becomes the vertical scaling factor.

this could reduce the # of "scaling" iterations at the cost of some over scaling.

@harpratap
Copy link
Contributor

@sanposhiho Could you please elaborate more on this part?

But, it keeps recreating Pods very many time, which is not great, because vertical scaling requires restarting Pods.

How frequent is too frequent? Based on this we can probably try exponential backoff like @lchavey suggested but we will need to add some edge cases to it because if the backoff window is too long then it will cause pods to crash and throttle

@sanposhiho
Copy link
Collaborator Author

Currently, each Tortoise is reconciled every 15s. Meaning Tortoise keeps restarting(scaling up) Pods every 15s until the replica number goes below 30, which is obviously too frequent.

Yup, "exponential backoff" would be a good idea to try out. Actually, the delay of this vertical scaling doesn't cause any problem on services because HPA still keeps increasing the replica in case of CPU utilization reaches the threshold of threshold. If vertical scaling up is too late, we can modify the factor from 1.1 to something bigger.

@lchavey
Copy link

lchavey commented Feb 17, 2024

Sorry, I may have missed understood the original post.

it tries to make Pods vertically bigger.
With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1
I was reading this as we were increasing the vertical by 1.1 each time.

So I was thinking of using "exponential scaling" for the vertical

it tries to make Pods vertically bigger.
With the current implementation, we just simply keep resourceRequest.MilliValue() * 1.1

This got me thinking that we could use both (time and scale).

@sanposhiho sanposhiho removed the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants