Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Discover Instance Type Capacity Memory Overhead #7004

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jukie
Copy link

@jukie jukie commented Sep 12, 2024

Fixes #5161

Description
This adds a new controller for managing a cache on the instancetype provider that stores the memory capacity overhead for each instance type by comparing the actual value after a Kubernetes Node gets registered with the cluster.
The cache is then implemented by NewInstanceType() when calculating memory capacity. If a cached value doesn't exist it will fall back to the existing logic of vmMemoryOverheadPercent.

The cache is keyed by a combination of instance type name and hash of nodeClass.Status.AMIs so this should always be accurate and when an AMI update is triggered will use the existing logic of calculating against vmMemoryOverheadPercent to ensure safe instance creation every time.

How was this change tested?
Suite tests

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened:
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jukie jukie requested a review from a team as a code owner September 12, 2024 22:06
@jukie jukie changed the title WIP: Add memory overhead tracking per nodeclass and ami family Draft: Add memory overhead tracking per nodeclass and ami family Sep 12, 2024
Copy link

netlify bot commented Sep 12, 2024

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 065f328
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/66faef18e9c75d000842516a

@jukie jukie changed the title Draft: Add memory overhead tracking per nodeclass and ami family Draft: feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024
@jukie jukie changed the title Draft: feat: Discover Instance Type Capacity Memory Overhead [DRAFT]: feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024
@jukie jukie marked this pull request as draft September 12, 2024 23:53
@jukie jukie changed the title [DRAFT]: feat: Discover Instance Type Capacity Memory Overhead [DRAFT] feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024
@jukie jukie changed the title [DRAFT] feat: Discover Instance Type Capacity Memory Overhead feat: Discover Instance Type Capacity Memory Overhead Sep 14, 2024
@jukie jukie marked this pull request as ready for review September 14, 2024 08:30
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks really good!

pkg/providers/instancetype/instancetype.go Show resolved Hide resolved
}

nodeClassMap := lo.Associate(nodeClassList.Items, func(nodeClass v1.EC2NodeClass) (string, v1.EC2NodeClass) {
return nodeClass.Hash(), nodeClass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you want to do the hash, as that encompasses more details than we care about for resource allocation.

Copy link
Author

@jukie jukie Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down there's a match against this map and the node's nodeclass hash annotation so that we're only storing cache values for the latest version of the nodeclass. Would you instead prefer keying this map on the nodeclass name and then while during the node loop having a check that the node's ami is within the nodeclass status.amis list?

pkg/providers/instancetype/instancetype.go Outdated Show resolved Hide resolved
@jukie
Copy link
Author

jukie commented Sep 26, 2024

CC @jmdeal (I think you were the one on the call today?) There's a negligible amount of variance when measuring overhead in Ki but I was unable to find any variance between the same instance type when measuring in Mi which my PR is doing.

Here's a simple script that can be used to verify: https://gist.github.com/jukie/df045af8fec68941f5d119044bf04aee

Instance Type: m6i.24xlarge
Variance detected in memory capacities:
 - 389990024
 - 389990016
 - 389990004
 - 389989996
 - 389990000
 - 389990004
 - 389989988
 - 389990000
 - 389990012
 - 389990012
 - 389989988
 - 389989980
 - 389990020

@jukie jukie requested a review from njtran September 26, 2024 22:11
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent
2 participants