feat: Discover Instance Type Memory Capacity #7004

jukie · 2024-09-12T22:06:11Z

Description
This adds a new controller for managing a cache on the instancetype provider that stores the memory capacity overhead for each instance type by comparing the actual value after a Kubernetes Node gets registered with the cluster.
The cache is then implemented by NewInstanceType() when calculating memory capacity. If a cached value doesn't exist it will fall back to the existing logic of vmMemoryOverheadPercent.

The cache is keyed by a combination of instance type name and hash of nodeClass.Status.AMIs so this should always be accurate and when an AMI update is triggered will use the existing logic of calculating against vmMemoryOverheadPercent to ensure safe instance creation every time.

How was this change tested?
Suite tests and live environment

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened:
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2024-09-12T22:06:26Z

✅ Deploy Preview for karpenter-docs-prod ready!

Name	Link
🔨 Latest commit	`90d9c55`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/67107692a7a1e80008f75dd1
😎 Deploy Preview	https://deploy-preview-7004--karpenter-docs-prod.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

njtran

Thanks for the PR! Looks really good!

pkg/controllers/providers/instancetype/memoryoverhead/controller.go

pkg/providers/instancetype/instancetype.go

jukie · 2024-09-26T22:08:10Z

CC @jmdeal (I think you were the one on the call today?) There's a negligible amount of variance when measuring overhead in Ki but I was unable to find any variance between the same instance type when measuring in Mi which my PR is doing.

Here's a simple script that can be used to verify: https://gist.github.com/jukie/df045af8fec68941f5d119044bf04aee

Instance Type: m6i.24xlarge
Variance detected in memory capacities:
 - 389990024
 - 389990016
 - 389990004
 - 389989996
 - 389990000
 - 389990004
 - 389989988
 - 389990000
 - 389990012
 - 389990012
 - 389989988
 - 389989980
 - 389990020

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

pkg/controllers/providers/instancetype/memoryoverhead/controller.go

pkg/cache/cache.go

pkg/providers/instancetype/instancetype.go

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

jmdeal

/karpenter snapshot

pkg/controllers/providers/instancetype/discoveredcapacitycache/controller.go

pkg/controllers/providers/instancetype/discoveredcapacitycache/suite_test.go

pkg/test/environment.go

github-actions · 2024-10-10T18:10:52Z

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-7194ce607171bb52280313c434a574b616877c85.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-7194ce607171bb52280313c434a574b616877c85" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

jukie · 2024-10-14T23:25:49Z

@jmdeal could you take another look please

coveralls · 2024-10-15T21:52:51Z

Pull Request Test Coverage Report for Build 11377237253

Details

58 of 83 (69.88%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.2%) to 82.883%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/controllers.go	0	1	0.0%
pkg/operator/operator.go	0	1	0.0%
pkg/providers/instancetype/instancetype.go	43	49	87.76%
pkg/controllers/providers/instancetype/capacity/controller.go	10	27	37.04%

Totals
Change from base Build 11372026312:	-0.2%
Covered Lines:	5641
Relevant Lines:	6806

💛 - Coveralls

jukie · 2024-10-15T22:01:50Z

Fixed the conflicts

jukie · 2024-10-16T17:00:13Z

Any other changes needed here @jmdeal?

jmdeal

One nit, and I'm going to let E2Es run. Otherwise LGTM.

/karpenter snapshot

pkg/controllers/providers/instancetype/capacity/controller.go

jmdeal

/karpenter snapshot

github-actions · 2024-10-16T17:49:02Z

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-7548492412f9ec41de90fd048119ccc0ba7d0141.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-7548492412f9ec41de90fd048119ccc0ba7d0141" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Co-authored-by: Jason Deal <dealj@umich.edu>

jukie · 2024-10-17T02:30:05Z

I think the e2e's need a retry after committing your suggestion but the prior run appeared to pass.

jmdeal

No need to rerun, the only changes since they last ran was the comment nit and some docs changes you merged in. This LGTM 🚀.

jukie · 2024-11-01T04:41:58Z

Hi @jmdeal I saw 1.0.7 was released but this wasn't included. Will this make it into the next one?

jmdeal · 2024-11-01T16:44:13Z

We weren't planning on including this in a patch release, that's typically reserved for critical bug and security fixes. The plan would be to include this in the next minor version release, v1.1.0.

jukie · 2024-11-02T03:48:48Z

Thanks!

jukie requested a review from a team as a code owner September 12, 2024 22:06

jukie requested a review from edibble21 September 12, 2024 22:06

jukie changed the title ~~WIP: Add memory overhead tracking per nodeclass and ami family~~ Draft: Add memory overhead tracking per nodeclass and ami family Sep 12, 2024

jukie mentioned this pull request Sep 12, 2024

Discover Instance Type Capacity Memory Overhead Instead of vmMemoryOverheadPercent #5161

Closed

jukie changed the title ~~Draft: Add memory overhead tracking per nodeclass and ami family~~ Draft: feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024

jukie changed the title ~~Draft: feat: Discover Instance Type Capacity Memory Overhead~~ [DRAFT]: feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024

jukie marked this pull request as draft September 12, 2024 23:53

jukie changed the title ~~[DRAFT]: feat: Discover Instance Type Capacity Memory Overhead~~ [DRAFT] feat: Discover Instance Type Capacity Memory Overhead Sep 12, 2024

jukie force-pushed the memory-overhead-controller branch from 2493708 to 6ee1b24 Compare September 14, 2024 08:29

jukie changed the title ~~[DRAFT] feat: Discover Instance Type Capacity Memory Overhead~~ feat: Discover Instance Type Capacity Memory Overhead Sep 14, 2024

jukie marked this pull request as ready for review September 14, 2024 08:30

jukie force-pushed the memory-overhead-controller branch from 6975505 to c6f9af5 Compare September 14, 2024 15:31

jukie mentioned this pull request Sep 17, 2024

feat: Add VMReserved to InstanceTypeOverhead kubernetes-sigs/karpenter#1673

Closed

njtran reviewed Sep 26, 2024

View reviewed changes

jukie requested a review from njtran September 26, 2024 22:11

rebase

cda22f9

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

jukie force-pushed the memory-overhead-controller branch from c63c9e6 to cda22f9 Compare September 30, 2024 18:27

jukie and others added 7 commits September 30, 2024 12:33

cleanup

065f328

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

Filter by AMI vs nodeclass hash

d194b2f

Docs

15701a3

Merge branch 'main' into memory-overhead-controller

945e325

use base16 for hash key

d53e74c

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

comment

62346bf

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

Whitespace and error log

f12d27f

jmdeal reviewed Oct 1, 2024

View reviewed changes

jukie and others added 3 commits October 2, 2024 23:59

startup reconcile, longer cache expiry, override capacity

be14680

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

Merge branch 'main' into memory-overhead-controller

59b893d

missed a few renames

e2a9413

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

jmdeal self-assigned this Oct 10, 2024

jmdeal reviewed Oct 10, 2024

View reviewed changes

Add tests and renaming

aa3ed23

jukie force-pushed the memory-overhead-controller branch from be96dfa to aa3ed23 Compare October 11, 2024 03:46

jukie and others added 3 commits October 11, 2024 08:15

Update tests

37f1ac0

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

Merge branch 'main' into memory-overhead-controller

bf17a94

comments and error output

77f68b5

Signed-off-by: jukie <10012479+Jukie@users.noreply.github.com>

jukie requested a review from jmdeal October 13, 2024 20:22

Merge branch 'main' into memory-overhead-controller

e6f8d86

Merge branch 'main' into memory-overhead-controller

d3b7880

jukie added 2 commits October 15, 2024 16:22

Fix ci

46d7bfa

linting

7548492

jmdeal reviewed Oct 16, 2024

View reviewed changes

pkg/controllers/providers/instancetype/capacity/controller.go Outdated Show resolved Hide resolved

jmdeal reviewed Oct 16, 2024

View reviewed changes

jukie and others added 2 commits October 16, 2024 14:56

Update pkg/controllers/providers/instancetype/capacity/controller.go

36a5307

Co-authored-by: Jason Deal <dealj@umich.edu>

Merge branch 'main' into memory-overhead-controller

90d9c55

jmdeal approved these changes Oct 17, 2024

View reviewed changes

jmdeal merged commit f9b3292 into aws:main Oct 21, 2024
17 checks passed

jukie deleted the memory-overhead-controller branch October 21, 2024 17:24

BEvgeniyS mentioned this pull request Oct 21, 2024

fix: accurately track allocatable resources for nodes kubernetes-sigs/karpenter#1420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Discover Instance Type Memory Capacity #7004

feat: Discover Instance Type Memory Capacity #7004

jukie commented Sep 12, 2024 •

edited

Loading

netlify bot commented Sep 12, 2024 •

edited

Loading

njtran left a comment

jukie commented Sep 26, 2024

jmdeal left a comment

github-actions bot commented Oct 10, 2024

jukie commented Oct 14, 2024

coveralls commented Oct 15, 2024 •

edited

Loading

jukie commented Oct 15, 2024

jukie commented Oct 16, 2024

jmdeal left a comment

jmdeal left a comment

github-actions bot commented Oct 16, 2024

jukie commented Oct 17, 2024

jmdeal left a comment

jukie commented Nov 1, 2024

jmdeal commented Nov 1, 2024

jukie commented Nov 2, 2024

feat: Discover Instance Type Memory Capacity #7004

feat: Discover Instance Type Memory Capacity #7004

Conversation

jukie commented Sep 12, 2024 • edited Loading

netlify bot commented Sep 12, 2024 • edited Loading

✅ Deploy Preview for karpenter-docs-prod ready!

njtran left a comment

Choose a reason for hiding this comment

jukie commented Sep 26, 2024

jmdeal left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 10, 2024

jukie commented Oct 14, 2024

coveralls commented Oct 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11377237253

Details

💛 - Coveralls

jukie commented Oct 15, 2024

jukie commented Oct 16, 2024

jmdeal left a comment

Choose a reason for hiding this comment

jmdeal left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 16, 2024

jukie commented Oct 17, 2024

jmdeal left a comment

Choose a reason for hiding this comment

jukie commented Nov 1, 2024

jmdeal commented Nov 1, 2024

jukie commented Nov 2, 2024

jukie commented Sep 12, 2024 •

edited

Loading

netlify bot commented Sep 12, 2024 •

edited

Loading

coveralls commented Oct 15, 2024 •

edited

Loading