Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter v1.0.0 chart does not respect Helm values in packaged CRDs leading to ArgoCD upgrade difficulty #6847

Open
booleanbetrayal opened this issue Aug 22, 2024 · 80 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@booleanbetrayal
Copy link

Description

Observed Behavior:

The setting of any webhook values (such as webhook.serviceName) in the karpenter primary chart is not respected. Instead, the chart embeds static CRDs, which are symlinked in this repository, unlike the karpenter-crd chart, which respects these values.

This makes upgrading from v0.37.0 to v1.0.0 using platforms like ArgoCD difficult and error-prone since ArgoCD is capable of managing CRDs and will likewise detect drift if they are updated manually or via a secondary chart such as karpenter-crd.

In our case, the conversion.webhook.clientConfig.service.name value for the conversion webhook (specified in the static CRDs) deployed with the karpenter chart did not match the name that the chart generates for the Service itself (using the karpenter.fullName helper), and so all webhook calls would fail. Likewise, disabling the webhook could not be done with values and manual edits will show diff changes in ArgoCD.

Attempting to utilize both charts in ArgoCD will attempt to create duplicate resources and could cause unexpected issues. There is no way to disable the deployment of CRDs in the karpenter primary chart.

A possible workaround for our specific issue would be to set the nameOverride value to karpenter such that it matches the static serviceName buried in the CRD, but ideally we don't even need these conversion webhooks and would prefer to have them disabled entirely.

Expected Behavior:

The karpenter Helm package respects Helm values in its CRDs, just as the karpenter-crd Chart does. If static CRDs are required they can be referenced elsewhere from the GitHub repository, etc.

Alternatively (and less elegantly), a dual-chart solution with a flag supporting the disabling of CRDs in the primary chart (or just not packaging them to begin with) could also be a possible fix vector, as it would allow the other chart to install without duplicates.

Reproduction Steps (Please include YAML):

Attempt to upgrade the karpenter chart from v0.37.0 to v1.0.0 using ArgoCD. Specify override Helm values for webhook properties and observe no diff is presented in the CRD definitions reflecting those overrides.

Versions:

  • Chart Version: 1.0.0
  • Kubernetes Version (kubectl version): 1.30.2
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@booleanbetrayal booleanbetrayal added bug Something isn't working needs-triage Issues that need to be triaged labels Aug 22, 2024
@engedaam
Copy link
Contributor

As part of ArgoCD, you should be able to disable CRD installation, no? https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-skip-crds. Also, It sounds like you updated the fullnameOverride? Does that align with what is passed as the webhook.serviceName in karpenter-crd charts?

@booleanbetrayal
Copy link
Author

Hi @engedaam and thanks for the quick feedback. That skip-crds setting applies to an ArgoCD Application, but that might deploy multiple charts / sub-charts. In our case, we have other Helm packages being deployed in this namespace / ArgoCD Application (kube-system) which are modeled in this fashion. We wouldn't want them to all carry this setting, as this is the only problematic deployment.

Other popular charts often include some flag to toggle CRD deployment for more granular fidelity. For example, ESO uses a installCRDs flag for this feature.

We should be able to update the fullnameOverride for the karpenter chart to match the "karpenter" value for conversion.webhook.clientConfig.service.name that its bundled (static) CRDs set, and hopefully everything just works at that point since the Service would be named appropriately. However, that requires rebuilding all our existing Karpenter resources to support the new naming, along with IAM trust policies, etc.

At this point, I'm curious how people running Karpenter within an ArgoCD Application are performing the v1.0.0 upgrade.

@booleanbetrayal
Copy link
Author

At this point, I'm curious how people running Karpenter within an ArgoCD Application are performing the v1.0.0 upgrade.

Just to provide more context, Helm won't even overwrite resources that are managed-by ArgoCD, and I'm not sure how to spoof Helm's .Release.Service stamping. As a result, AFAIK, the official v1 Migration Guide can't be followed for ArgoCD Karpenter deployments and the CRD upgrade step. Even if you could spoof it so Helm CLI did deploy those karpenter-crd CRDs, the Karpenter CRDs would immediately show up as drifting out of sync with the karpenter Chart source in ArgoCD Application display, as they would differ from the bundled variants.

@Vinaum8
Copy link

Vinaum8 commented Aug 23, 2024

+1

@Vinaum8
Copy link

Vinaum8 commented Aug 23, 2024

The expected behavior according to this issue would be perfect, as we would have the option of enabling or not the installation of CRDs with the karpenter helm chart.

@artem-nefedov
Copy link
Contributor

artem-nefedov commented Aug 23, 2024

Given 75d482f is already in main, I assume official guidelines from now on will be "disable CRDs in the main chart and always use karpenter-crd if you require any sort of CRD customization".

@booleanbetrayal
Copy link
Author

booleanbetrayal commented Aug 23, 2024

@artem-nefedov - How does one disable CRDs in the main chart? main is still currently bundling all the static CRDs as it was before, without any mechanism to prevent deployment in ArgoCD, as far as I can tell.

EDIT: I've verified that the 1.0.1 package release is continuing to distribute the static CRDs.

@artem-nefedov
Copy link
Contributor

@artem-nefedov - How does one disable CRDs in the main chart? main is still currently bundling all the static CRDs as it was before, without any mechanism to prevent deployment in ArgoCD, as far as I can tell.

  1. With helm CLI, you would use --skip-crds option.
  2. With Flux2 GitOps, you would set spec.install.crds: Skip and spec.upgrade.crds: Skip in HelmRelease.
  3. I don't use ArgoCD, but quick search found that it supports feature to skip CRDs since January 2022: feat: add skipCrds flag for helm charts argoproj/argo-cd#8012

@booleanbetrayal
Copy link
Author

@artem-nefedov - I've posted a response to that workaround elsewhere but the gist is that there may be multiple Helm charts comprising an ArgoCD Application. Skipping CRDs to relieve a problematic deployment from one Chart could cascade into multiple issues with the maintenance of CRDs in other charts.

@artem-nefedov
Copy link
Contributor

@booleanbetrayal Maybe I'm missing something, but so far this looks like solvable problem. Split it into two separate Applications, and optionally add a third top-level Application that installs both if you still want to have a single entrypoint.

@adamwilner
Copy link

this exact thing happened to me. spent several hours with it. skipping CRDs at an application level is not an option. i would need to be able to include both the karpenter and karpenter-crds charts in the same ArgoCD application with only the CRDs from the karpenter chart skipped. or even better, if there was a solution that only required using the karpenter chart where the service naming was consistent

@booleanbetrayal
Copy link
Author

@booleanbetrayal Maybe I'm missing something, but so far this looks like solvable problem. Split it into two separate Applications, and optionally add a third top-level Application that installs both if you still want to have a single entrypoint.

@artem-nefedov - So if my Application is kube-system, deploying to the kube-system Namespace and I am running Karpenter within it, along with multiple other Charts (AWS-LB, ESO, etc) that typically control their CRD deployments with things like Helm flags, I am supposed to re-architect my application to move / rename resources into a separate Applications and Namespaces in all my clusters, juggling Karpenter controller uptime so that Nodes can continue to spool, all in order to support this v1.0.0 upgrade?

You might not have much familiarity with ArgoCD, but I can tell you that this is not a light chore, would be prone to outage chances, is not documented as a migration vector, and would feel especially egregious in light of AWS's recent endorsement of ArgoCD deployment approaches.

@jmdeal
Copy link
Contributor

jmdeal commented Aug 26, 2024

For context, we've went with this design following Helm v3's recommendations for managing CRDs. These recommendations state that, in a main chart, CRDs should be placed in the crds directory. Resources in this directory can not be templated, which is why the packaged CRDs are hardcoded with the package defaults (this only affects the conversion block). When templating is required, the recommendation is to use a separate chart, in this case karpenter-crd.

Other popular charts often include some flag to toggle CRD deployment for more granular fidelity.

The linked examples in this thread appear to be working around Helm's design for managing CRDs. Unless absolutely necessary, we shouldn't go down this path and stick to helm standard practices.

Skipping CRDs at an application level is not an option.

This does not appear to be necessary. skipCrds is scoped to the source, not the application. You should be able to skip CRDs in one source and not another within a single application. I was able to verify with the following application; CRDs were not installed from the main Karpenter chart, but were installed from the aws-controllers-k8s chart:

project: default
destination:
  server: https://kubernetes.default.svc
  namespace: karpenter
sources:
  - repoURL: public.ecr.aws/aws-controllers-k8s
    targetRevision: 1.2.17
    chart: ec2-chart
  - repoURL: public.ecr.aws/karpenter
    targetRevision: 1.0.1
    chart: karpenter-crd
  - repoURL: public.ecr.aws/karpenter
    targetRevision: 1.0.1
    helm:
      parameters:
        - name: settings.clusterName
          value: redacted
        - name: serviceAccount.annotations.eks\.amazonaws\.com/role-arn
          value: redacted
      skipCrds: true
    chart: karpenter

@booleanbetrayal
Copy link
Author

@jmdeal - This is a bit of an annoying workaround for users who already have karpenter being deployed in an ArgoCD application, originating from a private GitOps repo, containing relevant chart binaries and YAML values. For example, here is the core of our Application definition:

project: kube-system
source:
  helm:
    valueFiles:
      - staging-values.yaml
  path: kube-system
  repoURL: user@privately-hosted-git-repo.com/some-org/some-private-gitops-repo
  targetRevision: main

In this instance we would either have to expose our Application to third-party supply chain repositories (a deal-breaker for lots of orgs who attempt to mitigate supply-side attacks / reduce third-party dependency failures) or figure out a new scheme of pushing karpenter chart + value definitions to a private repo under a new repo, tag, or path.

Even if that were feasible for most users, they would then be forced to perform a moderately risky Application migration to move a pre-1.0.0 running Karpenter controller + provisioning (NodePool, NodeClass) manifests, into the appropriate new source state. Feels like a lot of users could accidentally induce some sort of Karpenter downtime in such a requirement if they fail to atomically swap resources with identical configuration.

If you all absolutely do not wish to implement a convenience flag for this, are adamant on keeping with Helm 3 recommendations, why not just drop the static CRDs from the primary chart and make the karpenter-crd chart an explicit or implicit dependency? That would also align with the same Helm 3 Recommendations (option 2) which states:

Another way to do this is to put the CRD definition in one chart, and then put any resources that use that CRD in another chart.

In this method, each chart must be installed separately. However, this workflow may be more useful for cluster operators who have admin access to a cluster

Or alternatively, just drop the conversion webhook altogether and clarify upgrade requirements to appropriate provisioning manifests?

@thomasgouveia
Copy link

If you all absolutely do not wish to implement a convenience flag for this, are adamant on keeping with Helm 3 recommendations, why not just drop the static CRDs from the primary chart and make the karpenter-crd chart an explicit or implicit dependency? That would also align with the same Helm 3 Recommendations (option 2)

+1. I also mentioned adding a dependency to the karpenter chart here. Do you have any feedback about it?

@trojaond
Copy link

trojaond commented Aug 27, 2024

@booleanbetrayal Have you considered including karpenter-crd as a separate helm chart in your private repo and reference it from a single ArgoCD application. In this way it does not affect the supply chain.

  project: {project}
  destination:
    namespace: kube-system
    server: 'https://kubernetes.default.svc'
  sources:
  - repoURL: {private_git_repository}
    targetRevision: HEAD
    path: '{path_to_helmcharts}/karpenter-crd/'
  - repoURL: {private_git_repository}
    targetRevision: HEAD
    path: '{path_to_helmcharts}/karpenter/'
    helm:
      skipCrds: true
      values: |-
          ...

This respect the Helm 3 Recommendations (option 2) as there are 2 charts but they are being deployed from a single ArgoCD Application.

Also worth mentioning that multi-source application is still considered a beta feature

@booleanbetrayal
Copy link
Author

booleanbetrayal commented Aug 27, 2024

@trojaond - Currently, our ArgoCD application paths are all namespace-oriented with differing values files based on environment (development, staging, production, etc). This change breaks with those conventions and would require some other scripting changes to accomodate, so even if this ArgoCD multi-source beta feature is viable (and does not "change in backwards incompatible ways", it's not very desirable. It would be much more operator friendly if there were just a single chart with a convenience flag or two separate charts, one with templated CRDs and one with none.

EDIT - Additionally, I think the Option 2 statement ...

Another way to do this is to put the CRD definition in one chart, and then put any resources that use that CRD in another chart.

... implies that the CRDs are omitted from one of the chart, not included in both, and not requiring command-line flags to work around the duplication.

@Vinaum8
Copy link

Vinaum8 commented Aug 27, 2024

Why not put the option to enable or disable CRD in the Helm Chart?

@artem-nefedov
Copy link
Contributor

Why not put the option to enable or disable CRD in the Helm Chart?

Helm doesn't allow any templating for CRDs that are located in "crds" directory, so it's impossible to affect them based on input values. They can only be turned off by --skip-crds option (and similar things in various controllers). To make it possible to turn off CRDs via values, they must be moved from "crds" to "templates", like it's done in karpenter-crd chart, but that could bring other unwanted side effects (there are reasons why "crds" directory exists and is handled differently).

@booleanbetrayal
Copy link
Author

Helm doesn't allow any templating for CRDs that are located in "crds" directory, so it's impossible to affect them based on input values. They can only be turned off by --skip-crds option (and similar things in various controllers). To make it possible to turn off CRDs via values, they must be moved from "crds" to "templates", like it's done in karpenter-crd chart, but that could bring other unwanted side effects (there are reasons why "crds" directory exists and is handled differently).

@artem-nefedov - What are the side effects of templating CRDs? Lots of Helm charts template CRDs, including ArgoCD itself.

@andrewjamesbrown
Copy link

@artem-nefedov unless I'm mistaken, helm supports this via helm/helm#7138 (i.e what ArgoCD is doing)

@artem-nefedov
Copy link
Contributor

artem-nefedov commented Aug 27, 2024

@artem-nefedov unless I'm mistaken, helm supports this via helm/helm#7138 (i.e what ArgoCD is doing)

@andrewjamesbrown This is only for outputting CRDs via helm template command. It still doesn't allow any actual templating (changing the content) based on input values. And it requires a separate option.

@artem-nefedov
Copy link
Contributor

@artem-nefedov - What are the side effects of templating CRDs? Lots of Helm charts template CRDs, including ArgoCD itself.

@booleanbetrayal For comprehensive reasoning (as the team behind helm sees it), read this:
https://github.com/helm/community/blob/f9e06c16d89ccea1bea77c01a6a96ae3b309f823/architecture/crds.md

@andrewjamesbrown
Copy link

@artem-nefedov my mistake... ArgoCD CRDs are in the templates/crds directory not crds
🤦

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

@adrianmiron

@dcherniv The latest version includes an update to the controller IAM role attached policy which is enabled with a variable flag. When i turn that on, the policy gets a long update and it can no longer spawn resources (in either 0.37.X or 1.0.X )

I didn't know this flag existed. Just tested with the flag enabled in the module on 20.24.0 and i cannot reproduce. Just scaled up and down one of my node pools and saw launch templates get created and deleted with no problem.

@adrianmiron
Copy link

@dcherniv Can you please show me what you terraform plan looks like when updating the policy to v1 version ? You can post it terraform-aws-modules/terraform-aws-eks#3152 to avoid spam here maybe ?

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

Some more updates.
I just tried to go from 0.37.3 to 1.0.2 in production and can confidently say there's no valid upgrade path. I tried with webhooks enabled, webhooks disabled. Tried rolling the nodeclaims before going to 1.0.2, tried running 0.37.3 and creating the new nodeclaims with proper v1 spec and migrating to those using phased migration.
There appears to be no way to avoid this error on the nodeclaims

{"level":"ERROR","time":"2024-09-11T19:57:52.140Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodepool.hash","controllerGroup":"karpenter.sh","controllerKind":"NodePool","NodePool":{"name":"default"},"namespace":"","name":"default","reconcileID":"109f1dc7-640d-45ee-85b6-ced5a57d64bb","error":"NodeClaim.karpenter.sh \"default-k8vbn\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]; NodeClaim.karpenter.sh \"default-vn6mj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]; NodeClaim.karpenter.sh \"default-2gt6z\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]","errorCauses":[{"error":"NodeClaim.karpenter.sh \"default-k8vbn\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"},{"error":"NodeClaim.karpenter.sh \"default-vn6mj\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"},{"error":"NodeClaim.karpenter.sh \"default-2gt6z\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"}]}

I dont know what karpenter devs think we are supposed to do, but i'd highly suggest to not upgrade to 1.x.x, it is an absolute disaster.
It seems this release wasn't tested, like at all. I'm starting to doubt that karpenter has any kind of QA at all. This is pretty basic stuff.
The only upgrade path is what i did in dev.

  1. Upgrade to 1.0.2
  2. force delete all instances (physically through AWS console)
  3. Wait for karpenter to provision new ones and observe the above error in the logs because karpenter is stupid like that
  4. Roll back to 0.37.3 and let this version delete the old claims
  5. Upgrade back to 1.0.2

@booleanbetrayal
Copy link
Author

I dont know what karpenter devs think we are supposed to do, but i'd highly suggest to not upgrade to 1.x.x, it is an absolute disaster. It seems this release wasn't tested, like at all. I'm starting to doubt that karpenter has any kind of QA at all. This is pretty basic stuff.

They expect us to re-package the Chart archives (and then subsequently ignore all checksum validation during helm dep update calls).

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

They expect us to re-package the Chart archives (and then subsequently ignore all checksum validation during helm dep update calls).

This is not to mention the problem with IAM policy. I reproduced it. 1.0.2 policy is not backwards compatible with 0.37.3 Karpenter. As soon as I enabled that v1 flag in my IAM module while still keeping Karpenter itself on 0.37.3 it promptly broke with that error lacking permissions for launch templates and ec2 tagging.
What a clown show this upgrade is

The main problem is that in this procedure https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure they expect you to keep the webhooks on throughout the upgrade, even on version 1.0.2
The problem is of course that on 1.0.2 the webhook straight up doesn't work and gives no indication as to why:

{"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}

And if you do the "procedure":

  kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.k8s.aws
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.sh
  kubectl delete mutatingwebhookconfiguration defaulting.webhook.karpenter.k8s.aws

You still get the nodeclaims nonsense which was supposed to be taken care of for you by the webhook:

{"level":"ERROR","time":"2024-09-11T20:33:41.534Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.tagging","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-arm64-dgxr9"},"namespace":"","name":"default-arm64-dgxr9","reconcileID":"67898e7c-e5f5-47fb-b8cb-1a094ddf26df","error":"NodeClaim.karpenter.sh \"default-arm64-dgxr9\" is invalid: [spec.nodeClassRef.group: Required value, <nil>: Invalid value: \"null\": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation]"}
^C

@bryantbiggs
Copy link
Member

let keep the convo civil and professional please - we should assume good intent. I can vouch for the Karpenter team in terms of how much time and effort went into not only creating the v1.0 release, but testing and validating it.

Please continue to share any findings or steps taken to resolve as this will help both the Karpenter team and other users who may run into similar challenges

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

@bryantbiggs

let keep the convo civil and professional please - we should assume good intent. I can vouch for the Karpenter team in terms of how much time and effort went into not only creating the v1.0 release, but testing and validating it.

I apologize, but you have to walk a mile in my shoes. This is extremely frustrating for an end user, going through upgrade path https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure, it just doesn't work.
I'm happy to share exactly when and how it is breaking, but it is the happy path, and even that doesn't work.
I love karpenter and would like it to succeed but things like these just turn users away.

@booleanbetrayal
Copy link
Author

@bryantbiggs - Do you have authority in the Karpenter release process? Can you ensure that ArgoCD and Flux both become test vectors for future releases?

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

@bryantbiggs
Created additional specific issues i'm having currently: #6981 #6982 #6983
If we can resolve this, then at least there's a semblance of a path forward. Right now it's just filling up the logs with errors.

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

I finally resolved my issues in prod with stale nodeclaims RE: #6981

Issue #6981 is likely caused, in my case, by the fact that conversion webhooks do NOT work on karpenter 1.0.2 chart:
#6982 If you manage to get conversion webhook working on 1.0.2 then the below probably wouldn't apply to you.

Warning: below step may incur downtime if you go too fast! As well i wouldn't advise to go this route if you have a large cluster. Luckily mine was a handful of nodes, which still took me a better part of the day.
This was by strategically terminating nodes in aws console whose nodeclaim didn't have group field for some reason:

for i in `kubectl get nodeclaims | grep -v NAME | awk '{print $1}'` ; do echo $i ; kubectl get nodeclaim -o yaml $i | grep group ; done
default-arm64-6pv26
    group: karpenter.k8s.aws
default-arm64-8xdhn
    group: karpenter.k8s.aws
default-arm64-9xmjx
    group: karpenter.k8s.aws
default-arm64-fcdc7
    group: karpenter.k8s.aws
default-arm64-kndsr
    group: karpenter.k8s.aws
default-arm64-vrzlw
    group: karpenter.k8s.aws
default-bvqlp
    group: karpenter.k8s.aws
default-h74m4
[...]

for the nodeclaims that didn't have group set in the above output, i had to terminate the instance forcefully through aws console.
Once karpenter provisioned replacements for these, i quickly reverted back to 0.37.3, with webhooks enabled (because 0.37.3 with webhooks disabled doesn't even work (see this one #6975)
0.37.3 cleaned up stale claims with no problems. Carefully monitoring the nodeclaims (You dont want 0.37.3 to provision new claims because 1.0.2 wont be able to delete them so you'd be at square 1). Anyway, you gotta be quick here.
I executed the "procedure" real quick:

  kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.k8s.aws
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.sh
  kubectl delete mutatingwebhookconfiguration defaulting.webhook.karpenter.k8s.aws

Then quickly synced to 1.0.2 in argocd with prune option.
Finally, i have nodeclaims on the new v1 spec and no errors in karpenter logs.
And don't forget to sync the IAM policy to v1 version because v1 policy is not compatible with 0.37.3 (see #6983)

To go back and forth between 0.37.3 and 1.0.2 i suggest removing ALL of the conditionals manually in the IAM policy for the time. This is outline in step 8 here: https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure
Unfortunately it fails to mention that policy is completely incompatible. 0.37.3 doesn't work with policy for 1.0.2 and vice-versa

PS. Don't forget to disable the webhooks on version 1.0.2 when you are going back and forth. I suggest having two values files prepared for this case (see #6982)

Good luck!

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 12, 2024

For a lot of folks here, it seems like the validating and mutating webhooks are hanging around even after the upgrade to v1.0 where they are dropped in the chart and I'm wondering if we can help with this bit. Is this expected behavior for Argo? I'd expect that when a resource is removed from the helm release that the respective resource is also pruned by whichever GitOps tool is applying it.

@booleanbetrayal
Copy link
Author

booleanbetrayal commented Sep 12, 2024

@jonathan-innis - ArgoCD has an optional Prune option in syncing, which will delete any tracked resources that are presently missing from the chart manifests (due to templating or value conditions). Until they are Pruned, they will reflect in the ArgoCD UI as a diff with missing contents. However, in our case, I do not believe we had any visibility in our Application that any ValidatingWebhookConfigurations or MutatingWebhookConfigurations ever existed.

If some other process had updated the metadata.labels values on those resources, or they had been explicitly overwritten by the chart manifest themselves (possibly?) then ArgoCD would lose the ability to track them as they would become detached from ArgoCD management. Therefore, they would not show up in the Application at all, let alone any indication of diff, and would not cleaned up when Prune was enabled.

I am theorizing here that there was some label mismanagement as part of the numerous charts / versions / hooks / templating helpers that broke these resources at deployment onset, but I'm definitely not rewinding the clock on attempting to reproduce this one.

I think it really behooves the Karpenter team to incorporate ArgoCD and Flux tooling into release testing procedures (cc: @bryantbiggs)

@adrianmiron
Copy link

I finally resolved my issues in prod with stale nodeclaims RE: #6981

Issue #6981 is likely caused, in my case, by the fact that conversion webhooks do NOT work on karpenter 1.0.2 chart: #6982 If you manage to get conversion webhook working on 1.0.2 then the below probably wouldn't apply to you.

Warning: below step may incur downtime if you go too fast! As well i wouldn't advise to go this route if you have a large cluster. Luckily mine was a handful of nodes, which still took me a better part of the day. This was by strategically terminating nodes in aws console whose nodeclaim didn't have group field for some reason:

for i in `kubectl get nodeclaims | grep -v NAME | awk '{print $1}'` ; do echo $i ; kubectl get nodeclaim -o yaml $i | grep group ; done
default-arm64-6pv26
    group: karpenter.k8s.aws
default-arm64-8xdhn
    group: karpenter.k8s.aws
default-arm64-9xmjx
    group: karpenter.k8s.aws
default-arm64-fcdc7
    group: karpenter.k8s.aws
default-arm64-kndsr
    group: karpenter.k8s.aws
default-arm64-vrzlw
    group: karpenter.k8s.aws
default-bvqlp
    group: karpenter.k8s.aws
default-h74m4
[...]

for the nodeclaims that didn't have group set in the above output, i had to terminate the instance forcefully through aws console. Once karpenter provisioned replacements for these, i quickly reverted back to 0.37.3, with webhooks enabled (because 0.37.3 with webhooks disabled doesn't even work (see this one #6975) 0.37.3 cleaned up stale claims with no problems. Carefully monitoring the nodeclaims (You dont want 0.37.3 to provision new claims because 1.0.2 wont be able to delete them so you'd be at square 1). Anyway, you gotta be quick here. I executed the "procedure" real quick:

  kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.k8s.aws
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.sh
  kubectl delete mutatingwebhookconfiguration defaulting.webhook.karpenter.k8s.aws

Then quickly synced to 1.0.2 in argocd with prune option. Finally, i have nodeclaims on the new v1 spec and no errors in karpenter logs. And don't forget to sync the IAM policy to v1 version because v1 policy is not compatible with 0.37.3 (see #6983)

To go back and forth between 0.37.3 and 1.0.2 i suggest removing ALL of the conditionals manually in the IAM policy for the time. This is outline in step 8 here: https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure Unfortunately it fails to mention that policy is completely incompatible. 0.37.3 doesn't work with policy for 1.0.2 and vice-versa

PS. Don't forget to disable the webhooks on version 1.0.2 when you are going back and forth. I suggest having two values files prepared for this case (see #6982)

Good luck!

When doing rollback to remoe stale NodeClaims, do it to 0.37.2 , with webhooks disabled, that worked for me.

However what happens if during the switch your cluster wants to spawn nodes...you get 0.37.2 NodeClaims again :)) ..... unless you setup different NodePools for both versions, block 0.37 NodePools from creating nodes when moving to 1.X and keep that setting when going back to 0.37 to avoid spawing bad nodes...in other words it is not worth doing on 10 clusters

@dcherniv
Copy link

dcherniv commented Sep 12, 2024

I just tried to rollback to 0.37.3 from 1.0.2 on one of the non-important clusters
Upon changing the manifests to 0.37.3 with webhook.enabled in karpenter chart, webhooks showed a diff (not present argocd wants to install)
image
I synced just one of the webhooks and disabled webhook.enabled in the values.
Argocd found that webhook is not in the configuration and wants to delete it:
image
I didn't delete it, but instead bumped the chart to 1.0.2 again all the while keeping the webhook.enabled: false in the values
Argocd still detected that webhook is present but shouldn't be present on chart 1.0.2
image
Syncing with prune on argocd now, deleted the webhook properly.

This was partial sync above.

On a full sync. I just did a full rollback to 0.37.3 with webhook.enabled true.
Webhooks are enabled. Chart is on 0.37.3.
image
Disabling webhooks in values on the same chart 0.37.3 and refreshing in argocd, boom webhooks disappeared entirely from the configuration, even though they are still there on the cluster.
Attaching the webhook yaml as text.
validation.webhook.config.karpenter.sh.txt
So this problem happens even on chart 0.37.3 without upgrading at all. As soon as webhook.enabled value is set to false the webhooks disappear from configuration but not deinstalled.

It still wants to delete the cert which is under the same conditional
image

I bet you the problem is universal and has nothing to do with argocd. Probably happens with helm too.

@dcherniv
Copy link

When doing rollback to remoe stale NodeClaims, do it to 0.37.2 , with webhooks disabled, that worked for me.

0.37.2 worked. I just tried following the v1 upgrade guide and there it was mentioned that you have to be on the latest patch version, which was unfortunately broken :(

@dcherniv
Copy link

dcherniv commented Sep 12, 2024

I think this has something to do with https://kubernetes.io/docs/concepts/architecture/garbage-collection/#orphaned-dependents maybe?
since the webhook specifies karpenter ?deployment/rs? as its owner in the owner block:

  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Namespace
    name: karpenter
    uid: a4d1f1ad-a593-4666-8677-62337f9c602f

But when we flip webhook.enabled flag in the chart values this causes the re-creation of the karpenter replicaset (since it goes by uid, its hard to tell who is the owner, whether its the rs or the deployment)
When the rs is deleted the webhooks become orphans and are removed from consideration or something.
i'm not 100% on how this owner reference block works, but it seems plausible in my head

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 16, 2024

I am theorizing here that there was some label mismanagement as part of the numerous charts / versions / hooks / templating helpers that broke these resources at deployment onset, but I'm definitely not rewinding the clock on attempting to reproduce this one.

That would be my guess as well. We need to look into the reconciler in the code that's injecting the certificate into the Validation and Mutating Webhook Configurations. I suspect that it may be doing exactly what you are saying and removing ownership. And solving that problem would reduce a lot of the pain here.

I think it really behooves the Karpenter team to incorporate ArgoCD and Flux tooling into release testing procedures

I'd be interested to see what other projects do around testing with GitOps tooling integrations. I'd imagine that there's some prior art out there for integration testing but I'd be interested to see which CNCF projects have it.

@iNoahNothing
Copy link

iNoahNothing commented Sep 16, 2024

FWIW I found an easier way to recover from this
1. kubectl delete the NodeClaims that were left over from earlier karpenter deploy
2. Temporarily update the nodeclaims.karpenter.sh crd to make group not a required field so that you can edit the crd without failing CEL validation
3. kubectl edit the NodeClaims and remove the finalizer so that the resource gets removed from the API server
4. Restart karpenter and let it spin up new nodes

This will temporarily orphan the old nodes that were claimed by the deleted NodeClaims but Karpenter appears to cleanly remove those for me once it was back in a healthy state. This was relatively painless and did not incur downtime

Edit: you can actually just temporarily update the nodeclaims.karpenter.sh CRD to remove the validation that blocks editing the spec and manually add the group. This made it so that we didn't have to orphan any nodes

@adrianmiron
Copy link

FWIW I found an easier way to recover from this

  1. kubectl delete the NodeClaims that were left over from earlier karpenter deploy
  2. Temporarily update the nodeclaims.karpenter.sh crd to make group not a required field so that you can edit the crd without failing CEL validation
  3. kubectl edit the NodeClaims and remove the finalizer so that the resource gets removed from the API server
  4. Restart karpenter and let it spin up new nodes

This will temporarily orphan the old nodes that were claimed by the deleted NodeClaims but Karpenter appears to cleanly remove those for me once it was back in a healthy state. This was relatively painless and did not incur downtime

That sounds clever, but i would not do that on my prod clusters :D

@iNoahNothing
Copy link

FWIW I found an easier way to recover from this

  1. kubectl delete the NodeClaims that were left over from earlier karpenter deploy
  2. Temporarily update the nodeclaims.karpenter.sh crd to make group not a required field so that you can edit the crd without failing CEL validation
  3. kubectl edit the NodeClaims and remove the finalizer so that the resource gets removed from the API server
  4. Restart karpenter and let it spin up new nodes

This will temporarily orphan the old nodes that were claimed by the deleted NodeClaims but Karpenter appears to cleanly remove those for me once it was back in a healthy state. This was relatively painless and did not incur downtime

That sounds clever, but i would not do that on my prod clusters :D

Curious as to why this would be thought of as riskier to you than force deleting the nodes and having to the delicate and well timed dance of rolling back and rolling forward described above. Since the nodes will be drained before being removed everything was restarted and running before the nodes were reaped.

@adrianmiron
Copy link

Well i would not use either solution on prod, i will wait for an official version which does this for us without any tricks. We do not need to move to 1.0.X right now and most of the stuff here is just to discover issues and possible solutions so the devs know about them

@Vinaum8
Copy link

Vinaum8 commented Sep 18, 2024

Well i would not use either solution on prod, i will wait for an official version which does this for us without any tricks. We do not need to move to 1.0.X right now and most of the stuff here is just to discover issues and possible solutions so the devs know about them

me too

@jonathan-innis
Copy link
Contributor

That would be my guess as well. We need to look into the reconciler in the code that's injecting the certificate into the Validation and Mutating Webhook Configurations. I suspect that it may be doing exactly what you are saying and removing ownership. And solving that problem would reduce a lot of the pain here.

So at least for this specific issue, I think we've tracked down what's going on here.

There's an open issue here that actually specifically describes this interaction between Knative and ArgoCD. What it comes down to is that ArgoCD refuses to delete objects with ownerReferences -- even if it was the original creator of the object.

Because knative injects an owner reference onto the ValidatingWebhookConfiguration and MutatingWebhookConfiguration as part of its reconciliation flow, this causes ArgoCD to ignore these objects when pruning and leave them behind on the cluster. I'm trying out what happens when we don't have these owner references by patching the knative package that we are using here: #7042 to see if that resolves the issue (I suspect it should).

Code link to where ArgoCD does this check: https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L1167

@jonathan-innis
Copy link
Contributor

Trying to get this feature into upstream knative/pkg so we can add the change and patch this back to the Karpenter versions that are affected by this: knative/pkg#3095

@ShahroZafar
Copy link

ShahroZafar commented Sep 24, 2024

Hi, we are planning to upgrade to v1 but before upgrading I landed on this issue.

We are using ArgoCD for managing karpenter controller. We have 2 applications

  1. Karpenter controller
  2. Nodepools and ec2nodeclasses

We plan to go from v0.37.3 (We upgraded to this version just because it was mentioned in the upgrade guide. Had to manually update the namespace to karpenter in the CRD to fix the karpenter service not found issue in the CRDs ) to v1.0.2.

I see that conversion webhooks are removed in v1.0.2 helm chart. Since the webhooks are removed (Though this issue points correctly that they are just removed from the argocd when you put webhook enabled false and not actually which i confirmed on v0.37.3 as well), how the conversion from v1beta1 to v1 is supposed to work.

If webhook is to work, should we put the 2. application out of sync so that argocd dont revert the changes made by conversion webhook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests