Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter 0.37 Upgrade: Generic Ephemeral Volumes Not Deleting After Pod Removal Without Enabling Webhook #6997

Open
apjneeraj opened this issue Sep 12, 2024 · 1 comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@apjneeraj
Copy link

apjneeraj commented Sep 12, 2024

Description

Observed Behavior:

We use Generic Ephemeral Volume in one of our use case and the lifecycle of these volumes follows the lifecycle of the pod. Until Karpenter 0.36.x, those volumes(PVCs/PVs) getting deleted automatically as soon as pod is deleted.

We upgraded to Karpenter 0.37.2, that has a webhook which is disabled by default. That breaks some of the functionality due to both v1 and v1beta1 APIs and we were unable to directly use kubectl get nodepool|nodeclaims|ec2nodeclasses without api suffix. But that did not break any server side functionality until we noticed that we have hundreds of EBS volumes in available state, meaning Pods using those volumes are already gone but underlying PVCs and volumes still lying around. Earlier, that was not the case.

Further investigation showed, only recent change in cluster was Karpenter and a spike in PVCs in Grafana dashboard post upgrading the Karpenter.

Due to other CRD and webhook issues in Karpenter chart related to #6847 and #6867, there is no direct way we could use Flux to change default namespace hardcoded in CRDs in main chart.

Workaround: We had to manually update the CRDs in cluster and then enabled the webhook which is enabled by default in recent 0.37.3 chart version. After that we stopped observing the issue in our cluster and PVCs remained at steady state.

Expected Behavior:
Upgrading to chart 0.37.2 without enabling the webhook should still work and PVCs created thru generic ephemeral should be cleaned up automatically as expected and as it was working with 0.36.x version.

Question: How enabling webhook or in general, Karpenter is involved in deleting those PVCs or PVs. My understanding is Karpenter works with scheduler and is not involved in direct creation or deletion of PVCs , is there anything Karpenter started doing thru webhook which blocks PVCs deletion.

Reproduction Steps (Please include YAML):

  1. Upgrade to Karpenter chart 0.37.2 and don't enable webhook.
  2. Create a sample pod using manifest from https://kubernetes.io/docs/concepts/storage/ephemeral-volumes/#generic-ephemeral-volumes.
  3. A new pod my-app and a PVC my-app-scratch-volume will be created.
  4. Run kubectl get pvc my-app-scratch-volume -oyaml to see the ownerReference, it would be something like below:
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Pod
    name: my-app
  1. Just delete the pod , kubectl delete pod my-app
  2. Observe PVC created in step 2 will still be available and not cleaned up after pod deletion. kubectl get pvc
  3. Check AWS EC2 Console for EBS volume created for above PVC. The volume id can be fetched from below steps. Volume will be in available state and free to be deleted.
1. kubectl describe pvc  my-app-scratch-volume | grep -i volume:
2. kubectl describe pv <pv name from above step> | grep -i VolumeHandle:
  1. Only if we delete pvc manually, the underlying EBS volume gets deleted.
  2. If we enable webhook and update CRDs to override the default namespace for Karpenter, everything comes back to normal.

Versions:

  • Chart Version: 0.37.2
  • Kubernetes Version (kubectl version): 1.29.0
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@apjneeraj apjneeraj added bug Something isn't working needs-triage Issues that need to be triaged labels Sep 12, 2024
@apjneeraj
Copy link
Author

Is there anything else I can provide or any more information needed to get some insights here? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

1 participant