CNI not removing network built on a node after IP is lost externally and IPAMD reconciles this state #2834

AbeOwlu · 2024-03-08T22:42:40Z

IPAM reconciliation:
Scenario;

Pod is created and assigned an IP, 10.0.2.99
the IP after complete sandbox initialization is reclaimed by an automation in the network external to the cluster
the IPAMD logs show an IP pool reconcile that catches this lost IP and reconciles its cache calling EC2 endpoint
the network route for this pod with IP 10.0.2.99 remains unchanged on the local node however, other node peers are no longer able to reach this pod on 10.0.2.99 of its host nodes, it is reachable from this local host and kubernetes liveness probes are succeeding - keeping an unhealthy pod in the cluster


{"level":"debug","ts":"2024-03-08T18:10:50.378Z","caller":"rpc/rpc.pb.go:713","msg":"AddNetworkRequest: K8S_POD_NAME:\"liveness-http\" K8S_POD_NAMESPACE:\"gateway-ns\" K8S_POD_INFRA_CONTAINER_ID:\"7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380\" ContainerID:\"7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380\" IfName:\"eth0\" NetworkName:\"aws-cni\" Netns:\"/var/run/netns/cni-d4e752dc-bdf7-f594-2a1a-38dfa2445dfb\""}

{"level":"info","ts":"2024-03-08T18:10:50.378Z","caller":"datastore/data_store.go:750","msg":"AssignPodIPv4Address: Assign IP 10.0.2.99 to sandbox aws-cni/7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380/eth0"}

Externl automation event Event time
March 08, 2024, 18:11:25 (UTC+00:00) UnassignPrivateIpAddresses  "privateIpAddress": "10.0.2.99"

{"level":"warn","ts":"2024-03-08T18:12:00.256Z","caller":"ipamd/ipamd.go:1404","msg":"Instance metadata does not match data store! ipPool: [10.0.2.99 10.0.2.27 10.0.2.158], metadata: [{\n  Primary: true,\n  PrivateIpAddress: \"10.0.2.149\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"10.0.2.27\"\n} {\n  Primary: false,\n  PrivateIpAddress: \"10.0.2.158\"\n}]"}

{"level":"info","ts":"2024-03-08T18:12:00.334Z","caller":"datastore/data_store.go:578","msg":"UnAssignPodIPAddress: Unassign IP 10.0.2.99 from sandbox aws-cni/7f92409d45a01365839f5db2b7c30c35626c1de02779233046bf5c1bd2c59380/eth0"}

What you expected to happen:

After event "UnAssignPodIPAddress: Unassign IP 10.0.2.99 from sandbox aws-cni/7f9240... the CNI is triggered to tear down the network route with this IP, and liveness probe may eventually fail and attempt to heal this pod.

How to reproduce it (as minimally and precisely as possible):

create pod with liveness and readiness probe, like;

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness3
  name: liveness-http3
spec:
  containers:
  - name: ngo-proxy
    image: gcr.io/google_containers/echoserver:1.4
    # args:
    # - /server
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        # httpHeaders:
        # - name: Custom-Header
        #   value: Awesome
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
      # initialDelaySeconds: 50
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 2
  restartPolicy: Always

remove the IP from the node this pod is scheduled at any time

Anything else we need to know?:

during the sweep phase of the nodeIPPoolReconcile process, should the CNI be invoked to updateHostNetwork for the removed IPs?
see issue

Environment:

Kubernetes version (use kubectl version):
CNI Version: image: 602401143452.dkr.ecr.us-west-1.amazonaws.com/amazon-k8s-cni-init:v1.15.3-eksbuild.1
OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Kernel (e.g. uname -a):

Linux ....compute.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

jdn5126 · 2024-03-08T23:04:36Z

@AbeOwlu what is this "external event" that reclaims an IP on an ENI? Only the IPAM daemon should be assigning and unassigning IPs to an ENI. Before calling the EC2 API to unassign IPs, it removes those IPs from the datastore. That precondition is required to avoid this exact scenario

AbeOwlu · 2024-03-13T00:51:35Z

There's an automation pipeline that's incorrectly, (I might add) seeing a drift in the VPC network and unassigns an IP from an EC2 instance at the moment.

looking into this further, it actually appears to show the CRI attempting to recreate container sandbox, but the CNI was not not responsive.. connection refused on the 3 attempts so the orchestrator may may be handling this case.

Will update with more details and logs...

GnatorX · 2024-04-19T03:36:26Z

I think I hit this issue too. Let me circle back with some more info

GnatorX · 2024-05-31T21:20:16Z

We had this issue. aws/amazon-vpc-resource-controller-k8s#412 which deleted branch ENI from pods. CNI didn't do anything about the missing network interface or lost IP address

orsenthil · 2024-06-26T20:26:45Z

@AbeOwlu - CNI will not remove any interface that doesn't manage. For any external changes introduced to the interfaces that CNI manages, if they are not in use, it will garbage collect them. If it didn't happen, and you can reproduce this as bug, let us know. Otherwise, we can close this ticket.

github-actions · 2024-08-26T00:03:50Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions · 2024-09-09T00:04:20Z

Issue closed due to inactivity.

AbeOwlu added the bug label Mar 8, 2024

jdn5126 added question and removed bug labels Mar 8, 2024

AbeOwlu mentioned this issue Mar 13, 2024

kublet prober infinite Readiness check - no Liveness probe defeating self-heal kubernetes/kubernetes#123778

Closed

github-actions bot added the stale Issue or PR is stale label Aug 26, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI not removing network built on a node after IP is lost externally and IPAMD reconciles this state #2834

CNI not removing network built on a node after IP is lost externally and IPAMD reconciles this state #2834

AbeOwlu commented Mar 8, 2024

jdn5126 commented Mar 8, 2024

AbeOwlu commented Mar 13, 2024

GnatorX commented Apr 19, 2024

GnatorX commented May 31, 2024

orsenthil commented Jun 26, 2024

github-actions bot commented Aug 26, 2024

github-actions bot commented Sep 9, 2024

CNI not removing network built on a node after IP is lost externally and IPAMD reconciles this state #2834

CNI not removing network built on a node after IP is lost externally and IPAMD reconciles this state #2834

Comments

AbeOwlu commented Mar 8, 2024

jdn5126 commented Mar 8, 2024

AbeOwlu commented Mar 13, 2024

GnatorX commented Apr 19, 2024

GnatorX commented May 31, 2024

orsenthil commented Jun 26, 2024

github-actions bot commented Aug 26, 2024

github-actions bot commented Sep 9, 2024