Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd-only and agent nodes do not properly fail back during apiserver outage #11349

Open
brandond opened this issue Nov 21, 2024 · 0 comments
Open
Assignees
Labels
kind/enhancement An improvement to existing functionality

Comments

@brandond
Copy link
Member

brandond commented Nov 21, 2024

Environmental Info:
K3s Version:
v1.30.6+k3s1 but earlier releases are affected as well... probably going back to May/June when the load-balancer stuff was last worked on. Maybe this has never worked right, I'm not sure yet.

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
cluster with dedicated etcd-only and control-plane-only servers

Describe the bug:

In the current state, if alternating apiservers are shut down for more than about 15 seconds, the cluster essentially gets stuck until the last apiserver is started again, or /var/lib/rancher/k3s/agent/etc/k3s-agent-load-balancer.json is removed from all nodes and k3s restarted. Removing the json file forces the nodes to re-fetch the apiserver list from etcd, which is guaranteed to contain the address of the most recently started apiserver node.

Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Starting k3s agent v1.31.2+k3s1 (6da20424)"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.4:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 172.17.0.4:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Supervisor proxy using supervisor=https://127.0.0.1:6444 apiserver=https://127.0.0.1:6444 lb=true"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Dial error from load balancer k3s-agent-load-balancer after 163.025µs: dial tcp 172.17.0.7:6443: connect: connection refused"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Failed over to new server for load balancer k3s-agent-load-balancer: 172.17.0.7:6443 -> 172.17.0.4:6443"

Nov 21 09:22:09 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:09Z" level=info msg="Getting list of apiserver endpoints from server"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Failed to retrieve list of apiservers from server: Get \"https://127.0.0.1:6444/v1-k3s/apiservers\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Got apiserver addresses from supervisor: []"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Got apiserver addresses from kubernetes endpoints: []"

Nov 21 09:22:19 systemd-node-6 k3s[1084]: E1121 09:22:19.786464    1084 server.go:666] "Failed to retrieve node info" err="apiserver disabled"

Steps To Reproduce:

  1. Create a cluster with 3 etcd, 2 control-plane, 1 agent. Use the first etcd node as the --server address for the other nodes.
  2. Stop k3s on the first control-plane node
  3. Wait 20 seconds. Note that the first control-plane node is removed from the apiserver load-balancer on the etcd and agent nodes, as the surviving control-plane node has removed it from the apiserver endpoint list due to it being down.
  4. Stop K3s on the second control-plane node, and start k3s on the first control-plane node
  5. Note that the etcd and agent nodes do not fail back over to the first control-plane node, as it was removed from the loadbalancer. They are will go NotReady in the node list after a minute or so.
  6. Start k3s on the second control-plane server.
  7. Note that the etcd and agent nodes reconnect and go back to Ready.

Expected behavior:
etcd and agent nodes fail back to the first control-plane node, and do not require the second control-plane node to come back up before they are functional.

Actual behavior:
etcd and agent nodes are stuck talking to an etcd node with no apiserver, and a control-plane node that is down.

Additional context / logs:

Our docs say to use a fixed registration address that is backed by multiple servers, but that guidance is not always followed. I'm not sure if it even actually makes a difference in this case.

Options:

  • Periodically re-sync the apiserver list from etcd, in case all of the previously known apiserver endpoints are not available.
  • Have the loadbalancer watch addresses of control-plane nodes, instead of apiserver endpoints.
  • Watch both apiserver endpoints AND nodes, and weight the server lower if it's not in both lists, and only remove it from the loadbalancer if it's gone from both. Compared to other options, this might just be doing more work for little benefit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement An improvement to existing functionality
Projects
Status: Peer Review
Development

No branches or pull requests

1 participant