etcd-only and agent nodes do not properly fail back during apiserver outage #11349

brandond · 2024-11-21T09:31:00Z

Environmental Info:
K3s Version:
v1.30.6+k3s1 but earlier releases are affected as well... probably going back to May/June when the load-balancer stuff was last worked on. Maybe this has never worked right, I'm not sure yet.

Node(s) CPU architecture, OS, and Version:
n/a

Cluster Configuration:
cluster with dedicated etcd-only and control-plane-only servers

Describe the bug:

Similar to Secondary etcd-only nodes do not reconnect to apiserver after outage if joined against an etcd-only node #11311 but with a HA control-plane

In the current state, if alternating apiservers are shut down for more than about 15 seconds, the cluster essentially gets stuck until the last apiserver is started again, or /var/lib/rancher/k3s/agent/etc/k3s-agent-load-balancer.json is removed from all nodes and k3s restarted. Removing the json file forces the nodes to re-fetch the apiserver list from etcd, which is guaranteed to contain the address of the most recently started apiserver node.

Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Starting k3s agent v1.31.2+k3s1 (6da20424)"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.4:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 172.17.0.4:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Supervisor proxy using supervisor=https://127.0.0.1:6444 apiserver=https://127.0.0.1:6444 lb=true"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Dial error from load balancer k3s-agent-load-balancer after 163.025µs: dial tcp 172.17.0.7:6443: connect: connection refused"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Failed over to new server for load balancer k3s-agent-load-balancer: 172.17.0.7:6443 -> 172.17.0.4:6443"

Nov 21 09:22:09 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:09Z" level=info msg="Getting list of apiserver endpoints from server"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Failed to retrieve list of apiservers from server: Get \"https://127.0.0.1:6444/v1-k3s/apiservers\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Got apiserver addresses from supervisor: []"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Got apiserver addresses from kubernetes endpoints: []"

Nov 21 09:22:19 systemd-node-6 k3s[1084]: E1121 09:22:19.786464    1084 server.go:666] "Failed to retrieve node info" err="apiserver disabled"

Steps To Reproduce:

Create a cluster with 3 etcd, 2 control-plane, 1 agent. Use the first etcd node as the --server address for the other nodes.
Stop k3s on the first control-plane node
Wait 20 seconds. Note that the first control-plane node is removed from the apiserver load-balancer on the etcd and agent nodes, as the surviving control-plane node has removed it from the apiserver endpoint list due to it being down.
Stop K3s on the second control-plane node, and start k3s on the first control-plane node
Note that the etcd and agent nodes do not fail back over to the first control-plane node, as it was removed from the loadbalancer. They are will go NotReady in the node list after a minute or so.
Start k3s on the second control-plane server.
Note that the etcd and agent nodes reconnect and go back to Ready.

Expected behavior:
etcd and agent nodes fail back to the first control-plane node, and do not require the second control-plane node to come back up before they are functional.

Actual behavior:
etcd and agent nodes are stuck talking to an etcd node with no apiserver, and a control-plane node that is down.

Additional context / logs:

Our docs say to use a fixed registration address that is backed by multiple servers, but that guidance is not always followed. I'm not sure if it even actually makes a difference in this case.

Options:

Periodically re-sync the apiserver list from etcd, in case all of the previously known apiserver endpoints are not available.
~~Have the loadbalancer watch addresses of control-plane nodes, instead of apiserver endpoints.~~
Watch both apiserver endpoints AND nodes, and weight the server lower if it's not in both lists, and only remove it from the loadbalancer if it's gone from both. Compared to other options, this might just be doing more work for little benefit.

The text was updated successfully, but these errors were encountered:

brandond self-assigned this Nov 21, 2024

github-project-automation bot added this to K3s Development Nov 21, 2024

github-project-automation bot moved this to New in K3s Development Nov 21, 2024

brandond added this to the 2024-12 Release Cycle milestone Nov 21, 2024

brandond moved this from New to Working in K3s Development Nov 21, 2024

brandond added the kind/enhancement An improvement to existing functionality label Nov 21, 2024

brandond mentioned this issue Nov 22, 2024

Rework loadbalancer server selection logic #11329

Open

brandond moved this from Working to Peer Review in K3s Development Nov 22, 2024

brandond mentioned this issue Nov 22, 2024

Embedded load-balancer behavior is flakey and hard to understand #11334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd-only and agent nodes do not properly fail back during apiserver outage #11349

etcd-only and agent nodes do not properly fail back during apiserver outage #11349

brandond commented Nov 21, 2024 •

edited

Loading

etcd-only and agent nodes do not properly fail back during apiserver outage #11349

etcd-only and agent nodes do not properly fail back during apiserver outage #11349

Comments

brandond commented Nov 21, 2024 • edited Loading

brandond commented Nov 21, 2024 •

edited

Loading