You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Environmental Info:
K3s Version:
v1.30.6+k3s1 but earlier releases are affected as well... probably going back to May/June when the load-balancer stuff was last worked on. Maybe this has never worked right, I'm not sure yet.
Node(s) CPU architecture, OS, and Version:
n/a
Cluster Configuration:
cluster with dedicated etcd-only and control-plane-only servers
In the current state, if alternating apiservers are shut down for more than about 15 seconds, the cluster essentially gets stuck until the last apiserver is started again, or /var/lib/rancher/k3s/agent/etc/k3s-agent-load-balancer.json is removed from all nodes and k3s restarted. Removing the json file forces the nodes to re-fetch the apiserver list from etcd, which is guaranteed to contain the address of the most recently started apiserver node.
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Starting k3s agent v1.31.2+k3s1 (6da20424)"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.4:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 172.17.0.4:6443"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Supervisor proxy using supervisor=https://127.0.0.1:6444 apiserver=https://127.0.0.1:6444 lb=true"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Dial error from load balancer k3s-agent-load-balancer after 163.025µs: dial tcp 172.17.0.7:6443: connect: connection refused"
Nov 21 09:22:07 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:07Z" level=debug msg="Failed over to new server for load balancer k3s-agent-load-balancer: 172.17.0.7:6443 -> 172.17.0.4:6443"
Nov 21 09:22:09 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:09Z" level=info msg="Getting list of apiserver endpoints from server"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Failed to retrieve list of apiservers from server: Get \"https://127.0.0.1:6444/v1-k3s/apiservers\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Got apiserver addresses from supervisor: []"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: time="2024-11-21T09:22:19Z" level=info msg="Got apiserver addresses from kubernetes endpoints: []"
Nov 21 09:22:19 systemd-node-6 k3s[1084]: E1121 09:22:19.786464 1084 server.go:666] "Failed to retrieve node info" err="apiserver disabled"
Steps To Reproduce:
Create a cluster with 3 etcd, 2 control-plane, 1 agent. Use the first etcd node as the --server address for the other nodes.
Stop k3s on the first control-plane node
Wait 20 seconds. Note that the first control-plane node is removed from the apiserver load-balancer on the etcd and agent nodes, as the surviving control-plane node has removed it from the apiserver endpoint list due to it being down.
Stop K3s on the second control-plane node, and start k3s on the first control-plane node
Note that the etcd and agent nodes do not fail back over to the first control-plane node, as it was removed from the loadbalancer. They are will go NotReady in the node list after a minute or so.
Start k3s on the second control-plane server.
Note that the etcd and agent nodes reconnect and go back to Ready.
Expected behavior:
etcd and agent nodes fail back to the first control-plane node, and do not require the second control-plane node to come back up before they are functional.
Actual behavior:
etcd and agent nodes are stuck talking to an etcd node with no apiserver, and a control-plane node that is down.
Additional context / logs:
Our docs say to use a fixed registration address that is backed by multiple servers, but that guidance is not always followed. I'm not sure if it even actually makes a difference in this case.
Options:
Periodically re-sync the apiserver list from etcd, in case all of the previously known apiserver endpoints are not available.
Have the loadbalancer watch addresses of control-plane nodes, instead of apiserver endpoints.
Watch both apiserver endpoints AND nodes, and weight the server lower if it's not in both lists, and only remove it from the loadbalancer if it's gone from both. Compared to other options, this might just be doing more work for little benefit.
The text was updated successfully, but these errors were encountered:
Environmental Info:
K3s Version:
v1.30.6+k3s1 but earlier releases are affected as well... probably going back to May/June when the load-balancer stuff was last worked on. Maybe this has never worked right, I'm not sure yet.
Node(s) CPU architecture, OS, and Version:
n/a
Cluster Configuration:
cluster with dedicated etcd-only and control-plane-only servers
Describe the bug:
In the current state, if alternating apiservers are shut down for more than about 15 seconds, the cluster essentially gets stuck until the last apiserver is started again, or
/var/lib/rancher/k3s/agent/etc/k3s-agent-load-balancer.json
is removed from all nodes and k3s restarted. Removing the json file forces the nodes to re-fetch the apiserver list from etcd, which is guaranteed to contain the address of the most recently started apiserver node.Steps To Reproduce:
--server
address for the other nodes.Expected behavior:
etcd and agent nodes fail back to the first control-plane node, and do not require the second control-plane node to come back up before they are functional.
Actual behavior:
etcd and agent nodes are stuck talking to an etcd node with no apiserver, and a control-plane node that is down.
Additional context / logs:
Our docs say to use a fixed registration address that is backed by multiple servers, but that guidance is not always followed. I'm not sure if it even actually makes a difference in this case.
Options:
Have the loadbalancer watch addresses of control-plane nodes, instead of apiserver endpoints.Watch both apiserver endpoints AND nodes, and weight the server lower if it's not in both lists, and only remove it from the loadbalancer if it's gone from both. Compared to other options, this might just be doing more work for little benefit.The text was updated successfully, but these errors were encountered: