Rework loadbalancer server selection logic #11329

brandond · 2024-11-15T22:29:05Z

Proposed Changes

This PR does the following:

groups related functions into separate files
simplifies the LoadBalancer public functions to be more consistent with regards to naming
moves the load-balancer server list into a new type, does away with a number of redundant shared-state vars (localServerURL, defaultServerAddress, randomServers, currentServerAddress, nextServerIndex), and reworks the dialer and health-checking logic to use the sorted list maintained by that type.
Adds a fallback check to retrieve apiserver endpoints from the server address, in case of a total apiserver outage.
Fixes issues with tests that made it difficult to diagnose issues while testing the above changes.

This should be easier to test and maintain, and provide more consistent behavior:

All connections are sent to the same server (the active server), as long as it passes health checks and can be dialed.
If a server fails health checks or cannot be dialed, we want to pick a new active server, and close other connections to the failed node so that they reconnect to the new server.
The new active server should be picked from the list of servers in order of preference:
1. Servers that recently came back from failed, as they may have been taken down for patching and are less likely to go down again
2. Servers that are passing health checks
3. The default server (if it wouldn't otherwise be present in the server list)
4. Servers that are were failed, but passed a single health check or were dialed successfully
5. Servers that failed their most recent health check or dial

Server state preference order:

flowchart LR
Active --> Preferred --> Healthy --> Unchecked --> Standby --> Recovering --> Failed

Possible state changes:

stateDiagram-v2
[*] --> Unchecked
[*] --> Standby
Unchecked --> Recovering
Unchecked --> Failed
Unchecked --> Invalid
Failed --> Recovering
Failed --> Invalid
Recovering --> Preferred
Recovering --> Active
Recovering --> Failed
Recovering  --> Invalid
Standby --> Unchecked
Standby --> Invalid
Healthy --> Failed
Healthy --> Active
Healthy --> Invalid
Preferred --> Failed
Preferred --> Healthy
Preferred --> Active
Preferred --> Invalid
Active --> Failed
Active --> Invalid
Invalid --> [*]

Logging

Health check state transitions are also logged at INFO level, for better visibility when not running with debug logging enabled:

Nov 17 20:45:32 systemd-node-2 systemd[1]: Starting Lightweight Kubernetes...

Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 172.17.0.4:6443"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->ACTIVE from successful dial"

Nov 17 20:45:47 systemd-node-2 k3s[362]: time="2024-11-17T20:45:47Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.8:6443"
Nov 17 20:45:47 systemd-node-2 k3s[362]: time="2024-11-17T20:45:47Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.7:6443 172.17.0.8:6443] [default: 172.17.0.4:6443]"

Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:46:48 systemd-node-2 k3s[362]: time="2024-11-17T20:46:48Z" level=info msg="Server 172.17.0.8:6443@PREFERRED->HEALTHY from successful health check"

Nov 17 20:46:53 systemd-node-2 systemd[1]: Starting Lightweight Kubernetes...

Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 172.17.0.4:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.8:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Server 172.17.0.8:6443@PREFERRED->ACTIVE from successful dial"

Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->FAILED from failed health check"

Nov 17 20:47:10 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:10Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:10 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:10Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:47:11 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:11Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:11 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:11Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443] [default: 172.17.0.4:6443]"

Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:50:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:50:50Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->HEALTHY from successful health check"

Types of Changes

tech debt

Verification

Testing

Yes

Linked Issues

User-Facing Change

Further Comments

codecov · 2024-11-16T05:11:18Z

Codecov Report

Attention: Patch coverage is 78.70036% with 118 lines in your changes missing coverage. Please review.

Project coverage is 42.59%. Comparing base (cd4dded) to head (fbdbcc5).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/agent/loadbalancer/servers.go	83.02%	41 Missing and 5 partials ⚠️
pkg/etcd/etcdproxy.go	0.00%	19 Missing ⚠️
pkg/agent/loadbalancer/loadbalancer.go	69.04%	8 Missing and 5 partials ⚠️
pkg/agent/tunnel/tunnel.go	83.33%	10 Missing and 3 partials ⚠️
pkg/server/router.go	85.18%	5 Missing and 3 partials ⚠️
pkg/agent/loadbalancer/httpproxy.go	80.00%	5 Missing and 2 partials ⚠️
pkg/agent/config/config.go	63.63%	3 Missing and 1 partial ⚠️
pkg/util/client.go	66.66%	2 Missing and 1 partial ⚠️
pkg/agent/proxy/apiproxy.go	66.66%	1 Missing ⚠️
pkg/cli/token/token.go	0.00%	1 Missing ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #11329      +/-   ##
==========================================
- Coverage   47.18%   42.59%   -4.59%     
==========================================
  Files         179      180       +1     
  Lines       18600    18768     +168     
==========================================
- Hits         8776     7995     -781     
- Misses       8469     9573    +1104     
+ Partials     1355     1200     -155

Flag	Coverage Δ
e2etests	`34.81% <62.81%> (-7.40%)`	⬇️
inttests	`18.69% <18.95%> (-16.00%)`	⬇️
unittests	`14.18% <50.36%> (+0.35%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

pkg/agent/loadbalancer/loadbalancer_test.go

pkg/agent/loadbalancer/servers.go

pkg/agent/loadbalancer/loadbalancer.go

liyimeng · 2024-11-20T09:52:13Z

@brandond Thanks! With old implementation, we recently observe a panic when apiserver is offline. I hope this sort out all issues.

liyimeng · 2024-11-20T10:16:33Z

I even doubt that this is the cause of #11346

brandond · 2024-11-20T17:34:47Z

@liyimeng what you're describing sounds like #10317 which was fixed a while ago.

None of these fields or functions are used in k3s or rke2 Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

manuelbuil

Looks good! One extra theoretical question: is it possible for healthcheck to work but dial to fail?

pkg/agent/loadbalancer/proxy.go

pkg/agent/loadbalancer/servers.go

manuelbuil · 2024-11-21T18:00:10Z

pkg/agent/loadbalancer/servers.go

+					// otherwise, close its connections and demote it to preferred
+					s.state = statePreferred
+					defer s.closeAll()


I'm confused by this use case. How would the agent have more connections with a server which was not the previously active one?

The dialContext loop will dial all the servers in the list until one succeeds. Each dialing goroutine has its own copy of the list and may end up connecting to different servers and calling recordSuccess near-simultaneously if they are all stuck in recovering due to failing health checks. Once one passes health checks and becomes active, we should close all the connections to other servers so that they switch over to the active one with the most connections.

Note that only the server list has a lock on it, individual server state can be changed from different goroutines at any time.

brandond · 2024-11-22T07:30:04Z

@manuelbuil
Looks good! One extra theoretical question: is it possible for healthcheck to work but dial to fail?

Yes, on RKE2 where the etcd and apiserver pods may not be up yet when the service has just started.