Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework loadbalancer server selection logic #11329

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

brandond
Copy link
Member

@brandond brandond commented Nov 15, 2024

Proposed Changes

This PR does the following:

  • groups related functions into separate files
  • simplifies the LoadBalancer public functions to be more consistent with regards to naming
  • moves the load-balancer server list into a new type, does away with a number of redundant shared-state vars (localServerURL, defaultServerAddress, randomServers, currentServerAddress, nextServerIndex), and reworks the dialer and health-checking logic to use the sorted list maintained by that type.
  • Adds a fallback check to retrieve apiserver endpoints from the server address, in case of a total apiserver outage.
  • Fixes issues with tests that made it difficult to diagnose issues while testing the above changes.

This should be easier to test and maintain, and provide more consistent behavior:

  1. All connections are sent to the same server (the active server), as long as it passes health checks and can be dialed.
  2. If a server fails health checks or cannot be dialed, we want to pick a new active server, and close other connections to the failed node so that they reconnect to the new server.
  3. The new active server should be picked from the list of servers in order of preference:
    1. Servers that recently came back from failed, as they may have been taken down for patching and are less likely to go down again
    2. Servers that are passing health checks
    3. The default server (if it wouldn't otherwise be present in the server list)
    4. Servers that are were failed, but passed a single health check or were dialed successfully
    5. Servers that failed their most recent health check or dial
Server state preference order:
flowchart LR
Active --> Preferred --> Healthy --> Unchecked --> Standby --> Recovering --> Failed
Loading
Possible state changes:
stateDiagram-v2
[*] --> Unchecked
[*] --> Standby
Unchecked --> Recovering
Unchecked --> Failed
Unchecked --> Invalid
Failed --> Recovering
Failed --> Invalid
Recovering --> Preferred
Recovering --> Active
Recovering --> Failed
Recovering  --> Invalid
Standby --> Unchecked
Standby --> Invalid
Healthy --> Failed
Healthy --> Active
Healthy --> Invalid
Preferred --> Failed
Preferred --> Healthy
Preferred --> Active
Preferred --> Invalid
Active --> Failed
Active --> Invalid
Invalid --> [*]
Loading
Logging

Health check state transitions are also logged at INFO level, for better visibility when not running with debug logging enabled:

Nov 17 20:45:32 systemd-node-2 systemd[1]: Starting Lightweight Kubernetes...

Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 172.17.0.4:6443"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->ACTIVE from successful dial"

Nov 17 20:45:47 systemd-node-2 k3s[362]: time="2024-11-17T20:45:47Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.8:6443"
Nov 17 20:45:47 systemd-node-2 k3s[362]: time="2024-11-17T20:45:47Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.7:6443 172.17.0.8:6443] [default: 172.17.0.4:6443]"

Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:46:48 systemd-node-2 k3s[362]: time="2024-11-17T20:46:48Z" level=info msg="Server 172.17.0.8:6443@PREFERRED->HEALTHY from successful health check"

Nov 17 20:46:53 systemd-node-2 systemd[1]: Starting Lightweight Kubernetes...

Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 172.17.0.4:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.8:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Server 172.17.0.8:6443@PREFERRED->ACTIVE from successful dial"

Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->FAILED from failed health check"

Nov 17 20:47:10 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:10Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:10 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:10Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:47:11 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:11Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:11 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:11Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443] [default: 172.17.0.4:6443]"

Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:50:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:50:50Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->HEALTHY from successful health check"

Types of Changes

tech debt

Verification

Testing

Yes

Linked Issues

User-Facing Change

Further Comments

@brandond brandond force-pushed the loadbalancer-rewrite branch 3 times, most recently from 416f641 to f38a6cf Compare November 16, 2024 04:21
@brandond brandond marked this pull request as ready for review November 16, 2024 04:29
@brandond brandond requested a review from a team as a code owner November 16, 2024 04:29
Copy link

codecov bot commented Nov 16, 2024

Codecov Report

Attention: Patch coverage is 78.70036% with 118 lines in your changes missing coverage. Please review.

Project coverage is 42.59%. Comparing base (cd4dded) to head (fbdbcc5).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
pkg/agent/loadbalancer/servers.go 83.02% 41 Missing and 5 partials ⚠️
pkg/etcd/etcdproxy.go 0.00% 19 Missing ⚠️
pkg/agent/loadbalancer/loadbalancer.go 69.04% 8 Missing and 5 partials ⚠️
pkg/agent/tunnel/tunnel.go 83.33% 10 Missing and 3 partials ⚠️
pkg/server/router.go 85.18% 5 Missing and 3 partials ⚠️
pkg/agent/loadbalancer/httpproxy.go 80.00% 5 Missing and 2 partials ⚠️
pkg/agent/config/config.go 63.63% 3 Missing and 1 partial ⚠️
pkg/util/client.go 66.66% 2 Missing and 1 partial ⚠️
pkg/agent/proxy/apiproxy.go 66.66% 1 Missing ⚠️
pkg/cli/token/token.go 0.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11329      +/-   ##
==========================================
- Coverage   47.18%   42.59%   -4.59%     
==========================================
  Files         179      180       +1     
  Lines       18600    18768     +168     
==========================================
- Hits         8776     7995     -781     
- Misses       8469     9573    +1104     
+ Partials     1355     1200     -155     
Flag Coverage Δ
e2etests 34.81% <62.81%> (-7.40%) ⬇️
inttests 18.69% <18.95%> (-16.00%) ⬇️
unittests 14.18% <50.36%> (+0.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@brandond brandond force-pushed the loadbalancer-rewrite branch 3 times, most recently from a149b59 to 5443d8e Compare November 16, 2024 10:21
@brandond brandond changed the title [WIP] Rework loadbalancer server selection logic Rework loadbalancer server selection logic Nov 16, 2024
@brandond brandond force-pushed the loadbalancer-rewrite branch 13 times, most recently from 168f212 to a50f5d9 Compare November 19, 2024 11:24
pkg/agent/loadbalancer/servers.go Outdated Show resolved Hide resolved
pkg/agent/loadbalancer/loadbalancer.go Show resolved Hide resolved
@liyimeng
Copy link
Contributor

@brandond Thanks! With old implementation, we recently observe a panic when apiserver is offline. I hope this sort out all issues.

@liyimeng
Copy link
Contributor

I even doubt that this is the cause of #11346

@brandond
Copy link
Member Author

@liyimeng what you're describing sounds like #10317 which was fixed a while ago.

None of these fields or functions are used in k3s or rke2

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
dereknola
dereknola previously approved these changes Nov 21, 2024
Copy link
Contributor

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One extra theoretical question: is it possible for healthcheck to work but dial to fail?

pkg/agent/loadbalancer/proxy.go Outdated Show resolved Hide resolved
pkg/agent/loadbalancer/proxy.go Outdated Show resolved Hide resolved
pkg/agent/loadbalancer/servers.go Show resolved Hide resolved
pkg/agent/loadbalancer/servers.go Outdated Show resolved Hide resolved
Comment on lines +257 to +259
// otherwise, close its connections and demote it to preferred
s.state = statePreferred
defer s.closeAll()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this use case. How would the agent have more connections with a server which was not the previously active one?

Copy link
Member Author

@brandond brandond Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dialContext loop will dial all the servers in the list until one succeeds. Each dialing goroutine has its own copy of the list and may end up connecting to different servers and calling recordSuccess near-simultaneously if they are all stuck in recovering due to failing health checks. Once one passes health checks and becomes active, we should close all the connections to other servers so that they switch over to the active one with the most connections.

Note that only the server list has a lock on it, individual server state can be changed from different goroutines at any time.

@brandond
Copy link
Member Author

brandond commented Nov 22, 2024

@manuelbuil
Looks good! One extra theoretical question: is it possible for healthcheck to work but dial to fail?

Yes, on RKE2 where the etcd and apiserver pods may not be up yet when the service has just started.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…rivate

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@brandond brandond force-pushed the loadbalancer-rewrite branch 2 times, most recently from 034ac01 to bca06c4 Compare November 22, 2024 10:14
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@liyimeng
Copy link
Contributor

@liyimeng what you're describing sounds like #10317 which was fixed a while ago.

Ha, did not notice that. Yes, it is the fix I need

@brandond brandond force-pushed the loadbalancer-rewrite branch 3 times, most recently from 9e7c1bf to 3f388a5 Compare November 22, 2024 21:43
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…watch fails

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
The error message should be printf style, not just concatenated. The
current message is garbled if the command or result contains things that
look like formatting directives:

`Internal error occurred: error sending request: Post "https://10.10.10.102:10250/exec/default/volume-test/volume-test?command=sh&command=-c&command=echo+local-path-test+%!!(MISSING)E(MISSING)+%!!(MISSING)F(MISSING)data%!!(MISSING)F(MISSING)test&error=1&output=1": proxy error from 127.0.0.1:6443 while dialing 10.10.10.102:10250, code 502: 502 Bad Gateway`

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants