-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework loadbalancer server selection logic #11329
base: master
Are you sure you want to change the base?
Conversation
416f641
to
f38a6cf
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11329 +/- ##
==========================================
- Coverage 47.18% 42.59% -4.59%
==========================================
Files 179 180 +1
Lines 18600 18768 +168
==========================================
- Hits 8776 7995 -781
- Misses 8469 9573 +1104
+ Partials 1355 1200 -155
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
a149b59
to
5443d8e
Compare
168f212
to
a50f5d9
Compare
@brandond Thanks! With old implementation, we recently observe a panic when apiserver is offline. I hope this sort out all issues. |
I even doubt that this is the cause of #11346 |
a50f5d9
to
74dde01
Compare
None of these fields or functions are used in k3s or rke2 Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
74dde01
to
7e516bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One extra theoretical question: is it possible for healthcheck to work but dial to fail?
// otherwise, close its connections and demote it to preferred | ||
s.state = statePreferred | ||
defer s.closeAll() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this use case. How would the agent have more connections with a server which was not the previously active one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dialContext loop will dial all the servers in the list until one succeeds. Each dialing goroutine has its own copy of the list and may end up connecting to different servers and calling recordSuccess near-simultaneously if they are all stuck in recovering due to failing health checks. Once one passes health checks and becomes active, we should close all the connections to other servers so that they switch over to the active one with the most connections.
Note that only the server list has a lock on it, individual server state can be changed from different goroutines at any time.
7e516bd
to
bb6b0cb
Compare
Yes, on RKE2 where the etcd and apiserver pods may not be up yet when the service has just started. |
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…rivate Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
034ac01
to
bca06c4
Compare
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
bca06c4
to
b74b2b1
Compare
9e7c1bf
to
3f388a5
Compare
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…watch fails Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
The error message should be printf style, not just concatenated. The current message is garbled if the command or result contains things that look like formatting directives: `Internal error occurred: error sending request: Post "https://10.10.10.102:10250/exec/default/volume-test/volume-test?command=sh&command=-c&command=echo+local-path-test+%!!(MISSING)E(MISSING)+%!!(MISSING)F(MISSING)data%!!(MISSING)F(MISSING)test&error=1&output=1": proxy error from 127.0.0.1:6443 while dialing 10.10.10.102:10250, code 502: 502 Bad Gateway` Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
3f388a5
to
fbdbcc5
Compare
Proposed Changes
This PR does the following:
This should be easier to test and maintain, and provide more consistent behavior:
Server state preference order:
Possible state changes:
Logging
Health check state transitions are also logged at INFO level, for better visibility when not running with debug logging enabled:
Types of Changes
tech debt
Verification
Testing
Yes
Linked Issues
User-Facing Change
Further Comments