Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded load-balancer behavior is flakey and hard to understand #11334

Open
brandond opened this issue Nov 17, 2024 · 0 comments
Open

Embedded load-balancer behavior is flakey and hard to understand #11334

brandond opened this issue Nov 17, 2024 · 0 comments
Assignees
Labels
kind/enhancement An improvement to existing functionality

Comments

@brandond
Copy link
Member

brandond commented Nov 17, 2024

The loadbalancer server list is a bit of a mess. its behavior has been tinkered with a lot over the last year, but it's still hard to reason about. This has caused a spate of issues:

From a code perspective, the loadbalancer state is directly accessed by a number of functions that all poke at various index vars, current and default server name vars, a list of server addresses, another RANDOM list of server addresses, and a map of addresses to structs that hold state:

serviceName string
configFile string
localAddress string
localServerURL string
defaultServerAddress string
ServerURL string
ServerAddresses []string
randomServers []string
servers map[string]*server
currentServerAddress string
nextServerIndex int

The DialContext function is called whenever a new connection comes in, and holds a read lock while iterating (possibly twice) over the random server list, and servers may be added or removed at any time. The code is VERY hard to read and understand, given the number of variables involved:

var allChecksFailed bool
startIndex := lb.nextServerIndex
for {
targetServer := lb.currentServerAddress
server := lb.servers[targetServer]
if server == nil || targetServer == "" {
logrus.Debugf("Nil server for load balancer %s: %s", lb.serviceName, targetServer)
} else if allChecksFailed || server.healthCheck() {
dialTime := time.Now()
conn, err := server.dialContext(ctx, network, targetServer)
if err == nil {
return conn, nil
}
logrus.Debugf("Dial error from load balancer %s after %s: %s", lb.serviceName, time.Now().Sub(dialTime), err)
// Don't close connections to the failed server if we're retrying with health checks ignored.
// We don't want to disrupt active connections if it is unlikely they will have anywhere to go.
if !allChecksFailed {
defer server.closeAll()
}
} else {
logrus.Debugf("Dial health check failed for %s", targetServer)
}
newServer, err := lb.nextServer(targetServer)
if err != nil {
return nil, err
}
if targetServer != newServer {
logrus.Debugf("Failed over to new server for load balancer %s: %s -> %s", lb.serviceName, targetServer, newServer)
}
if ctx.Err() != nil {
return nil, ctx.Err()
}
maxIndex := len(lb.randomServers)
if startIndex > maxIndex {
startIndex = maxIndex
}
if lb.nextServerIndex == startIndex {
if allChecksFailed {
return nil, errors.New("all servers failed")
}
logrus.Debugf("Health checks for all servers in load balancer %s have failed: retrying with health checks ignored", lb.serviceName)
allChecksFailed = true
}
}

We should simplify the load-balancer behavior so that it functions more reliably, and its functionality is easier to understand and explain.

@brandond brandond self-assigned this Nov 17, 2024
@brandond brandond moved this from New to Working in K3s Development Nov 17, 2024
@brandond brandond added this to the 2024-12 Release Cycle milestone Nov 17, 2024
@brandond brandond added the kind/enhancement An improvement to existing functionality label Nov 17, 2024
@brandond brandond moved this from Working to Peer Review in K3s Development Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement An improvement to existing functionality
Projects
Status: Peer Review
Development

No branches or pull requests

1 participant