Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Add Route Reconciliation #1749

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Fix: Add Route Reconciliation #1749

wants to merge 13 commits into from

Conversation

aauren
Copy link
Collaborator

@aauren aauren commented Oct 9, 2024

This PR is still very early and is not fully implemented, but I wanted to start to get some feedback going on the direction of this before I got too much further down the road.

The end goal of this PR is to enable route reconciliation within the route sync controller. However, in order to do that, I wanted to break out injectRoute() from the NetworkRoutesController.

Also, over the years, it has become obvious that the injectRoute() function is hard for new users to parse, and there is often a lot of confusion about how it works because the logic is complicated. So this also attempts to split out the logic in inject route into BGP type logic as well as Route type logic.

Final fix for: #1738

@aauren aauren added the override-stale Don't allow automatic management of stale issues / PRs label Oct 9, 2024
@@ -0,0 +1,45 @@
package pkg
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twz123 I wanted to get your feedback about this convention within Go.

As I started to get into the refactoring of InjectRoute, I found that I kept getting into circular dependency type situations between the bgp and routes packages.

After watching a video from Brian Ketelsen about structuring go programs, as well as several other posts, it seems like pulling up your domain models into the root of the project is a conventional way of working around circular dependencies.

Doing this with the below methods, did work out for me, but I noticed that this architecture is limited. For instance, I couldn't leave the TunnelCreator interface with EncapType and EncapPort because those are declared in the tunnels package. And for some reason, it didn't feel right to bring those up into this top level package either.

So I worked around it for now, but maybe this is showing that either I should have brought them up, or that maybe I'm embracing the wrong pattern here.

I would be curious for your thoughts.

@@ -1,82 +0,0 @@
package routes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twz123 I ended up bringing this back into the routing package because I realized that it was behaving a bit more like an extension of the NetworkRoutesController, in that, in order to have it process route reconciliation I needed it to work with the routes source of truth, which is BGP. At that point it is no longer strictly related to host routes (although now that I'm thinking about it, it means that maybe host_route_sync.go isn't the right name for the file any more either.

Anyway, curious about your thoughts here as well.

@@ -30,3 +32,109 @@ func DeleteByDestination(destinationSubnet *net.IPNet) error {
}
return nil
}

func InjectRoute(subnet *net.IPNet, gw net.IP, ipa utils.NodeIPAware, tn pkg.Tunneler, rs pkg.RouteSyncer,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twz123 - Now that this logic is disentangled with the BGP logic that used to be buried in here I like it a lot better. It feels a lot cleaner and more straight forward to me.

pkg/bgp/path.go Outdated
"k8s.io/klog/v2"
)

func PathChanged(path *gobgpapi.Path, pl PeerLister, rs pkg.RouteSyncer, tc pkg.TunnelCleaner) error {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twz123 - This feels like the right pattern, because it decouples the routes package from the bgp package by making it implementation agnostic. So I think that making all of the arguments to this function interfaces is the right thing to do, but wanted to see if you had any feedback.

type LinuxRouter struct {
NodeIPA pkg.NodeIPAware
Tunneler pkg.Tunneler
RouteSyncer pkg.RouteSyncer
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like the circular dependency here. LinuxRouter requires pkg.RouteSyncer so that it can update the state host routes. pkg.RouteSyncer in turn requires LinuxRouter so that it can reuse the InjectRoute() logic when new routes are found during checkCacheAgainstBGP().

It feels like there should be a better / cleaner way to do this.

@aauren aauren force-pushed the aauren/issue1738 branch 2 times, most recently from 25b6ff4 to c204b26 Compare October 27, 2024 20:34
@aauren aauren marked this pull request as ready for review October 27, 2024 20:35
Add a route reconciliation step to the host_route_sync file that uses
the BGP state table as a system of record for the routes that should or
should not be synced to the system.

In theory, kube-router should be consistently informed of route changes
via BGP Path change events that are sent to us from GoBGP. However, in
practice it seems that some routes are getting stuck in our state table.
Host route syncing is an important part of kube-router that is run in
its own goroutine, so we should be checking it for health the way we do
the major controllers to make sure that it never stops functioning.

Additionally, as the health check controller has been continuously added
on to, we also make some modifications here to make it a bit more robust
and scalable.
In addition to added / removed routes, there is also the chance that the
gateway has changed and we need to track that as well by fixing the
cache state.
We don't want to add paths that are either not the Best path, which is
something that we get for free on the BGP Watch method that we have in
the NRC, or paths that are currently being withdrawn (even though this
should be an edge case).
In a List operation, we get our own routes that we've advertised. This
was a problem that we didn't have with BGP Watches. However, this now
affects us, so we need to remove IPs that represent ourselves, so that
we don't add needless routes.
Adding a boolean to the return values of InjectRoute() allows us to
track when a route is actually inserted vs. when it was called. This
helps us better represent HostRoutesStaleAddedCounter metric.

Without this, the administrator will see a bunch of routes seemingly
being added every time we call checkCacheAgainstBGP() when, in fact,
most of them are tossed out depending on the user's mesh subnet
settings.
There are a couple of items that have typically ended up in a no-op for
us when considering routes to inject. However, now that we have a route
map where we track route state, we need this not just to be no-ops, but
also update the route state cache as well to ensure that the route
doesn't get replaced in the future.

When we find tunnels to clean up, we need to not only remove the tunnel
and the route to that tunnel, but also remove the route from the state
map.

When we discover that no route needs to be added to the host because
it's not in the same subnet and we weren't supposed to create a tunnel,
then we also clean it up and ensure that it isn't in our state as well.
@aauren aauren marked this pull request as draft November 17, 2024 19:33
@aauren
Copy link
Collaborator Author

aauren commented Nov 17, 2024

Most of this PR is out-dated. After several attempts to patch it up, I've found that my approach of adding BGP Cache Checking into the route_sync controller directly, just isn't the best way to do what I'm trying to do here. Over time I've seen that it is frought with problems and logic gaps.

I'm leaving this PR around for now, while I re-consider my approach here, but in the end, I think that most of the problems can be resolved by adopting commit 4c9fb28.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
override-stale Don't allow automatic management of stale issues / PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant