-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubevirt: Force neighbors cache update after live migrtation #4416
base: master
Are you sure you want to change the base?
kubevirt: Force neighbors cache update after live migrtation #4416
Conversation
docs, docs, please fix the docs issue here as well if possible @qinqon ? |
You mean https://issues.redhat.com/browse/SDN-4904, right ? |
I meant #4398 :D ah they are the same! |
207ac71
to
fa77351
Compare
Rebased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing my review comments for the first commit only. Will move to the other commits right away.
docs/design/live-migration.md
Outdated
┌──────────────────────┐ │ │ ┌──────────────────────┐ | ||
│ lsp ns-virt-launcher1│──┘ └───│ lsp pod1 │ | ||
│ address: │ │ address: │ | ||
│ 0a:58:0a:f4:00:01 │ │ 0a:58:0a:f1:00:02 │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MAC address for lsp pod1 shouldn't have changed, I suppose? Here I see that the f4
bytes became f1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, will fix it.
docs/design/live-migration.md
Outdated
```text | ||
┌────────────────────────┐ ┌────────────────────────┐ | ||
│logical switch node2 │ │logic switch node1 │ | ||
│ (10.244.1.0/24) │ │ (10.244.1.0/24) │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here the subnet for logical switch node1 should be the same as before: 10.244.0.0/24
docs/design/live-migration.md
Outdated
└─────────────────────────────────┘ | ||
``` | ||
|
||
Now the neighbors tables are incorrect since non of those mac address are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non
-> none
, address
-> addresses
docs/design/live-migration.md
Outdated
Now the neighbors tables are incorrect since non of those mac address are | ||
are part of the logical switches, after some time or traffic neighbors cache | ||
get invalidated and updated so they will point to the arp_proxy mac `0a:58:0a:f3:00:00`, problem | ||
is that it may take too much times and connections will be broken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
times
-> time
docs/design/live-migration.md
Outdated
is that it may take too much times and connections will be broken. | ||
|
||
To improve the situation ovn-k sends the following GARPs and unsolicited NA taking | ||
into account tha the arp proxy mac is `0a:58:0a:f3:00:00`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tha
-> that
docs/design/live-migration.md
Outdated
|
||
At node2 advertise to VM that pod's from the same subnet should go over proxy_arp: | ||
- `GARP(10.244.0.9 -> 0a:58:0a:f3:00:00, vm mac(0a:58:0a:f4:00:01))` | ||
- one per pod at same subnet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity, here I would say something like `one per pod on the same subnet (in this example, only for pod1).
fa77351
to
4f58658
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly asking for more comments. I haven't looked at the last two commits yet.
test/e2e/kubevirt.go
Outdated
WithPolling(polling). | ||
WithTimeout(timeout). | ||
Should(Succeed(), func() string { return stage + ": " + podName + ": " + output }) | ||
output, err := kubevirt.RunCommand(vmi, fmt.Sprintf("curl http://%s", net.JoinHostPort(podIP, "8000")), 5*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you briefly explain why there's no need for the call to Eventually
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So Eventually
was enmascarading the real issue, since that it's the time it takes for arp neighbors cache to be reconciled, removing it showed the issue and force for it to be solved.
resource: virtualMachine, | ||
test: liveMigrate, | ||
topology: "localnet", | ||
eastWestTrafficTimeout: 15 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're increasing the timeout from 5 to 15 seconds because we want to allow enough time for the GARPs to be sent on both nodes (old, new) and for ovn-k to remove the VM mac address from the old node LSP, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue we fix here is for default pod network, this tests are for multi-homing live migration we have lower expectations right now, we have some jira to improve live migration at multi-homing, that's why we put this value here to be able to re-use the east/west traffic function but make it pass at multi-homing live migration.
go-controller/pkg/kubevirt/pod.go
Outdated
@@ -146,6 +148,31 @@ func AllVMPodsAreCompleted(client *factory.WatchFactory, pod *corev1.Pod) (bool, | |||
return true, nil | |||
} | |||
|
|||
// isLiveMigrationInProgress return true if a live migration is still progressing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add to the comment how live migration is detected? From what I understand (to continue your comment line) "that is when there's more than one pod associated to the same VM".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old code.
go-controller/pkg/kubevirt/pod.go
Outdated
@@ -199,6 +226,29 @@ func nodeContainsPodSubnet(watchFactory *factory.WatchFactory, nodeName string, | |||
return false, nil | |||
} | |||
|
|||
// findNodeOwningSubnet will return the node owning the subnet | |||
// contains the subnets from the argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the sentence in the comment is broken :)
How about findNodeOwningSubnet returns the node owning the subnet found in the pod annotation passed as input
|
||
// notifyARPProxyMACForIP will send an GARP to force arp caches to clean up | ||
// at pods and reference to the arp proxy gw mac | ||
func notifyARPProxyMACForIPs(ips []*net.IPNet, dstMAC net.HardwareAddr) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is one of the core points from this PR. I know you described the global picture in the markdown document, but could you also describe here what's happening in the relevant functions and in the commit message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, so the commits implementing the design doc are from the PoC, if implementatino is "sound" I will refactor them with proper commit messages and structure.
4f58658
to
4413548
Compare
This change add the design to advertise neighbors after kubevirt's VM live migration to improve connectivity Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
Signed-off-by: Enrique Llorente <ellorent@redhat.com>
4413548
to
c0d4d35
Compare
The approach looks good to me. The pod informer used in the logic is a shared informer, so it shouldn't add overhead to the existing code. |
This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or reach out to maintainers for code reviews or consider closing this if you do not plan to work on it. |
What this PR does and why is it needed
Improve connectivity after kubevirt live migration by forcing updating neighbors cache at VM and pods, it remove flakiness at difference scenarios. The details are at the first commit on the design doc.
Which issue(s) this PR fixes
Fixes #4237
Fixes #3986
Fixes https://issues.redhat.com/browse/SDN-4860
Special notes for reviewers
The main change to review is the first commit that contains the design doc changes, this is where review has to go. The rest of the PR is the PoC's code to illustrate the design doc, after design doc is ok the rest of the PR will be reworked.
PoC TODO:
Details to documentation updates
live migration design docs updated
Description for the changelog
kubevirt: Force neighbors cache update after live migrtation
Does this PR introduce a user-facing change?
NONE
For more information on release notes see: TBD
-->