-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eBPF not issuing TCP RST on unknown connections #8882
Comments
Yeah, this is a known issue for many years 😿 It wasn't addressed yet as with nodeports, the idea is that the node itself will respond - seems like that does not work for you. With other services advertised via bgp, that does not work and it is related to #5039. Also the #8854 might be at play. Let's say we will prioritize the work higher. |
We do issue ICMP error when there is a service but that service does not have a backend. |
Relates to #7983 I am not even sure iptables mode (vanilla k8s would respond with RST - hence I treat it as kind/Enhancement) |
An additional factor here is that I wouldn't even expect eBPF to interfere here, since we are plainly working with hostIP, not pods or calico interfaces involved at all. It is pure host2host communication, with IP addresses not managed via calico or K8s. And, as mentioned, in non-ebpf mode, it behaves as expected. |
Well calico ebpf interferes on any host device that is included in the dataiface regexp. But my expectation would be that node IPs would just work 🤔 Would you see any conntracks in Linux or dropped packets in ebpf counters etc? |
All ebpf drop counters stay 0. And I didn't see anything special in conntrack. In a cluster without eBPF, as far as I understand we pass In the eBPF case, we see directly the following in the INPUT chain: Observing the counter, this triggers for my unrelated packet and causes the drop. So this at least should be the reason for the differing behaviour. Anything we can do about this? |
@sfudeus great investigation. Interestingly, this rule should be only hit by when (a) we see a mid-flow TCP packet for which we do not have a conntrack entry in ebpf (the connection predates enabling ebpf) and (b) linux does not have an entry for an existing flow neither - so it may be some spoofed packet and thus it should be dropped. 🤔 I will try to investigate. |
@tomastigera that does fit, because in my scenario we do have such a mid-flow packet, because it was rerouted due to ECMP when 1 potential (and original) destination was removed. I understand the concern about spoofing, but I'd say there only is a risk if such a packet would break an existing connection state. In this case, it would only be about "revealing" that no matching connections exists. And that response would be sent to the "original" IP, not the spoofer. And not sure inhowfar this would require sequence numbers to match sufficiently. |
Hmmm it is an interesting one. I agree. Let me think about how to achieve the desired behaviour. I think we just need to pass these packets through policy anyway and not just rely on iptables and conntrack. Because when contrack exists, we know that it was allowed, to we let it through. But if conntrack does not exist, it is already too late to police the packet and thus we need to drop it. The question is what to do with such a packet if it would be now denied by policy but iptables conntrack exists. I think it is fine to deny, however, at the startup (switch from iptables to ebpf) we could break existing connections and that is what we do not want to in the first place. And we cannot quite just send RST from ebpf because what if there was a conntrack in the kernel! With newer kernels, we could query linux conntrack. Or we may just allow the packet and mark it with "drop it if conntrack does not exist because denied by policy" 🤔 @fasaxc WDYT? |
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
@sfudeus could you try this image if it would work for you? |
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
calico-node didn't come up ready on 3 of 8 nodes with internal dataplane main loop timeout:
With loglevel set to info, nothing special visible in the logs AFAICS. Initially, it is reporting live, but not ready. Later, it is reporting timeouts for both liveness and readiness (as in the excerpt above). I could not identify a pattern which hosts would come up and which not. But regarding the original issue: All 3 masters came up fine with the new calico-node and I properly received my tcp-rst when sending out-of-line packets. So it seems this might be solved. |
@sfudeus what's the load like on your hosts? Do they have a lot of pods, is CPU maxed out? (That timeout suggests either a concurrency bug and Felix has locked up, or your nodes have a very high workload and Felix is struggling.) |
@fasaxc nodes were only minimally loaded. This was on a bare metal sandbox cluster without real load. Theoretically there could have been CPU throttling on pod-level because of CPU limits involved, but I don't remember seeing anything specifically there. I can retry and report back if there is something. edit: seems like the same 3 nodes as yesterday don't become healthy - ah, and we don't even have a cpu limit anymore on calico-node. I'm attaching a log excerpt of the InternalDataplaneMainLoop logs from health.go |
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
This is a newer image |
As discussed in Slack, it came up good in a cluster after reboot and now even in the other cluster where I originally had seen issues. Strange, since these cluster get rebooted roughly once a week anyway. |
For a smooth switch from iptables to ebpf mode, we do not want to interrupt existing connections. If we see midflow packets, we pass them to the host stack. If the stack can verify that they belong to an existing conntrack, we let them through and we learn the conntrack. We drop the rest. However, there are some situations when we can see a stray TCP packet during ebpf mode, for instance when a pod dies and ECMP kicks in and sends a packet to a different host. If such a packet gets dropped, the end of the connections remains stuck. This change sends an RST to such a stream instead of just dropping the packets so that the end host can break the connection. Fixes projectcalico#8882
Expected Behavior
When a node receives a packet for an unknown TCP connection on the host IP or to a closed port, it should by default reject that connection by issuing a packet with the TCP RST bit set. There may be situations where it should be silently dropped, in default it should behave transparently.
This behaviour should be identical between the eBPF mode and the iptables mode.
Current Behavior
When sending a TCP packet to a port on the hostIP of a node running calico on eBPF mode, for unknown connections, there is no reply sent, the packet is silently dropped and the client will resume retries until timeout. This happens both for unknown connections against closed ports and open ports. New connections against closed ports are properly rejected.
When sending such an unknown TCP packet to a node running calico on iptables mode, there is an immediate TCP RST packet and the client can immediately terminate (and potentially reestablish a connection).
Possible Solution
Even in eBPF mode, a packet with RST bit set should be sent for any connection not in conntrack.
Steps to Reproduce (for bugs)
Trivial python script to trigger this:
Context
We are running the apiservers on an anycast IP in the hostNetwork, announced via BGP. When one master node goes down and withdraws its announcement, ECMP routing will route the traffic to a remaining control plane node. With that, such a remaining node will see traffic for unknown TCP connections, right after the ECMP change.
In our case, on eBPF-enabled clusters, kubelets' connections to the apiserver start to hang until a timeout is reached, partially causing nodes to be ejected as not-ready anymore. We identified that this happens because the kubelet don't notice that their connection to the apiserver broke, they keep on resending packets, because they never see a TCP reset.
On the non-eBPF clusters, this scenario is not an issue at all. After the ECMP change, kubelet will see a TCP reset, and establish a new TCP connection immediately.
Remarks
This might be related to #8854 or at least affected by a fix for this. From the user perspective, it is a different thing.
Your Environment
The text was updated successfully, but these errors were encountered: