Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eBPF not issuing TCP RST on unknown connections #8882

Open
sfudeus opened this issue Jun 5, 2024 · 14 comments · May be fixed by #8933
Open

eBPF not issuing TCP RST on unknown connections #8882

sfudeus opened this issue Jun 5, 2024 · 14 comments · May be fixed by #8933
Labels
area/bpf eBPF Dataplane issues kind/enhancement

Comments

@sfudeus
Copy link

sfudeus commented Jun 5, 2024

Expected Behavior

When a node receives a packet for an unknown TCP connection on the host IP or to a closed port, it should by default reject that connection by issuing a packet with the TCP RST bit set. There may be situations where it should be silently dropped, in default it should behave transparently.

This behaviour should be identical between the eBPF mode and the iptables mode.

Current Behavior

When sending a TCP packet to a port on the hostIP of a node running calico on eBPF mode, for unknown connections, there is no reply sent, the packet is silently dropped and the client will resume retries until timeout. This happens both for unknown connections against closed ports and open ports. New connections against closed ports are properly rejected.

When sending such an unknown TCP packet to a node running calico on iptables mode, there is an immediate TCP RST packet and the client can immediately terminate (and potentially reestablish a connection).

Possible Solution

Even in eBPF mode, a packet with RST bit set should be sent for any connection not in conntrack.

Steps to Reproduce (for bugs)

  • Send a TCP packet without proper prior handshake to a nodeIP of a node which is eBPF enabled, regardless if the port is open or not.
  • Observe that you won't get a reply.

Trivial python script to trigger this:

#!/usr/bin/env python3

from scapy.all import *
import argparse

def main():
    parser = argparse.ArgumentParser("rst-test")
    parser.add_argument("dest_host", help="the destination host to test against")
    parser.add_argument("dest_port", help="the destination port to test against", type=int)
    args = parser.parse_args()

    packet = IP(dst=args.dest_host) / TCP(sport=13337, dport=args.dest_port, seq=1337, ack=1336, flags="PA") / Raw(load="test")

    response = sr1(packet, timeout=10)

    if response:
        if response.haslayer(TCP) and response.getlayer(TCP).flags & 0x04:
            print("Received RST packet")
        else:
            print("Did not receive RST packet")
    else:
        print("No response received")

if __name__ == "__main__":
    main()

Context

We are running the apiservers on an anycast IP in the hostNetwork, announced via BGP. When one master node goes down and withdraws its announcement, ECMP routing will route the traffic to a remaining control plane node. With that, such a remaining node will see traffic for unknown TCP connections, right after the ECMP change.

In our case, on eBPF-enabled clusters, kubelets' connections to the apiserver start to hang until a timeout is reached, partially causing nodes to be ejected as not-ready anymore. We identified that this happens because the kubelet don't notice that their connection to the apiserver broke, they keep on resending packets, because they never see a TCP reset.

On the non-eBPF clusters, this scenario is not an issue at all. After the ECMP change, kubelet will see a TCP reset, and establish a new TCP connection immediately.

Remarks

This might be related to #8854 or at least affected by a fix for this. From the user perspective, it is a different thing.

Your Environment

  • Calico version 3.27.4/3.28.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes v1.28.9
  • Operating System and version: FlatCar beta 3874.1.0 with kernel 6.6.21-flatcar
@tomastigera tomastigera added kind/enhancement area/bpf eBPF Dataplane issues labels Jun 5, 2024
@tomastigera
Copy link
Contributor

Yeah, this is a known issue for many years 😿 It wasn't addressed yet as with nodeports, the idea is that the node itself will respond - seems like that does not work for you. With other services advertised via bgp, that does not work and it is related to #5039. Also the #8854 might be at play. Let's say we will prioritize the work higher.

@tomastigera
Copy link
Contributor

We do issue ICMP error when there is a service but that service does not have a backend.

@tomastigera
Copy link
Contributor

Relates to #7983 I am not even sure iptables mode (vanilla k8s would respond with RST - hence I treat it as kind/Enhancement)

@sfudeus
Copy link
Author

sfudeus commented Jun 5, 2024

An additional factor here is that I wouldn't even expect eBPF to interfere here, since we are plainly working with hostIP, not pods or calico interfaces involved at all. It is pure host2host communication, with IP addresses not managed via calico or K8s. And, as mentioned, in non-ebpf mode, it behaves as expected.

@tomastigera
Copy link
Contributor

Well calico ebpf interferes on any host device that is included in the dataiface regexp. But my expectation would be that node IPs would just work 🤔 Would you see any conntracks in Linux or dropped packets in ebpf counters etc?

@sfudeus
Copy link
Author

sfudeus commented Jun 7, 2024

All ebpf drop counters stay 0. And I didn't see anything special in conntrack.
But I digged a bit into the iptables rules.

In a cluster without eBPF, as far as I understand we pass
INPUT->cali-INPUT->cali-from-host-endpoint->cali-fh-any-interface-at-all->cali-failsafe-in
and at least for this, we have a failsafe rule which causes the accept (end then ending in a reset because there is no connection).

In the eBPF case, we see directly the following in the INPUT chain:
373 49106 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:XQL0mC-L6wldZdgN / / Drop packets from unknown flows. */ mark match 0x5000000/0x5000000

Observing the counter, this triggers for my unrelated packet and causes the drop. So this at least should be the reason for the differing behaviour. Anything we can do about this?

@tomastigera
Copy link
Contributor

@sfudeus great investigation. Interestingly, this rule should be only hit by when (a) we see a mid-flow TCP packet for which we do not have a conntrack entry in ebpf (the connection predates enabling ebpf) and (b) linux does not have an entry for an existing flow neither - so it may be some spoofed packet and thus it should be dropped. 🤔 I will try to investigate.

@sfudeus
Copy link
Author

sfudeus commented Jun 7, 2024

@tomastigera that does fit, because in my scenario we do have such a mid-flow packet, because it was rerouted due to ECMP when 1 potential (and original) destination was removed. I understand the concern about spoofing, but I'd say there only is a risk if such a packet would break an existing connection state. In this case, it would only be about "revealing" that no matching connections exists. And that response would be sent to the "original" IP, not the spoofer. And not sure inhowfar this would require sequence numbers to match sufficiently.
And, at least in my case, now sure how generic this is: If someone would be able to inject an invalid packet into the hostNetwork at this level, we'd have a problem anyway. A spoofed packet from outside should have been rejected with invalid source IP already before being routed into that network.

@tomastigera
Copy link
Contributor

Hmmm it is an interesting one. I agree. Let me think about how to achieve the desired behaviour.

I think we just need to pass these packets through policy anyway and not just rely on iptables and conntrack. Because when contrack exists, we know that it was allowed, to we let it through. But if conntrack does not exist, it is already too late to police the packet and thus we need to drop it. The question is what to do with such a packet if it would be now denied by policy but iptables conntrack exists. I think it is fine to deny, however, at the startup (switch from iptables to ebpf) we could break existing connections and that is what we do not want to in the first place. And we cannot quite just send RST from ebpf because what if there was a conntrack in the kernel! With newer kernels, we could query linux conntrack. Or we may just allow the packet and mark it with "drop it if conntrack does not exist because denied by policy" 🤔 @fasaxc WDYT?

tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 20, 2024
For a smooth switch from iptables to ebpf mode, we do not want to
interrupt existing connections. If we see midflow packets, we pass them
to the host stack. If the stack can verify that they belong to an
existing conntrack, we let them through and we learn the conntrack.

We drop the rest. However, there are some situations when we can see a
stray TCP packet during ebpf mode, for instance when a pod dies and ECMP
kicks in and sends a packet to a different host.

If such a packet gets dropped, the end of the connections remains stuck.
This change sends an RST to such a stream instead of just dropping the
packets so that the end host can break the connection.

Fixes projectcalico#8882
@tomastigera tomastigera linked a pull request Jun 20, 2024 that will close this issue
3 tasks
tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 20, 2024
For a smooth switch from iptables to ebpf mode, we do not want to
interrupt existing connections. If we see midflow packets, we pass them
to the host stack. If the stack can verify that they belong to an
existing conntrack, we let them through and we learn the conntrack.

We drop the rest. However, there are some situations when we can see a
stray TCP packet during ebpf mode, for instance when a pod dies and ECMP
kicks in and sends a packet to a different host.

If such a packet gets dropped, the end of the connections remains stuck.
This change sends an RST to such a stream instead of just dropping the
packets so that the end host can break the connection.

Fixes projectcalico#8882
@tomastigera
Copy link
Contributor

@sfudeus could you try this image if it would work for you? thruby/node:3-28-tcp-reject-1

tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 24, 2024
For a smooth switch from iptables to ebpf mode, we do not want to
interrupt existing connections. If we see midflow packets, we pass them
to the host stack. If the stack can verify that they belong to an
existing conntrack, we let them through and we learn the conntrack.

We drop the rest. However, there are some situations when we can see a
stray TCP packet during ebpf mode, for instance when a pod dies and ECMP
kicks in and sends a packet to a different host.

If such a packet gets dropped, the end of the connections remains stuck.
This change sends an RST to such a stream instead of just dropping the
packets so that the end host can break the connection.

Fixes projectcalico#8882
tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 25, 2024
For a smooth switch from iptables to ebpf mode, we do not want to
interrupt existing connections. If we see midflow packets, we pass them
to the host stack. If the stack can verify that they belong to an
existing conntrack, we let them through and we learn the conntrack.

We drop the rest. However, there are some situations when we can see a
stray TCP packet during ebpf mode, for instance when a pod dies and ECMP
kicks in and sends a packet to a different host.

If such a packet gets dropped, the end of the connections remains stuck.
This change sends an RST to such a stream instead of just dropping the
packets so that the end host can break the connection.

Fixes projectcalico#8882
tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 25, 2024
For a smooth switch from iptables to ebpf mode, we do not want to
interrupt existing connections. If we see midflow packets, we pass them
to the host stack. If the stack can verify that they belong to an
existing conntrack, we let them through and we learn the conntrack.

We drop the rest. However, there are some situations when we can see a
stray TCP packet during ebpf mode, for instance when a pod dies and ECMP
kicks in and sends a packet to a different host.

If such a packet gets dropped, the end of the connections remains stuck.
This change sends an RST to such a stream instead of just dropping the
packets so that the end host can break the connection.

Fixes projectcalico#8882
@sfudeus
Copy link
Author

sfudeus commented Jun 25, 2024

calico-node didn't come up ready on 3 of 8 nodes with internal dataplane main loop timeout:

│ calico-node 2024-06-25 13:47:17.147 [INFO][53] felix/health.go 336: Overall health status changed: live=false ready=false                                                                                                                                                                                                │
│ calico-node +---------------------------+---------+----------------+-----------------+--------+                                                                                                                                                                                                                          │
│ calico-node |         COMPONENT         | TIMEOUT |    LIVENESS    |    READINESS    | DETAIL |                                                                                                                                                                                                                          │
│ calico-node +---------------------------+---------+----------------+-----------------+--------+                                                                                                                                                                                                                          │
│ calico-node | BPFEndpointManager        | -       | -              | reporting ready |        |                                                                                                                                                                                                                          │
│ calico-node | CalculationGraph          | 30s     | reporting live | reporting ready |        |                                                                                                                                                                                                                          │
│ calico-node | FelixStartup              | -       | reporting live | reporting ready |        |                                                                                                                                                                                                                          │
│ calico-node | InternalDataplaneMainLoop | 1m30s   | timed out      | timed out       |        |                                                                                                                                                                                                                          │
│ calico-node +---------------------------+---------+----------------+-----------------+--------+                                                                                                                                                                                                                          │
│ calico-node 2024-06-25 13:47:17.149 [WARNING][53] felix/health.go 286: Reporter is not live: timed out. name="InternalDataplaneMainLoop"                                                                                                                                                                                 │
│ calico-node 2024-06-25 13:47:17.149 [INFO][53] felix/health.go 294: Reporter is not ready: timed out. name="InternalDataplaneMainLoop"                                                                                                                                                                                   │
│ calico-node 2024-06-25 13:47:27.122 [WARNING][53] felix/health.go 286: Reporter is not live: timed out. name="InternalDataplaneMainLoop"                                                                                                                                                                                 │
│ calico-node 2024-06-25 13:47:27.122 [INFO][53] felix/health.go 294: Reporter is not ready: timed out. name="InternalDataplaneMainLoop"                                                                                                                                                                                   │
│ calico-node 2024-06-25 13:47:27.128 [WARNING][53] felix/health.go 286: Reporter is not live: timed out. name="InternalDataplaneMainLoop"                                                                                                                                                                                 │

With loglevel set to info, nothing special visible in the logs AFAICS.

Initially, it is reporting live, but not ready. Later, it is reporting timeouts for both liveness and readiness (as in the excerpt above). I could not identify a pattern which hosts would come up and which not.

But regarding the original issue: All 3 masters came up fine with the new calico-node and I properly received my tcp-rst when sending out-of-line packets. So it seems this might be solved.

@fasaxc
Copy link
Member

fasaxc commented Jun 26, 2024

@sfudeus what's the load like on your hosts? Do they have a lot of pods, is CPU maxed out? (That timeout suggests either a concurrency bug and Felix has locked up, or your nodes have a very high workload and Felix is struggling.)

@sfudeus
Copy link
Author

sfudeus commented Jun 26, 2024

@fasaxc nodes were only minimally loaded. This was on a bare metal sandbox cluster without real load. Theoretically there could have been CPU throttling on pod-level because of CPU limits involved, but I don't remember seeing anything specifically there. I can retry and report back if there is something.

edit: seems like the same 3 nodes as yesterday don't become healthy - ah, and we don't even have a cpu limit anymore on calico-node. I'm attaching a log excerpt of the InternalDataplaneMainLoop logs from health.go
InternalDataplaneMainLoop-sample.log

tomastigera added a commit to tomastigera/project-calico-calico that referenced this issue Jun 28, 2024
For a smooth switch from iptables to ebpf mode, we do not want to
interrupt existing connections. If we see midflow packets, we pass them
to the host stack. If the stack can verify that they belong to an
existing conntrack, we let them through and we learn the conntrack.

We drop the rest. However, there are some situations when we can see a
stray TCP packet during ebpf mode, for instance when a pod dies and ECMP
kicks in and sends a packet to a different host.

If such a packet gets dropped, the end of the connections remains stuck.
This change sends an RST to such a stream instead of just dropping the
packets so that the end host can break the connection.

Fixes projectcalico#8882
@tomastigera
Copy link
Contributor

This is a newer image thruby/node:3-28-tcp-reject-2 that came up well for me. Hard to say with the log excerpt anything beyond that there is a loop timeout indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bpf eBPF Dataplane issues kind/enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants