Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Felix should configure iptables rules such that VXLAN UDP Flows are not tracked in conntrack, when in VXLAN mode #8934

Open
adkafka opened this issue Jun 21, 2024 · 6 comments

Comments

@adkafka
Copy link

adkafka commented Jun 21, 2024

We have a workload that manages many (10,000s) TCP connections per node. Traffic is sent between nodes in a Kubernetes cluster. We use Calico in a VXLAN configuration as our CNI. Felix manages the iptables entries as expected. Additionally, we have kube-proxy running on our cluster in a standard configuration (using iptables not ipvs).

The issue we are noticing is that our Conntrack tables are unexpectedly full. Some of the entries in the Conntrack table are expected (the TCP connections responsible for our application traffic), but to my surprise, almost half of the entries in our Conntrack table are UDP "connections" responsible for the VxLAN tunnels between nodes. All of these connections are in the "UNREPLIED" state.

Here are some commands to illustrate this:

$ cat /proc/sys/net/nf_conntrack_max
131072
$ cat /proc/net/nf_conntrack | wc -l
129605
$ cat /proc/net/nf_conntrack | grep "udp" | wc -l
59262
$ cat /proc/net/nf_conntrack | grep "udp" | grep "UNREPLIED" | grep "dport=4789" | wc -l
59258
$ cat /proc/net/nf_conntrack | grep "tcp" | wc -l
70343

Of the 129,605 flows tracking in Conntrack, 59,258 (~46%) of them are UDP flows corresponding to VXLAN. This limits how many connections each node in our cluster can handle significantly. Luckily, when one of these nodes Conntrack table does fill up, after dropping a couple packets, it will enter "early_drop" mode, and remove many UNREPLIED connections from the Conntrack table (which in our case, are the UDP VXLAN flows). This prevents having significant application impact, but it does make monitoring our Conntrack usage much more difficult.

After some discussion in the Calico slack (#networking https://calicousers.slack.com/archives/CPEPF833L/p1718663125404499), we decided to experiment with adding iptables rules such that these VXLAN UDP flows were not tracked in Conntrack. We found that it had the desired effect and caused no impact to our application traffic. Therefore, we are proposing that Calico automatically add these rules when in VXLAN mode. It may be worth putting this behind a configuration flag and defaulting to "off" to prevent accidentally breaking any workloads.

The iptables rules I added to each of these node to configure it not to track VXLAN UDP traffic was:

$ iptables --table raw --append OUTPUT --protocol udp --dport 4789 --jump NOTRACK
$ iptables --table raw --append PREROUTING --protocol udp --dport 4789 --jump NOTRACK

This was based off a tool I found online that did something very similar (https://review.opendev.org/c/openstack/tripleo-heat-templates/+/831444/1/deployment/neutron/neutron-ovs-agent-container-puppet.yaml).

After we apply these rules on our nodes, we see 0 entries in Conntrack matching the UDP port:

$ conntrack -L | grep "udp" | grep "UNREPLIED" | grep "dport=4789" | wc -l
0

My understanding is that tracking these UDP flows in Conntrack has no advantage. These flows remain in the UNREPLIED state because the traffic only flows one way. Therefore, stateful connection tracking has no positive impact.

Expected Behavior

Conntrack table does not fill up with VXLAN UDP flows.

Current Behavior

Conntrack table contains a non-trivial amount of VXLAN UDP flows, resulting in these tables filling prematurely.

Possible Solution

Configure Felix to add NOTRACK rules to the raw table in iptables when used in VXLAN mode. Controlling this with a configuration parameter seems ideal, in case there are some unique workloads where this change does have an impact (though I can't think of one).

Your Environment

  • Calico version: v3.28.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
  • Operating System and version: Amazon Linux 2 (5.10.217-205.860.amzn2.x86_64)
@cyclinder
Copy link
Contributor

cyclinder commented Jun 24, 2024

I noticed the issue before, but I also found it doesn't affect the communication, Thanks for the report! @adkafka, I'd help to solve this. this feature doesn't need a flag, right?

@adkafka
Copy link
Author

adkafka commented Jun 24, 2024

this feature doesn't need a flag, right?

I cannot think of a use case where we'd want these flows tracked in conntrack. So, from that perspective, no feature flag should be needed. With that said, there could be some creative use cases out there that I'm not familiar with. Adding a feature flag (even if we default the flag such that it doesn't create these flows in conntrack) could help support these use cases. Perhaps we should wait until we know of a concrete use case to add the feature flag though? I hope we can get some guidance from the maintainers about this.

@cyclinder
Copy link
Contributor

Yes, I also agree we don't need the feature flag, this should be a default behavior, also need ack from the maintainers.

/cc @caseydavenport @tomastigera @fasaxc

@fasaxc
Copy link
Member

fasaxc commented Jun 25, 2024

I think this is likely to be a good idea but we need to check for interactions with Calico host endpoint policy, where Calico is policing traffic on the host's own interface. That feature already has an auto-allow for VXLAN but we'd need to check that that all worked correctly with NOTRACK (and there was no performance regression, for example). Were there any established flows for VXLAN, or does it always go into this state?

@adkafka
Copy link
Author

adkafka commented Jun 25, 2024

Were there any established flows for VXLAN, or does it always go into this state?

There are 0 established (ASSURED) VXLAN UDP flows in the conntrack tables I looked at. All 59,258 VXLAN UDP flows are UNREPLIED in the instance I examined:

$ cat ~/Desktop/nf_conntrack.txt | grep "udp" | grep "dport=4789" | wc -l
   59258
$ cat ~/Desktop/nf_conntrack.txt | grep "udp" | grep "dport=4789" | grep "UNREPLIED" | wc -l
   59258

This matches my understanding of VXLAN. It is a one-way tunnel between hosts. If the host responds, it will be over a different UDP flow because that other host will use the destination port of 4789 to respond.

@fasaxc
Copy link
Member

fasaxc commented Jun 25, 2024

Yeah, that makes sense; we're likely getting no benefit from conntrack then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants