Bird CPU usage is almost always 100% #95

mithilarun · 2021-06-29T23:14:23Z

This is likely #77 all over again, but we're seeing the bird process run on 100% CPU almost always.

Expected Behavior

Bird should not consume the entire CPU to run.

Current Behavior

Possible Solution

We were able to lower to the CPU usage by editing /etc/calico/confd/config/bird.cfg by hand in the calico-node container and setting the following values:

protocol kernel {
  ....
  scan time 10;       # Scan kernel routing table every 2 seconds
}
....
protocol device {
  ...
  scan time 10;    # Scan interfaces every 2 seconds
}

These values are not set by confd and so I had to hand edit the file.

Steps to Reproduce (for bugs)

Context

Most calico-node pods in our K8s environment are not completely up:

# kubectl get pods -A  | grep calico-node | tail
kube-system              calico-node-tjfxz                                        0/1     Running            4          6d13h
kube-system              calico-node-tmq8d                                        0/1     Running            1          8d
kube-system              calico-node-ttq2v                                        0/1     Running            1          12d
kube-system              calico-node-txdgs                                        0/1     Running            3          7d
kube-system              calico-node-txs88                                        1/1     Running            1          4m39s
kube-system              calico-node-v4npm                                        0/1     Running            2          14d
kube-system              calico-node-v56lq                                        0/1     Running            2          7d7h
kube-system              calico-node-v7nfv                                        0/1     Running            33         14d
kube-system              calico-node-vggbt                                        0/1     Running            2          6d15h
kube-system              calico-node-zpvz2                                        1/1     Running            5          6d13h

Your Environment

bird version: BIRD version v0.3.3+birdv1.6.8
Operating System and version: Ubuntu 20.04.2 LTS
Link to your project (optional):

# ip addr | wc -l
50794
# ip route | wc -l
380

We are using kube-proxy in ipvs mode due to iptables being inefficient.

The text was updated successfully, but these errors were encountered:

mithilarun · 2021-06-29T23:16:09Z

Verified that we have the fix mentioned in projectcalico/confd#314. That is not helping.

sh-4.4# grep 'interface' /etc/calico/confd/config/bird.cfg
# Watch interface up/down events.
  scan time 2;    # Scan interfaces every 2 seconds
  interface -"cali*", -"kube-ipvs*", "*"; # Exclude cali* and kube-ipvs* but
                                          # kube-ipvs0 interface. We exclude
                                          # kube-ipvs0 because this interface

shivendra-ntnx · 2021-10-28T11:21:53Z

Do we any workaround here?

mnaser · 2021-11-09T16:38:52Z

Something that helped us a bit is the patch above, but we're still seeing this heavily. When running tcpdump, I see a whole load of AF_NETLINK traffic.

dbfancier · 2022-05-11T02:00:35Z

we ran into the same problem in our production environment.

CPU usage of bird is usually around 30%, but occasionally spikes to 100% and stays there for a while.

We did a CPU hot spot analysis using perf and found that the CPU time was concentrated in the function if_find_by_name(about 86%) and if_find_by_index(about 11%).

so I send SIGUSR1 to bird for a dump. It shows that iface_list has 30000 ~ 40000 nodes. The index field of most nodes is 0 and flags include LINK-DOWN and SHUTDOWN, and MTU is 0.

These devices no longer exist on the host, but remain in iface_list. Our scenario is offline training, so many pods are created and deleted every day.

Now I rebuild the list using the extreme method of "kill bird"

I wonder if kif_scan() has a problem with the interface_list maintenance mechanism. We hope the community will help identify and fix the problem.

Thanks a lot.

mgleung · 2022-09-20T17:33:45Z

@mithilarun @shivendra-ntnx @mnaser any other details about your cluster setup you can share? I'm trying to see if this can be fixed by addressing #102 or if we are looking at a separate issue.

mithilarun · 2022-09-23T17:43:04Z

@mgleung We had to tear the cluster down, but it looked quite similar to what @dbfancier reported here: #95 (comment)

caseydavenport · 2022-10-04T15:14:48Z

This PR was merged to master recently: #104

It looks like it has potential to fix this issue. We'll soak it and release it in v3.25 and hopefully we can close this then.

ialidzhikov · 2022-11-14T07:42:40Z

@caseydavenport , is there a chance to backport the fix to 3.24 and 3.23? When we can expect 3.25 to be released?

caseydavenport · 2022-11-14T11:52:39Z

Here are cherry-picks for v3.23 and v3.24:

[release-v3.24] Auto pick #6748: Update BIRD version to include memory usage fix calico#6991
[release-v3.23] Auto pick #6748: Update BIRD version to include memory usage fix calico#6992

v3.25 should be available by the end of the year.

dilyevsky · 2022-11-15T00:58:29Z

We were observing high Bird CPU usage/liveness probes failing on clusters with large number of services running with IPVS kube-proxy mode. What happens is kube-ipvs0 interface accumulates large number of addresses and it's getting picked up by kif_scan code (device protocol) -> ifa_update:

I have a patch 6680cc9 that ignores address updates for DOWN interfaces in the kif_scan loop that seems to improve this corner case which I can open a PR for unless someone has a better solution how to tackle this.

caseydavenport · 2022-11-16T11:47:26Z

@dilyevsky what version are you running? I thought in modern versions we exclude that interface from the BIRD direct protocol:

https://github.com/projectcalico/calico/blob/master/confd/etc/calico/confd/templates/bird.cfg.template#L116

dilyevsky · 2022-11-16T15:43:31Z

@caseydavenport v3.19 but it looks to be in the latest too. You're right - it's excluded on the direct protocol but device still picking up all the interfaces and there's no interface option there it seems - bird complains of "syntax errror" when you try to add it. My thinking was if interface is in managed DOWN state there's no point in ingesting its addrs in device so lmk if that makes sense to you.

ialidzhikov · 2022-12-05T16:00:29Z

@caseydavenport thank you very much for the cherry-picks! Do you also plan to cut new patch releases for 3.23 and 3.24? Thank you in advance!

caseydavenport · 2022-12-19T18:45:27Z

My thinking was if interface is in managed DOWN state there's no point in ingesting its addrs in device so lmk if that makes sense to you.

This makes sense to me - I'd want to think about it a bit more to make sure there's no reason we'd want that for other cases. Maybe @neiljerram or @song-jiang would know.

Do you also plan to cut new patch releases for 3.23 and 3.24? Thank you in advance!

@mgleung is running releases at the moment, so he can chime in. I know things are a bit slow right now around the holidays so I doubt there will be another release this year, if I had to guess.

nelljerram · 2023-01-05T09:35:53Z

In the docs for BIRD 2, there is an interface option in the protocol device section, which suggests that it wouldn't be fundamentally problematic to skip some interfaces in the scan associated with the device protocol. In order to be more confident, it might help to track down when that was added to the BIRD 2 code, to see if there were other changes needed with that. In advance of that, the idea of skipping DOWN interfaces does feel like it should be safe.

For another approach, I tried reading our BIRD (1.6 based) code to understand the interface scanning more deeply, but it mutates global state and is not easy to follow - would need to schedule more time to follow that approach properly.

mgleung · 2023-01-05T22:21:38Z

@ialidzhikov We currently don't have any patch releases for v3.24 and v3.23 planned since we are focusing on getting v3.25 out at the moment. Sorry we're a little late on the releases at the moment.

ialidzhikov · 2023-01-06T07:06:32Z

@mgleung, thanks for sharing. The last patch releases for calico are beginning of November, 2022. It feels odd that the fixes are merged but we cannot consume them from the upstream. I hope that cutting the patch releases will be prioritised after cutting the v3.25 release. Thank you in advance!

mgleung · 2023-01-09T23:55:48Z

@ialidzhikov, thanks for the feedback. I can't make any promises about an exact timeline, but if these are sought after fixes, then that makes a compelling argument to cut the patch releases sooner rather than later.

ialidzhikov · 2023-01-27T07:42:35Z

@mgleung , we now see that 3.25 is released. Can you give ETA for the patch releases? Thanks in advance!

mgleung · 2023-01-27T22:31:21Z

@ialidzhikov if all goes well, I'm hoping to have it cut in the next couple of weeks.

mithilarun · 2023-04-03T15:33:39Z

@mgleung I see cherry-picks done for 3.23 and 3.24, but there isn't a release that we can consume yet. Do you have an ETA on when those might be available?

florianbeer · 2023-08-28T08:53:45Z

Just chiming in: we very likely have the same problem on a few of our clusters. All of them have a high number of pods being created and destroyed via Kubernetes Jobs.

Setting scan time for bird and bird6 does lower the CPU usage a bit and the readyness probe of the affected calico-node pods goes green again.

Versions:

# calico-node -v
v3.25.0
# bird --version
BIRD version v0.3.3+birdv1.6.8

Settings:

# sed -i 's/scan time 2\;/scan time 10\;/g' /etc/calico/confd/config/bird{,6}.cfg

# birdcl configure
BIRD v0.3.3+birdv1.6.8 ready.
Reading configuration from /etc/calico/confd/config/bird.cfg
Reconfigured

# birdcl6 configure
BIRD v0.3.3+birdv1.6.8 ready.
Reading configuration from /etc/calico/confd/config/bird6.cfg
Reconfigured

axel7born · 2023-09-19T15:18:39Z

The issue doesn't seem to be completely resolved by #104. When creating and deleting a large number of pods in a cluster, we've noticed that the number of interfaces visible with dump interfaces gradually increases over time.

This issue can be easily reproduced by creating a Kubernetes job with a large number of completions. However, it does take some time, and only a fraction of the created pods results in a permanent increase in the number of internal interfaces.

Would it be sensible to remove all interfaces with the IF_SHUTDOWN flag by iterating over the interfaces?

Another suggestion could be to make the watchdog timeout in the bird config configurable via Calico, or set it to a reasonable default value (perhaps 10 seconds). This way, problematic processes would automatically be restarted.

nueavv · 2024-09-25T06:17:11Z

I sent a kill -SIGUSR1 signal to the BIRD process but can't find where the logs are being written. Could anyone share where I should be looking, especially in a container environment? It seems like I'm facing a similar issue as others might have experienced.

nelljerram · 2024-09-25T09:34:58Z

@nueavv What is your BIRD issue?

nueavv · 2024-09-25T09:49:47Z

It is in the kubernetes. I am using calico!

nelljerram · 2024-09-25T10:17:39Z

I'm sorry, you're not giving enough detail for me to understand. As far as we know, this issue was fixed by #104 and #111 - and so should be closed now. (Unfortunately it seems we missed closing this last year.)

nelljerram · 2024-09-25T11:43:33Z

Resolved by #104 and #111

nueavv · 2024-09-25T12:36:07Z

@nelljerram I just want to make sure that the problem I’m experiencing is because of this. Could you help confirm?

nelljerram · 2024-09-25T12:43:52Z

@nueavv I'm afraid I have no idea because I don't think you've described your problem yet. It will probably be clearest for you to do that in a new issue at https://github.com/projectcalico/bird/issues/new/choose

nueavv · 2024-09-25T13:18:38Z

@nelljerram Thanks for your response. What I'm trying to do is gather information about the iface_list. I have cronjob-based pods that start and stop every minute, and I want to confirm if they are the cause. I sent a kill -SIGUSR1 signal to the BIRD process on the Calico node, but I couldn't find any logs generated anywhere. Could you guide me on where I might be able to find this information?

nueavv · 2024-09-25T23:35:42Z

I found logs.! Thank you @nelljerram

mnaser mentioned this issue Nov 9, 2021

Skip tap/qbr/qvo ports projectcalico/confd#521

Closed

splitice mentioned this issue Jun 14, 2022

CPU usage of bird spikes to 100% and stays there for a while #102

Closed

mgleung added kind/bug likelihood/low impact/high labels Sep 20, 2022

dilyevsky mentioned this issue Nov 28, 2022

[proto/device] Ignore addr updates on DOWN ifcs #105

Merged

florianbeer mentioned this issue Aug 28, 2023

slow increase in bird cpu usage projectcalico/calico#5942

Open

axel7born mentioned this issue Sep 21, 2023

Remove interfaces that are flagged as shutdown. #111

Merged

ming12713 mentioned this issue Nov 8, 2023

calico-node frequent restart output errors "Readiness probe failed: calico/node is not ready: BIRD is not ready: Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout" projectcalico/calico#8194

Closed

nelljerram closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bird CPU usage is almost always 100% #95

Bird CPU usage is almost always 100% #95

mithilarun commented Jun 29, 2021 •

edited

Loading

mithilarun commented Jun 29, 2021 •

edited

Loading

shivendra-ntnx commented Oct 28, 2021

mnaser commented Nov 9, 2021

dbfancier commented May 11, 2022 •

edited

Loading

mgleung commented Sep 20, 2022

mithilarun commented Sep 23, 2022

caseydavenport commented Oct 4, 2022

ialidzhikov commented Nov 14, 2022

caseydavenport commented Nov 14, 2022

dilyevsky commented Nov 15, 2022 •

edited

Loading

caseydavenport commented Nov 16, 2022

dilyevsky commented Nov 16, 2022

ialidzhikov commented Dec 5, 2022

caseydavenport commented Dec 19, 2022

nelljerram commented Jan 5, 2023

mgleung commented Jan 5, 2023

ialidzhikov commented Jan 6, 2023

mgleung commented Jan 9, 2023

ialidzhikov commented Jan 27, 2023

mgleung commented Jan 27, 2023

mithilarun commented Apr 3, 2023

florianbeer commented Aug 28, 2023 •

edited

Loading

axel7born commented Sep 19, 2023

nueavv commented Sep 25, 2024

nelljerram commented Sep 25, 2024

nueavv commented Sep 25, 2024

nelljerram commented Sep 25, 2024

nelljerram commented Sep 25, 2024

nueavv commented Sep 25, 2024 •

edited

Loading

nelljerram commented Sep 25, 2024

nueavv commented Sep 25, 2024 •

edited

Loading

nueavv commented Sep 25, 2024 •

edited

Loading

Bird CPU usage is almost always 100% #95

Bird CPU usage is almost always 100% #95

Comments

mithilarun commented Jun 29, 2021 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

mithilarun commented Jun 29, 2021 • edited Loading

shivendra-ntnx commented Oct 28, 2021

mnaser commented Nov 9, 2021

dbfancier commented May 11, 2022 • edited Loading

mgleung commented Sep 20, 2022

mithilarun commented Sep 23, 2022

caseydavenport commented Oct 4, 2022

ialidzhikov commented Nov 14, 2022

caseydavenport commented Nov 14, 2022

dilyevsky commented Nov 15, 2022 • edited Loading

caseydavenport commented Nov 16, 2022

dilyevsky commented Nov 16, 2022

ialidzhikov commented Dec 5, 2022

caseydavenport commented Dec 19, 2022

nelljerram commented Jan 5, 2023

mgleung commented Jan 5, 2023

ialidzhikov commented Jan 6, 2023

mgleung commented Jan 9, 2023

ialidzhikov commented Jan 27, 2023

mgleung commented Jan 27, 2023

mithilarun commented Apr 3, 2023

florianbeer commented Aug 28, 2023 • edited Loading

axel7born commented Sep 19, 2023

nueavv commented Sep 25, 2024

nelljerram commented Sep 25, 2024

nueavv commented Sep 25, 2024

nelljerram commented Sep 25, 2024

nelljerram commented Sep 25, 2024

nueavv commented Sep 25, 2024 • edited Loading

nelljerram commented Sep 25, 2024

nueavv commented Sep 25, 2024 • edited Loading

nueavv commented Sep 25, 2024 • edited Loading

mithilarun commented Jun 29, 2021 •

edited

Loading

mithilarun commented Jun 29, 2021 •

edited

Loading

dbfancier commented May 11, 2022 •

edited

Loading

dilyevsky commented Nov 15, 2022 •

edited

Loading

florianbeer commented Aug 28, 2023 •

edited

Loading

nueavv commented Sep 25, 2024 •

edited

Loading

nueavv commented Sep 25, 2024 •

edited

Loading

nueavv commented Sep 25, 2024 •

edited

Loading