-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rke2 server eats up all the memory #6370
Comments
PS: K3s has the same problem. |
Please upgrade to v1.28.11+rke2r1 |
@brandond BTW I still don't see option to upgrade rke2 to 2.18.11 from Rancher. Do you have any info when it will be available? Because I still during few weeks have to go and manually clear my bucket |
@serhiynovos , yes, I am using a local minio to store a copy of the snapshots. Update: S3 snapshots were on, but they are disabled right now. Only local storage. |
@harridu Please check your bucket. There should be a lot of snapshots. You can clean them manually and see if it will resolve the issue. |
deleted. How comes the S3 storage is still in use, even though it is disabled in the GUI? |
After removing all backups in S3 rke2 memory usage stays low, as it seems. I still have no idea why it backups to S3 at all. If 1.28.11 provides some fixes, then please make it available in Rancher 2.8.5. |
I'm not sure what you mean. Why does it back up to S3 when you configure s3 for backups? |
@brandond i think @harridu means that on rancher ui he disabled s3 backups but rke2 still uploads them on s3 storage. Btw finally got 1.28.11 version on Rancher. Issue with s3 is resolved |
For us that are on Any suggestions or input on this? |
Upgrade to 2.9.0, or deal with running newer RKE2 releases that are not technically supported by the patch release of Rancher that you're on. |
Thanks! As a side note: We just noticed that when we in Rancher (the UI) check RKE2 versions we can select version |
It is, yes. |
Closing as resolved in releases that have a fix for the s3 snapshot prune issue. |
@brandond The memory issue is partially solved. Retention is working fine, it deletes old snapshots from both S3 and local, but the memory keeps increasing when etcd backup to s3 enabled. It was noticed that on the etcd node leader RAM memory constantly increases by the size of db on a daily basis. It seems that rke2 caches the upload to s3 and never releases it. After the backup snapshots to S3 are disabled memory immediately drops. Node was at 90% memory usage and dropped to 15%. rke2 v1.30.4+rke2r1 with 3 nodes etcd cluster |
What do you mean "caches it and never releases it". What lead you to this conclusion?
What do you mean by "increases by the size of the db"? Can you provide figures demonstrating this? |
@boris-stojnev did you try to use latest stable 1.30.6 version? I had similar issue in previous versions but don't experience it anymore after upgrade |
@serhiynovos No, I didn’t try it. I can’t upgrade at this point, but I’m not seeing anything related to etcd snapshots in the release notes. :-/ |
I would recommend upgrading to v1.30.6 or newer. As per the upstream release notes:
You are likely seeing this memory leak in core Kubernetes. |
I'm going to follow up after I enabled it again, to buy some time untill next cycle upgrade. On a side note, you should be consistent in defining snapshot retention. It says per node, for example 30 per node, so my expectation is to have 30 on each node - local and 90 in the s3 (for 3 node etcd cluster), but there are 30 in total on s3 which means I have only the last 10 on s3. |
Not really related at all to the issue under discussion here, see instead: |
Environmental Info:
RKE2 Version:
Node(s) CPU architecture, OS, and Version:
Debian12 running inside kvm, 4 cores, 32 GByte memory, no swap
Cluster Configuration:
3 controller nodes, 32 GByte RAM and 4 cores each, kvm
6 "real" worker nodes, 512 GByte RAM and 64 cores each
All Debian 12, RKE2 1.28.10, managed in Rancher 2.8.5
Describe the bug:
On the control plane nodes rke2 uses up quite a big chunk of memory. On the first control node I get
That is 20 GByte rss. On the other control plane nodes it is "just" 3 GByte. Still way too much for 3 days uptime. Memory usage increases over time, till the first control plane nodes runs into OOM.
The worker nodes seem fine.
Steps To Reproduce:
Setup a cluster using Rancher 2.8.5 and RKE2 1.28.10 and see it grow. If I use RKE2 on the command line to setup a cluster there is no such problem.
The text was updated successfully, but these errors were encountered: