-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks in pmon daemons related to Redis set activity #17025
Comments
It looks like
|
I issued a
|
Back up to 14.2% for psud this morning.
|
Since both issues are dealing with thermal issues on this platform, it's possible there may be a link to the changes @arunlk-dell mentioned here: #16666 |
Didn't catch the memory consumption in time over the weekend and the switch crashed again this morning after running out of memory. This is a critical issue - I'm surprised if I'm the only one seeing it. |
Have identified certain file read are not properly closed. It could also lead to increase in mem usage. |
@arunlk-dell , please share the PR for the potential fix. Thanks! |
You're not the only one, i'm seeing the exact same thing. I periodically restart pmon as an workaround but i'm hoping for a fix |
While it gets sorted out, I added a daily cron job to restart |
Built from
@arunlk-dell do you have a PR for the fix yet? I'd be happy to build and test it. |
An update on the impact of this: I think it extends beyond the Dell units. It's very visible there because resources are tight (4GB of RAM). But I had a build of SONiC from
...and I just (accidentally, actually - I thought I was on the SSH session to my Dell unit) checked the memory usage on my Micas unit and I see that
That unit is much more powerful than my n3248te, so the crashes aren't as frequent (or noticeable since it's in my lab). @gechiang does this need to be assigned to someone with broader responsibility than just the Dell platforms? Since I see the problem is likely outside of the Dell platform code, I'll also look at |
can you share how much memory you have on your device and gather the output of the following commands as soon as the device boots up top -H Then wait for few hours (prefer right before your cron job kicks in to restart pmon) to gather the same comand output Also, canyou try running a 202305 based image or 202205 based image and see if you also see this issue? Thanks! |
@gechiang sure, I'll work on that. I've been working on tracking this down today, and have found that if I comment out these lines that interact to https://github.com/sonic-net/sonic-platform-daemons/blob/1c9b01d12393f25db7b258f58d10ea77c3f5b682/sonic-psud/scripts/psud#L610 |
I disabled the cron restart job and just restarted the Here is the
Here's the
Here are the other outputs you asked for. In that first one, I think
|
Here's the updated information, 3 hours later.
|
My best guess at this point is that it's a leak somewhere in the But that doesn't explain why I'm only seeing it in |
I see this issue too on Edgecore/Accton-AS7326-56X and Ufispace/S8901-54XC-980B boxes running current master. On my side the leak seems to be present in psud, pcied, thermalctld and xcrvd. See below the top memory consumption on a AS7326-56X with 8 days uptime:
9651 root 20 0 5456652 5.2g 15636 S 0.0 33.4 48:02.97 python3 pid info: monit status docker stats --no-streamCONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS I compare with a AS7326-56X box running 202211, 82 days uptime: top - 08:21:39 up 82 days, 6:08, 1 user, load average: 0.61, 0.95, 0.97
6870 root 20 0 2498512 635616 140268 S 11.6 3.9 17091:02 syncd (none of the 4 pmon processes are in top) monit status docker stats --no-stream |
Thanks for that note. That seems to support my thought that this may be a leak in the |
I compiled an image with this PR: The memory leak is reduced but is still present, see below the stats for the same box AS7326-56X with 2 days uptime top - 09:06:54 up 2 days, 18:13, 1 user, load average: 0.50, 0.73, 0.80 monit status docker stats --no-stream |
@justindthomas The title says there are lot many set operation on redis DB. Can you try run the unit test mentioned here https://github.com/sonic-net/sonic-swss-common/pull/343/files and see if it similar issue. |
Is there any news ? |
I also believe I have the same problem with the Celestica DX010 which I have reported in #18871 but this might extend down to older builds... since i am on 202311 From after a reboot this morning:
PMON log before the (maybe during the reboot):
|
@prgeor unassigned from my name as the issue common to all platforms. The changes which i have is specific to dell platform. |
I am not brave enough to go to master for production. I had a lot of issues with 202305. when doing a lot of configuration or change configuration, I would suddenly start to have issues where swss would just crash and not work unless I completely reload the ISO. Meaning, even if I start off from a clean config_db.json or reloading a good config_db.json. swss would just crash contantly. |
I raise a PR: sonic-net/sonic-swss-common#876 for fixing this issue, please check it. |
Thank you! when will this get pass on to the 202311 build? |
Once this PR is merged, I will raise another PR for 202311 branch ASAP. I estimate it will be done next week. |
I compiled an image with the last PR: sonic-net/sonic-swss-common#876 and the issue is fixed: Platform: x86_64-accton_as7326_56x-r0 uptime The pmon memory usage remains stable at ~ 184 MiB |
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak #859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak #859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
This issue has been fixed and related PR has been merged into the sonic-buildimage. |
@Pterosaur can you tell me how to track this against Azure DevOps? I still can't figure out how to find out what PR have been merged with the Azure DevOps. |
Hi @mbze430 , I just checked the submodule, sonic-swss-common, of sonic-buildimage on the master and 202311 branch has included my fixed PR: If you download a image from Azure DevOps(a.k.a Azure Pipeline), just chek the commit version via |
as of ver: SONiC Software Version: SONiC.202311.564467-43bcca75f |
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak #859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity Description The issue reports a memory leak on the Redis set operations Reason Didn't decrease the reference count after PySequence_GetItem Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string Fix: Refer PR: Fix swig template memory leak #859 from @praveenraja1 Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string Add unit test To monitor there is no dramatic memory increment after a huge amount of Redis set operations. Signed-off-by: Ze Gan <zegan@microsoft.com>
Description
In the last couple of weeks, I've noticed that my switch is spontaneously rebooting every few days. I started watching the logs more closely as a result and am seeing that after being up for a day or so, it starts complaining about memory thresholds being exceeded. I start seeing errors like these repeated continuously:
I update to
master
each Monday, so I'm not sure if this is a very recent change, but I didn't notice this behavior before the last couple of weeks (i.e., I did see it on the software I deployed on Monday, 10/16 and this last Monday, 10/23). It may have been happening before that, but I was also in the switch making changes more frequently then than I have been in the last couple of weeks.I also noticed this morning that my system status LED was blinking yellow (it had been solid green, with everything reporting OK). This is the
show system-health summary
right now:Steps to reproduce the issue:
Describe the results you received:
It runs continuously without interruption.
Describe the results you expected:
It reboots after a couple of days with memory consumption errors in the logs.
Output of
show version
:Output of
show techsupport
:output.txt
Additional information you deem important (e.g. issue happens only occasionally):
File is too large. I uploaded it to Dropbox: https://www.dropbox.com/scl/fi/9ps71obxnxzca8wusz522/sonic_dump_sonic_20231027_011924.tar.gz?rlkey=cg5m1a4tom2an8g0x9iykw1et&dl=0
The text was updated successfully, but these errors were encountered: