Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks in pmon daemons related to Redis set activity #17025

Closed
justindthomas opened this issue Oct 26, 2023 · 36 comments
Closed

Memory leaks in pmon daemons related to Redis set activity #17025

justindthomas opened this issue Oct 26, 2023 · 36 comments
Assignees
Labels
Issue for 202311 MSFT P0 Priority of the issue Triaged this issue has been triaged

Comments

@justindthomas
Copy link
Contributor

justindthomas commented Oct 26, 2023

Description

In the last couple of weeks, I've noticed that my switch is spontaneously rebooting every few days. I started watching the logs more closely as a result and am seeing that after being up for a day or so, it starts complaining about memory thresholds being exceeded. I start seeing errors like these repeated continuously:

Oct 27 01:14:16.901857 sonic ERR monit[498]: 'sonic' mem usage of 93.1% matches resource limit [mem usage>90.0%]
Oct 27 01:14:16.903620 sonic NOTICE rsyslog_plugin: :- publish: EVENT_PUBLISHED: {"sonic-events-host:memory-usage":{"limit":"90.0","timestamp":"2023-10-27T01:14:16.902305Z","usage":"93.1"}}
Oct 27 01:14:17.391484 sonic ERR memory_threshold_check: Free memory 238372 is less then free memory threshold 396047

I update to master each Monday, so I'm not sure if this is a very recent change, but I didn't notice this behavior before the last couple of weeks (i.e., I did see it on the software I deployed on Monday, 10/16 and this last Monday, 10/23). It may have been happening before that, but I was also in the switch making changes more frequently then than I have been in the last couple of weeks.

I also noticed this morning that my system status LED was blinking yellow (it had been solid green, with everything reporting OK). This is the show system-health summary right now:

jdt@sonic:~$ sudo show system-health summary
System status summary

  System status LED  blink_yellow
  Services:
    Status: Not OK
    Not Running: sonic, memory_check
  Hardware:
    Status: OK

Steps to reproduce the issue:

  1. Turn the switch on and leave it running.

Describe the results you received:

It runs continuously without interruption.

Describe the results you expected:

It reboots after a couple of days with memory consumption errors in the logs.

Output of show version:

jdt@sonic:~$ sudo show version

SONiC Software Version: SONiC.master.391826-73dd38a5c
SONiC OS Version: 11
Distribution: Debian 11.7
Kernel: 5.10.0-23-2-amd64
Build commit: 73dd38a5c
Build date: Sat Oct 21 12:25:57 UTC 2023
Built by: AzDevOps@vmss-soni0029ZK

Platform: x86_64-dellemc_n3248te_c3338-r0
HwSKU: DellEMC-N3248TE
ASIC: broadcom
ASIC Count: 1
Serial Number: 4GNXV43
Model Number: 0WNWT9
Hardware Revision:
Uptime: 01:19:01 up 1 day, 16:06,  1 user,  load average: 1.33, 1.37, 1.19
Date: Fri 27 Oct 2023 01:19:01

Docker images:
REPOSITORY                    TAG                       IMAGE ID       SIZE
docker-gbsyncd-broncos        latest                    6e303fa4697c   349MB
docker-gbsyncd-broncos        master.391826-73dd38a5c   6e303fa4697c   349MB
docker-gbsyncd-credo          latest                    0bd955d467d2   320MB
docker-gbsyncd-credo          master.391826-73dd38a5c   0bd955d467d2   320MB
docker-syncd-brcm             latest                    0e34565ef63f   674MB
docker-syncd-brcm             master.391826-73dd38a5c   0e34565ef63f   674MB
docker-macsec                 latest                    ac9336c9e36d   326MB
docker-dhcp-relay             latest                    7b42d0ccbd2e   307MB
docker-orchagent              latest                    7c8d9d883d66   336MB
docker-orchagent              master.391826-73dd38a5c   7c8d9d883d66   336MB
docker-eventd                 latest                    0a9f8e958401   299MB
docker-eventd                 master.391826-73dd38a5c   0a9f8e958401   299MB
docker-fpm-frr                latest                    aa4b85b82143   356MB
docker-fpm-frr                master.391826-73dd38a5c   aa4b85b82143   356MB
docker-nat                    latest                    dd760f56804b   327MB
docker-nat                    master.391826-73dd38a5c   dd760f56804b   327MB
docker-sflow                  latest                    37e7c15da8a4   326MB
docker-sflow                  master.391826-73dd38a5c   37e7c15da8a4   326MB
docker-teamd                  latest                    afce1ffc456e   324MB
docker-teamd                  master.391826-73dd38a5c   afce1ffc456e   324MB
docker-snmp                   latest                    312cb3cfc41b   338MB
docker-snmp                   master.391826-73dd38a5c   312cb3cfc41b   338MB
docker-sonic-telemetry        latest                    733d471ec6ed   387MB
docker-sonic-telemetry        master.391826-73dd38a5c   733d471ec6ed   387MB
docker-platform-monitor       latest                    56df3e17f6cd   419MB
docker-platform-monitor       master.391826-73dd38a5c   56df3e17f6cd   419MB
docker-router-advertiser      latest                    22064ca2727c   299MB
docker-router-advertiser      master.391826-73dd38a5c   22064ca2727c   299MB
docker-lldp                   latest                    ca3aad56211e   341MB
docker-lldp                   master.391826-73dd38a5c   ca3aad56211e   341MB
docker-mux                    latest                    932ba9db8aa2   348MB
docker-mux                    master.391826-73dd38a5c   932ba9db8aa2   348MB
docker-database               latest                    b9165208f67f   299MB
docker-database               master.391826-73dd38a5c   b9165208f67f   299MB
docker-sonic-mgmt-framework   latest                    686c7ce701ea   416MB
docker-sonic-mgmt-framework   master.391826-73dd38a5c   686c7ce701ea   416MB

Output of show techsupport:

output.txt

Additional information you deem important (e.g. issue happens only occasionally):

File is too large. I uploaded it to Dropbox: https://www.dropbox.com/scl/fi/9ps71obxnxzca8wusz522/sonic_dump_sonic_20231027_011924.tar.gz?rlkey=cg5m1a4tom2an8g0x9iykw1et&dl=0

@justindthomas
Copy link
Contributor Author

It looks like /usr/local/bin/psud may be taking more than its fair share of memory and seems to be growing over time. Maybe thermalctld as well, but at a smaller scale.

jdt@sonic:~$ ps aux | sort -rnk 4 | head -5
root      125996  6.1 22.2 909852 879440 pts/0   S    Oct25 134:30 python3 /usr/local/bin/psud
root      123126 15.8 11.7 2017780 465432 pts/0  Sl   Oct25 347:47 /usr/bin/syncd --diag -u -s -p /etc/sai.d/sai.profile -b /tmp/break_before_make_objects
root      126101  0.8  4.8 221812 191320 pts/0   S    Oct25  18:41 python3 /usr/local/bin/thermalctld
300       123501  0.0  2.3 1201932 93944 pts/0   Sl   Oct25   0:04 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp --asic-offload=notify_on_offload
root      125995  1.3  1.9 317460 78228 pts/0    Sl   Oct25  28:43 python3 /usr/local/bin/xcvrd

jdt@sonic:~$ ps aux | head -1; ps aux | sort -rnk 4 | head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      125996  6.1 22.2 910108 879704 pts/0   S    Oct25 134:33 python3 /usr/local/bin/psud
root      123126 15.8 11.7 2017780 465432 pts/0  Sl   Oct25 347:54 /usr/bin/syncd --diag -u -s -p /etc/sai.d/sai.profile -b /tmp/break_before_make_objects
root      126101  0.8  4.8 221812 191320 pts/0   S    Oct25  18:41 python3 /usr/local/bin/thermalctld
300       123501  0.0  2.3 1201932 93676 pts/0   Sl   Oct25   0:04 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp --asic-offload=notify_on_offload
root      125995  1.3  1.9 317460 78228 pts/0    Sl   Oct25  28:44 python3 /usr/local/bin/xcvrd

jdt@sonic:~$ ps aux | head -1; ps aux | sort -rnk 4 | head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      125996  6.1 22.2 913080 882464 pts/0   R    Oct25 135:01 python3 /usr/local/bin/psud
root      123126 15.8 11.7 2017780 465432 pts/0  Sl   Oct25 349:09 /usr/bin/syncd --diag -u -s -p /etc/sai.d/sai.profile -b /tmp/break_before_make_objects
root      126101  0.8  4.8 222460 191640 pts/0   S    Oct25  18:45 python3 /usr/local/bin/thermalctld
300       123501  0.0  2.3 1201932 93692 pts/0   Sl   Oct25   0:04 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp --asic-offload=notify_on_offload
root      125995  1.3  1.9 317460 78752 pts/0    Sl   Oct25  28:50 python3 /usr/local/bin/xcvrd

jdt@sonic:~$ ps aux | head -1; ps aux | sort -rnk 4 | head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      125996  6.1 22.3 915016 884400 pts/0   S    Oct25 135:18 python3 /usr/local/bin/psud
root      123126 15.8 11.7 2017780 465432 pts/0  Sl   Oct25 349:52 /usr/bin/syncd --diag -u -s -p /etc/sai.d/sai.profile -b /tmp/break_before_make_objects
root      126101  0.8  4.8 222848 192204 pts/0   S    Oct25  18:47 python3 /usr/local/bin/thermalctld
300       123501  0.0  2.3 1201932 93288 pts/0   Sl   Oct25   0:04 /usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp --asic-offload=notify_on_offload
root      125995  1.3  1.9 317716 79152 pts/0    Sl   Oct25  28:53 python3 /usr/local/bin/xcvrd

@justindthomas justindthomas changed the title Apparent memory leak on N3248TE-ON Apparent memory leak on N3248TE-ON (possibly psud and thermalctld in pmon) Oct 26, 2023
@justindthomas
Copy link
Contributor Author

justindthomas commented Oct 26, 2023

I issued a docker restart pmon and that reset the memory usage and brought the system-health back to OK. Memory is climbing in those components almost immediately, however:

jdt@sonic:~$ ps aux | head -1; ps aux | sort -rnk 4 | egrep 'psud|thermalctld'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     1171748  0.3  0.8  52808 34804 pts/0    S    02:18   0:00 python3 /usr/local/bin/thermalctld
root     1171746  6.2  0.8  52932 35240 pts/0    S    02:18   0:10 python3 /usr/local/bin/psud
root     1171796  0.9  0.6  53064 26072 pts/0    S    02:18   0:01 python3 /usr/local/bin/thermalctld
jdt      1173525  0.0  0.0   6868   648 pts/0    S+   02:20   0:00 grep -E psud|thermalctld

@justindthomas
Copy link
Contributor Author

Back up to 14.2% for psud this morning.

jdt@sonic:~$ ps aux | head -1; ps aux | sort -rnk 4 | egrep 'psud|thermalctld'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     1171746  6.1 14.2 580880 563228 pts/0   S    Oct27  82:59 python3 /usr/local/bin/psud
root     1171796  0.8  3.2 157168 130244 pts/0   S    Oct27  11:30 python3 /usr/local/bin/thermalctld
root     1171748  0.0  0.8  52808 34720 pts/0    S    Oct27   0:00 python3 /usr/local/bin/thermalctld
jdt      1794409  0.0  0.0   6868   712 pts/0    S+   00:50   0:00 grep -E psud|thermalctld

@justindthomas
Copy link
Contributor Author

Since both issues are dealing with thermal issues on this platform, it's possible there may be a link to the changes @arunlk-dell mentioned here: #16666

@justindthomas
Copy link
Contributor Author

Didn't catch the memory consumption in time over the weekend and the switch crashed again this morning after running out of memory. This is a critical issue - I'm surprised if I'm the only one seeing it.

@arunlk-dell
Copy link
Contributor

Have identified certain file read are not properly closed. It could also lead to increase in mem usage.
Have setup a device with latest community sonic and monitoring it. Will check on it.

@gechiang
Copy link
Collaborator

gechiang commented Nov 8, 2023

@arunlk-dell , please share the PR for the potential fix. Thanks!

@gechiang gechiang added Triaged this issue has been triaged DELL labels Nov 8, 2023
@pliviu
Copy link

pliviu commented Nov 26, 2023

Didn't catch the memory consumption in time over the weekend and the switch crashed again this morning after running out of memory. This is a critical issue - I'm surprised if I'm the only one seeing it.

You're not the only one, i'm seeing the exact same thing. I periodically restart pmon as an workaround but i'm hoping for a fix

@justindthomas
Copy link
Contributor Author

You're not the only one, i'm seeing the exact same thing. I periodically restart pmon as an workaround but i'm hoping for a fix

While it gets sorted out, I added a daily cron job to restart pmon. That has helped to keep the switch from rebooting unexpectedly. Those reboots cause problems with some iscsi database mounts I'm using.

@justindthomas
Copy link
Contributor Author

Built from master and this is still an issue. To hack around it, I've added /etc/cron.daily/restart-pmon with a mode of 755:

#!/bin/bash

docker restart pmon

@arunlk-dell do you have a PR for the fix yet? I'd be happy to build and test it.

@pliviu
Copy link

pliviu commented Dec 4, 2023

Have identified certain file read are not properly closed. It could also lead to increase in mem usage. Have setup a device with latest community sonic and monitoring it. Will check on it.

The memory leak is there for sure, you can see how my cpu graph looks like with pmon process restarted every night at 02:00 A.M. If you can share some info about the files not properly closed maybe we can help fixing this issue.

pmon-leak

@justindthomas
Copy link
Contributor Author

justindthomas commented Dec 8, 2023

An update on the impact of this: I think it extends beyond the Dell units. It's very visible there because resources are tight (4GB of RAM). But I had a build of SONiC from master compiled for a Micas unit I'm testing:

admin@sonic:~$ show ver

SONiC Software Version: SONiC.HEAD.0-dirty-20231204.222613
SONiC OS Version: 11
Distribution: Debian 11.8
Kernel: 5.10.0-23-2-amd64
Build commit: 389306a3b
Build date: Tue Dec  5 07:34:56 UTC 2023
Built by: micasrd@micasrd-PowerEdge-R730

Platform: x86_64-micas_m2-w6510-48v8c-r0
HwSKU: M2-W6510-48V8C
ASIC: broadcom
ASIC Count: 1
Serial Number: G1SL72A00002B
Model Number: 01018904
Hardware Revision: 100
Uptime: 18:30:15 up 1 day, 20:09,  1 user,  load average: 0.90, 0.84, 0.92
Date: Fri 08 Dec 2023 18:30:15

...and I just (accidentally, actually - I thought I was on the SSH session to my Dell unit) checked the memory usage on my Micas unit and I see that psud is gobbling memory like crazy there too:

admin@sonic:~$ ps aux | head -1; ps aux | sort -rnk 4 | head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        6033  6.8 13.8 1126480 1108928 pts/0 R    Dec06 182:00 python3 /usr/local/bin/psud
root        6036  0.1  9.4 774444 757504 pts/0   S    Dec06   4:24 /usr/bin/python3 /usr/local/bin/pcied
root        3621 10.0  8.8 2831004 710740 pts/0  Sl   Dec06 267:16 /usr/bin/syncd --diag -u -s -p /etc/sai.d/sai.profile -b /tmp/break_before_make_objects
root        6043  2.8  4.8 418072 391732 pts/0   S    Dec06  75:01 python3 /usr/local/bin/thermalctld
root        6032  8.5  1.5 346692 127140 pts/0   Sl   Dec06 225:19 python3 /usr/local/bin/xcvrd

That unit is much more powerful than my n3248te, so the crashes aren't as frequent (or noticeable since it's in my lab).

@gechiang does this need to be assigned to someone with broader responsibility than just the Dell platforms? Since I see the problem is likely outside of the Dell platform code, I'll also look at psud to see if it's something I can create a pull request for.

@gechiang
Copy link
Collaborator

gechiang commented Dec 9, 2023

can you share how much memory you have on your device and gather the output of the following commands as soon as the device boots up

top -H
docker stats --no-stream
sudo monit status

Then wait for few hours (prefer right before your cron job kicks in to restart pmon) to gather the same comand output

Also, canyou try running a 202305 based image or 202205 based image and see if you also see this issue?

Thanks!

@justindthomas
Copy link
Contributor Author

@justindthomas
Copy link
Contributor Author

I disabled the cron restart job and just restarted the pmon container. I'll add another comment with the updated stats in a couple of hours. This device has 4GB of RAM.

Here is the top -H at start from the container:

top - 19:10:04 up 2 days, 0 min,  0 users,  load average: 2.95, 3.03, 2.84
Threads:  20 total,   1 running,  19 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.4 us,  6.5 sy,  0.0 ni, 45.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3863.8 total,    294.1 free,   2560.2 used,   1009.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    962.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    102 root      20   0    7028   3228   2740 R   6.7   0.1   0:00.01 top
      1 root      20   0   29732  25476   9448 S   0.0   0.6   0:01.33 supervisord
     31 root      20   0  124036  26592  14996 S   0.0   0.7   0:00.26 python3
     32 root      20   0  124036  26592  14996 S   0.0   0.7   0:00.00 ZMQbg/Reaper
     33 root      20   0  124036  26592  14996 S   0.0   0.7   0:00.00 ZMQbg/IO/0
     34 root      20   0  222220   4188   3684 S   0.0   0.1   0:00.02 rsyslogd
     35 root      20   0  222220   4188   3684 S   0.0   0.1   0:00.01 in:imuxsock
     36 root      20   0  222220   4188   3684 S   0.0   0.1   0:00.01 in:imfile
     37 root      20   0  222220   4188   3684 S   0.0   0.1   0:00.01 rs:main Q:Reg
     40 root      20   0  271104  35436  16364 S   0.0   0.9   0:00.62 python3
     86 root      20   0  271104  35436  16364 S   0.0   0.9   0:00.09 python3
     87 root      20   0  271104  35436  16364 S   0.0   0.9   0:00.18 python3
     88 root      20   0  271104  35436  16364 S   0.0   0.9   0:00.93 python3
     42 root      20   0   50492  35264  15864 S   0.0   0.9   0:04.49 python3
     44 root      20   0   49540  33972  15632 S   0.0   0.9   0:00.61 python3
     45 root      20   0   50940  35632  15728 S   0.0   0.9   0:00.59 python3
     47 root      20   0   38236  22304  13816 S   0.0   0.6   0:00.25 python3
     68 root      20   0   55312   1128      0 S   0.0   0.0   0:00.31 sensord
     89 root      20   0   51084  26364   6360 S   0.0   0.7   0:00.98 python3
     96 root      20   0    4160   3276   2864 S   0.0   0.1   0:00.01 bash

Here's the ps that I'm watching to monitor the psud usage at the start (from the host).

Every 3.0s: ps aux | grep psud                                                                                                                                                        sonic: Sat Dec  9 19:10:56 2023

root     1668068  5.9  0.8  50748 35528 pts/0    S    19:08   0:07 python3 /usr/local/bin/psud
jdt      1668460  0.1  0.0   7216  3280 pts/2    S+   19:09   0:00 watch -n3 ps aux | grep psud
jdt      1669063  0.0  0.0   7216  1436 pts/2    S+   19:10   0:00 watch -n3 ps aux | grep psud
jdt      1669064  0.0  0.0   2576   952 pts/2    S+   19:10   0:00 sh -c ps aux | grep psud
jdt      1669066  0.0  0.0   6972  2068 pts/2    S+   19:10   0:00 grep psud

Here are the other outputs you asked for. In that first one, I think bgp is so high because of a problem with frrcfgd. I'm using the sonic_frr_mgmt_framework to configure the BGP and OSPF routing in the database and noticed that that daemon tends to run hot. I haven't had time to dig in to why that might be yet.

jdt@sonic:/var/log$ docker stats --no-stream
CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O   PIDS
bc7213ac0e2d   snmp             4.24%     74.68MiB / 3.773GiB   1.93%     0B / 0B   0B / 0B     12
2fcc5b38652b   pmon             1.50%     122.3MiB / 3.773GiB   3.17%     0B / 0B   0B / 0B     19
1c9eb09f3b9f   mgmt-framework   0.03%     55.93MiB / 3.773GiB   1.45%     0B / 0B   0B / 0B     15
86d4deb3a68c   lldp             0.06%     64.68MiB / 3.773GiB   1.67%     0B / 0B   0B / 0B     14
35b8ad65dbf5   gnmi             0.86%     109.3MiB / 3.773GiB   2.83%     0B / 0B   0B / 0B     28
33cd64fe0e02   dhcp_relay       0.32%     62.69MiB / 3.773GiB   1.62%     0B / 0B   0B / 0B     28
c174510c980a   radv             0.03%     40.97MiB / 3.773GiB   1.06%     0B / 0B   0B / 0B     9
dc7ea08357fb   bgp              96.22%    228.9MiB / 3.773GiB   5.93%     0B / 0B   0B / 0B     34
052227f05ded   syncd            19.37%    502.1MiB / 3.773GiB   13.00%    0B / 0B   0B / 0B     53
ca09bd7dfd3b   teamd            0.05%     45.43MiB / 3.773GiB   1.18%     0B / 0B   0B / 0B     15
7c2de035651d   swss             0.18%     86.33MiB / 3.773GiB   2.23%     0B / 0B   0B / 0B     49
76d7d0b9e8c5   eventd           0.19%     49.29MiB / 3.773GiB   1.28%     0B / 0B   0B / 0B     20
7495ebd599f5   database         9.93%     139.5MiB / 3.773GiB   3.61%     0B / 0B   0B / 0B     14
jdt@sonic:/var/log$ sudo monit status
Monit 5.20.0 uptime: 2d 0h 1m

Filesystem 'root-overlay'
  status                       Accessible
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  permission                   755
  uid                          0
  gid                          0
  filesystem flags             0x1000
  block size                   4 kB
  space total                  28.8 GB (of which 0.1% is reserved for root user)
  space free for non superuser 23.4 GB [81.4%]
  space free total             23.5 GB [81.4%]
  inodes total                 1929536
  inodes free                  1829501 [94.8%]
  data collected               Sat, 09 Dec 2023 19:11:27

Filesystem 'var-log'
  status                       Accessible
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  permission                   755
  uid                          0
  gid                          0
  filesystem flags             0x1000
  block size                   4 kB
  space total                  487.2 MB (of which 7.4% is reserved for root user)
  space free for non superuser 269.6 MB [55.3%]
  space free total             305.5 MB [62.7%]
  inodes total                 32768
  inodes free                  32679 [99.7%]
  data collected               Sat, 09 Dec 2023 19:11:27

System 'sonic'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  load average                 [3.16] [3.08] [2.87]
  cpu                          63.1%us 16.1%sy 0.0%wa
  memory usage                 2.8 GB [75.3%]
  swap usage                   0 B [0.0%]
  uptime                       2d 0h 2m
  boot time                    Thu, 07 Dec 2023 19:09:59
  data collected               Sat, 09 Dec 2023 19:11:27

Process 'rsyslog'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          20610
  parent pid                   1
  uid                          0
  effective uid                0
  gid                          0
  uptime                       1d 23h 48m
  threads                      14
  children                     9
  cpu                          0.0%
  cpu total                    0.0%
  memory                       0.3% [12.3 MB]
  memory total                 1.7% [65.9 MB]
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'routeCheck'
  status                       Status ok
  monitoring status            Waiting
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'dualtorNeighborCheck'
  status                       Status ok
  monitoring status            Waiting
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'diskCheck'
  status                       Status ok
  monitoring status            Waiting
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'container_checker'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'vnetRouteCheck'
  status                       Status ok
  monitoring status            Waiting
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'memory_check'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'container_memory_snmp'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'container_memory_gnmi'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

Program 'container_eventd'
  status                       Status ok
  monitoring status            Waiting
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 19:11:27

@justindthomas
Copy link
Contributor Author

Here's the updated information, 3 hours later.

Threads:  20 total,   1 running,  19 sleeping,   0 stopped,   0 zombie
%Cpu(s): 77.1 us, 20.3 sy,  0.0 ni,  2.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3863.8 total,    206.8 free,   2745.4 used,    911.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    777.6 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
     42 root      20   0  143208 127688  15668 S   6.6   3.2  14:06.03 python3
     88 root      20   0  274432  39512  16184 S   1.0   1.0   2:36.28 python3
      1 root      20   0   29732  25456   9428 S   0.0   0.6   0:08.12 supervisord
     31 root      20   0  124036  26636  14976 S   0.0   0.7   0:01.30 python3
     32 root      20   0  124036  26636  14976 S   0.0   0.7   0:00.00 ZMQbg/Reaper
     33 root      20   0  124036  26636  14976 S   0.0   0.7   0:00.00 ZMQbg/IO/0
     34 root      20   0  222220   6208   3684 S   0.0   0.2   0:00.02 rsyslogd
     35 root      20   0  222220   6208   3684 S   0.0   0.2   0:00.22 in:imuxsock
     36 root      20   0  222220   6208   3684 S   0.0   0.2   0:00.01 in:imfile
     37 root      20   0  222220   6208   3684 S   0.0   0.2   0:00.05 rs:main Q:Reg
     40 root      20   0  274432  39512  16184 S   0.0   1.0   0:00.62 python3
     86 root      20   0  274432  39512  16184 S   0.0   1.0   0:04.58 python3
     87 root      20   0  274432  39512  16184 S   0.0   1.0   0:47.23 python3
     44 root      20   0   49540  33840  15436 S   0.0   0.9   0:00.68 python3
     45 root      20   0   50940  35420  15516 S   0.0   0.9   0:00.63 python3
     47 root      20   0   38236  22292  13804 S   0.0   0.6   0:01.22 python3
     68 root      20   0   55312   2592   1456 S   0.0   0.1   0:17.24 sensord
     89 root      20   0   69676  44940   6404 S   0.0   1.1   1:58.48 python3
     96 root      20   0    4160   3380   2860 S   0.0   0.1   0:00.01 bash
   1545 root      20   0    7028   3216   2720 R   0.0   0.1   0:00.00 top
Every 3.0s: ps aux | grep psud                                                                                                                                                        sonic: Sat Dec  9 23:12:01 2023

root     1668068  5.8  3.2 143596 127952 pts/0   S    19:08  14:10 python3 /usr/local/bin/psud
jdt      1801100  0.3  0.0   7216  3304 pts/2    S+   23:11   0:00 watch -n3 ps aux | grep psud
jdt      1801107  0.0  0.0   7216  1380 pts/2    S+   23:12   0:00 watch -n3 ps aux | grep psud
jdt      1801108  0.0  0.0   2576   952 pts/2    S+   23:12   0:00 sh -c ps aux | grep psud
jdt      1801110  0.0  0.0   6972  2104 pts/2    S+   23:12   0:00 grep psud
jdt@sonic:~$ docker stats --no-stream
CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O   PIDS
bc7213ac0e2d   snmp             5.14%     74.58MiB / 3.773GiB   1.93%     0B / 0B   0B / 0B     12
2fcc5b38652b   pmon             5.39%     247.8MiB / 3.773GiB   6.41%     0B / 0B   0B / 0B     19
1c9eb09f3b9f   mgmt-framework   0.09%     59.84MiB / 3.773GiB   1.55%     0B / 0B   0B / 0B     15
86d4deb3a68c   lldp             0.04%     64.73MiB / 3.773GiB   1.68%     0B / 0B   0B / 0B     14
35b8ad65dbf5   gnmi             0.88%     110.2MiB / 3.773GiB   2.85%     0B / 0B   0B / 0B     28
33cd64fe0e02   dhcp_relay       0.04%     62.62MiB / 3.773GiB   1.62%     0B / 0B   0B / 0B     28
c174510c980a   radv             0.04%     40.97MiB / 3.773GiB   1.06%     0B / 0B   0B / 0B     9
dc7ea08357fb   bgp              95.23%    229.2MiB / 3.773GiB   5.93%     0B / 0B   0B / 0B     34
052227f05ded   syncd            11.79%    501.8MiB / 3.773GiB   12.99%    0B / 0B   0B / 0B     53
ca09bd7dfd3b   teamd            0.05%     45.38MiB / 3.773GiB   1.17%     0B / 0B   0B / 0B     15
7c2de035651d   swss             0.17%     87.9MiB / 3.773GiB    2.27%     0B / 0B   0B / 0B     49
76d7d0b9e8c5   eventd           0.37%     49.39MiB / 3.773GiB   1.28%     0B / 0B   0B / 0B     20
7495ebd599f5   database         4.41%     138.1MiB / 3.773GiB   3.58%     0B / 0B   0B / 0B     14
jdt@sonic:~$ sudo monit status
Monit 5.20.0 uptime: 2d 4h 2m

Filesystem 'root-overlay'
  status                       Accessible
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  permission                   755
  uid                          0
  gid                          0
  filesystem flags             0x1000
  block size                   4 kB
  space total                  28.8 GB (of which 0.1% is reserved for root user)
  space free for non superuser 23.4 GB [81.4%]
  space free total             23.5 GB [81.4%]
  inodes total                 1929536
  inodes free                  1829501 [94.8%]
  data collected               Sat, 09 Dec 2023 23:12:47

Filesystem 'var-log'
  status                       Accessible
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  permission                   755
  uid                          0
  gid                          0
  filesystem flags             0x1000
  block size                   4 kB
  space total                  487.2 MB (of which 7.4% is reserved for root user)
  space free for non superuser 265.4 MB [54.5%]
  space free total             301.3 MB [61.8%]
  inodes total                 32768
  inodes free                  32679 [99.7%]
  data collected               Sat, 09 Dec 2023 23:12:47

System 'sonic'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  load average                 [2.03] [2.22] [2.24]
  cpu                          62.4%us 14.4%sy 0.0%wa
  memory usage                 3.0 GB [78.9%]
  swap usage                   0 B [0.0%]
  uptime                       2d 4h 3m
  boot time                    Thu, 07 Dec 2023 19:09:59
  data collected               Sat, 09 Dec 2023 23:12:47

Process 'rsyslog'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          20610
  parent pid                   1
  uid                          0
  effective uid                0
  gid                          0
  uptime                       2d 3h 49m
  threads                      14
  children                     9
  cpu                          0.0%
  cpu total                    0.0%
  memory                       0.3% [12.3 MB]
  memory total                 1.7% [65.9 MB]
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'routeCheck'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'dualtorNeighborCheck'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'diskCheck'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'container_checker'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'vnetRouteCheck'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'memory_check'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'container_memory_snmp'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'container_memory_gnmi'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

Program 'container_eventd'
  status                       Status ok
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  last exit value              0
  last output                  -
  data collected               Sat, 09 Dec 2023 23:12:47

@justindthomas
Copy link
Contributor Author

My best guess at this point is that it's a leak somewhere in the C swsscommon Table.set interface. I don't see any reason the Python code itself would be causing it.

But that doesn't explain why I'm only seeing it in psud. Although I don't see many other services that run as a daemon and issue a lot of repeated set commands.

@bobo-t
Copy link

bobo-t commented Dec 10, 2023

I see this issue too on Edgecore/Accton-AS7326-56X and Ufispace/S8901-54XC-980B boxes running current master. On my side the leak seems to be present in psud, pcied, thermalctld and xcrvd.

See below the top memory consumption on a AS7326-56X with 8 days uptime:
top - 08:09:35 up 8 days, 12:52, 1 user, load average: 0.48, 0.48, 0.53
Tasks: 285 total, 1 running, 281 sleeping, 0 stopped, 3 zombie
%Cpu(s): 3.6 us, 0.5 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15892.0 total, 255.5 free, 14467.7 used, 1168.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1028.4 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                 

9651 root 20 0 5456652 5.2g 15636 S 0.0 33.4 48:02.97 python3
9655 root 20 0 4539116 4.3g 16136 S 0.0 27.8 21:44.09 pcied
9749 root 20 0 1998292 1.9g 6580 S 0.0 12.1 14:53.60 python3
9649 root 20 0 936872 711512 16420 S 2.9 4.4 361:31.60 python3
6641 root 20 0 2551008 616236 173160 S 16.7 3.8 1864:51 syncd

pid info:
9651 pts/0 S 48:05 python3 /usr/local/bin/psud
9655 pts/0 S 21:45 /usr/bin/python3 /usr/local/bin/pcied
9749 pts/0 S 14:54 python3 /usr/local/bin/thermalctld
9649 pts/0 Sl 361:47 python3 /usr/local/bin/xcvrd

monit status
<...>
System 'sonic'
status Resource limit matched
monitoring status Monitored
monitoring mode active
on reboot start
load average [0.34] [0.37] [0.45]
cpu 4.3%us 1.1%sy 0.0%wa
memory usage 14.5 GB [93.6%]
swap usage 0 B [0.0%]
uptime 8d 13h 2m
boot time Fri, 01 Dec 2023 17:16:53
data collected Sun, 10 Dec 2023 06:18:33

docker stats --no-stream

CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
836b8cba7d28 snmp 5.24% 89.79MiB / 15.52GiB 0.57% 0B / 0B 6.5MB / 131kB 12
78eea650e9c8 lldp 0.07% 71.67MiB / 15.52GiB 0.45% 0B / 0B 4.62MB / 123kB 14
6366ff2b407f pmon 4.12% 12.14GiB / 15.52GiB 78.24% 0B / 0B 21.1MB / 4.65MB 19
276fa354f73e bgp 0.08% 190.9MiB / 15.52GiB 1.20% 0B / 0B 19.5MB / 1.3GB 32
acb30fa550fd syncd 5.42% 539.8MiB / 15.52GiB 3.40% 0B / 0B 259MB / 786kB 51
3381cff2150d teamd 0.42% 62.79MiB / 15.52GiB 0.40% 0B / 0B 1.6MB / 98.3kB 18
f9d7e44a7941 iccpd 0.11% 43.89MiB / 15.52GiB 0.28% 0B / 0B 737kB / 73.7kB 10
0c6c479e482f swss 0.16% 99.78MiB / 15.52GiB 0.63% 0B / 0B 26.8MB / 266kB 49
3fb3109ee034 eventd 0.23% 66.03MiB / 15.52GiB 0.42% 0B / 0B 352kB / 94.2kB 20
e787bdd536be database 4.94% 138.8MiB / 15.52GiB 0.87% 0B / 0B 48.5MB / 86kB 14

I compare with a AS7326-56X box running 202211, 82 days uptime:

top - 08:21:39 up 82 days, 6:08, 1 user, load average: 0.61, 0.95, 0.97
Tasks: 295 total, 1 running, 292 sleeping, 0 stopped, 2 zombie
%Cpu(s): 3.5 us, 1.6 sy, 0.0 ni, 94.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15903.2 total, 11311.5 free, 2441.4 used, 2150.2 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 13069.0 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                 

6870 root 20 0 2498512 635616 140268 S 11.6 3.9 17091:02 syncd
676 root 20 0 2048452 111160 52712 S 0.0 0.7 540:41.45 dockerd
4737 root 20 0 229100 104380 7720 S 7.3 0.6 10584:23 redis-server
6915 root 20 0 672052 86252 16400 S 0.0 0.5 186:19.87 orchagent
6615 root 20 0 566336 65480 9244 S 0.3 0.4 164:27.83 eventd

(none of the 4 pmon processes are in top)

monit status
<...>
System 'sonic'
status Running
monitoring status Monitored
monitoring mode active
on reboot start
load average [0.72] [0.97] [0.97]
cpu 4.3%us 1.1%sy 0.0%wa
memory usage 2.8 GB [17.8%]
swap usage 0 B [0.0%]
uptime 82d 6h 8m
boot time Tue, 19 Sep 2023 00:13:36
data collected Sun, 10 Dec 2023 06:21:43

docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
2d58ee86afe1 snmp 10.55% 87.11MiB / 15.53GiB 0.55% 0B / 0B 6.03MB / 131kB 12
31a360e1a888 lldp 3.66% 72.04MiB / 15.53GiB 0.45% 0B / 0B 4.38MB / 111kB 14
475b12ac0e92 pmon 3.95% 193.2MiB / 15.53GiB 1.21% 0B / 0B 19.4MB / 3.81MB 19
6e03f21c15b9 bgp 0.06% 206.7MiB / 15.53GiB 1.30% 0B / 0B 16.7MB / 11.6GB 30
e94a072318f9 syncd 24.04% 544MiB / 15.53GiB 3.42% 0B / 0B 231MB / 385kB 51
44c0af962a42 teamd 0.86% 72.91MiB / 15.53GiB 0.46% 0B / 0B 3.74MB / 98.3kB 22
b6df39bc23d3 p4rt 0.03% 40.9MiB / 15.53GiB 0.26% 0B / 0B 0B / 127kB 9
3f87518e5e6f iccpd 0.09% 46.75MiB / 15.53GiB 0.29% 0B / 0B 758kB / 73.7kB 10
f504f27d0095 swss 0.17% 159.4MiB / 15.53GiB 1.00% 0B / 0B 17.8MB / 266kB 50
a93b832d4c84 eventd 0.23% 109.2MiB / 15.53GiB 0.69% 0B / 0B 283kB / 94.2kB 20
e00477933221 database 4.91% 183.1MiB / 15.53GiB 1.15% 0B / 0B 47.2MB / 106kB 14

@justindthomas
Copy link
Contributor Author

On my side the leak seems to be present in psud, pcied, thermalctld and xcrvd.

Thanks for that note. That seems to support my thought that this may be a leak in the swsscommon Table set function. I was looking closely at xcrvd code yesterday because of some similarities to psud, but I hadn't noticed the same leak behavior there (probably just lighter usage of that in my scenario).

@justindthomas justindthomas changed the title Apparent memory leak on N3248TE-ON (possibly psud and thermalctld in pmon) Memory leaks in pmon daemons related to Redis set activity Dec 11, 2023
@bobo-t
Copy link

bobo-t commented Mar 4, 2024

I compiled an image with this PR:
sonic-net/sonic-swss-common#859

The memory leak is reduced but is still present, see below the stats for the same box AS7326-56X with 2 days uptime

top - 09:06:54 up 2 days, 18:13, 1 user, load average: 0.50, 0.73, 0.80
Tasks: 284 total, 1 running, 280 sleeping, 0 stopped, 3 zombie
%Cpu(s): 3.7 us, 0.9 sy, 0.0 ni, 95.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15892.0 total, 10683.8 free, 3199.9 used, 2008.3 buff/cache

monit status
<...>
System 'sonic'
status Running
monitoring status Monitored
monitoring mode active
on reboot start
load average [0.85] [0.78] [0.81]
cpu 4.2%us 1.1%sy 0.0%wa
memory usage 3.5 GB [22.7%]
swap usage 0 B [0.0%]
uptime 2d 18h 14m
boot time Fri, 01 Mar 2024 12:53:37
data collected Mon, 04 Mar 2024 07:07:51

docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c2c648f961e9 snmp 2.94% 87.74MiB / 15.52GiB 0.55% 0B / 0B 6.61MB / 197kB 12
40715633e710 lldp 0.04% 73.13MiB / 15.52GiB 0.46% 0B / 0B 4.2MB / 123kB 14
c2e3fb38d7f4 pmon 1.27% 1.109GiB / 15.52GiB 7.15% 0B / 0B 22.1MB / 3.91MB 19
68dd01aa8d87 bgp 0.06% 194.5MiB / 15.52GiB 1.22% 0B / 0B 17.6MB / 403MB 32
4208f075dea9 syncd 24.89% 537.1MiB / 15.52GiB 3.38% 0B / 0B 258MB / 664kB 52
8215c54eda9d teamd 29.13% 63.71MiB / 15.52GiB 0.40% 0B / 0B 1.62MB / 98.3kB 18
cc7d4286ea68 iccpd 20.83% 46.18MiB / 15.52GiB 0.29% 0B / 0B 778kB / 73.7kB 10
c684419a8fb2 swss 22.75% 89.57MiB / 15.52GiB 0.56% 0B / 0B 25.3MB / 266kB 49
4ba10b3b33d8 eventd 2.27% 65.83MiB / 15.52GiB 0.41% 0B / 0B 352kB / 94.2kB 26
18fc27613dd1 database 2.91% 142.8MiB / 15.52GiB 0.90% 0B / 0B 48.4MB / 86kB 14

@prgeor
Copy link
Contributor

prgeor commented Apr 17, 2024

@justindthomas The title says there are lot many set operation on redis DB. Can you try run the unit test mentioned here https://github.com/sonic-net/sonic-swss-common/pull/343/files and see if it similar issue.

@dvouilloz
Copy link

Is there any news ?
I have exactly the same problem with a Mellanox SN2010.

@mbze430
Copy link

mbze430 commented May 15, 2024

I also believe I have the same problem with the Celestica DX010 which I have reported in #18871 but this might extend down to older builds... since i am on 202311

From after a reboot this morning:

admin@sonic:~$ sudo docker stats --no-stream
CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O   BLOCK I/O         PIDS
c774224056c8   dhcp_relay       0.72%     73.48MiB / 7.736GiB   0.93%     0B / 0B   10.2MB / 217kB    69
8623ab159769   snmp             15.48%    66.11MiB / 7.736GiB   0.83%     0B / 0B   12.6MB / 131kB    11
268b2ad842c7   pmon             0.56%     446.6MiB / 7.736GiB   5.64%     0B / 0B   19.3MB / 147kB    20
5b05efd6d4f0   mgmt-framework   0.03%     58.44MiB / 7.736GiB   0.74%     0B / 0B   65.9MB / 69.6kB   15
8261ca44f473   lldp             0.04%     59.59MiB / 7.736GiB   0.75%     0B / 0B   10.3MB / 115kB    13
f28dfff8d7fd   gnmi             1.34%     99.03MiB / 7.736GiB   1.25%     0B / 0B   43.8MB / 86kB     28
ca106ae11b65   bgp              0.06%     169.9MiB / 7.736GiB   2.15%     0B / 0B   22.2MB / 35.4MB   30
c933aa6d838b   radv             0.04%     33.23MiB / 7.736GiB   0.42%     0B / 0B   5.7MB / 77.8kB    8
419bf92d9d39   syncd            16.50%    531.6MiB / 7.736GiB   6.71%     0B / 0B   286MB / 274kB     49
ad8352bc6290   teamd            0.16%     39.12MiB / 7.736GiB   0.49%     0B / 0B   5.26MB / 90.1kB   15
5cf4f2acfda6   swss             0.22%     76.38MiB / 7.736GiB   0.96%     0B / 0B   33.8MB / 303kB    48
f57b759b15de   eventd           0.30%     43.37MiB / 7.736GiB   0.55%     0B / 0B   4.16MB / 77.8kB   19
1d53abbaf1e7   database         11.39%    132.8MiB / 7.736GiB   1.68%     0B / 0B   48.1MB / 86kB     13
admin@sonic:~$

PMON log before the (maybe during the reboot):

2024-05-10 23:05:50,312 INFO exited: dependent-startup (exit status 0; expected)
sonic-platform package not installed, attempting to install...
Error: Unable to locate /usr/share/sonic/platform/sonic_platform-1.0-py2-none-any.whl
/usr/local/lib/python3.9/dist-packages/supervisor/options.py:473: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
  self.warnings.warn(
2024-05-15 08:30:59,802 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2024-05-15 08:30:59,802 INFO Set uid to user 0 succeeded
Unlinking stale socket /var/run/supervisor.sock
2024-05-15 08:31:00,117 INFO RPC interface 'supervisor' initialized
2024-05-15 08:31:00,117 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-05-15 08:31:00,118 INFO supervisord started with pid 1
2024-05-15 08:31:01,123 INFO spawned: 'dependent-startup' with pid 21
2024-05-15 08:31:01,128 INFO spawned: 'supervisor-proc-exit-listener' with pid 22
2024-05-15 08:31:02,495 INFO success: dependent-startup entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-15 08:31:02,495 INFO success: supervisor-proc-exit-listener entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-15 08:31:02,598 INFO spawned: 'rsyslogd' with pid 25
2024-05-15 08:31:03,810 INFO success: rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-15 08:31:04,883 INFO spawned: 'chassis_db_init' with pid 29
2024-05-15 08:31:04,885 INFO success: chassis_db_init entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-05-15 08:31:04,897 INFO spawned: 'lm-sensors' with pid 30
2024-05-15 08:31:04,898 INFO success: lm-sensors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2024-05-15 08:31:04,913 INFO spawned: 'fancontrol' with pid 31
2024-05-15 08:31:04,942 INFO spawned: 'xcvrd' with pid 33
2024-05-15 08:31:04,970 INFO spawned: 'psud' with pid 37
2024-05-15 08:31:05,001 INFO spawned: 'syseepromd' with pid 41
2024-05-15 08:31:05,040 INFO spawned: 'thermalctld' with pid 46
2024-05-15 08:31:05,078 INFO spawned: 'pcied' with pid 51
2024-05-15 08:31:07,155 INFO exited: chassis_db_init (exit status 0; expected)
2024-05-15 08:31:07,215 INFO exited: lm-sensors (exit status 0; expected)
2024-05-15 08:31:15,260 INFO success: fancontrol entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024-05-15 08:31:15,260 INFO success: xcvrd entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024-05-15 08:31:15,261 INFO success: psud entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024-05-15 08:31:15,261 INFO success: syseepromd entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024-05-15 08:31:15,261 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024-05-15 08:31:15,262 INFO success: pcied entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024-05-15 08:31:16,730 INFO exited: dependent-startup (exit status 0; expected)

@arunlk-dell arunlk-dell removed their assignment May 17, 2024
@arunlk-dell
Copy link
Contributor

@prgeor unassigned from my name as the issue common to all platforms. The changes which i have is specific to dell platform.

@mbze430
Copy link

mbze430 commented May 17, 2024

I am not brave enough to go to master for production. I had a lot of issues with 202305. when doing a lot of configuration or change configuration, I would suddenly start to have issues where swss would just crash and not work unless I completely reload the ISO.

Meaning, even if I start off from a clean config_db.json or reloading a good config_db.json. swss would just crash contantly.

@Pterosaur
Copy link
Contributor

I raise a PR: sonic-net/sonic-swss-common#876 for fixing this issue, please check it.

@mbze430
Copy link

mbze430 commented May 24, 2024

Thank you! when will this get pass on to the 202311 build?

@Pterosaur
Copy link
Contributor

Thank you! when will this get pass on to the 202311 build?

Once this PR is merged, I will raise another PR for 202311 branch ASAP. I estimate it will be done next week.

@bobo-t
Copy link

bobo-t commented May 26, 2024

I compiled an image with the last PR: sonic-net/sonic-swss-common#876 and the issue is fixed:

Platform: x86_64-accton_as7326_56x-r0

uptime
08:56:11 up 1 day, 14 min, 1 user, load average: 0.71, 0.64, 0.64
docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
ef96419d7c3d snmp 0.20% 84.7MiB / 15.52GiB 0.53% 0B / 0B 6.21MB / 131kB 12
167d15f7afee lldp 0.06% 75.82MiB / 15.52GiB 0.48% 0B / 0B 4.14MB / 123kB 14
fd4b820499af pmon 1.14% 184.8MiB / 15.52GiB 1.16% 0B / 0B 19.4MB / 3.16MB 19
3c0fc3d007ab bgp 0.09% 189.3MiB / 15.52GiB 1.19% 0B / 0B 17.4MB / 133MB 32
6732c346084b syncd 20.27% 532.2MiB / 15.52GiB 3.35% 0B / 0B 257MB / 680kB 51
4cd3e20a8ade teamd 0.70% 65.7MiB / 15.52GiB 0.41% 0B / 0B 1.59MB / 98.3kB 20
088a52dfdd7b iccpd 0.13% 44.13MiB / 15.52GiB 0.28% 0B / 0B 733kB / 73.7kB 10
f53da8f61a96 swss 0.17% 89.77MiB / 15.52GiB 0.56% 0B / 0B 25.3MB / 266kB 49
e66027027c37 eventd 0.25% 62.5MiB / 15.52GiB 0.39% 0B / 0B 352kB / 98.3kB 19
519bc82b4070 database 5.34% 138.8MiB / 15.52GiB 0.87% 0B / 0B 48MB / 86kB 14

The pmon memory usage remains stable at ~ 184 MiB

lguohan pushed a commit to sonic-net/sonic-swss-common that referenced this issue May 28, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak #859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
Pterosaur added a commit to Pterosaur/sonic-swss-common that referenced this issue May 29, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
Pterosaur added a commit to Pterosaur/sonic-swss-common that referenced this issue May 29, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
Pterosaur added a commit to Pterosaur/sonic-swss-common that referenced this issue May 29, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
yxieca pushed a commit to sonic-net/sonic-swss-common that referenced this issue Jun 3, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak #859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
@Pterosaur
Copy link
Contributor

This issue has been fixed and related PR has been merged into the sonic-buildimage.

@mbze430
Copy link

mbze430 commented Jun 7, 2024

@Pterosaur can you tell me how to track this against Azure DevOps? I still can't figure out how to find out what PR have been merged with the Azure DevOps.

@Pterosaur
Copy link
Contributor

Pterosaur commented Jun 7, 2024

@Pterosaur can you tell me how to track this against Azure DevOps? I still can't figure out how to find out what PR have been merged with the Azure DevOps.

Hi @mbze430 , I just checked the submodule, sonic-swss-common, of sonic-buildimage on the master and 202311 branch has included my fixed PR:

image
image

If you download a image from Azure DevOps(a.k.a Azure Pipeline), just chek the commit version via show ver in the sonic machine to see whether this commit includes the fixed submodule.

@mbze430
Copy link

mbze430 commented Jun 8, 2024

as of ver: SONiC Software Version: SONiC.202311.564467-43bcca75f
PMON is no longer showing any more memory leak. Thanks for fixing!

arfeigin pushed a commit to arfeigin/sonic-swss-common that referenced this issue Jun 27, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
Pterosaur added a commit to Pterosaur/sonic-swss-common that referenced this issue Aug 28, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak sonic-net#859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
vincentchiang-ec pushed a commit to vincentchiang-ec/sonic-swss-common_ec that referenced this issue Oct 8, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak #859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
vincentchiang-ec pushed a commit to edge-core/sonic-swss-common that referenced this issue Oct 8, 2024
Fix the issue sonic-net/sonic-buildimage#17025 about Redis set activity

Description
The issue reports a memory leak on the Redis set operations

Reason
Didn't decrease the reference count after PySequence_GetItem
Use the inappropriate Swig API and didn't release the string of SWIG_AsPtr_std_string
Fix:
Refer PR: Fix swig template memory leak #859 from @praveenraja1
Replace the API SWIG_AsPtr_std_string to SWIG_AsVal_std_string
Add unit test
To monitor there is no dramatic memory increment after a huge amount of Redis set operations.

Signed-off-by: Ze Gan <zegan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202311 MSFT P0 Priority of the issue Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

10 participants