Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dual-ToR][active-active] After 'config reload' it takes a lot of time to set up mux ports #15644

Open
ayurkiv-nvda opened this issue Nov 11, 2024 · 7 comments
Assignees

Comments

@ayurkiv-nvda
Copy link
Contributor

ayurkiv-nvda commented Nov 11, 2024

Description

It is a degradation comparing to 202311
(Did not reproduce on 202311.689016-bba184c4d)

Steps to reproduce the issue:

Problem was found during running community test test_active_link_admin_down_config_reload_downstream[active-active] on 202405
Original scenario:

  1. Set up Dual-ToR A-A
  2. shutdown all mux ports on ToR-1
  3. config save -y
  4. config reload -y

Simplified scenario:

  1. Set up Dual-ToR A-A
  2. config reload -y

Describe the results you received:

  • community test fails
  • after 'config reload' is finished (it expected to give 60 seconds after config reload executed + 30 extra seconds) mux ports should be up (
  • redis-dump -d 0 -k "HW_MUX_CABLE_TABLE:* return empty lines, show mux status is empty

on 202405 it takes approximately 2 minutes 15 seconds beetween config reload and non-empty data in redis-dump -d 0 -k "HW_MUX_CABLE_TABLE:*

on 202311 it takes ~2 minutes beetween config reload and non-empty data in redis-dump -d 0 -k "HW_MUX_CABLE_TABLE:*

Describe the results you expected:

show mux status shows valid info afrer 2minutes after config reload

Output of show version:

SONiC Software Version: SONiC.202405.689023-cf8484700
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: cf8484700
Build date: Thu Nov  7 12:26:51 UTC 2024
Built by: azureuser@939e246cc000000

Platform: x86_64-mlnx_msn4600c-r0
HwSKU: Mellanox-SN4600C-C64
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2131X10295
Model Number: MSN4600-CS2FO
Hardware Revision: AB
Uptime: 15:21:09 up  1:11,  1 user,  load average: 0.48, 0.37, 0.50
Date: Mon 11 Nov 2024 15:21:09

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@bingwang-ms
Copy link
Collaborator

Hi @ayurkiv-nvda, can you please share syslog when running into this issue?
@zjswhhh, @lolyu FYI.

@ayurkiv-nvda
Copy link
Contributor Author

Hi @ayurkiv-nvda, can you please share syslog when running into this issue? @zjswhhh, @lolyu FYI.

Hello @bingwang-ms
Just sent it via mail, please check
🙂

@zjswhhh
Copy link
Contributor

zjswhhh commented Nov 12, 2024

Hi @ayurkiv-nvda -

Based on the log you shared. 01:25:28 config reload started.
2024 Nov 12 01:25:28.386894 mtvr-tigon-02 NOTICE switch_hash: 'reload' executing with command: config reload -y

But the interfaces didn't get up till 01:27:37

... ... 
Line 119207: 2024 Nov 12 01:27:37.784629 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet16 oper state set from down to up
Line 119235: 2024 Nov 12 01:27:37.791358 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet20 oper state set from down to up
Line 119268: 2024 Nov 12 01:27:37.802129 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet32 oper state set from down to up
Line 119294: 2024 Nov 12 01:27:37.809063 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet116 oper state set from down to up
Line 119308: 2024 Nov 12 01:27:37.817029 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet124 oper state set from down to up
Line 119338: 2024 Nov 12 01:27:37.825814 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet120 oper state set from down to up
Line 119361: 2024 Nov 12 01:27:37.832792 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet28 oper state set from down to up
Line 119380: 2024 Nov 12 01:27:37.839368 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet44 oper state set from down to up
Line 119405: 2024 Nov 12 01:27:37.846062 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet36 oper state set from down to up
Line 119422: 2024 Nov 12 01:27:37.852459 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet60 oper state set from down to up
Line 119438: 2024 Nov 12 01:27:37.858093 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet48 oper state set from down to up
Line 119465: 2024 Nov 12 01:27:37.865628 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet24 oper state set from down to up
Line 119479: 2024 Nov 12 01:27:37.870400 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet72 oper state set from down to up
Line 119543: 2024 Nov 12 01:27:37.888840 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet56 oper state set from down to up
Line 119555: 2024 Nov 12 01:27:37.893710 mtvr-tigon-02 NOTICE swss#orchagent: :- updatePortOperStatus: Port Ethernet40 oper state set from down to up
... ...

SONiC has an assumption that link should go operationally up within 5min after config reload is executed. Normally I saw links up within 2 min. So it might still fall under our expectation. May I ask you a couple questions though:

  • are you running tests on a physical setup or using simulated environment?
  • does this test used to pass on your setup on an earlier version?

Though we might need to adjust the timeout for this test anyway.

@ayurkiv-nvda
Copy link
Contributor Author

ayurkiv-nvda commented Nov 12, 2024

Hello @zjswhhh
-We are running test on nic-simulator for Dual-ToR A-A. But switch is physical
-yes, this test passed successfully on 202311 (checked on 202311.689016-bba184c4d)
Difference is 10-15 seconds (for port readiness) during executing test_active_link_admin_down_config_reload_downstream

@zjswhhh
Copy link
Contributor

zjswhhh commented Nov 13, 2024

Hi @ayurkiv-nvda - Checked our nightly test data, this case passed consistently on 202405 and 202311.

Since the delay still matches our expectation, I feel it's unlikely an image issue.

The current wait time in this test is 60s+30s after config reload is triggered. I suggest we increment that to 120s or 150s.

@ayurkiv-nvda
Copy link
Contributor Author

Hello @zjswhhh
I think in this case we need to move this ticket from sonic-buildimage to sonic-mgmt in order to handle it in test_active_link_admin_down_config_reload_downstream test

@prgeor prgeor added the Triaged label Nov 20, 2024
@prgeor
Copy link
Contributor

prgeor commented Nov 20, 2024

@zjswhhh @bingwang-ms can you confirm who is working on increasing the timeout?

@prgeor prgeor assigned zjswhhh and unassigned bingwang-ms Nov 20, 2024
@prgeor prgeor added the MSFT label Nov 20, 2024
@prgeor prgeor transferred this issue from sonic-net/sonic-buildimage Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants