Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart syncd fails in "applyView" with condition current view switches: 0 != temporary view switches #639

Open
rajkumar38 opened this issue Jul 9, 2020 · 15 comments
Assignees

Comments

@rajkumar38
Copy link
Contributor

rajkumar38 commented Jul 9, 2020

@kcudnik @lguohan : Pls check.

All dockers are fully up and then issued "sudo systemctl restart syncd", it fails in apply_view which
causes orchagent crash. Attached full logs for reference.

sonic-buildimage commit : f31eabb5ee65f7d37d57d0da85dacf39d3b5fad1
sonic-buildimage/src/sonic-sairedis commit 322dd01

syslog_FATAL_error.txt

Feb 14 10:23:37.448931 sonic NOTICE swss#orchagent: :- syncd_apply_view: Notify syncd APPLY_VIEW
Feb 14 10:23:37.448931 sonic NOTICE swss#orchagent: :- notifySyncd: sending syncd: APPLY_VIEW
Feb 14 10:23:37.450194 sonic WARNING syncd#syncd: :- processNotifySyncd: syncd received APPLY VIEW, will translate
Feb 14 10:23:37.450999 sonic NOTICE syncd#syncd: :- dump: getting took 0.000143 sec
Feb 14 10:23:37.451152 sonic NOTICE syncd#syncd: :- getAsicView: ASIC_STATE switch count: 0:
Feb 14 10:23:37.451342 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from ASIC_STATE took 0.000990 sec
Feb 14 10:23:37.454001 sonic NOTICE syncd#syncd: :- dump: getting took 0.002015 sec
Feb 14 10:23:37.454562 sonic NOTICE syncd#syncd: :- getAsicView: TEMP_ASIC_STATE switch count: 1:
Feb 14 10:23:37.454699 sonic NOTICE syncd#syncd: :- getAsicView: oid:0x21000000000000: objects count: 64
Feb 14 10:23:37.455059 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from TEMP_ASIC_STATE took 0.003421 sec
Feb 14 10:23:37.455239 sonic ERR syncd#syncd: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL
Feb 14 10:23:37.475524 sonic NOTICE syncd#syncd: :- applyView: apply took 0.025017 sec
Feb 14 10:23:37.477195 sonic ERR syncd#syncd: :- run: Runtime error: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL
Feb 14 10:23:37.477339 sonic NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Feb 14 10:23:37.485238 sonic NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfull
Feb 14 10:23:37.505620 sonic INFO swss#supervisord: orchagent terminate called after throwing an instance of 'std::invalid_argument'
Feb 14 10:23:37.506549 sonic INFO swss#supervisord: orchagent what(): parse error - unexpected 'o'

=============================================================================

sroot@sonic:/home/admin# show version

SONiC Software Version: SONiC.jun9.0-dirty-20200706.194346
Distribution: Debian 10.4
Kernel: 4.19.0-6-armmp
Build commit: f31eabb5
Build date: Mon Jul 6 14:54:32 UTC 2020
Built by: marvell@cpss-build7

Platform: armhf-marvell_et6448m_52x-r0
HwSKU: et6448m
ASIC: marvell
Serial Number: 201908150001
Uptime: 11:38:48 up 1:27, 1 user, load average: 0.51, 0.56, 0.52

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-orchagent jun9.0-dirty-20200706.194346 9af6ab3099fe 352MB
docker-orchagent latest 9af6ab3099fe 352MB
docker-fpm-frr jun9.0-dirty-20200706.194346 1dedd2aa317c 300MB
docker-fpm-frr latest 1dedd2aa317c 300MB
docker-sflow jun9.0-dirty-20200706.194346 b6fb327d6add 286MB
docker-sflow latest b6fb327d6add 286MB
docker-nat jun9.0-dirty-20200706.194346 696e475b4d4c 287MB
docker-nat latest 696e475b4d4c 287MB
docker-teamd jun9.0-dirty-20200706.194346 c203e683b9a5 286MB
docker-teamd latest c203e683b9a5 286MB
docker-syncd-mrvl jun9.0-dirty-20200706.194346 a15ba5251617 531MB
docker-syncd-mrvl latest a15ba5251617 531MB
docker-snmp-sv2 jun9.0-dirty-20200706.194346 1967491bf338 348MB
docker-snmp-sv2 latest 1967491bf338 348MB
docker-lldp jun9.0-dirty-20200706.194346 f49ecd7684d8 328MB
docker-lldp latest f49ecd7684d8 328MB
docker-database jun9.0-dirty-20200706.194346 c85327cc927e 313MB
docker-database latest c85327cc927e 313MB
docker-platform-monitor jun9.0-dirty-20200706.194346 9752ae58bea8 322MB
docker-platform-monitor latest 9752ae58bea8 322MB
docker-router-advertiser jun9.0-dirty-20200706.194346 be949c4f52b2 270MB
docker-router-advertiser latest be949c4f52b2 270MB
docker-dhcp-relay jun9.0-dirty-20200706.194346 a05bf6ab2d92 277MB
docker-dhcp-relay latest a05bf6ab2d92 277MB

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 10, 2020

applyView: current view switches: 0 != temporary view switches: 1, FATAL
this suggest that there was not switch in previous run of syncd,
what scenario is this ? is it cold boot, then warm boot, or cold boot only ?
this seems to be running on VS, is this consistent repro ?

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 10, 2020

something is wrong here:
Feb 14 10:22:37.915367 sonic NOTICE syncd#syncd: :- helperLoadColdVids: read 1346 COLD VIDS
Feb 14 10:22:44.154247 sonic NOTICE syncd#syncd: :- onSyncdStart: hard reinit succeeded
hard reinit succeeded, and there were over 1k objects in db

Feb 14 10:23:37.043951 sonic NOTICE syncd#syncd: :- onSwitchCreateInInitViewMode: new switch oid:0x21000000000000 contains hardware info: '' // this indicates that switch is already present in syncd memory, which is ok

Feb 14 10:23:37.450999 sonic NOTICE syncd#syncd: :- dump: getting took 0.000143 sec
Feb 14 10:23:37.451152 sonic NOTICE syncd#syncd: :- getAsicView: ASIC_STATE switch count: 0: // and this is wrong, no switches in database dump, but it was already loaded, also dump database time is very low, so it suggest db is empty
was somehow db cleared ?

Feb 14 10:23:37.451342 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from ASIC_STATE took 0.000990 sec
Feb 14 10:23:37.454001 sonic NOTICE syncd#syncd: :- dump: getting took 0.002015 sec
Feb 14 10:23:37.454562 sonic NOTICE syncd#syncd: :- getAsicView: TEMP_ASIC_STATE switch count: 1:
Feb 14 10:23:37.454699 sonic NOTICE syncd#syncd: :- getAsicView: oid:0x21000000000000: objects count: 64
Fe
b 14 10:23:37.455059 sonic NOTICE syncd#syncd: :- getAsicView: get asic view from TEMP_ASIC_STATE took 0.003421 sec
Feb 14 10:23:37.455239 sonic ERR syncd#syncd: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 10, 2020

after reaching this state, could you check "redis-cli -n 1" and "keys *" to see if there are any ASIC_STATE keys except TEMP_ASIC_STATE ?

@rajkumar38
Copy link
Contributor Author

rajkumar38 commented Jul 11, 2020

applyView: current view switches: 0 != temporary view switches: 1, FATAL
this suggest that there was not switch in previous run of syncd,
what scenario is this ? is it cold boot, then warm boot, or cold boot only ?
this seems to be running on VS, is this consistent repro ?

Steps to reproduce:

Using ET-6448M board comprising ARMADA-385 CPU, with 48 1G
Copper ports, and 4 10G SFP+ ports.

  1. In stand-alone box, load t1 preset config and reboot the box.
  2. Verify all the ports are present and IP address are set.
  3. Restart syncd, using command "sudo systemctl restart syncd" multiple times.
    I'm consistently seeing this issue on second iteration of syncd restart.

@rajkumar38
Copy link
Contributor Author

after reaching this state, could you check "redis-cli -n 1" and "keys *" to see if there are any ASIC_STATE keys except TEMP_ASIC_STATE ?

Attached redis-cli output in issue state.
ASIC.txt

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 11, 2020

in ASIC.txt there is no ASIC_STATE* entries, this seems like those entreis were removed or database was cleared, could we work together on that this week?

@rajkumar38
Copy link
Contributor Author

in ASIC.txt there is no ASIC_STATE* entries, this seems like those entreis were removed or database was cleared, could we work together on that this week?

Yes. We can work on this.

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 13, 2020

Since ASIC.txt dump don't contain "HIDDEN" and any ASIC_STATE* keys, then this means database was flushed,
also i suspect RIDTOVID and VIDTORID hashes to be also small count like less than < 100, could you check that also "keys RIDTOVID"

if so, then this means some script probably did FLUSHALL on database, most likely outside syncd/sairedis/OA

@rajkumar38
Copy link
Contributor Author

rajkumar38 commented Jul 14, 2020

Since ASIC.txt dump don't contain "HIDDEN" and any ASIC_STATE* keys, then this means database was flushed,
also i suspect RIDTOVID and VIDTORID hashes to be also small count like less than < 100, could you check that also "keys RIDTOVID"

if so, then this means some script probably did FLUSHALL on database, most likely outside syncd/sairedis/OA

"keys RIDTOVID" has some 116 entries.
ASIC_RIDTOVID_keys.txt

Additionally, I collected the "redis-cli monitor" logs.
redis_monitor_logs.txt

And syncd logs with "DEBUG" level enabled.
syslog_DEBUG_logs.txt

This shows that lua scripts loaded in syncd context as part of docker restart, cleans-up the ASIC state. Hope these logs are helpful.

Note that in the above log file, syncd is crashed. This is because of 32-bit arch, which we will fix in parallel.

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 14, 2020

Hey, thanks for those, logs, this explains the issue, take a look here:
1550139701.569199 [1 unix:/var/run/redis/redis.sock] "KEYS" "ASIC_STATE:" // full of objects
...
1550139724.135934 [0 127.0.0.1:53888] "SELECT" "1"
1550139724.137403 [1 127.0.0.1:53888] "FLUSHDB" // THIS CLEARS DB !
...
1550139761.398453 [1 lua] "HSET" "TEMP_ASIC_STATE:SAI_OBJECT_TYPE_ROUTE_ENTRY:{"de // OA populates temp view
..
1550139761.738593 [1 lua] "KEYS" "ASIC_STATE:
" // this one is empty
1550139760.677580 [1 unix:/var/run/redis/redis.sock] "KEYS" "TEMP_ASIC_STATE:*" // this one is full of objects

will try investigate this internally

this is related to sonic-buildimage/files/scripts/swss.sh execution, seems like this script is executed when OA and syncd is already rinning

@kcudnik
Copy link
Collaborator

kcudnik commented Jul 14, 2020

could you try to execute "sudo systemctl restart swss" or "config reload" instead of "syncd" ?, i think currently restart syncd is not supported.
is there are reason why you want to restart syncd only?

@rajkumar38
Copy link
Contributor Author

could you try to execute "sudo systemctl restart swss" or "config reload" instead of "syncd" ?, i think currently restart syncd is not supported.
is there are reason why you want to restart syncd only?

Thanks for the update. I see this issue only with restart_syncd, all other cases (reboot,config reload, restart swss ) are fine.
Any reason why restart_syncd is not supported ?
I see restart_syncd test-case in sonic-mgmt repo, https://github.com/Azure/sonic-mgmt/blob/master/ansible/roles/test/tasks/restart_syncd.yml

@rajkumar38
Copy link
Contributor Author

@yxieca @kcudnik Hi, Is this bug scheduled for fix ? Let me know if you need any more logs to fix this issue. Thanks.

@kcudnik
Copy link
Collaborator

kcudnik commented Aug 27, 2020

i dont know why restart_syncd is not supported, maybe @yxieca can help here
from syncd/sairedis this is not a bug, and supporing restart_syncd command is on the system level

@yxieca
Copy link
Contributor

yxieca commented Aug 27, 2020

restart syncd should be the same as restarting swss. which could trigger some dependent service failure. Issue is not really with swss or syncd. at this moment, config reload is the safest way to recover.

With that said, I think we should try to make restarting syncd/swss work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants