Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed race conditions and bugs in xds client #941

Merged
merged 3 commits into from
Oct 10, 2023
Merged

Fixed race conditions and bugs in xds client #941

merged 3 commits into from
Oct 10, 2023

Conversation

shivamgupta1
Copy link
Contributor

@shivamgupta1 shivamgupta1 commented Oct 6, 2023

Summary

Fix three issues:

  • XdsClientImpl currently fails to detect connection failure to the ADS server. Connection failure leads to no notification to the StreamObserver since onError and onCompleted are terminal notifications. Instead, the onReady notification needs to be relied on to detect successful connection establishment.
  • There were some race conditions
    • startRpcStream public method didn't use the single-threaded executor for execution. It also didn't check for an already existing stream, which could lead to multiple ads streams.
    • It didn't account for backed off client state, when retries are exponentially backed off due to repeat failures. In such cases, stream could still get started due to external method calls.
  • Allow passing custom executor for the xDS client for custom monitoring and IC handling.

Testing Done

Built local snapshot and deployed against toki. Typically it seems to take 1 sec for the stream to get ready. Tested against various scenarios:

  • Successful connection
  • Pinned observer; connection failure on startup.
  • Pinned observer; observer crashes while running.
  • Pinned observer; observer starts up again.

Normal state

2023/10/06 14:14:31.752 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream started, connected to server: main.indis-registry-observer.ei-ltx1.atd.disco.linkedin.com:32123
2023/10/06 14:14:31.753 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/clusters/NonExistentCluster
2023/10/06 14:14:31.754 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/NonExistentCluster]
2023/10/06 14:14:32.739 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream ready, cancelled timeout task: true
2023/10/06 14:14:32.742 INFO [XdsLoadBalancer] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Enabled primary stores
2023/10/06 14:14:46.672 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/services/dataVaultAccessControlList
2023/10/06 14:14:46.673 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/services/dataVaultAccessControlList]
2023/10/06 14:18:57.778 INFO [ClusterLoadBalancerSubscriber] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] getting new ClusterInfoItem for cluster TokiBackend: _clusterProperties = ClusterProperties [_clusterName=TokiBackend, _prioritizedSchemes=[], _properties={}, _bannedUris=[], _partitionProperties=com.linkedin.d2.balancer.properties.NullPartitionProperties@1aa51757, _sslSessionValidationStrings=[toki-restli-backend], _darkClusterConfigMap={darkTokiBackend={multiplier=1.0, dispatcherMaxRequestsToBuffer=1, dispatcherOutboundTargetRate=0.0, darkClusterStrategyList=[RELATIVE_TRAFFIC], transportClientProperties={http.streamingTimeout=30000, http.idleTimeout=30000, http.protocolVersion=HTTP_2, http.maxHeaderSize=16384, http.maxResponseSize=20971520, http.requestTimeout=30000}, dispatcherBufferedRequestExpiryInSeconds=1}}, _delegated=false]

Connection failure on startup (Observer pinned and crashed)

2023/10/06 15:11:24.422 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream started, connected to server: ltx1-app4252.stg.linkedin.com:32123
2023/10/06 15:11:24.424 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/clusters/NonExistentCluster
2023/10/06 15:11:24.426 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/NonExistentCluster]
2023/10/06 15:11:29.284 WARN [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream not ready within 5000 milliseconds
2023/10/06 15:11:29.286 INFO [XdsLoadBalancer] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Enabled backup stores
2023/10/06 15:13:18.556 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream ready, cancelled timeout task: false

Start observer; immediately the stream becomes ready, and all the pending subscription requests are sent again

2023/10/06 15:13:18.556 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream ready, cancelled timeout task: false
2023/10/06 15:13:18.557 INFO [XdsLoadBalancer] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Enabled primary stores
2023/10/06 15:13:18.560 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/clusters/DataVaultAclService
2023/10/06 15:13:18.561 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/DataVaultAclService]
2023/10/06 15:13:18.563 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/clusters/TokiBackend
2023/10/06 15:13:18.563 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/TokiBackend]
2023/10/06 15:13:18.563 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/clusters/LixBackend
2023/10/06 15:13:18.564 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/LixBackend]
2023/10/06 15:13:18.564 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/clusters/DataVaultChangeMgmt
2023/10/06 15:13:18.565 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/DataVaultChangeMgmt]
2023/10/06 15:13:18.565 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/services/dataVaultAccessControlList
2023/10/06 15:13:18.565 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/services/dataVaultAccessControlList]
2023/10/06 15:13:18.565 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/services/dataVaultAclChanges
2023/10/06 15:13:18.566 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/services/dataVaultAclChanges]
2023/10/06 15:13:18.568 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/services/lixClientContexts
2023/10/06 15:13:18.568 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/services/lixClientContexts]
2023/10/06 15:13:18.568 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE resource /d2/services/tokiBackend
2023/10/06 15:13:18.568 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/services/tokiBackend]
2023/10/06 15:13:18.569 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE_MAP resource /d2/uris/DataVaultAclService
2023/10/06 15:13:18.569 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE_MAP request for resources: [/d2/uris/DataVaultAclService]
2023/10/06 15:13:18.569 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE_MAP resource /d2/uris/TokiBackend
2023/10/06 15:13:18.570 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE_MAP request for resources: [/d2/uris/TokiBackend]
2023/10/06 15:13:18.570 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE_MAP resource /d2/uris/LixBackend
2023/10/06 15:13:18.570 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE_MAP request for resources: [/d2/uris/LixBackend]
2023/10/06 15:13:18.571 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Subscribe D2_NODE_MAP resource /d2/uris/DataVaultChangeMgmt
2023/10/06 15:13:18.571 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE_MAP request for resources: [/d2/uris/DataVaultChangeMgmt]

The local filestore is also updated immediately

[sgupta8@sgupta8-ld2 toki]$ ls -ahl /tmp/d2/indis/uris/TokiBackend.ini
-rw-r--r-- 1 sgupta8 eng 753 Oct  6 15:13 /tmp/d2/indis/uris/TokiBackend.ini

Test Observer crash

2023/10/06 15:16:34.302 ERROR [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream closed with status UNAVAILABLE: Network closed for unknown reason. Cause: null
2023/10/06 15:16:34.302 INFO [XdsLoadBalancer] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Enabled backup stores
2023/10/06 15:16:34.302 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Retry ADS stream in 0 ns
2023/10/06 15:16:34.312 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream started, connected to server: ltx1-app4252.stg.linkedin.com:32123
2023/10/06 15:16:34.313 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE request for resources: [/d2/clusters/DataVaultAclService, /d2/clusters/TokiBackend, /d2/services/dataVaultAclChanges, /d2/clusters/NonExistentCluster, /d2/clusters/LixBackend, /d2/services/tokiBackend, /d2/clusters/DataVaultChangeMgmt, /d2/services/dataVaultAccessControlList, /d2/services/lixClientContexts]
2023/10/06 15:16:34.316 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Sending D2_NODE_MAP request for resources: [/d2/uris/LixBackend, /d2/uris/TokiBackend, /d2/uris/DataVaultChangeMgmt, /d2/uris/DataVaultAclService]
2023/10/06 15:16:39.309 WARN [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream not ready within 5000 milliseconds

Start Observer again after crash

2023/10/06 15:18:23.929 INFO [XdsClientImpl] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] ADS stream ready, cancelled timeout task: false
2023/10/06 15:18:23.930 INFO [XdsLoadBalancer] [D2 xDS PropertyEventExecutor-6-1] [toki-war] [] Enabled primary stores

Filestore is updated again

[sgupta8@sgupta8-ld2 toki]$ ls -ahl /tmp/d2/indis/uris/TokiBackend.ini
-rw-r--r-- 1 sgupta8 eng 753 Oct  6 15:18 /tmp/d2/indis/uris/TokiBackend.ini

@bohhyang
Copy link
Contributor

bohhyang commented Oct 6, 2023

you need to run ./scripts/update-changelog to update the correct entry in the change log:)

Copy link
Contributor

@bohhyang bohhyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! main comment about adding a timeout config. Thanks for fixing this!

Copy link
Contributor

@bohhyang bohhyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@shivamgupta1 shivamgupta1 merged commit 5e63519 into master Oct 10, 2023
2 checks passed
@shivamgupta1 shivamgupta1 deleted the bug/xds branch October 10, 2023 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants