Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel xds stream ready timeout when the stream is closed. Correct xds connection status metric. #1007

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

bohhyang
Copy link
Contributor

@bohhyang bohhyang commented Jun 24, 2024

Summary

See context at slack discussion.

This change:

  1. cleans up xds stream ready timeout future when that stream is closed (instead of being ready).
  2. align xds connection status metric with what the connection listeners receive (in the previous edge case, the connected is indeed good, but the listeners are mis-notified that it's not connected).
  3. cleans up connection retry future when that reconnection stream is ready.
  4. After all, if similar issues happen in future where xds stream is actually ready and taking in updates but the event bus is incorrectly enabled with backup store, updated XdsToD2PropertiesAdaptor will notify that xds is reconnected when it receives an update.

Tests

N/A

d2/src/main/java/com/linkedin/d2/xds/XdsClientImpl.java Outdated Show resolved Hide resolved
Comment on lines 271 to 276
if (_retryRpcStreamFuture != null)
{
_log.info("ADS stream reconnected after {} milliseconds",
_retryRpcStreamFuture.getDelay(TimeUnit.MILLISECONDS));
_retryRpcStreamFuture = null;
_xdsClientJmx.incrementReconnectionCount();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these conditions seem to be precluded by the isInBackoff check above at line 255? Also, getDelay seems to return the still remaining delay and not the elapsed time.

Copy link
Contributor Author

@bohhyang bohhyang Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precluded by the isInBackoff check above at line 255

not really. This is not for checking backoff but just to clean _retryRpcStreamFuture for the case when the retry future is executed (and the reconnect stream is ready), but the future itself is not cleaned. This check will differentiate the initial connect from reconnections (previously, reconnectionCount metric was incremented for initial connection as well, which was not a reconnect).

getDelay seems to return the still remaining delay

whoops, that's right. Removed, thanks.

Comment on lines +261 to +270
// Also confirm ready timeout future is not null to avoid notifying multiple times.
if (!_adsStream.isReady() || _readyTimeoutFuture == null)
{
// if the ready timeout future is non-null, a reconnect notification hasn't been sent yet.
if (_readyTimeoutFuture != null)
{
// timeout task will be cancelled only if it hasn't already executed.
boolean cancelledTimeout = _readyTimeoutFuture.cancel(false);
_log.info("ADS stream ready, cancelled timeout task: {}", cancelledTimeout);
_readyTimeoutFuture = null; // set it to null to avoid repeat notifications to subscribers.
_xdsClientJmx.incrementReconnectionCount();
notifyStreamReconnect();
}
_xdsClientJmx.setIsConnected(true);
return;
}

// timeout task will be cancelled only if it hasn't already executed.
boolean cancelledTimeout = _readyTimeoutFuture.cancel(false);
_log.info("ADS stream ready, cancelled timeout task: {}", cancelledTimeout);
_readyTimeoutFuture = null; // set it to null to avoid repeat notifications to subscribers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not too bad, this might be some unneeded refactoring? (don't fix what ain't broken)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's just return-early, as suggested by @brycezhongqing above.

@bohhyang bohhyang merged commit bba13f9 into master Jun 25, 2024
2 checks passed
@bohhyang bohhyang deleted the by/disconnectMetric branch June 25, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants