-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to avoid timing windows in WebSocket eventstream resulting in blocked streams in edge case reconnect scenario #152
Conversation
…gging Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Been thinking a bit more about this, and I realize there is still a window. Going to work a bit more on a fix. |
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Codecov Report
@@ Coverage Diff @@
## main #152 +/- ##
==========================================
+ Coverage 97.43% 97.46% +0.03%
==========================================
Files 56 56
Lines 6590 6595 +5
==========================================
+ Hits 6421 6428 +7
+ Misses 128 127 -1
+ Partials 41 40 -1
Continue to review full report at Codecov.
|
Ok - a further update to the fix, which makes me more confident about it. I have removed the |
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
close(t.closingChannel) | ||
t.closingChannel = make(chan struct{}) | ||
select { | ||
case t.receiverChannel <- errors.Errorf(errors.WebSocketClosed, connInfo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the meat of the change.
We've seen a hard to diagnose pattern of logs, that seem to result in an event stream getting stuck:
This PR provides a very speculative change that might fix this.
The
cycleTopic
code assumes it's safe to close thet.closingChannel
and then create a new one, at the same time another thread is spinning around a retry loop.I'm speculating that isn't actually safe, and it's possible the loop comes back round and ends up getting the wrong
t.closingChannel
when it comes back round here in some timing condition:firefly-ethconnect/internal/events/websockets.go
Line 47 in 8545f1c
The result is the next disconnect doesn't actually wake the loop, because it's listening on the wrong
t.closingChannel
.So this PR does two things:
Stops closing the channel, instead does a non-blocking push of abool
down the channel