-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
break out of infinite retry for submitted ops #1595
Conversation
8d79de6
to
f968a81
Compare
Thanks @shorsher - I see the problem now I was thinking we could approach this one differently by using the backend ID that is included in the transaction to EVMConnect so that when the operation does go ahead and retries using the EVMConnect connector the connector should be able to handle the case where the transaction successfully worked. Instead of marking it as Failed or not returning an error. |
@@ -147,6 +148,7 @@ func (e *Ethereum) Init(ctx context.Context, cancelCtx context.CancelFunc, conf | |||
e.capabilities = &blockchain.Capabilities{} | |||
e.callbacks = common.NewBlockchainCallbacks() | |||
e.subs = common.NewFireflySubscriptions() | |||
e.count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this count is global so this would only work the first time a TX is sent in the whole of FF
I think the bug is that FireFly is not correctly handling the conflict error @EnriqueL8. It gets properly classified in the operations manager as a 409 (which it is), but just missing the step that unblocks the processor. EVMConnect receives and successfully handles the transaction, I don't see how using a different ID would make any difference. We would still be resubmitting the transaction. |
22629de
to
68ad190
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1595 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 337 337
Lines 29492 29498 +6
=========================================
+ Hits 29492 29498 +6 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct fix.
The only alternative that I was considering, is whether Operations Manager should swallow the error. We have very specific conflict handling added there in #1406 for HTTP 409, but even though it does some processing of the error, that code elects to still return the error upward. You could make the case that Operations Manager has handled the error and could clear it from bubbling up any further - but that would potentially have wider implications.
So all of that to say, I think re-parsing the conflict error up at this level feels like the safest solution to the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, and agree with @awrichar on the solution. The other approach we had discussed around checking the op being successful was more complicated.
One comment on the successful flow that might have been missed and missing codecov
internal/batch/batch_processor.go
Outdated
conflictErr, conflictTestOk := err.(operations.ConflictError) | ||
if conflictTestOk && conflictErr.IsConflictError() { | ||
// We know that the connector has received our batch, so we shouldn't need to retry | ||
return true, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is the right approach but we are missing the else
section where we update the message state so in the issue the first time EVMConnect call you get EOF and the second time you get a conflict so it never updated that state for the case where it's a pinned batch tx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to add the message state, thanks for catching.
If an operation receives a conflict error from the blockchain connector, currently it will continutally retry submitting that operation but it will never succeed. Instead, we shouldn't retry if we know the connector has received the submission. Signed-off-by: Alex Shorsher <alex.shorsher@kaleido.io>
e4a24f3
to
ef997cd
Compare
Signed-off-by: Alex Shorsher <alex.shorsher@kaleido.io>
ef997cd
to
a116123
Compare
conflictErr, conflictTestOk := err.(operations.ConflictError) | ||
if conflictTestOk && conflictErr.IsConflictError() { | ||
// We know that the connector has received our batch, so we shouldn't need to retry | ||
payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateSent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need the whole block and we can extract this to a function
if core.IsPinned(payload.Batch.TX.Type) {
payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateSent)
} else {
payload.addMessageUpdate(payload.Messages, core.MessageStateReady, core.MessageStateConfirmed)
}
For pinned batches we have to wait for the blockchain TX that is why we set them as sent and you might have just assumed this was always the case in conflict mode. But this batch processor is completed agnostic to how the batch is handled, it could be a conflict from data exchange or any other plugin needed to dispatch that batch. In the case where it's not pinned, we don't need to wait for TX so we can set it as confirmed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore there is a snippet of code above that handles a cancelled batch pin and this is an edge case but if the last error before the cancel is a conflict we will update the state of the initial batch incorrectly from cancelled to sent
- don't infinitely retry for submitted operations
- don't infinitely retry for submitted operations
Fix for #1594
I've noticed that if there's an error (timeout/disconnect/connection loss) during a BatchPin submission to the blockchain connector (EVMConnect in my case), it's possible for EVMConnect to successfully process the transaction but the FireFly operation will be Failed. Then, the batch processor resubmits the batch but EVMConnect returns a 409, which causes the processor to indefinitely retry and prevent new batches from occurring.
Example error message:
FF10458: Conflict from blockchain connector: FF21065: ID 'default:f5296ba7-1b23-4c7f-8612-620dafc0e40a' is not unique d=pinned_broadcast ns=default opcache=1UGYM3mn p=did:firefly:org/org_5452a6| pid=61421 role=batchmgr Currently, the only way to fix this is restarting FireFly.
Instead, I propose we break the retry loop by having the batch processor not retry on conflict errors.