-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries to notification calls #125
Conversation
start := time.Now() | ||
_, err := me.client.Send(ctx, fcmMsg) | ||
if me.metrics != nil { | ||
me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add in the metrics the rety attempt? I don't see the counter being added to the metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The counter would go after all the retries. No need to mark it as a failure if we are retrying.
We could add some different metric for that though. 0/5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should add a metrics/ligs at least to have some idea that is indeed happening and why, that way it would allows us to narrow down what could be causing future double notifications etc...
if isRetriableError(err) { | ||
if nextIteration := retry + 1; nextIteration < MAX_RETRIES { | ||
return me.SendNotificationWithRetry(fcmMsg, nextIteration) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a metric or log showing that the max retries was reached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we probably should.
server/apple_notification_server.go
Outdated
} | ||
|
||
if err != nil { | ||
if nextIteration := retry + 1; nextIteration < MAX_RETRIES { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are not checking if we should retry the same way we do on Android
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On iOS the library works slightly different.
We may get a response from the library saying that it was not sent, and those would be the "known errors". Anything that comes in the err
variable is something probably related to the network, so that is what we are retrying here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I see, even when there is a mismatch with the device id or the device is no longer registered (for example the app has been unistalled)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those scenarios are covered inside the response, and not in the error.
server/apple_notification_server.go
Outdated
|
||
res, err := me.AppleClient.PushWithContext(ctx, notification) | ||
if me.metrics != nil { | ||
me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as on Android for metrics and logs
if isRetriableError(err) { | ||
me.logger.Errorf("Failed to send android push did=%v retry=%v error=%v", fcmMsg.Token, retry, err) | ||
if nextIteration := retry + 1; nextIteration < MAX_RETRIES { | ||
return me.SendNotificationWithRetry(fcmMsg, nextIteration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a wait here before immediately retrying? In the server we do exponential backoff times to give it some time if it's a network hiccup or some sort before continuously retrying - see https://github.com/mattermost/mattermost/blob/master/server/channels/utils/backoff.go#L14-L15
Also in the CWS we use a backoff library that is the same concept https://github.com/mattermost/customer-web-server/blob/master/app/cloud.go#L204
if me.metrics != nil { | ||
me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds()) | ||
} | ||
err := me.SendNotificationWithRetry(fcmMsg, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Android, retries are already built-in to the HTTP client: https://github.com/firebase/firebase-admin-go/blob/6a28190d0d30fe0134648566c7e89f8635a51672/messaging/messaging.go#L892.
There is no need to do anything. In future, please feel free to ask me for review or take a deeper look at the documentation. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that. Thank you for pointing it out!
That being said, the reason we are retrying is, among others, because FCM may return a "internal server error", which looking around the web, seems to be a fluke in their system, and that we just have to retry.
Looking at the internals, I see that only serviceUnavailable
status is retried, so other problems (like the context timing out, or this internal server error) won't get retried. Am I right?
If that is so, how would you approach this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would trust on the implementation to be correct on which status codes to retry on. Also, any network errors like timeouts are also retried.
So the only missing thing is to retry on additional status codes. I'd want to see some metrics on the distribution of errors we are getting from FCM and then take a call.
For context, overriding the retry policy from the user side is difficult: firebase/firebase-admin-go#579.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I don't see FCM errors on the metrics, but @marianunez spotted a few some time ago.
For iOS we recently have found a couple regarding context timeout and others, so that is why we are adding the retry for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marianunez The library doesn't expose the error codes. It has a few util functions to distinguish the errors, and that is what we are doing in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we can really get that. The library exposes just an error, and doesn't expose the http status code. We could use some castings to get it out, but not sure if that is what we want.
@larkox - casting is exactly what we want :)
@marianunez - thanks for the link. It's clear that we should retry on 5xx. However, from the logs, it's not clear if the ones with internal error were logged after a retry with 503, or they were logged from some other non-503 error which wasn't retried. That's what I am trying to find out here.
In any case, the library is poorly written and makes it such that the user cannot choose to override the retry policy inspite of the fact that the client clearly takes a retry policy struct as a parameter. There isn't even a way to bypass retrying so that we can copy over just the retry code.
So in worst case, we have to fork the repo and move out some packages out of internal, so that they can be imported and overridden from the client.
But before that, I wanted some metrics on the distribution of the status codes. If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone. But if we see other non-503 status codes, then we have a decision in our hand to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agnivade Just to understand your ask.
You want a first change to surface the status code of the response, let it simmer for a while so we get new data, verify what kind of 5xx error it is, and then decide whether this PR makes sense or not. Am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone
@agnivade I may have missed this in the thread above, but why don't we want to retry for a 503?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larkox - Yep, exactly.
@marianunez - To clarify: we, ourselves, don't want to retry for 503, because the client already does by default. My objective here is to figure out if the internal failures are all 503 or not. If it is, then we know that those were already retried and still failed. If it's something other than 503, then yes we definitely need to do something ourselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to look at the error codes to make sure we are doing the retries when recommended by FCM and also see the indications on when can the retries happen.
@agnivade I did some checking on the error, and did some refactoring. We are still retrying other errors, but we are filtering out (and logging) when the error is 503. Let me know if this would be ok for you, or you would prefer to remove the retry on the first iteration completely. |
This does not seem to be what we discussed here: #125 (comment). Couple of things:
|
Re: Not what we discussed Re: Reflection |
I see. You updated the PR without any extra context, so I didn't know about this extra conversation with Maria. Anyways, so the problem is with implementing our own retry logic. The current logic that you have done is not correct, because it's not respecting the But before we do anything ourselves, why not gather some data first and see what response codes we are getting? I wanted to see if all error codes are always 503 or not, in which case, we don't need to retry anything ourselves. Ideally, I would prefer not to have any duplication at all. So worst case, we would need to fork the repo and maintain it ourselves. But before that, I wanted to check if we need to retry at all or not.
Ah yes ofc, thanks for that. Just goes to show how unconfigurable that module is. |
@agnivade OK. Then let's do that. Let's drop the retry for now, and just collect data about the error code. |
To avoid noise, I will close this PR in favor of #126 , since even if we decide to continue this, we will probably take a different approach (forking the library). |
@marianunez I prefer to handle all retries at once, when we decide on what to do with android. |
Summary
We have seen several errors that seem to be network hiccups. To avoid missing notifications, we add a retry logic whenever an unexpected error appears.
This may result on duplicated notifications, but we prefer to err on the safe side, instead of having missed notifications.
Ticket Link
Fix https://mattermost.atlassian.net/browse/MM-56827