Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to notification calls #125

Closed
wants to merge 3 commits into from
Closed

Add retries to notification calls #125

wants to merge 3 commits into from

Conversation

larkox
Copy link
Contributor

@larkox larkox commented Aug 20, 2024

Summary

We have seen several errors that seem to be network hiccups. To avoid missing notifications, we add a retry logic whenever an unexpected error appears.

This may result on duplicated notifications, but we prefer to err on the safe side, instead of having missed notifications.

Ticket Link

Fix https://mattermost.atlassian.net/browse/MM-56827

@larkox larkox added the 1: Dev Review Requires review by a core commiter label Aug 20, 2024
start := time.Now()
_, err := me.client.Send(ctx, fcmMsg)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add in the metrics the rety attempt? I don't see the counter being added to the metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counter would go after all the retries. No need to mark it as a failure if we are retrying.

We could add some different metric for that though. 0/5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should add a metrics/ligs at least to have some idea that is indeed happening and why, that way it would allows us to narrow down what could be causing future double notifications etc...

if isRetriableError(err) {
if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
return me.SendNotificationWithRetry(fcmMsg, nextIteration)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a metric or log showing that the max retries was reached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we probably should.

}

if err != nil {
if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are not checking if we should retry the same way we do on Android

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On iOS the library works slightly different.

We may get a response from the library saying that it was not sent, and those would be the "known errors". Anything that comes in the err variable is something probably related to the network, so that is what we are retrying here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I see, even when there is a mismatch with the device id or the device is no longer registered (for example the app has been unistalled)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those scenarios are covered inside the response, and not in the error.


res, err := me.AppleClient.PushWithContext(ctx, notification)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on Android for metrics and logs

@larkox larkox requested a review from enahum August 20, 2024 12:58
if isRetriableError(err) {
me.logger.Errorf("Failed to send android push did=%v retry=%v error=%v", fcmMsg.Token, retry, err)
if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
return me.SendNotificationWithRetry(fcmMsg, nextIteration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a wait here before immediately retrying? In the server we do exponential backoff times to give it some time if it's a network hiccup or some sort before continuously retrying - see https://github.com/mattermost/mattermost/blob/master/server/channels/utils/backoff.go#L14-L15

Also in the CWS we use a backoff library that is the same concept https://github.com/mattermost/customer-web-server/blob/master/app/cloud.go#L204

if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds())
}
err := me.SendNotificationWithRetry(fcmMsg, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Android, retries are already built-in to the HTTP client: https://github.com/firebase/firebase-admin-go/blob/6a28190d0d30fe0134648566c7e89f8635a51672/messaging/messaging.go#L892.

There is no need to do anything. In future, please feel free to ask me for review or take a deeper look at the documentation. :)

Copy link
Contributor Author

@larkox larkox Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that. Thank you for pointing it out!

That being said, the reason we are retrying is, among others, because FCM may return a "internal server error", which looking around the web, seems to be a fluke in their system, and that we just have to retry.

Looking at the internals, I see that only serviceUnavailable status is retried, so other problems (like the context timing out, or this internal server error) won't get retried. Am I right?

If that is so, how would you approach this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would trust on the implementation to be correct on which status codes to retry on. Also, any network errors like timeouts are also retried.

So the only missing thing is to retry on additional status codes. I'd want to see some metrics on the distribution of errors we are getting from FCM and then take a call.

For context, overriding the retry policy from the user side is difficult: firebase/firebase-admin-go#579.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I don't see FCM errors on the metrics, but @marianunez spotted a few some time ago.

For iOS we recently have found a couple regarding context timeout and others, so that is why we are adding the retry for both.

Copy link
Member

@marianunez marianunez Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the last 7 days, there has been 54 instances of FCM push failures:

Screenshot 2024-08-22 at 8 30 26 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marianunez The library doesn't expose the error codes. It has a few util functions to distinguish the errors, and that is what we are doing in the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we can really get that. The library exposes just an error, and doesn't expose the http status code. We could use some castings to get it out, but not sure if that is what we want.

@larkox - casting is exactly what we want :)

@marianunez - thanks for the link. It's clear that we should retry on 5xx. However, from the logs, it's not clear if the ones with internal error were logged after a retry with 503, or they were logged from some other non-503 error which wasn't retried. That's what I am trying to find out here.

In any case, the library is poorly written and makes it such that the user cannot choose to override the retry policy inspite of the fact that the client clearly takes a retry policy struct as a parameter. There isn't even a way to bypass retrying so that we can copy over just the retry code.

So in worst case, we have to fork the repo and move out some packages out of internal, so that they can be imported and overridden from the client.

But before that, I wanted some metrics on the distribution of the status codes. If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone. But if we see other non-503 status codes, then we have a decision in our hand to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agnivade Just to understand your ask.

You want a first change to surface the status code of the response, let it simmer for a while so we get new data, verify what kind of 5xx error it is, and then decide whether this PR makes sense or not. Am I right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone

@agnivade I may have missed this in the thread above, but why don't we want to retry for a 503?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larkox - Yep, exactly.

@marianunez - To clarify: we, ourselves, don't want to retry for 503, because the client already does by default. My objective here is to figure out if the internal failures are all 503 or not. If it is, then we know that those were already retried and still failed. If it's something other than 503, then yes we definitely need to do something ourselves.

Copy link
Member

@marianunez marianunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to look at the error codes to make sure we are doing the retries when recommended by FCM and also see the indications on when can the retries happen.

@larkox
Copy link
Contributor Author

larkox commented Sep 3, 2024

@agnivade I did some checking on the error, and did some refactoring. We are still retrying other errors, but we are filtering out (and logging) when the error is 503. Let me know if this would be ok for you, or you would prefer to remove the retry on the first iteration completely.

@agnivade
Copy link
Member

agnivade commented Sep 3, 2024

This does not seem to be what we discussed here: #125 (comment).

Couple of things:

  1. Do we really need reflection here? I believe we should be just able to type-cast the error with (err.(*Type)) and get the target type.
  2. We should not implement the retry logic (on Android) by ourselves. Because you also have to respect the X-Retry-After response header and incorporate the exponential retry with that. All of that is already implemented in the fcm client. Potentially, we could copy over the retry logic, but then having 2 separate pieces of retry feels incorrect to me. Especially when they can internally retry on new status codes any time. So a library upgrade can potentially break our logic.

@larkox
Copy link
Contributor Author

larkox commented Sep 3, 2024

Re: Not what we discussed
You are right. After talking with @marianunez , we thought that if the concern was retrying what the library is already retrying, not retry that, and still retry other things. If you are not ok with that approach, I can revert it.

Re: Reflection
The type (FirebaseError if I am not mistaken) is internal, so I cannot directly cast it. Therefore the reflection.

@agnivade
Copy link
Member

agnivade commented Sep 3, 2024

You are right. After talking with @marianunez , we thought that if the concern was retrying what the library is already retrying, not retry that, and still retry other things. If you are not ok with that approach, I can revert it.

I see. You updated the PR without any extra context, so I didn't know about this extra conversation with Maria. Anyways, so the problem is with implementing our own retry logic. The current logic that you have done is not correct, because it's not respecting the X-Retry-After header. So that's the first thing to fix.

But before we do anything ourselves, why not gather some data first and see what response codes we are getting? I wanted to see if all error codes are always 503 or not, in which case, we don't need to retry anything ourselves.

Ideally, I would prefer not to have any duplication at all. So worst case, we would need to fork the repo and maintain it ourselves. But before that, I wanted to check if we need to retry at all or not.

The type (FirebaseError if I am not mistaken) is internal, so I cannot directly cast it.

Ah yes ofc, thanks for that. Just goes to show how unconfigurable that module is.

@larkox
Copy link
Contributor Author

larkox commented Sep 3, 2024

@agnivade OK. Then let's do that. Let's drop the retry for now, and just collect data about the error code.

@larkox
Copy link
Contributor Author

larkox commented Sep 3, 2024

To avoid noise, I will close this PR in favor of #126 , since even if we decide to continue this, we will probably take a different approach (forking the library).

@larkox larkox closed this Sep 3, 2024
@marianunez
Copy link
Member

To avoid noise, I will close this PR in favor of #126 , since even if we decide to continue this, we will probably take a different approach (forking the library).

@larkox How about the ios retry? We don't want to continue with that ?

@larkox
Copy link
Contributor Author

larkox commented Sep 4, 2024

@marianunez I prefer to handle all retries at once, when we decide on what to do with android.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1: Dev Review Requires review by a core commiter
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants