Add retries to notification calls #125

larkox · 2024-08-20T11:17:37Z

Summary

We have seen several errors that seem to be network hiccups. To avoid missing notifications, we add a retry logic whenever an unexpected error appears.

This may result on duplicated notifications, but we prefer to err on the safe side, instead of having missed notifications.

Ticket Link

Fix https://mattermost.atlassian.net/browse/MM-56827

enahum · 2024-08-20T11:24:18Z

server/android_notification_server.go

+	start := time.Now()
+	_, err := me.client.Send(ctx, fcmMsg)
+	if me.metrics != nil {
+		me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds())


Should we add in the metrics the rety attempt? I don't see the counter being added to the metrics

The counter would go after all the retries. No need to mark it as a failure if we are retrying.

We could add some different metric for that though. 0/5

Yes I think we should add a metrics/ligs at least to have some idea that is indeed happening and why, that way it would allows us to narrow down what could be causing future double notifications etc...

enahum · 2024-08-20T11:25:16Z

server/android_notification_server.go

+	if isRetriableError(err) {
+		if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
+			return me.SendNotificationWithRetry(fcmMsg, nextIteration)
+		}


Should we add a metric or log showing that the max retries was reached?

Yes, we probably should.

enahum · 2024-08-20T11:25:52Z

server/apple_notification_server.go

+	}
+
+	if err != nil {
+		if nextIteration := retry + 1; nextIteration < MAX_RETRIES {


Here we are not checking if we should retry the same way we do on Android

On iOS the library works slightly different.

We may get a response from the library saying that it was not sent, and those would be the "known errors". Anything that comes in the err variable is something probably related to the network, so that is what we are retrying here.

Hmmm I see, even when there is a mismatch with the device id or the device is no longer registered (for example the app has been unistalled)

Yes, those scenarios are covered inside the response, and not in the error.

enahum · 2024-08-20T11:26:13Z

server/apple_notification_server.go

+
+	res, err := me.AppleClient.PushWithContext(ctx, notification)
+	if me.metrics != nil {
+		me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds())


Same comment as on Android for metrics and logs

marianunez · 2024-08-21T17:19:11Z

server/android_notification_server.go

+	if isRetriableError(err) {
+		me.logger.Errorf("Failed to send android push did=%v retry=%v error=%v", fcmMsg.Token, retry, err)
+		if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
+			return me.SendNotificationWithRetry(fcmMsg, nextIteration)


Should we add a wait here before immediately retrying? In the server we do exponential backoff times to give it some time if it's a network hiccup or some sort before continuously retrying - see https://github.com/mattermost/mattermost/blob/master/server/channels/utils/backoff.go#L14-L15

Also in the CWS we use a backoff library that is the same concept https://github.com/mattermost/customer-web-server/blob/master/app/cloud.go#L204

agnivade · 2024-08-22T04:11:20Z

server/android_notification_server.go

-	if me.metrics != nil {
-		me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds())
-	}
+	err := me.SendNotificationWithRetry(fcmMsg, 0)


For Android, retries are already built-in to the HTTP client: https://github.com/firebase/firebase-admin-go/blob/6a28190d0d30fe0134648566c7e89f8635a51672/messaging/messaging.go#L892.

There is no need to do anything. In future, please feel free to ask me for review or take a deeper look at the documentation. :)

I missed that. Thank you for pointing it out!

That being said, the reason we are retrying is, among others, because FCM may return a "internal server error", which looking around the web, seems to be a fluke in their system, and that we just have to retry.

Looking at the internals, I see that only serviceUnavailable status is retried, so other problems (like the context timing out, or this internal server error) won't get retried. Am I right?

If that is so, how would you approach this?

I would trust on the implementation to be correct on which status codes to retry on. Also, any network errors like timeouts are also retried.

So the only missing thing is to retry on additional status codes. I'd want to see some metrics on the distribution of errors we are getting from FCM and then take a call.

For context, overriding the retry policy from the user side is difficult: firebase/firebase-admin-go#579.

Right now I don't see FCM errors on the metrics, but @marianunez spotted a few some time ago.

For iOS we recently have found a couple regarding context timeout and others, so that is why we are adding the retry for both.

For example, in the last 7 days, there has been 54 instances of FCM push failures:

@marianunez The library doesn't expose the error codes. It has a few util functions to distinguish the errors, and that is what we are doing in the PR.

Not sure how we can really get that. The library exposes just an error, and doesn't expose the http status code. We could use some castings to get it out, but not sure if that is what we want.

@larkox - casting is exactly what we want :)

@marianunez - thanks for the link. It's clear that we should retry on 5xx. However, from the logs, it's not clear if the ones with internal error were logged after a retry with 503, or they were logged from some other non-503 error which wasn't retried. That's what I am trying to find out here.

In any case, the library is poorly written and makes it such that the user cannot choose to override the retry policy inspite of the fact that the client clearly takes a retry policy struct as a parameter. There isn't even a way to bypass retrying so that we can copy over just the retry code.

So in worst case, we have to fork the repo and move out some packages out of internal, so that they can be imported and overridden from the client.

But before that, I wanted some metrics on the distribution of the status codes. If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone. But if we see other non-503 status codes, then we have a decision in our hand to make.

@agnivade Just to understand your ask.

You want a first change to surface the status code of the response, let it simmer for a while so we get new data, verify what kind of 5xx error it is, and then decide whether this PR makes sense or not. Am I right?

If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone

@agnivade I may have missed this in the thread above, but why don't we want to retry for a 503?

@larkox - Yep, exactly.

@marianunez - To clarify: we, ourselves, don't want to retry for 503, because the client already does by default. My objective here is to figure out if the internal failures are all 503 or not. If it is, then we know that those were already retried and still failed. If it's something other than 503, then yes we definitely need to do something ourselves.

marianunez

We need to look at the error codes to make sure we are doing the retries when recommended by FCM and also see the indications on when can the retries happen.

larkox · 2024-09-03T07:37:54Z

@agnivade I did some checking on the error, and did some refactoring. We are still retrying other errors, but we are filtering out (and logging) when the error is 503. Let me know if this would be ok for you, or you would prefer to remove the retry on the first iteration completely.

agnivade · 2024-09-03T07:50:14Z

This does not seem to be what we discussed here: #125 (comment).

Couple of things:

Do we really need reflection here? I believe we should be just able to type-cast the error with (err.(*Type)) and get the target type.
We should not implement the retry logic (on Android) by ourselves. Because you also have to respect the X-Retry-After response header and incorporate the exponential retry with that. All of that is already implemented in the fcm client. Potentially, we could copy over the retry logic, but then having 2 separate pieces of retry feels incorrect to me. Especially when they can internally retry on new status codes any time. So a library upgrade can potentially break our logic.

larkox · 2024-09-03T07:59:27Z

Re: Not what we discussed
You are right. After talking with @marianunez , we thought that if the concern was retrying what the library is already retrying, not retry that, and still retry other things. If you are not ok with that approach, I can revert it.

Re: Reflection
The type (FirebaseError if I am not mistaken) is internal, so I cannot directly cast it. Therefore the reflection.

agnivade · 2024-09-03T08:45:42Z

You are right. After talking with @marianunez , we thought that if the concern was retrying what the library is already retrying, not retry that, and still retry other things. If you are not ok with that approach, I can revert it.

I see. You updated the PR without any extra context, so I didn't know about this extra conversation with Maria. Anyways, so the problem is with implementing our own retry logic. The current logic that you have done is not correct, because it's not respecting the X-Retry-After header. So that's the first thing to fix.

But before we do anything ourselves, why not gather some data first and see what response codes we are getting? I wanted to see if all error codes are always 503 or not, in which case, we don't need to retry anything ourselves.

Ideally, I would prefer not to have any duplication at all. So worst case, we would need to fork the repo and maintain it ourselves. But before that, I wanted to check if we need to retry at all or not.

The type (FirebaseError if I am not mistaken) is internal, so I cannot directly cast it.

Ah yes ofc, thanks for that. Just goes to show how unconfigurable that module is.

larkox · 2024-09-03T08:52:09Z

@agnivade OK. Then let's do that. Let's drop the retry for now, and just collect data about the error code.

larkox · 2024-09-03T09:16:25Z

To avoid noise, I will close this PR in favor of #126 , since even if we decide to continue this, we will probably take a different approach (forking the library).

marianunez · 2024-09-03T11:58:52Z

To avoid noise, I will close this PR in favor of #126 , since even if we decide to continue this, we will probably take a different approach (forking the library).

@larkox How about the ios retry? We don't want to continue with that ?

larkox · 2024-09-04T09:03:02Z

@marianunez I prefer to handle all retries at once, when we decide on what to do with android.

Add retries to notification calls

f27c42b

larkox added the 1: Dev Review Requires review by a core commiter label Aug 20, 2024

larkox requested review from marianunez and enahum August 20, 2024 11:17

enahum requested changes Aug 20, 2024

View reviewed changes

Add missing logs

2079917

larkox requested a review from enahum August 20, 2024 12:58

enahum approved these changes Aug 20, 2024

View reviewed changes

marianunez reviewed Aug 21, 2024

View reviewed changes

agnivade requested changes Aug 22, 2024

View reviewed changes

marianunez requested changes Aug 23, 2024

View reviewed changes

Retry only non 503 errors and use exponential back-off

85437fd

larkox requested review from agnivade and marianunez September 3, 2024 07:38

larkox closed this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add retries to notification calls #125

Add retries to notification calls #125

larkox commented Aug 20, 2024

enahum Aug 20, 2024

larkox Aug 20, 2024

enahum Aug 20, 2024

enahum Aug 20, 2024

larkox Aug 20, 2024

enahum Aug 20, 2024

larkox Aug 20, 2024

enahum Aug 20, 2024

larkox Aug 20, 2024

enahum Aug 20, 2024

marianunez Aug 21, 2024

agnivade Aug 22, 2024

larkox Aug 22, 2024 •

edited

Loading

agnivade Aug 22, 2024

larkox Aug 22, 2024

marianunez Aug 22, 2024 •

edited

Loading

larkox Aug 26, 2024

agnivade Aug 27, 2024

larkox Aug 27, 2024

marianunez Aug 27, 2024

agnivade Aug 28, 2024

marianunez left a comment

larkox commented Sep 3, 2024

agnivade commented Sep 3, 2024

larkox commented Sep 3, 2024

agnivade commented Sep 3, 2024

larkox commented Sep 3, 2024

larkox commented Sep 3, 2024

marianunez commented Sep 3, 2024

larkox commented Sep 4, 2024

Add retries to notification calls #125

Add retries to notification calls #125

Conversation

larkox commented Aug 20, 2024

Summary

Ticket Link

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larkox Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marianunez Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marianunez left a comment

Choose a reason for hiding this comment

larkox commented Sep 3, 2024

agnivade commented Sep 3, 2024

larkox commented Sep 3, 2024

agnivade commented Sep 3, 2024

larkox commented Sep 3, 2024

larkox commented Sep 3, 2024

marianunez commented Sep 3, 2024

larkox commented Sep 4, 2024

larkox Aug 22, 2024 •

edited

Loading

marianunez Aug 22, 2024 •

edited

Loading