Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retries to notification calls #125

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 31 additions & 8 deletions server/android_notification_server.go
Original file line number Diff line number Diff line change
Expand Up @@ -166,16 +166,9 @@ func (me *AndroidNotificationServer) SendNotification(msg *PushNotification) Pus
},
}

ctx, cancel := context.WithTimeout(context.Background(), me.sendTimeout)
defer cancel()

me.logger.Infof("Sending android push notification for device=%v type=%v ackId=%v", me.AndroidPushSettings.Type, msg.Type, msg.AckID)

start := time.Now()
_, err := me.client.Send(ctx, fcmMsg)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds())
}
err := me.SendNotificationWithRetry(fcmMsg, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Android, retries are already built-in to the HTTP client: https://github.com/firebase/firebase-admin-go/blob/6a28190d0d30fe0134648566c7e89f8635a51672/messaging/messaging.go#L892.

There is no need to do anything. In future, please feel free to ask me for review or take a deeper look at the documentation. :)

Copy link
Contributor Author

@larkox larkox Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that. Thank you for pointing it out!

That being said, the reason we are retrying is, among others, because FCM may return a "internal server error", which looking around the web, seems to be a fluke in their system, and that we just have to retry.

Looking at the internals, I see that only serviceUnavailable status is retried, so other problems (like the context timing out, or this internal server error) won't get retried. Am I right?

If that is so, how would you approach this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would trust on the implementation to be correct on which status codes to retry on. Also, any network errors like timeouts are also retried.

So the only missing thing is to retry on additional status codes. I'd want to see some metrics on the distribution of errors we are getting from FCM and then take a call.

For context, overriding the retry policy from the user side is difficult: firebase/firebase-admin-go#579.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now I don't see FCM errors on the metrics, but @marianunez spotted a few some time ago.

For iOS we recently have found a couple regarding context timeout and others, so that is why we are adding the retry for both.

Copy link
Member

@marianunez marianunez Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in the last 7 days, there has been 54 instances of FCM push failures:

Screenshot 2024-08-22 at 8 30 26 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marianunez The library doesn't expose the error codes. It has a few util functions to distinguish the errors, and that is what we are doing in the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we can really get that. The library exposes just an error, and doesn't expose the http status code. We could use some castings to get it out, but not sure if that is what we want.

@larkox - casting is exactly what we want :)

@marianunez - thanks for the link. It's clear that we should retry on 5xx. However, from the logs, it's not clear if the ones with internal error were logged after a retry with 503, or they were logged from some other non-503 error which wasn't retried. That's what I am trying to find out here.

In any case, the library is poorly written and makes it such that the user cannot choose to override the retry policy inspite of the fact that the client clearly takes a retry policy struct as a parameter. There isn't even a way to bypass retrying so that we can copy over just the retry code.

So in worst case, we have to fork the repo and move out some packages out of internal, so that they can be imported and overridden from the client.

But before that, I wanted some metrics on the distribution of the status codes. If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone. But if we see other non-503 status codes, then we have a decision in our hand to make.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agnivade Just to understand your ask.

You want a first change to surface the status code of the response, let it simmer for a while so we get new data, verify what kind of 5xx error it is, and then decide whether this PR makes sense or not. Am I right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we see that out of all 34 internal errors, all were actually 503, then I think we could leave this alone

@agnivade I may have missed this in the thread above, but why don't we want to retry for a 503?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larkox - Yep, exactly.

@marianunez - To clarify: we, ourselves, don't want to retry for 503, because the client already does by default. My objective here is to figure out if the internal failures are all 503 or not. If it is, then we know that those were already retried and still failed. If it's something other than 503, then yes we definitely need to do something ourselves.


if err != nil {
me.logger.Errorf("Failed to send FCM push sid=%v did=%v err=%v type=%v", msg.ServerID, msg.DeviceID, err, me.AndroidPushSettings.Type)
Expand Down Expand Up @@ -220,3 +213,33 @@ func (me *AndroidNotificationServer) SendNotification(msg *PushNotification) Pus
}
return NewOkPushResponse()
}

func isRetriableError(err error) bool {
// Any error that is not a known error, but the Internal error
return !(messaging.IsUnregistered(err) ||
messaging.IsInvalidArgument(err) ||
messaging.IsQuotaExceeded(err) ||
messaging.IsSenderIDMismatch(err) ||
messaging.IsThirdPartyAuthError(err))
}

func (me *AndroidNotificationServer) SendNotificationWithRetry(fcmMsg *messaging.Message, retry int) error {
ctx, cancel := context.WithTimeout(context.Background(), me.sendTimeout)
defer cancel()

start := time.Now()
_, err := me.client.Send(ctx, fcmMsg)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyAndroid, time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add in the metrics the rety attempt? I don't see the counter being added to the metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counter would go after all the retries. No need to mark it as a failure if we are retrying.

We could add some different metric for that though. 0/5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we should add a metrics/ligs at least to have some idea that is indeed happening and why, that way it would allows us to narrow down what could be causing future double notifications etc...

}

if isRetriableError(err) {
me.logger.Errorf("Failed to send android push did=%v retry=%v error=%v", fcmMsg.Token, retry, err)
if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
return me.SendNotificationWithRetry(fcmMsg, nextIteration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a wait here before immediately retrying? In the server we do exponential backoff times to give it some time if it's a network hiccup or some sort before continuously retrying - see https://github.com/mattermost/mattermost/blob/master/server/channels/utils/backoff.go#L14-L15

Also in the CWS we use a backoff library that is the same concept https://github.com/mattermost/customer-web-server/blob/master/app/cloud.go#L204

}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a metric or log showing that the max retries was reached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we probably should.

me.logger.Errorf("Max retries reached did=%v", fcmMsg.Token)
}

return err
}
31 changes: 23 additions & 8 deletions server/apple_notification_server.go
Original file line number Diff line number Diff line change
Expand Up @@ -227,15 +227,8 @@ func (me *AppleNotificationServer) SendNotification(msg *PushNotification) PushR

if me.AppleClient != nil {
me.logger.Infof("Sending apple push notification for device=%v type=%v ackId=%v", me.ApplePushSettings.Type, msg.Type, msg.AckID)
start := time.Now()

ctx, cancel := context.WithTimeout(context.Background(), me.sendTimeout)
defer cancel()

res, err := me.AppleClient.PushWithContext(ctx, notification)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds())
}
res, err := me.SendNotificationWithRetry(notification, 0)
if err != nil {
me.logger.Errorf("Failed to send apple push sid=%v did=%v err=%v type=%v", msg.ServerID, msg.DeviceID, err, me.ApplePushSettings.Type)
if me.metrics != nil {
Expand Down Expand Up @@ -269,3 +262,25 @@ func (me *AppleNotificationServer) SendNotification(msg *PushNotification) PushR
}
return NewOkPushResponse()
}

func (me *AppleNotificationServer) SendNotificationWithRetry(notification *apns.Notification, retry int) (*apns.Response, error) {
start := time.Now()

ctx, cancel := context.WithTimeout(context.Background(), me.sendTimeout)
defer cancel()

res, err := me.AppleClient.PushWithContext(ctx, notification)
if me.metrics != nil {
me.metrics.observerNotificationResponse(PushNotifyApple, time.Since(start).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as on Android for metrics and logs

}

if err != nil {
me.logger.Errorf("Failed to send apple push did=%v retry=%v error=%v", notification.DeviceToken, retry, err)
if nextIteration := retry + 1; nextIteration < MAX_RETRIES {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are not checking if we should retry the same way we do on Android

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On iOS the library works slightly different.

We may get a response from the library saying that it was not sent, and those would be the "known errors". Anything that comes in the err variable is something probably related to the network, so that is what we are retrying here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I see, even when there is a mismatch with the device id or the device is no longer registered (for example the app has been unistalled)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those scenarios are covered inside the response, and not in the error.

return me.SendNotificationWithRetry(notification, retry)
}
me.logger.Errorf("Max retries reached did=%v", notification.DeviceToken)
}

return res, err
}
1 change: 1 addition & 0 deletions server/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ const (
HEADER_REAL_IP = "X-Real-IP"
WAIT_FOR_SERVER_SHUTDOWN = time.Second * 5
CONNECTION_TIMEOUT_SECONDS = 60
MAX_RETRIES = 3
)

type NotificationServer interface {
Expand Down
Loading