fix(workload-launcher): SocketException handling #350

hanslemm · 2024-08-20T18:23:24Z

What

These code changes focus on enabling proper handling of SocketExceptions in the workload-launcher, more specifically, when trying to use the method ClaimedProcessor.getWorkloadList().

Before this PR, SocketException errors would go "unnoticed", creating a false healthy pod, when the service should either retry or throw an error, since it cannot perform its core functions.

How

It allows retries on SocketException
~~2. It limits the maximum retry policy to 10 minutes instead of 365 days~~
~~3. If the SocketException still occurs after the 10 minutes maximum retry, it throws the error.~~

With this approach, SocketException can be properly handled and is more likely to be recovered by the application.

If not recovered, the error throw will fail the deployment, which is ideal since without a successful .getWorkloadList() method call the service is incapable of operating its core functions (creating false healthy pod).

Reasoning

In my deployment, I found that my workload-launcher pod has a SocketException with Operation not permitted whenever I do an update, which requires a manual deletion of the pod to recover. That is because the SocketException is not retried and, although being fundamental for the healthy operation of the pod, also doesn't throw an error - meaning that the pod will hang on as healthy in K8s, although it cannot launch workloads.

Here are the logs that I found in my deployment:

ERROR i.a.w.l.StartupApplicationEventListener$onApplicationEvent$1(invoke):41 - rehydrateAndProcessClaimed failed
dev.failsafe.FailsafeException: java.net.SocketException: Operation not permitted
	at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:196) ~[failsafe-3.3.2.jar:3.3.2]
	at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) ~[failsafe-3.3.2.jar:3.3.2]
	at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) ~[failsafe-3.3.2.jar:3.3.2]
	at io.airbyte.workload.launcher.ClaimedProcessor.getWorkloadList(ClaimedProcessor.kt:111) ~[io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at io.airbyte.workload.launcher.ClaimedProcessor.retrieveAndProcess(ClaimedProcessor.kt:60) ~[io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at io.airbyte.workload.launcher.StartupApplicationEventListener$onApplicationEvent$1.invoke(StartupApplicationEventListener.kt:37) [io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at io.airbyte.workload.launcher.StartupApplicationEventListener$onApplicationEvent$1.invoke(StartupApplicationEventListener.kt:35) [io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30) [kotlin-stdlib-1.9.24.jar:1.9.24-release-822]

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

hanslemm · 2024-08-21T12:52:45Z

FYI I've built this image and used in my production environment and it successfully fixed SocketException issues. If anyone has the issue, the image is available here while this PR is not merged: docker.io/hlemm/airbyte-workload-launcher:dev-amd64

hanslemm · 2024-08-21T15:24:19Z

https://hub.docker.com/repository/docker/hlemm/airbyte-workload-launcher/general

jdpgrailsdev

Seems reasonable to me.

CLAassistant · 2024-09-04T14:09:26Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ hanslemm
❌ jpefaur
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gosusnp

@hanslemm, thank you for you contribution. I have some concerns with some the change which i explained inline.

Looking an this issue, I think the bug is that the launcher doesn't recover properly. However, I don't think it should fail because it is unable to reach a dependency.

gosusnp · 2024-09-06T18:57:27Z

airbyte-workload-launcher/src/main/kotlin/ClaimedProcessor.kt

        return Failsafe.with(
          RetryPolicy.builder<Any>()
-            .withBackoff(Duration.ofSeconds(20), Duration.ofDays(365))
+            .withBackoff(Duration.ofSeconds(20), Duration.ofMinutes(10))


I don't think we should simply change this value.
One of the requirements of the launcher is to make sure we never exceed a given amount of pending workloads. I believe the current logic relies on this retry "forever" bit.
This would effectively allow the launcher to start after 10mins if it isn't able to fetch the list of workloads that were in flight.

Coming back to your initial intent, at the moment, the launcher remains stuck in this state by design.

Hi @gosusnp, thanks for taking the time to review this.

True, I missed the maximum retry policy, the way it was proposed just set the maximum back off to 10 minutes, but still retries forever (I would need to add a .withMaxDuration(Duration.ofXXX(X))). I also got your view that you think this is not necessary, and it should be retried forever.

Yet, from my perspective, since ClaimedProcessor.getWorkloadList() is not optional for the pod to function, but rather sine qua non, I believe it is fair to fail after some threshold, similar to how pipelines fail after X retries. This brings more transparency on the real status of the pod, instead of just logging that it didn't connect for 20s backoffs until 365 days, and every 365 days after. What is the ideal threshold? I am not sure.

I do understand that it seems that this forever retry is relevant for other aspects of the project for reasons that are not 100% clear to me (although I believe that nothing should retry forever - after a given limit it should stop and throw an error for transparency and resource optimization - but this is just my opinion).

Perhaps the ideal would be to create ENV VARs with backoff, maximum backoff, and maximum retry time limit, so that developers can have the option on what is their ideal threshold (exposing in the Helm chart).

Furthermore, if we implement the retry mechanism for SocketException alone (without the 10 minutes fail throw or an ENV VAR for this), it is, in my view, already a win, with no apparent downsides.

I view the dependency of the workload-launcher on the workfload-api the same as an api-server on a database. In the api-server / database situation, you don't want your liveness to depend on the db to avoid churning through all the pods if the db was to become unavailable. I believe this is true for the launcher and the workload-api as well.

Now, I am interested in trying to understand why you are running into this situation more than we do. What I suspect is that we might have a couple of issues, I wonder if one of them is that the launcher starts before the workload-api.
Would you mind sharing some context into what operations you are doing when running into this?

The only operation I do when I run into this is when I do a new release.

As you've mentioned, most likely the workload-api is not yet ready by the time workload-launcher do the request, and without a retry it cannot recover, forcing me to manually kill the pod.

If the pod is not killed, the pipelines won't run, and will retry as long as they can (without success of course).

Killing the pod instantly recovers Airbyte once it is back.

hanslemm · 2024-09-12T20:57:39Z

@gosusnp can we release this?

I removed already in the last commit your point of concern.

The current state retries the SocketException, and doesn't limit the number of retries, as requested.

Thanks.

gosusnp · 2024-09-12T21:06:07Z

@hanslemm , sorry for the delay, haven't forgotten about this, a bit of backlog on my end.

What I'd like to understand is the effort of fixing our boot logic compared to extending an error handling in a way that could lead to surprises.
Some of the consideration for example, in a land with multiple cluster, because a given data-plane cannot reach a workload-api instance, is it worth failing the launcher deploy. In this case, this would imply that networking error could fail and block a deployment.

gosusnp · 2024-09-13T00:47:10Z

So I spent some time trying to understand how we could remain stuck in this loop. Looking at the documentation of FailSafe, I think we might have a few errors in this code that is causing us to not retry as expected

we set an exponential backoff that is way too high, thinking it will wait forever, however, what this mean is that eventually we could wait 365 before a retry and not for how long we will be backing off
we didn't change the max attempts which means that by default, we only retry 3 times
because we want to keep retrying transient network errors, we never rethrow those exceptions

What I think we need for our expected behavior

.withBackoff(Duration.ofSeconds(5), Duration.ofSeconds(60))
.withMaxAttempt(-1)
we shouldn't need to change how the exception re-throwing logic.

I can spend more time tomorrow to test my theory and fix.

hanslemm · 2024-09-16T07:26:34Z

Thanks @gosusnp.

gosusnp · 2024-09-17T16:37:11Z

Small update here, as I started setting up some tests, I found an issue that could leave the workload-launcher in a bad state. It was possible that we would fail the resume claims process and stay in a running state waiting for a lock that never clears.
I merged in d74ce0d which should address this issue.

gosusnp · 2024-09-18T23:45:12Z

@hanslemm, I merged f75a0fa which is a mix of making sure we retry on socket exceptions, simplifying some of the nested retry loop (mostly for our own sanity) and making sure those conditions are now tested.
Combined with the previous fix I mentioned, the current situation should be the launcher retries on socket exceptions or 5xx errors. Any other errors should make the launcher fail hard and cause the pod to fail.

Please let us know if you still observe such errors.

davinchia · 2024-09-23T17:06:31Z

We are going to close this. Please open an issue if there are follow up questions, thanks!

hanslemm · 2024-09-23T17:26:03Z

Perfect, thanks guys! I will update my app in the cluster after Airbyte 1.0 event.

hanslemm and others added 7 commits April 8, 2024 10:49

feat: env var configurable constants

522774b

chore: better error handling

164b24e

chore: add back the Copyright

dc5017d

Merge branch 'main' into main

878def0

Merge branch 'main' of https://github.com/hanslemm/airbyte-platform

bc7702b

fix(workload-launcher): SocketException handling

e1822a0

undo(changes)

bafcef0

hanslemm closed this Aug 20, 2024

hanslemm reopened this Aug 20, 2024

chore(WorkerConstants): undo changes

31a4111

hanslemm changed the title ~~undo(changes)~~ Fix(workload-launcher): SocketException handling Aug 20, 2024

hanslemm changed the title ~~Fix(workload-launcher): SocketException handling~~ fix(workload-launcher): SocketException handling Aug 20, 2024

Merge branch 'main' into main

1441824

hanslemm marked this pull request as ready for review August 20, 2024 18:36

hanslemm added 2 commits August 20, 2024 20:39

chore: remove extra identation

c9463b7

Merge branch 'main' of https://github.com/hanslemm/airbyte-platform

2afba5a

hanslemm mentioned this pull request Aug 20, 2024

workload-launcher pod fails with SocketException: operation not permitted airbytehq/airbyte#44024

Open

fix(ClaimedProcessor): build issues

4739402

Merge branch 'main' into main

beac76e

jdpgrailsdev approved these changes Sep 3, 2024

View reviewed changes

hanslemm and others added 2 commits September 3, 2024 16:12

Merge branch 'main' into main

c046746

Merge branch 'main' into main

feeb0f2

Merge branch 'main' into main

1f42037

gosusnp requested changes Sep 6, 2024

View reviewed changes

chore(ClaimedProcessor): revert retry policy

dd05e92

gosusnp added the team/platform-move label Sep 16, 2024

davinchia closed this Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workload-launcher): SocketException handling #350

fix(workload-launcher): SocketException handling #350

hanslemm commented Aug 20, 2024 •

edited

Loading

hanslemm commented Aug 21, 2024 •

edited

Loading

hanslemm commented Aug 21, 2024

jdpgrailsdev left a comment

CLAassistant commented Sep 4, 2024

gosusnp left a comment

gosusnp Sep 6, 2024

hanslemm Sep 7, 2024 •

edited

Loading

gosusnp Sep 10, 2024

hanslemm Sep 10, 2024

hanslemm Sep 10, 2024 •

edited

Loading

hanslemm commented Sep 12, 2024

gosusnp commented Sep 12, 2024

gosusnp commented Sep 13, 2024

hanslemm commented Sep 16, 2024

gosusnp commented Sep 17, 2024

gosusnp commented Sep 18, 2024

davinchia commented Sep 23, 2024

hanslemm commented Sep 23, 2024 •

edited

Loading

fix(workload-launcher): SocketException handling #350

fix(workload-launcher): SocketException handling #350

Conversation

hanslemm commented Aug 20, 2024 • edited Loading

What

How

Reasoning

Can this PR be safely reverted and rolled back?

hanslemm commented Aug 21, 2024 • edited Loading

hanslemm commented Aug 21, 2024

jdpgrailsdev left a comment

Choose a reason for hiding this comment

CLAassistant commented Sep 4, 2024

gosusnp left a comment

Choose a reason for hiding this comment

gosusnp Sep 6, 2024

Choose a reason for hiding this comment

hanslemm Sep 7, 2024 • edited Loading

Choose a reason for hiding this comment

gosusnp Sep 10, 2024

Choose a reason for hiding this comment

hanslemm Sep 10, 2024

Choose a reason for hiding this comment

hanslemm Sep 10, 2024 • edited Loading

Choose a reason for hiding this comment

hanslemm commented Sep 12, 2024

gosusnp commented Sep 12, 2024

gosusnp commented Sep 13, 2024

hanslemm commented Sep 16, 2024

gosusnp commented Sep 17, 2024

gosusnp commented Sep 18, 2024

davinchia commented Sep 23, 2024

hanslemm commented Sep 23, 2024 • edited Loading

hanslemm commented Aug 20, 2024 •

edited

Loading

hanslemm commented Aug 21, 2024 •

edited

Loading

hanslemm Sep 7, 2024 •

edited

Loading

hanslemm Sep 10, 2024 •

edited

Loading

hanslemm commented Sep 23, 2024 •

edited

Loading