Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(workload-launcher): SocketException handling #350

Closed
wants to merge 17 commits into from

Conversation

hanslemm
Copy link

@hanslemm hanslemm commented Aug 20, 2024

What

These code changes focus on enabling proper handling of SocketExceptions in the workload-launcher, more specifically, when trying to use the method ClaimedProcessor.getWorkloadList().

Before this PR, SocketException errors would go "unnoticed", creating a false healthy pod, when the service should either retry or throw an error, since it cannot perform its core functions.

How

  1. It allows retries on SocketException
    2. It limits the maximum retry policy to 10 minutes instead of 365 days
    3. If the SocketException still occurs after the 10 minutes maximum retry, it throws the error.

With this approach, SocketException can be properly handled and is more likely to be recovered by the application.

If not recovered, the error throw will fail the deployment, which is ideal since without a successful .getWorkloadList() method call the service is incapable of operating its core functions (creating false healthy pod).

Reasoning

In my deployment, I found that my workload-launcher pod has a SocketException with Operation not permitted whenever I do an update, which requires a manual deletion of the pod to recover. That is because the SocketException is not retried and, although being fundamental for the healthy operation of the pod, also doesn't throw an error - meaning that the pod will hang on as healthy in K8s, although it cannot launch workloads.

Here are the logs that I found in my deployment:

ERROR i.a.w.l.StartupApplicationEventListener$onApplicationEvent$1(invoke):41 - rehydrateAndProcessClaimed failed
dev.failsafe.FailsafeException: java.net.SocketException: Operation not permitted
	at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:196) ~[failsafe-3.3.2.jar:3.3.2]
	at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) ~[failsafe-3.3.2.jar:3.3.2]
	at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) ~[failsafe-3.3.2.jar:3.3.2]
	at io.airbyte.workload.launcher.ClaimedProcessor.getWorkloadList(ClaimedProcessor.kt:111) ~[io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at io.airbyte.workload.launcher.ClaimedProcessor.retrieveAndProcess(ClaimedProcessor.kt:60) ~[io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at io.airbyte.workload.launcher.StartupApplicationEventListener$onApplicationEvent$1.invoke(StartupApplicationEventListener.kt:37) [io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at io.airbyte.workload.launcher.StartupApplicationEventListener$onApplicationEvent$1.invoke(StartupApplicationEventListener.kt:35) [io.airbyte-airbyte-workload-launcher-0.63.15.jar:?]
	at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30) [kotlin-stdlib-1.9.24.jar:1.9.24-release-822]

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@hanslemm hanslemm closed this Aug 20, 2024
@hanslemm hanslemm reopened this Aug 20, 2024
@hanslemm hanslemm changed the title undo(changes) Fix(workload-launcher): SocketException handling Aug 20, 2024
@hanslemm hanslemm changed the title Fix(workload-launcher): SocketException handling fix(workload-launcher): SocketException handling Aug 20, 2024
@hanslemm hanslemm marked this pull request as ready for review August 20, 2024 18:36
@hanslemm
Copy link
Author

hanslemm commented Aug 21, 2024

FYI I've built this image and used in my production environment and it successfully fixed SocketException issues. If anyone has the issue, the image is available here while this PR is not merged: docker.io/hlemm/airbyte-workload-launcher:dev-amd64

@hanslemm
Copy link
Author

Copy link
Contributor

@jdpgrailsdev jdpgrailsdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: Seems reasonable to me.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ hanslemm
❌ jpefaur
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gosusnp gosusnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanslemm, thank you for you contribution. I have some concerns with some the change which i explained inline.

Looking an this issue, I think the bug is that the launcher doesn't recover properly. However, I don't think it should fail because it is unable to reach a dependency.

return Failsafe.with(
RetryPolicy.builder<Any>()
.withBackoff(Duration.ofSeconds(20), Duration.ofDays(365))
.withBackoff(Duration.ofSeconds(20), Duration.ofMinutes(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should simply change this value.
One of the requirements of the launcher is to make sure we never exceed a given amount of pending workloads. I believe the current logic relies on this retry "forever" bit.
This would effectively allow the launcher to start after 10mins if it isn't able to fetch the list of workloads that were in flight.

Coming back to your initial intent, at the moment, the launcher remains stuck in this state by design.

Copy link
Author

@hanslemm hanslemm Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gosusnp, thanks for taking the time to review this.

True, I missed the maximum retry policy, the way it was proposed just set the maximum back off to 10 minutes, but still retries forever (I would need to add a .withMaxDuration(Duration.ofXXX(X))). I also got your view that you think this is not necessary, and it should be retried forever.

Yet, from my perspective, since ClaimedProcessor.getWorkloadList() is not optional for the pod to function, but rather sine qua non, I believe it is fair to fail after some threshold, similar to how pipelines fail after X retries. This brings more transparency on the real status of the pod, instead of just logging that it didn't connect for 20s backoffs until 365 days, and every 365 days after. What is the ideal threshold? I am not sure.

I do understand that it seems that this forever retry is relevant for other aspects of the project for reasons that are not 100% clear to me (although I believe that nothing should retry forever - after a given limit it should stop and throw an error for transparency and resource optimization - but this is just my opinion).

Perhaps the ideal would be to create ENV VARs with backoff, maximum backoff, and maximum retry time limit, so that developers can have the option on what is their ideal threshold (exposing in the Helm chart).

Furthermore, if we implement the retry mechanism for SocketException alone (without the 10 minutes fail throw or an ENV VAR for this), it is, in my view, already a win, with no apparent downsides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view the dependency of the workload-launcher on the workfload-api the same as an api-server on a database. In the api-server / database situation, you don't want your liveness to depend on the db to avoid churning through all the pods if the db was to become unavailable. I believe this is true for the launcher and the workload-api as well.

Now, I am interested in trying to understand why you are running into this situation more than we do. What I suspect is that we might have a couple of issues, I wonder if one of them is that the launcher starts before the workload-api.
Would you mind sharing some context into what operations you are doing when running into this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only operation I do when I run into this is when I do a new release.

As you've mentioned, most likely the workload-api is not yet ready by the time workload-launcher do the request, and without a retry it cannot recover, forcing me to manually kill the pod.

Copy link
Author

@hanslemm hanslemm Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod is not killed, the pipelines won't run, and will retry as long as they can (without success of course).

Killing the pod instantly recovers Airbyte once it is back.

@hanslemm
Copy link
Author

@gosusnp can we release this?

I removed already in the last commit your point of concern.

The current state retries the SocketException, and doesn't limit the number of retries, as requested.

Thanks.

@gosusnp
Copy link
Contributor

gosusnp commented Sep 12, 2024

@hanslemm , sorry for the delay, haven't forgotten about this, a bit of backlog on my end.

What I'd like to understand is the effort of fixing our boot logic compared to extending an error handling in a way that could lead to surprises.
Some of the consideration for example, in a land with multiple cluster, because a given data-plane cannot reach a workload-api instance, is it worth failing the launcher deploy. In this case, this would imply that networking error could fail and block a deployment.

@gosusnp
Copy link
Contributor

gosusnp commented Sep 13, 2024

So I spent some time trying to understand how we could remain stuck in this loop. Looking at the documentation of FailSafe, I think we might have a few errors in this code that is causing us to not retry as expected

  • we set an exponential backoff that is way too high, thinking it will wait forever, however, what this mean is that eventually we could wait 365 before a retry and not for how long we will be backing off
  • we didn't change the max attempts which means that by default, we only retry 3 times
  • because we want to keep retrying transient network errors, we never rethrow those exceptions

What I think we need for our expected behavior

  • .withBackoff(Duration.ofSeconds(5), Duration.ofSeconds(60))
  • .withMaxAttempt(-1)
  • we shouldn't need to change how the exception re-throwing logic.

I can spend more time tomorrow to test my theory and fix.

@hanslemm
Copy link
Author

Thanks @gosusnp.

@gosusnp
Copy link
Contributor

gosusnp commented Sep 17, 2024

Small update here, as I started setting up some tests, I found an issue that could leave the workload-launcher in a bad state. It was possible that we would fail the resume claims process and stay in a running state waiting for a lock that never clears.
I merged in d74ce0d which should address this issue.

@gosusnp
Copy link
Contributor

gosusnp commented Sep 18, 2024

@hanslemm, I merged f75a0fa which is a mix of making sure we retry on socket exceptions, simplifying some of the nested retry loop (mostly for our own sanity) and making sure those conditions are now tested.
Combined with the previous fix I mentioned, the current situation should be the launcher retries on socket exceptions or 5xx errors. Any other errors should make the launcher fail hard and cause the pod to fail.

Please let us know if you still observe such errors.

@davinchia
Copy link
Contributor

We are going to close this. Please open an issue if there are follow up questions, thanks!

@davinchia davinchia closed this Sep 23, 2024
@hanslemm
Copy link
Author

hanslemm commented Sep 23, 2024

Perfect, thanks guys! I will update my app in the cluster after Airbyte 1.0 event.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants