Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement] How to Avoid Driver Resource Deadlock When Submitting Spark Jobs via Kyuubi #6710

Open
3 of 4 tasks
hippozjs opened this issue Sep 25, 2024 · 3 comments
Open
3 of 4 tasks

Comments

@hippozjs
Copy link

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Problem Description:
When submitting a Spark job to a Kubernetes (K8s) namespace via Kyuubi, if multiple Spark jobs are submitted concurrently and started at the same time, it may lead to a situation where several Spark drivers start successfully and enter the "running" state. This exhausts the resources of the namespace, leaving no resources available for Spark executors to start. As a result, multiple Spark drivers end up waiting for resources to start their Spark executors, but the Spark drivers do not release resources, causing a deadlock where the drivers are mutually waiting for resources.

How should we improve?

Solution:
We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock. Therefore, we adopted another approach by adding a switch in Kyuubi. If the switch is turned on, Spark jobs submitted to the same namespace are processed sequentially instead of in parallel. The submission of the current job depends on the running status of the previous job:
1.If both the driver and executor exist simultaneously, it means the previous Spark job has successfully occupied resources, and the current Spark job can be submitted.
2.If only the driver exists and no executor is present, the system waits for a period of time before checking the status of the previous Spark job again until both the driver and executor are present, then the current Spark job is submitted.
3.A timeout period can be configured. If the timeout is greater than 0 and the condition of having both the driver and executor is not met before the timeout, the previous driver is killed, and the current Spark job is submitted. If the configured timeout is less than or equal to 0, the system will wait indefinitely until the previous Spark job meets the condition of having both the driver and executor before submitting the current Spark job.

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.
Copy link

Hello @hippozjs,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

@sudohainguyen
Copy link
Contributor

We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock.

could you elaborate this? we're combining Yunikorn and Kyuubi as well but no problem so far, it does resolve your case I believe, we also do have concurrent requests

@pan3793
Copy link
Member

pan3793 commented Oct 15, 2024

I also suppose such issues should be addressed by Yunikorn or Volcano

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants