[Improvement] How to Avoid Driver Resource Deadlock When Submitting Spark Jobs via Kyuubi #6710

hippozjs · 2024-09-25T06:49:41Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

What would you like to be improved?

Problem Description:
When submitting a Spark job to a Kubernetes (K8s) namespace via Kyuubi, if multiple Spark jobs are submitted concurrently and started at the same time, it may lead to a situation where several Spark drivers start successfully and enter the "running" state. This exhausts the resources of the namespace, leaving no resources available for Spark executors to start. As a result, multiple Spark drivers end up waiting for resources to start their Spark executors, but the Spark drivers do not release resources, causing a deadlock where the drivers are mutually waiting for resources.

How should we improve?

Solution:
We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock. Therefore, we adopted another approach by adding a switch in Kyuubi. If the switch is turned on, Spark jobs submitted to the same namespace are processed sequentially instead of in parallel. The submission of the current job depends on the running status of the previous job:
1.If both the driver and executor exist simultaneously, it means the previous Spark job has successfully occupied resources, and the current Spark job can be submitted.
2.If only the driver exists and no executor is present, the system waits for a period of time before checking the status of the previous Spark job again until both the driver and executor are present, then the current Spark job is submitted.
3.A timeout period can be configured. If the timeout is greater than 0 and the condition of having both the driver and executor is not met before the timeout, the previous driver is killed, and the current Spark job is submitted. If the configured timeout is less than or equal to 0, the system will wait indefinitely until the previous Spark job meets the condition of having both the driver and executor before submitting the current Spark job.

Are you willing to submit PR?

Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
No. I cannot submit a PR at this time.

github-actions · 2024-09-25T06:50:01Z

Hello @hippozjs,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

sudohainguyen · 2024-09-29T03:40:35Z

We tried using Yunikorn-Gang scheduling to solve this issue, but it cannot completely avoid the deadlock.

could you elaborate this? we're combining Yunikorn and Kyuubi as well but no problem so far, it does resolve your case I believe, we also do have concurrent requests

pan3793 · 2024-10-15T07:16:10Z

I also suppose such issues should be addressed by Yunikorn or Volcano

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] How to Avoid Driver Resource Deadlock When Submitting Spark Jobs via Kyuubi #6710

[Improvement] How to Avoid Driver Resource Deadlock When Submitting Spark Jobs via Kyuubi #6710

hippozjs commented Sep 25, 2024

github-actions bot commented Sep 25, 2024

sudohainguyen commented Sep 29, 2024

pan3793 commented Oct 15, 2024 •

edited

Loading

[Improvement] How to Avoid Driver Resource Deadlock When Submitting Spark Jobs via Kyuubi #6710

[Improvement] How to Avoid Driver Resource Deadlock When Submitting Spark Jobs via Kyuubi #6710

Comments

hippozjs commented Sep 25, 2024

Code of Conduct

Search before asking

What would you like to be improved?

How should we improve?

Are you willing to submit PR?

github-actions bot commented Sep 25, 2024

sudohainguyen commented Sep 29, 2024

pan3793 commented Oct 15, 2024 • edited Loading

pan3793 commented Oct 15, 2024 •

edited

Loading