How to Dynamically Allocate Resources with Limited GPU Resources? #5013

oogou11 · 2024-10-09T01:11:12Z

oogou11
Oct 9, 2024

The integration of gRPC communication in BentoML improves the efficiency of tensor transmission after microservices deployment. However, I recently encountered an issue where dynamic GPU resource scheduling cannot be performed after containerized deployment. The current environment consists of four GPUs, each deploying a model, and each model has added business logic processing and been transformed into a microservice. However, the model loaded on the third GPU is too large, causing CUDA out-of-memory during GPU computation. Meanwhile, the first GPU still has ample resources available. Therefore, I had to modify the code to catch exceptions and control the allocation of this business computation to device=cuda:0 through the code, which puts us in a very passive state. However, there is no mention of dynamic GPU resource scheduling strategies in the official documentation.
Can a dynamic scheduling algorithm be integrated? It should include parameters such as: whether to enable dynamic resource scheduling, business concurrency, initial data size (such as audio, text length, etc.), and model tensor parsing size, etc. Currently, I am working on implementing this in my project. Is this already implemented in BentoML(version>=1.3.3)?

Answered by oogou11

Oct 26, 2024

Will the Bento framework have research and implementations in this area in the future?

View full answer

frostming · 2024-10-23T09:22:58Z

frostming
Oct 23, 2024
Maintainer

Dynamic GPU resource scheduling mentioned in the discussion is not supported BentoML. BentoML currently let's user to directly control GPU device scheduling.

2 replies

oogou11 Oct 26, 2024
Author

Yes, currently in my AIGC project, I have encountered many issues with GPU resource utilization, especially with large parameter models. The Bento framework has made deployment easier, but it’s difficult to fully utilize multiple GPUs for GPU utilization. As a result, we’ve added a lot of code to adapt the LLM models, significantly increasing the complexity of the code. The project team is slowly shifting towards the Ray framework, while Bento seems to have become merely a web framework, which is not what I wanted to see.

oogou11 Oct 26, 2024
Author

Will the Bento framework have research and implementations in this area in the future?

Answer selected by oogou11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

How to Dynamically Allocate Resources with Limited GPU Resources? #5013

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

BentoML

How to Dynamically Allocate Resources with Limited GPU Resources? #5013

oogou11 Oct 9, 2024

Replies: 1 comment · 2 replies

frostming Oct 23, 2024 Maintainer

oogou11 Oct 26, 2024 Author

oogou11 Oct 26, 2024 Author

oogou11
Oct 9, 2024

Replies: 1 comment 2 replies

frostming
Oct 23, 2024
Maintainer

oogou11 Oct 26, 2024
Author

oogou11 Oct 26, 2024
Author