Replies: 1 comment
-
Really appreciate your question& inspection! Your hypothesis is right. According to your experiment, every single inference takes 2.4 ~ 3.6 seconds. The batching mechanism of bentoml is already optimized for this kind of slow inference but requires users to adjust some parameters. How to understand the parameter "mb_max_latency"The cork algorithm of The default value of My suggestion is to set |
Beta Was this translation helpful? Give feedback.
-
I'm deploying a service in a context where I would expect it to receive many requests at once. This seems like a natural use case for batching. So I've been experimenting with the batching feature to see if I do get a performance gain. But I am having trouble actually seeing the benefit.
The model is a Tensorflow Object detection model. So first, to test out the model directly (outside of Bento), I created two functions: one takes in a list of images and processes them one by one through the model and the other combines them into one big tensor to put through the model.
On 150 images, I get a noticeable speedup, as one would expect
I then created two services: one that does not have batching enabled and one that does (and in the batching cases, uses the similar approach of creating one large tensor)
Next, I started three services:
I then sent 150 images to each in serial, where I send a request and wait for a response before sending the next. As expected they are roughly the same.
Finally, I send 150 images with an async request, where I send images one by one but over many threads so that the service would be getting many requests at the same time. Again, I see no difference in time between the three
My main hypothesis is that the batch size for any given batch is not that big so the gains aren't there? Is there a way to try to inspect how big the batches that are being used are?
Beta Was this translation helpful? Give feedback.
All reactions