Understanding batching and how to see performance gains #1281

gregd33 · 2020-11-27T16:15:04Z

gregd33
Nov 27, 2020

I'm deploying a service in a context where I would expect it to receive many requests at once. This seems like a natural use case for batching. So I've been experimenting with the batching feature to see if I do get a performance gain. But I am having trouble actually seeing the benefit.

The model is a Tensorflow Object detection model. So first, to test out the model directly (outside of Bento), I created two functions: one takes in a list of images and processes them one by one through the model and the other combines them into one big tensor to put through the model.

On 150 images, I get a noticeable speedup, as one would expect

CPU times: user 7min 43s, sys: 1min 22s, total: 9min 5s.   Wall time: 40.6 s  #serial
CPU times: user 5min 38s, sys: 15.4 s, total: 5min 54s.      Wall time: 24.1 s  #batch

I then created two services: one that does not have batching enabled and one that does (and in the batching cases, uses the similar approach of creating one large tensor)

Next, I started three services:

bentoml serve-gunicorn SerialObjectDetection:latest --port 5015 --workers 2
bentoml serve-gunicorn BatchObjectDetection:latest --port 5020 --workers 2
bentoml serve-gunicorn BatchObjectDetection:latest --port 5030 --enable-microbatch --workers 2 --microbatch-workers 2

I then sent 150 images to each in serial, where I send a request and wait for a response before sending the next. As expected they are roughly the same.

Finally, I send 150 images with an async request, where I send images one by one but over many threads so that the service would be getting many requests at the same time. Again, I see no difference in time between the three

My main hypothesis is that the batch size for any given batch is not that big so the gains aren't there? Is there a way to try to inspect how big the batches that are being used are?

bojiang · 2020-11-28T08:21:26Z

bojiang
Nov 28, 2020
Maintainer

Really appreciate your question& inspection! Your hypothesis is right.

According to your experiment, every single inference takes 2.4 ~ 3.6 seconds. The batching mechanism of bentoml is already optimized for this kind of slow inference but requires users to adjust some parameters.

How to understand the parameter "mb_max_latency"

The cork algorithm of bentoml will not wait for future requests forever. It respects the parameter mb_max_latency set by the user (See this section in docs). It will control the size of batches and the waiting time to try to meet the mb_max_latency set by the user. BTW there is another parameter, mb_max_batch_size, which restricts the max batch size directly to avoid OOM of model servers.

The default value of mb_max_latency is 10000ms. In your case, to meet this limitation, the batch size would be less than 2~4. The throughput would not benefit from batching a lot in this situation.

My suggestion is to set mb_max_latency as large as possible within the tolerable range of an API.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BentoML

Understanding batching and how to see performance gains #1281

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

BentoML

Understanding batching and how to see performance gains #1281

gregd33 Nov 27, 2020

Replies: 1 comment

bojiang Nov 28, 2020 Maintainer

How to understand the parameter "mb_max_latency"

gregd33
Nov 27, 2020

bojiang
Nov 28, 2020
Maintainer