Skip to content

Commit

Permalink
Added concurrency guide. (#677)
Browse files Browse the repository at this point in the history
* Added concurrency guide.

* fix typo

Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com>

---------

Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com>
  • Loading branch information
squidarth and philipkiely-baseten authored Sep 22, 2023
1 parent ff2e1ae commit 59fda14
Show file tree
Hide file tree
Showing 5 changed files with 119 additions and 1 deletion.
117 changes: 117 additions & 0 deletions docs/guides/concurrency.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
title: "How to configure concurrency"
description: "A guide to setting concurrency for your model"
---

Configuring concurrency is one of the major knobs available for getting the most performance
out of your model. In this doc, we'll cover the options that are available to you.

# What is concurrency, and why configure it?

At a very high level, "concurrency" in this context refers to how many requests a single replica can
process at the same time. There are no right answers to what this number ought to be -- the specifics
of your model and the metrics you are optimizing for (throughput? latency?) matter a lot for determining this.

In Baseten & Truss, there are two notions of concurrency:
* **Concurrency Target** -- the number of requests that will be sent to a model at the same time
* **Predict Concurrency** -- once requests have made it onto the model container, the "predict concurrency" governs how many
requests can go through the `predict` function on your Truss at once.

# Concurrency Target

The concurrency target is set in the Baseten UI, and to re-iterate, governs the maximum number of requests that will be sent
to a single model replica.

<Frame>
<img src="/images/concurrency-target-picture.png" />
</Frame>

An important note about this setting is that it is also used as a part of the auto-scaling parameters. If all replicas have
hit their Concurrency Target, this triggers Baseten's autoscaling.

Let's dive into a concrete example:

<Frame>
<img src="/images/concurrency-flow-chart-high-level.png" />
</Frame>

Let's say that there is a single replica of a model, and the concurrency target is 2. If 5 requests come in, the first 2 will
be sent to the replica, and the other 3 get queued up. Once the requests on the container complete the queued up
requests will make it to the model container.

<Note>
Remember that if all replicas have hit their concurrency target, this will trigger autoscaling. So in this specific example,
the queuing of requests 3-5 will trigger another replica to come up, if the model has not hit its max replicas yet.
</Note>


# Predict Concurrency

Alright, so we've talked about the **Concurreny Target** feature that governs how many requests will be sent to a model at once.
predict concurrency is a bit different -- it operates on the level of the model container and governs how many requests will go
through the `predict` function concurrently.

To get a sense for why this matters, let's recap the structure of a Truss:

```python model.py
class Model:

def __init__(self):
...

def preprocess(self, request):
...

def predict(self, request):
...

def postprocess(self, response):
...
```

In this Truss model, there are three functions that are called in order to serve a request:
* **preprocess** -- this function is used to perform any prework / modifications on the request before the `predict` function
runs. For instance, if you are running an image classification model, and need to download images from S3, this is a good placeholder
to do it.
* **predict** -- this function is where the actual inference happens. It is likely where the logic that runs GPU code lives
* **postprocess** -- this function is used to perform any postwork / modifications on the response before it is returned to the
user. For instance, if you are running a text-to-image model, this is a good place to implement the logic for uploading an image
to S3.

You can see with these three functions and the behaviors that they are used for that you might want to have different
levels of concurrency for the `predict` function. The most common need here is to limit access to the GPU, since multiple
requests running on the GPU at the same time could cause serious degradation in performance.

Unlike **Concurrency Target**, which is configured in the Baseten UI, the **Predict Concurrency** is configured as a part
of the Truss Config (in the `config.yaml` file).

```yaml config.yaml
model_name: "My model with concurrency limits"
...
runtime:
predict_concurrency: 2 # the default is 1
...
```

To better understand this, let's use a specific example:

<Frame>
<img src="/images/concurrency-flow-model-pod.png" />
</Frame>

Let's say predict concurrency is 1.
1. Two requests come in to the pod.
2. Both requests will begin preprocessing immediately (let's say,
downloading images from S3).
3. Once the first request finishes preprocessing, it will begin running on the GPU. The second request
will then remain queued until the first request finishes running on the GPU in predict.
4. After the first request finishes, the second request will begin being processed on the GPU
5. Once the second request finishes, it will begin postprocessing, even if the first request is not done postprocessing

To reiterate, predict concurrency is really great to use if you want to protect your GPU resource on your model pod,
while still allowing for high concurrency for the pre and post-process steps.

<Note>
Remember that to actually achieve the predict concurrency you desire, the Concurrency Target must be at least that amount,
so that the requests make it to the model container.
</Note>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/concurrency-flow-model-pod.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/concurrency-target-picture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,8 @@
"group": "Guides",
"pages": [
"guides/secrets",
"guides/base-images"
"guides/base-images",
"guides/concurrency"
]
},
{
Expand Down

0 comments on commit 59fda14

Please sign in to comment.