-
Notifications
You must be signed in to change notification settings - Fork 75
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Added concurrency guide. * fix typo Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com> --------- Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com>
- Loading branch information
1 parent
ff2e1ae
commit 59fda14
Showing
5 changed files
with
119 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
--- | ||
title: "How to configure concurrency" | ||
description: "A guide to setting concurrency for your model" | ||
--- | ||
|
||
Configuring concurrency is one of the major knobs available for getting the most performance | ||
out of your model. In this doc, we'll cover the options that are available to you. | ||
|
||
# What is concurrency, and why configure it? | ||
|
||
At a very high level, "concurrency" in this context refers to how many requests a single replica can | ||
process at the same time. There are no right answers to what this number ought to be -- the specifics | ||
of your model and the metrics you are optimizing for (throughput? latency?) matter a lot for determining this. | ||
|
||
In Baseten & Truss, there are two notions of concurrency: | ||
* **Concurrency Target** -- the number of requests that will be sent to a model at the same time | ||
* **Predict Concurrency** -- once requests have made it onto the model container, the "predict concurrency" governs how many | ||
requests can go through the `predict` function on your Truss at once. | ||
|
||
# Concurrency Target | ||
|
||
The concurrency target is set in the Baseten UI, and to re-iterate, governs the maximum number of requests that will be sent | ||
to a single model replica. | ||
|
||
<Frame> | ||
<img src="/images/concurrency-target-picture.png" /> | ||
</Frame> | ||
|
||
An important note about this setting is that it is also used as a part of the auto-scaling parameters. If all replicas have | ||
hit their Concurrency Target, this triggers Baseten's autoscaling. | ||
|
||
Let's dive into a concrete example: | ||
|
||
<Frame> | ||
<img src="/images/concurrency-flow-chart-high-level.png" /> | ||
</Frame> | ||
|
||
Let's say that there is a single replica of a model, and the concurrency target is 2. If 5 requests come in, the first 2 will | ||
be sent to the replica, and the other 3 get queued up. Once the requests on the container complete the queued up | ||
requests will make it to the model container. | ||
|
||
<Note> | ||
Remember that if all replicas have hit their concurrency target, this will trigger autoscaling. So in this specific example, | ||
the queuing of requests 3-5 will trigger another replica to come up, if the model has not hit its max replicas yet. | ||
</Note> | ||
|
||
|
||
# Predict Concurrency | ||
|
||
Alright, so we've talked about the **Concurreny Target** feature that governs how many requests will be sent to a model at once. | ||
predict concurrency is a bit different -- it operates on the level of the model container and governs how many requests will go | ||
through the `predict` function concurrently. | ||
|
||
To get a sense for why this matters, let's recap the structure of a Truss: | ||
|
||
```python model.py | ||
class Model: | ||
|
||
def __init__(self): | ||
... | ||
|
||
def preprocess(self, request): | ||
... | ||
|
||
def predict(self, request): | ||
... | ||
|
||
def postprocess(self, response): | ||
... | ||
``` | ||
|
||
In this Truss model, there are three functions that are called in order to serve a request: | ||
* **preprocess** -- this function is used to perform any prework / modifications on the request before the `predict` function | ||
runs. For instance, if you are running an image classification model, and need to download images from S3, this is a good placeholder | ||
to do it. | ||
* **predict** -- this function is where the actual inference happens. It is likely where the logic that runs GPU code lives | ||
* **postprocess** -- this function is used to perform any postwork / modifications on the response before it is returned to the | ||
user. For instance, if you are running a text-to-image model, this is a good place to implement the logic for uploading an image | ||
to S3. | ||
|
||
You can see with these three functions and the behaviors that they are used for that you might want to have different | ||
levels of concurrency for the `predict` function. The most common need here is to limit access to the GPU, since multiple | ||
requests running on the GPU at the same time could cause serious degradation in performance. | ||
|
||
Unlike **Concurrency Target**, which is configured in the Baseten UI, the **Predict Concurrency** is configured as a part | ||
of the Truss Config (in the `config.yaml` file). | ||
|
||
```yaml config.yaml | ||
model_name: "My model with concurrency limits" | ||
... | ||
runtime: | ||
predict_concurrency: 2 # the default is 1 | ||
... | ||
``` | ||
|
||
To better understand this, let's use a specific example: | ||
|
||
<Frame> | ||
<img src="/images/concurrency-flow-model-pod.png" /> | ||
</Frame> | ||
|
||
Let's say predict concurrency is 1. | ||
1. Two requests come in to the pod. | ||
2. Both requests will begin preprocessing immediately (let's say, | ||
downloading images from S3). | ||
3. Once the first request finishes preprocessing, it will begin running on the GPU. The second request | ||
will then remain queued until the first request finishes running on the GPU in predict. | ||
4. After the first request finishes, the second request will begin being processed on the GPU | ||
5. Once the second request finishes, it will begin postprocessing, even if the first request is not done postprocessing | ||
|
||
To reiterate, predict concurrency is really great to use if you want to protect your GPU resource on your model pod, | ||
while still allowing for high concurrency for the pre and post-process steps. | ||
|
||
<Note> | ||
Remember that to actually achieve the predict concurrency you desire, the Concurrency Target must be at least that amount, | ||
so that the requests make it to the model container. | ||
</Note> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters