diff --git a/docs/guides/concurrency.mdx b/docs/guides/concurrency.mdx new file mode 100644 index 000000000..ac0689293 --- /dev/null +++ b/docs/guides/concurrency.mdx @@ -0,0 +1,117 @@ +--- +title: "How to configure concurrency" +description: "A guide to setting concurrency for your model" +--- + +Configuring concurrency is one of the major knobs available for getting the most performance +out of your model. In this doc, we'll cover the options that are available to you. + +# What is concurrency, and why configure it? + +At a very high level, "concurrency" in this context refers to how many requests a single replica can +process at the same time. There are no right answers to what this number ought to be -- the specifics +of your model and the metrics you are optimizing for (throughput? latency?) matter a lot for determining this. + +In Baseten & Truss, there are two notions of concurrency: +* **Concurrency Target** -- the number of requests that will be sent to a model at the same time +* **Predict Concurrency** -- once requests have made it onto the model container, the "predict concurrency" governs how many +requests can go through the `predict` function on your Truss at once. + +# Concurrency Target + +The concurrency target is set in the Baseten UI, and to re-iterate, governs the maximum number of requests that will be sent +to a single model replica. + + + + + +An important note about this setting is that it is also used as a part of the auto-scaling parameters. If all replicas have +hit their Concurrency Target, this triggers Baseten's autoscaling. + +Let's dive into a concrete example: + + + + + +Let's say that there is a single replica of a model, and the concurrency target is 2. If 5 requests come in, the first 2 will +be sent to the replica, and the other 3 get queued up. Once the requests on the container complete the queued up +requests will make it to the model container. + + +Remember that if all replicas have hit their concurrency target, this will trigger autoscaling. So in this specific example, +the queuing of requests 3-5 will trigger another replica to come up, if the model has not hit its max replicas yet. + + + +# Predict Concurrency + +Alright, so we've talked about the **Concurreny Target** feature that governs how many requests will be sent to a model at once. +predict concurrency is a bit different -- it operates on the level of the model container and governs how many requests will go +through the `predict` function concurrently. + +To get a sense for why this matters, let's recap the structure of a Truss: + +```python model.py +class Model: + + def __init__(self): + ... + + def preprocess(self, request): + ... + + def predict(self, request): + ... + + def postprocess(self, response): + ... +``` + +In this Truss model, there are three functions that are called in order to serve a request: +* **preprocess** -- this function is used to perform any prework / modifications on the request before the `predict` function +runs. For instance, if you are running an image classification model, and need to download images from S3, this is a good placeholder +to do it. +* **predict** -- this function is where the actual inference happens. It is likely where the logic that runs GPU code lives +* **postprocess** -- this function is used to perform any postwork / modifications on the response before it is returned to the +user. For instance, if you are running a text-to-image model, this is a good place to implement the logic for uploading an image +to S3. + +You can see with these three functions and the behaviors that they are used for that you might want to have different +levels of concurrency for the `predict` function. The most common need here is to limit access to the GPU, since multiple +requests running on the GPU at the same time could cause serious degradation in performance. + +Unlike **Concurrency Target**, which is configured in the Baseten UI, the **Predict Concurrency** is configured as a part +of the Truss Config (in the `config.yaml` file). + +```yaml config.yaml +model_name: "My model with concurrency limits" +... +runtime: + predict_concurrency: 2 # the default is 1 +... +``` + +To better understand this, let's use a specific example: + + + + + +Let's say predict concurrency is 1. +1. Two requests come in to the pod. +2. Both requests will begin preprocessing immediately (let's say, +downloading images from S3). +3. Once the first request finishes preprocessing, it will begin running on the GPU. The second request +will then remain queued until the first request finishes running on the GPU in predict. +4. After the first request finishes, the second request will begin being processed on the GPU +5. Once the second request finishes, it will begin postprocessing, even if the first request is not done postprocessing + +To reiterate, predict concurrency is really great to use if you want to protect your GPU resource on your model pod, +while still allowing for high concurrency for the pre and post-process steps. + + +Remember that to actually achieve the predict concurrency you desire, the Concurrency Target must be at least that amount, +so that the requests make it to the model container. + diff --git a/docs/images/concurrency-flow-chart-high-level.png b/docs/images/concurrency-flow-chart-high-level.png new file mode 100644 index 000000000..760a2a539 Binary files /dev/null and b/docs/images/concurrency-flow-chart-high-level.png differ diff --git a/docs/images/concurrency-flow-model-pod.png b/docs/images/concurrency-flow-model-pod.png new file mode 100644 index 000000000..8e2e9614d Binary files /dev/null and b/docs/images/concurrency-flow-model-pod.png differ diff --git a/docs/images/concurrency-target-picture.png b/docs/images/concurrency-target-picture.png new file mode 100644 index 000000000..3e1e4528d Binary files /dev/null and b/docs/images/concurrency-target-picture.png differ diff --git a/docs/mint.json b/docs/mint.json index 9aa6bfb6d..d8b709887 100644 --- a/docs/mint.json +++ b/docs/mint.json @@ -68,7 +68,8 @@ "group": "Guides", "pages": [ "guides/secrets", - "guides/base-images" + "guides/base-images", + "guides/concurrency" ] }, {