Added concurrency guide. (#677)

* Added concurrency guide. * fix typo Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com> --------- Co-authored-by: Philip Kiely - Baseten <98474633+philipkiely-baseten@users.noreply.github.com>
basetenlabs · Sep 22, 2023 · 59fda14 · 59fda14
1 parent ff2e1ae
commit 59fda14
Show file tree

Hide file tree

Showing 5 changed files with 119 additions and 1 deletion.
diff --git a/docs/guides/concurrency.mdx b/docs/guides/concurrency.mdx
@@ -0,0 +1,117 @@
+---
+title: "How to configure concurrency"
+description: "A guide to setting concurrency for your model"
+---
+
+Configuring concurrency is one of the major knobs available for getting the most performance
+out of your model. In this doc, we'll cover the options that are available to you.
+
+# What is concurrency, and why configure it?
+
+At a very high level, "concurrency" in this context refers to how many requests a single replica can
+process at the same time. There are no right answers to what this number ought to be -- the specifics
+of your model and the metrics you are optimizing for (throughput? latency?) matter a lot for determining this.
+
+In Baseten & Truss, there are two notions of concurrency:
+* **Concurrency Target** -- the number of requests that will be sent to a model at the same time
+* **Predict Concurrency** -- once requests have made it onto the model container, the "predict concurrency" governs how many
+requests can go through the `predict` function on your Truss at once.
+
+# Concurrency Target
+
+The concurrency target is set in the Baseten UI, and to re-iterate, governs the maximum number of requests that will be sent
+to a single model replica.
+
+<Frame>
+  <img src="/images/concurrency-target-picture.png" />
+</Frame>
+
+An important note about this setting is that it is also used as a part of the auto-scaling parameters. If all replicas have
+hit their Concurrency Target, this triggers Baseten's autoscaling.
+
+Let's dive into a concrete example:
+
+<Frame>
+  <img src="/images/concurrency-flow-chart-high-level.png" />
+</Frame>
+
+Let's say that there is a single replica of a model, and the concurrency target is 2. If 5 requests come in, the first 2 will
+be sent to the replica, and the other 3 get queued up. Once the requests on the container complete the queued up
+requests will make it to the model container.
+
+<Note>
+Remember that if all replicas have hit their concurrency target, this will trigger autoscaling. So in this specific example,
+the queuing of requests 3-5 will trigger another replica to come up, if the model has not hit its max replicas yet.
+</Note>
+
+
+# Predict Concurrency
+
+Alright, so we've talked about the **Concurreny Target** feature that governs how many requests will be sent to a model at once.
+predict concurrency is a bit different -- it operates on the level of the model container and governs how many requests will go
+through the `predict` function concurrently.
+
+To get a sense for why this matters, let's recap the structure of a Truss:
+
+```python model.py
+class Model:
+
+    def __init__(self):
+        ...
+
+    def preprocess(self, request):
+        ...
+
+    def predict(self, request):
+        ...
+
+    def postprocess(self, response):
+        ...
+```
+
+In this Truss model, there are three functions that are called in order to serve a request:
+* **preprocess** -- this function is used to perform any prework / modifications on the request before the `predict` function
+runs. For instance, if you are running an image classification model, and need to download images from S3, this is a good placeholder
+to do it.
+* **predict** -- this function is where the actual inference happens. It is likely where the logic that runs GPU code lives
+* **postprocess** -- this function is used to perform any postwork / modifications on the response before it is returned to the
+user. For instance, if you are running a text-to-image model, this is a good place to implement the logic for uploading an image
+to S3.
+
+You can see with these three functions and the behaviors that they are used for that you might want to have different
+levels of concurrency for the `predict` function. The most common need here is to limit access to the GPU, since multiple
+requests running on the GPU at the same time could cause serious degradation in performance.
+
+Unlike **Concurrency Target**, which is configured in the Baseten UI, the **Predict Concurrency** is configured as a part
+of the Truss Config (in the `config.yaml` file).
+
+```yaml config.yaml
+model_name: "My model with concurrency limits"
+...
+runtime:
+    predict_concurrency: 2 # the default is 1
+...
+```
+
+To better understand this, let's use a specific example:
+
+<Frame>
+  <img src="/images/concurrency-flow-model-pod.png" />
+</Frame>
+
+Let's say predict concurrency is 1.
+1. Two requests come in to the pod.
+2. Both requests will begin preprocessing immediately (let's say,
+downloading images from S3).
+3. Once the first request finishes preprocessing, it will begin running on the GPU. The second request
+will then remain queued until the first request finishes running on the GPU in predict.
+4. After the first request finishes, the second request will begin being processed on the GPU
+5. Once the second request finishes, it will begin postprocessing, even if the first request is not done postprocessing
+
+To reiterate, predict concurrency is really great to use if you want to protect your GPU resource on your model pod,
+while still allowing for high concurrency for the pre and post-process steps.
+
+<Note>
+Remember that to actually achieve the predict concurrency you desire, the Concurrency Target must be at least that amount,
+so that the requests make it to the model container.
+</Note>
diff --git a/docs/images/concurrency-flow-chart-high-level.png b/docs/images/concurrency-flow-chart-high-level.png
diff --git a/docs/images/concurrency-flow-model-pod.png b/docs/images/concurrency-flow-model-pod.png
diff --git a/docs/images/concurrency-target-picture.png b/docs/images/concurrency-target-picture.png
diff --git a/docs/mint.json b/docs/mint.json
@@ -68,7 +68,8 @@
       "group": "Guides",
       "pages": [
         "guides/secrets",
-        "guides/base-images"
+        "guides/base-images",
+        "guides/concurrency"
       ]
     },
     {