diff --git a/docs/examples/performance/tgi-server.mdx b/docs/examples/performance/tgi-server.mdx index 6135474fb..fe50a3cb3 100644 --- a/docs/examples/performance/tgi-server.mdx +++ b/docs/examples/performance/tgi-server.mdx @@ -20,35 +20,35 @@ This example will cover: Get started by creating a new Truss: ```sh -truss init --backend TGI opt125 +truss init --backend TGI falcon-7b ``` You're going to see a couple of prompts. Follow along with the instructions below: -1. Type `facebook/opt-125M` when prompted for `model`. +1. Type `tiiuae/falcon-7b` when prompted for `model`. 2. Press the `tab` key when prompted for `endpoint`. Select the `generate_stream` endpoint. -3. Give your model a name like `OPT-125M`. +3. Give your model a name like `Falcon 7B`. Finally, navigate to the directory: ```sh -cd opt125 +cd falcon-7b ``` ### Step 2: Setting resources and other arguments You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor. -OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following: +Falcon 7B will need a GPU so let's set the correct resources. Update the `resources` key with the following: ```yaml config.yaml resources: - accelerator: T4 + accelerator: A10G cpu: "4" memory: 16Gi use_gpu: true ``` -Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py). +Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying TGI server. ### Step 3: Deploy the model @@ -56,7 +56,7 @@ Also notice the `build` key which contains the `model_server` we're using as wel You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. -Let's deploy our OPT-125M vLLM model. +Let's deploy our Falcon 7B TGI model. ```sh truss push @@ -65,7 +65,7 @@ truss push You can invoke the model with: ```sh -truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}} --published' +truss predict -d '{"inputs": "What is a large language model?", "parameters": {"max_new_tokens": 128, "sample": true}}' --published ``` @@ -74,16 +74,16 @@ truss predict -d '{"inputs": "What is a large language model?", "parameters": {" build: arguments: endpoint: generate_stream - model: facebook/opt-125M + model: tiiuae/falcon-7b model_server: TGI environment_variables: {} external_package_dirs: [] model_metadata: {} -model_name: OPT-125M +model_name: Falcon 7B python_version: py39 requirements: [] resources: - accelerator: T4 + accelerator: A10G cpu: "4" memory: 16Gi use_gpu: true