This codelab simulates scenarios where a startup CEO is trying to build a cloud-native intelligent app based on an open-source large language model. In particular, they want to quickly test and compare different cloud providers to find the best price performance.
In this codelab, you will follow a step-by-step guide to experiment with state-of-the-art hardware like Nvidia A100 GPU chips, large language model like Meta Llama 3.1, and software like vLLM. You'll leverage cloud-native technologies like Terraform, Docker, and Linux Bash on major cloud providers such as Azure and AWS.
- Bash (Unix shell) is required to execute commands in this codelab.
- Azure Cloud Shell is recommended. (Note: It is also highly recommended to mount a storage account in case of accidental browser closure. Follow instructions here) Alternatively, macOS and Ubuntu are supported.
In your lab environment, clone the repository and enter the directory:
git clone https://github.com/Azure-Samples/compete-labs
cd compete-labs
Install dependencies, authenticate, and initialize environments by running the commands below:
source scripts/init.sh
export CLOUD=azure
export REGION=eastus2
export CLOUD=aws
export REGION=us-west-2
Provision infrastructure resources like GPU Virtual Machine:
source scripts/resources.sh provision $CLOUD $REGION
Deploy the LLM-backed inferencing server using Docker:
source scripts/server.sh deploy $CLOUD
Download the Llama 3 8B model from Hugging Face, load it into the GPUs, and start the HTTP server:
source scripts/server.sh start $CLOUD
Send some prompt requests to the HTTP server to test chat completion endpoint:
source scripts/server.sh test $CLOUD
Cleanup infrastructure resources like GPU Virtual Machine:
source scripts/resources.sh cleanup $CLOUD $REGION
Collect and upload test results to Azure Data Explorer
source scripts/publish.sh $CLOUD
Check out aggregated and visualized test results on the dashboard