-
Notifications
You must be signed in to change notification settings - Fork 523
Run Petals server on Windows
You can use WSL or Docker to run Petals on Windows. In this guide, we will show how to set up Petals on WSL (Windows Subsystem for Linux).
-
This tutorial works on Windows 10-11 and NVIDIA GPUs with driver version >= 495 (this requirement is usually met for fresh installations).
If you have an AMD GPU, please proceed until you have errors, then upgrade to a patched version of Petals mentioned in the AMD GPU tutorial.
-
Launch Windows PowerShell as an Administrator, install WSL 2:
wsl --install
If you previously had WSL 1, please upgrade as explained here.
-
Open WSL, check that GPUs are available:
nvidia-smi
-
In WSL, install basic Python stuff:
sudo apt update sudo apt install python3-pip python-is-python3
-
Then, install Petals:
python -m pip install git+https://github.com/bigscience-workshop/petals
-
Run the Petals server:
python -m petals.cli.run_server petals-team/StableBeluga2
This will host a part of Stable Beluga 2 on your machine. You can also host
meta-llama/Llama-2-70b-hf
,meta-llama/Llama-2-70b-chat-hf
, repos with LLaMA-65B,bigscience/bloom
,bigscience/bloomz
, and other compatible models from 🤗 Model Hub, or add support for new model architectures.❓ Got an error? Check out the "Troubleshooting" page. Most errors are covered there and are easy to fix, including:
hivemind.dht.protocol.ValidationError: local time must be within 3 seconds of others
Killed
torch.cuda.OutOfMemoryError: CUDA out of memory
- If you have an error about
auth_token
, see the "Want to host LLaMA 2?" section below. - If your error is not covered there, let us know in Discord and we will help!
🦙 Want to host LLaMA 2? Request access to its weights at the ♾️ Meta AI website and 🤗 Model Hub, generate an 🔑 access token, then use this command:
python -m petals.cli.run_server meta-llama/Llama-2-70b-chat-hf --token YOUR_TOKEN_HERE
💪 Want to share multiple GPUs? In this case, you'd need to run a separate Petals server for each GPU. Open a separate WSL console for each GPU, then run this in the first console:
CUDA_VISIBLE_DEVICES=0 python -m petals.cli.run_server petals-team/StableBeluga2
Do the same for each console, replacing
CUDA_VISIBLE_DEVICES=0
withCUDA_VISIBLE_DEVICES=1
,CUDA_VISIBLE_DEVICES=2
, etc. -
Once all blocks are loaded, check that your server is available on https://health.petals.dev If your server is listed as available through a "Relay", please read the section below.
If you have a NAT or a firewall, Petals will use relays for NAT/firewall traversal by default, which negatively impacts performance. If your computer has a public IP address, we strongly recommend to set up port forwarding to make the server available directly. We explain how to do it below.
-
Create the
.wslconfig
file in your user's home directory with the following contents:[wsl2] localhostforwarding=true
-
In WSL, find out the IP address of your WSL container (
172.X.X.X
):sudo apt install net-tools ifconfig
-
Allow traffic to be routed into the WSL container (replace
172.X.X.X
with the IP address from step 1):netsh interface portproxy add v4tov4 listenport=31330 listenaddress=0.0.0.0 connectport=31330 connectaddress=172.X.X.X
-
Set up your firewall (e.g., Windows Defender) to allow traffic from the outside world to the port 31330/tcp.
-
If you have a router, set it up to allow connections from the outside world (port 31330/tcp) to your computer (port 31330/tcp).
-
Run the Petals server with the parameters
--public_ip YOUR_PUBLIC_IP --port 31330
:python -m petals.cli.run_server petals-team/StableBeluga2 --public_ip YOUR_PUBLIC_IP --port 31330
-
Ensure that the server prints
This server is available directly
(notvia relays
) after startup.