-
Notifications
You must be signed in to change notification settings - Fork 26
Example (single & multinode) (v0.0.6 ‐ v0.0.10)
Setepenre edited this page Mar 13, 2024
·
2 revisions
- 2 DGX 8xA100 SMX 80Go (the example use the hostnames
cn-d003
andcn-d004
)
The script below configure milabench for both single & multi node benchmarks and run both in one shot. It takes 1h to fully run.
Note that this script only works for milabench v0.0.6 - v0.0.10.
To make it work on your system, only the following values need to be tweaked:
-
USERNAME
: username used to ssh to both machine -
SSH_KEY_FILE
: key used to ssh to both machine -
ARCH
: GPU arch (cuda/rocm) -
WORKER_0
: ip or hostname to the main machine -
WORKER_1
: ip or hostname to the secondary machone
The script must be put and tweaked on the main machine.
Once it is done, one can simply run bash run.sh
.
You can find the full result folder generated by the example.
A report output is also included below.
#!/bin/bash
set -m
#
#
#
echo ">> Configure the benchmark"
echo "=========================="
#
# Tweak the values to fit your system
#
USERNAME=${USER:-"mila"}
SSH_KEY_FILE=$HOME/.ssh/id_rsa
ARCH="cuda"
WORKER_0="cn-d003"
WORKER_1="cn-d004"
VERSION="v0.0.10"
# Derived
IMAGE="ghcr.io/mila-iqia/milabench:$ARCH-$VERSION"
# Create the config file
cat >overrides.yaml <<EOL
opt-6_7b-multinode:
docker_image: "$IMAGE"
worker_user: "$USERNAME"
manager_addr: "$WORKER_0"
worker_addrs:
- "$WORKER_1"
num_machines: 2
capabilities:
nodes: 2
opt-1_3b-multinode:
docker_image: "$IMAGE"
worker_user: "$USERNAME"
manager_addr: "$WORKER_0"
worker_addrs:
- "$WORKER_1"
num_machines: 2
capabilities:
nodes: 2
EOL
echo "<< ======================="
echo ""
echo ">> Prepare docker images"
echo "========================"
echo ssh $USERNAME@$WORKER_0 "docker pull $IMAGE"
echo ssh $USERNAME@$WORKER_1 "docker pull $IMAGE"
ssh $USERNAME@$WORKER_0 "docker pull $IMAGE" &
ssh $USERNAME@$WORKER_1 "docker pull $IMAGE" &
fg
fg
echo "<< ====================="
echo ""
#
#
#
echo ">> Run milabench"
echo "================"
if [ "$ARCH" = "cuda" ]; then
docker run -it --rm --gpus all --network host --ipc=host --privileged \
-v $SSH_KEY_FILE:/milabench/id_milabench \
-v $(pwd)/results:/milabench/envs/runs \
$IMAGE \
milabench run --override "$(cat overrides.yaml)"
elif [ "$ARCH" = "rocm" ]; then
docker run -it --rm --network host --ipc host --privileged \
--security-opt seccomp=unconfined --group-add video \
-v /opt/amdgpu/share/libdrm/amdgpu.ids:/opt/amdgpu/share/libdrm/amdgpu.ids \
-v /opt/rocm:/opt/rocm \
-v $(pwd)/results:/milabench/envs/runs \
$IMAGE \
milabench run --override "$(cat overrides.yaml)"
fi
echo "<< ============="
echo ""
#
#
#
echo ">> Print report"
echo "==============="
docker run -it --rm \
-v $(pwd)/results:/milabench/envs/runs \
$IMAGE \
milabench report --runs /milabench/envs/runs
echo "<< ============"
===============
Source: /milabench/envs/runs
=================
Benchmark results
=================
fail n perf sem% std% peak_memory score weight
bert-fp16 0 8 155.09 0.3% 4.3% 24600 1241.393129 0.00
bert-fp32 0 8 29.56 0.0% 0.5% 31564 236.597358 0.00
bert-tf32 0 8 119.87 0.3% 5.5% 31566 959.552126 0.00
bert-tf32-fp16 0 8 154.75 0.3% 4.2% 24600 1238.401888 3.00
convnext_large-fp16 0 8 339.21 0.9% 13.3% 27462 2749.709059 0.00
convnext_large-fp32 0 8 45.52 0.6% 9.0% 49582 359.979549 0.00
convnext_large-tf32 0 8 146.50 0.9% 14.2% 49582 1168.371136 0.00
convnext_large-tf32-fp16 0 8 338.60 0.9% 13.6% 27462 2746.288606 3.00
davit_large 0 8 310.22 0.2% 5.2% 34504 2487.821250 1.00
davit_large-multi 0 1 2311.79 1.0% 7.6% 41824 2311.786629 5.00
dlrm 0 1 178899.18 1.9% 14.5% 3490 178899.178933 1.00
focalnet 0 8 387.85 0.2% 5.3% 26350 3112.201303 2.00
opt-1_3b 0 1 28.32 0.0% 0.2% 42262 28.323743 5.00
opt-1_3b-multinode 0 1 32.37 0.0% 0.2% 41578 32.367767 10.00
opt-6_7b 0 1 13.20 0.0% 0.1% 56820 13.204616 5.00
opt-6_7b-multinode 0 1 10.79 0.0% 0.0% 47694 10.785362 10.00
reformer 0 8 61.38 0.0% 1.0% 25404 491.560766 1.00
regnet_y_128gf 0 8 83.80 0.2% 5.2% 31554 671.706002 2.00
resnet152 0 8 665.49 0.3% 6.1% 36398 5338.088352 1.00
resnet152-multi 0 1 5115.10 1.3% 10.2% 38934 5115.104518 5.00
resnet50 0 8 998.41 0.5% 11.8% 4730 8008.913503 1.00
stargan 0 8 43.79 1.4% 30.3% 37426 352.550097 1.00
super-slomo 0 8 41.50 0.1% 1.9% 33800 331.820253 1.00
t5 0 8 47.73 0.2% 3.7% 34388 381.825063 2.00
whisper 0 8 547.80 0.2% 3.3% 9278 4384.366568 1.00
Scores
------
Failure rate: 0.00% (PASS)
Score: 205.26
<< ============