Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trtllm-serve : Failure to launch openAI-API on multiple nodes with 8 GPUs each #2594

Open
sivabreddy opened this issue Dec 19, 2024 · 1 comment

Comments

@sivabreddy
Copy link

I have built engine using command.

trtllm-build \
    --checkpoint_dir /var/run/models/TensorRT-Model-Optimizer/llm_ptq/llama3_405b_instruct_fp8_tp16_pp1 \
    --output_dir /var/run/models/llama-3.1-405b-instruct/1x16/engine \
    --max_num_tokens 128000 \
    --max_seq_len 128000 \
    --max_batch_size 4 \
    --gpt_attention_plugin auto \
    --context_fmha enable \
    --kv_cache_type paged \
    --multiple_profiles enable \
    --workers 8

I'm trying to start openAI-API compatible server using "trtllm-serve" command using CLI available at:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/commands/serve.py

I'm using mpirun to launch the server.

mpirun --allow-run-as-root --oversubscribe\
    --report-bindings \
    --host triton-multinode-leader-68968c8ffc-lb2kb:8,triton-multinode-worker1-7cbb7c6d5c-h542v:8 \
    --mca orte_base_help_aggregate 0 \
    --mca btl_tcp_if_include eth0 \
    --mca plm_rsh_agent kubessh \
    --mca orte_keepalive_timeout 60 \
    --mca pml ucx \
    -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx3_1:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1\
    -x UCX_TLS=rc,sm \
    -x NCCL_DEBUG=INFO \
    -x NCCL_IB_DISABLE=0 \
    -x NCCL_IB_HCA="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx3_1:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1" \
    -x NCCL_NET_GDR_LEVEL=5 \
    -x NCCL_SOCKET_IFNAME=eth0 \
    -x NCCL_P2P_DISABLE=0 \
    -np 16 \
trtllm-serve \
--host "0.0.0.0" \
--tokenizer /var/run/models/llama-3.1-405b-instruct/hf_download \
--tp_size 16 \
--pp_size 1 \
--max_beam_width 1 \
--max_seq_len 15000 \
--max_num_tokens 15000 \
--max_batch_size 4 \
--kv_cache_free_gpu_memory_fraction 0.90 \
/var/run/models/llama-3.1-405b-instruct/1x16/engine

Getting runtime error:

root@triton-multinode-leader-68968c8ffc-vkpw8:/var/run/models/TensorRT-LLM/tensorrt_llm/commands# mpirun --allow-run-as-root --oversubscribe    --report-bindings     --host triton-multinode-leader-68968c8ffc-vkpw8:8,triton-multinode-worker1-7cbb7c6d5c-blcp4:8     --mca orte_base_help_aggregate 0     --mca btl_tcp_if_include eth0     --mca plm_rsh_agent kubessh     --mca orte_keepalive_timeout 60     --mca pml ucx     -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx3_1:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1    -x UCX_TLS=rc,sm     -x NCCL_DEBUG=INFO     -x NCCL_IB_DISABLE=0     -x NCCL_IB_HCA="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx3_1:1,mlx5_6:1,mlx5_7:1,mlx5_8:1,mlx5_9:1"     -x NCCL_NET_GDR_LEVEL=5     -x NCCL_SOCKET_IFNAME=eth0     -x NCCL_P2P_DISABLE=0     -np 16 trtllm-serve --host "0.0.0.0" --tokenizer /var/run/models/llama-3.1-405b-instruct/hf_download --tp_size 8 --pp_size 2 --max_beam_width 1 --max_seq_len 15000 --max_num_tokens 15000 --max_batch_size 4 --kv_cache_free_gpu_memory_fraction 0.90 /var/run/models/llama-3.1-405b-instruct/2x8/engine
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
There was a problem when trying to write in your cache folder (hub). You should set the environment variable TRANSFORMERS_CACHE to a writable directory.
[triton-multinode-leader-68968c8ffc-vkpw8:05385] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05387] MCW rank 4 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04321] MCW rank 13 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04319] MCW rank 11 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04323] MCW rank 15 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05384] MCW rank 1 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05383] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05390] MCW rank 7 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05386] MCW rank 3 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05389] MCW rank 6 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-leader-68968c8ffc-vkpw8:05388] MCW rank 5 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04322] MCW rank 14 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04318] MCW rank 10 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04316] MCW rank 8 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04317] MCW rank 9 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[triton-multinode-worker1-7cbb7c6d5c-blcp4:04320] MCW rank 12 bound to socket 0[core 0[hwt 0]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]], socket 0[core 8[hwt 0-1]], socket 0[core 9[hwt 0-1]], socket 0[core 10[hwt 0-1]], socket 0[core 11[hwt 0-1]], socket 0[core 12[hwt 0-1]], socket 0[core 13[hwt 0-1]], socket 0[core 14[hwt 0-1]], socket 0[core 15[hwt 0-1]], socket 0[core 16[hwt 0-1]], socket 0[core 17[hwt 0-1]], socket 0[core 18[hwt 0-1]], socket 0[core 19[hwt 0-1]], socket 0[core 20[hwt 0-1]], socket 0[core 21[hwt 0-1]], socket 0[core 22[hwt 0-1]], socket 0[core 23[hwt 0-1]], socket 0[core 24[hwt 0-1]], socket 0[core 25[hwt 0-1]], socket 0[core 26[hwt 0-1]], socket 0[core 27[hwt 0-1]], socket 0[core 28[hwt 0-1]], socket 0[core 29[hwt 0-1]], socket 0[core 30[hwt 0-1]], socket 0[core 31[hwt 0-1]], socket 0[core 32[hwt 0-1]], socket 0[core 33[hwt 0-1]], socket 0[core 34[hwt 0-1]], socket 0[core 35[hwt 0-1]], socket 0[core 36[hwt 0-1]], socket 0[core 37[hwt 0-1]]: [B./../BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB][BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB/BB]
[1734600614.145533] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4321 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.153153] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4316 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.208646] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4320 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.214052] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4318 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.214091] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4323 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.230537] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4317 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.272303] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4319 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.288751] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4322 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.598473] [triton-multinode-leader-68968c8ffc-vkpw8:5386 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.618871] [triton-multinode-leader-68968c8ffc-vkpw8:5385 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.628930] [triton-multinode-leader-68968c8ffc-vkpw8:5387 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.646100] [triton-multinode-leader-68968c8ffc-vkpw8:5390 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.668800] [triton-multinode-leader-68968c8ffc-vkpw8:5384 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.675720] [triton-multinode-leader-68968c8ffc-vkpw8:5389 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.684026] [triton-multinode-leader-68968c8ffc-vkpw8:5383 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600614.686617] [triton-multinode-leader-68968c8ffc-vkpw8:5388 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.089126] [triton-multinode-leader-68968c8ffc-vkpw8:5386 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.091283] [triton-multinode-leader-68968c8ffc-vkpw8:5385 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.095339] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4319 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.113149] [triton-multinode-leader-68968c8ffc-vkpw8:5383 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.113425] [triton-multinode-leader-68968c8ffc-vkpw8:5389 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.113996] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4317 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.115366] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4316 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.115411] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4321 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.124298] [triton-multinode-leader-68968c8ffc-vkpw8:5390 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.128210] [triton-multinode-leader-68968c8ffc-vkpw8:5384 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.128892] [triton-multinode-leader-68968c8ffc-vkpw8:5387 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.130476] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4318 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.136306] [triton-multinode-leader-68968c8ffc-vkpw8:5388 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.137097] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4322 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.189160] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4323 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[1734600615.200641] [triton-multinode-worker1-7cbb7c6d5c-blcp4:4320 :0]     ucp_context.c:1234 UCX  WARN  network device 'mlx3_1:1' is not available, please use one or more of: 'eth0'(tcp), 'lo'(tcp), 'mlx5_0:1'(ib), 'mlx5_1:1'(ib), 'mlx5_2:1'(ib), 'mlx5_3:1'(ib), 'mlx5_4:1'(ib), 'mlx5_5:1'(ib), 'mlx5_6:1'(ib), 'mlx5_7:1'(ib), 'mlx5_8:1'(ib), 'mlx5_9:1'(ib), 'net1'(tcp), 'net2'(tcp), 'net3'(tcp), 'net4'(tcp), 'net5'(tcp), 'net6'
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.dev2024121700
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/serve.py", line 84, in main
    llm = LLM(**llm_args.to_dict())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/llmapi/llm.py", line 130, in __init__
    raise RuntimeError(
RuntimeError: Only 8 GPUs are available, but 16 are required.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20078,1],9]
  Exit code:    1
--------------------------------------------------------------------------
@vinaykagithapu
Copy link

I'm getting issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants