You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 20, 2023. It is now read-only.
The NC24rs_v3 has the ConnectX3 card and is identified as mlx4_0 not mlx5_0. Manually modifying shipyard_nodeprep.sh each time a pool is created will workaround the issue.
Batch Shipyard Version
3.9.1 (Mac)
Steps to Reproduce
Resize a multi-instance pool containing NC24rs_v3 and wait for it to fail.
Expected Results
Node finds the PKEYS and boots normally without intervention.
Actual Results
Manual intervention is required each time a pool is created or modified.
It looks like the environment variable SHIPYARD_USER_CMD in the file .shipyard.envlist is also hardcoded as UCX_NET_DEVICES=mlx5_0:1. This causes multinode MPI jobs to fail with Gen1 VMs that have mlx4 devices.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Problem Description
Creating a multi-instance pool with NC24rs_v3 fails during start prep as it is looking for the
mlx5_0
inshipyard_nodeprep.sh
lines 1609-1612:The NC24rs_v3 has the ConnectX3 card and is identified as
mlx4_0
notmlx5_0
. Manually modifyingshipyard_nodeprep.sh
each time a pool is created will workaround the issue.Batch Shipyard Version
3.9.1 (Mac)
Steps to Reproduce
Resize a multi-instance pool containing NC24rs_v3 and wait for it to fail.
Expected Results
Node finds the PKEYS and boots normally without intervention.
Actual Results
Manual intervention is required each time a pool is created or modified.
Redacted Configuration
The text was updated successfully, but these errors were encountered: