Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

themorey · 2020-11-04T18:54:52Z

Problem Description

Creating a multi-instance pool with NC24rs_v3 fails during start prep as it is looking for the mlx5_0 in shipyard_nodeprep.sh lines 1609-1612:

export_ib_pkey()
{
    key0=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/0)
    key1=$(cat /sys/class/infiniband/mlx5_0/ports/1/pkeys/1)

The NC24rs_v3 has the ConnectX3 card and is identified as mlx4_0 not mlx5_0. Manually modifying shipyard_nodeprep.sh each time a pool is created will workaround the issue.

Batch Shipyard Version

3.9.1 (Mac)

Steps to Reproduce

Resize a multi-instance pool containing NC24rs_v3 and wait for it to fail.

Expected Results

Node finds the PKEYS and boots normally without intervention.

Actual Results

Manual intervention is required each time a pool is created or modified.

Redacted Configuration

 pool_specification:
    id: arvinas-relion-pool-NCv3
    vm_configuration:
      platform_image:
       offer: CentOS-HPC
       publisher: OpenLogic
       sku: '7.7'
       version: '7.7.2020062600'
   vm_count:
     dedicated: 0
     low_priority: 0
   vm_size: STANDARD_NC24rs_v3
   autoscale:
     evaluation_interval: 00:05:00
     scenario:
       name: active_tasks
       maximum_vm_count:
         dedicated: 4
         low_priority: 4
       maximum_vm_increment_per_evaluation:
         dedicated: -1
         low_priority: -1
       bias_node_type: low_priority
   inter_node_communication_enabled: true
   virtual_network:
     arm_subnet_id: /subscriptions/{sub}/resourceGroups/{RG}/providers/Microsoft.Network/virtualNetworks/{Vnet}/subnets/{sn}
   ssh:
     username: shipyard

The text was updated successfully, but these errors were encountered:

themorey · 2020-11-05T15:53:25Z

It looks like the environment variable SHIPYARD_USER_CMD in the file .shipyard.envlist is also hardcoded as UCX_NET_DEVICES=mlx5_0:1. This causes multinode MPI jobs to fail with Gen1 VMs that have mlx4 devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

themorey commented Nov 4, 2020

themorey commented Nov 5, 2020

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

Pool resize with NC24rs_v3 fails to find PKEYS during nodeprep #360

Comments

themorey commented Nov 4, 2020

Problem Description

Batch Shipyard Version

Steps to Reproduce

Expected Results

Actual Results

Redacted Configuration

themorey commented Nov 5, 2020