Skip to content

Commit

Permalink
feat: Install Nvidia DOCA on the servers post provisioning
Browse files Browse the repository at this point in the history
Signed-off-by: Boris Glimcher <Boris.Glimcher@emc.com>
  • Loading branch information
glimchb committed Jan 16, 2024
1 parent 9f00d11 commit b6103c8
Show file tree
Hide file tree
Showing 20 changed files with 113 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This playbook achieves the following tasks:

* Configures a docker registry to pull images from the internet and store them locally

* Optionally installs OFED and CUDA
* Optionally installs OFED, DOCA and CUDA

.. toctree::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ Optional configurations managed by the provision tool
* CUDA requires an additional reboot while being installed. While this is taken care of by Omnia, users are required to wait an additional few minutes when running the provision tool with CUDA installation for the target nodes to come up.


**Installing DOCA**

**Using the provision tool**

* If ``nvidia_doca_path`` is provided in ``input/provision_config.yml`` and Nvidia DPUs are available on the target nodes, DOCA packages will be deployed post provisioning without user intervention.

**Using the Network playbook**

* DOCA can also be installed using `network.yml <../../Roles/Network/index.html>`_ after provisioning the servers (Assuming the provision tool did not install DOCA packages).

.. note:: The DOCA package can be downloaded from `here <https://developer.nvidia.com/networking/doca>`_ .

**Installing OFED**

**Using the provision tool**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,10 @@ Fill in all provision-specific parameters in ``input/provision_config.yml``
| ``string`` | |
| Optional | |
+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nvidia_doca_path | Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: "/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm" |
| ``string`` | |
| Optional | |
+----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. note::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,14 @@ Note the compatibility between cluster OS and control plane OS below:

.. [1] Ensure that control planes running RHEL have an active subscription or are configured to access local repositories. The following repositories should be enabled on the control plane: **AppStream**, **Code Ready Builder (CRB)**, **BaseOS**. For RHEL control planes running 8.5 and below, ensure that sshpass is additionally available to install or download to the control plane (from any local repository).
* To **optionally** set up CUDA and OFED using the provisioning tool, download the required repositories to the control plane from here to deploy on the target nodes:
* To **optionally** set up CUDA, DOCA and OFED using the provisioning tool, download the required repositories to the control plane from here to deploy on the target nodes:

1. `For NVIDIA GPUs: <https://developer.nvidia.com/cuda-downloads/>`_: CUDA is a parallel computing platform and application programming interface that allows software to use certain types of graphics processing units for general purpose processing, an approach called general-purpose computing on GPUs.

2. `For Mellanox <https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_: OFED (OpenFabrics Enterprise Distribution) is open-source software for RDMA and kernel bypass applications. OFED can be used in business, research and scientific environments that require highly efficient networks, storage connectivity and parallel computing.

3. `For NVIDIA DPUs: <https://developer.nvidia.com/networking/doca/>`_: DOCA is ...

* Ensure that all connection names under the network manager match their corresponding device names.
To verify network connection names: ::

Expand Down
1 change: 1 addition & 0 deletions docs/source/InstallationGuides/addinganewnode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ While adding a new node to the cluster, users can modify the following:
- The operating system
- CUDA
- OFED
- DOCA

A new node can be added using the following ways:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ In the event that an existing Omnia cluster needs a different OS version or a fr
- The operating system
- CUDA
- OFED
- DOCA

Omnia can re-provision the cluster by running the following command: ::

Expand Down
2 changes: 2 additions & 0 deletions docs/source/Overview/SupportMatrix/omniainstalledsoftware.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,8 @@ Software Installed by Omnia
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MLNX-OFED | BSD License | MLNX_OFED is an NVIDIA tested and packaged version of OFED that supports two interconnect types using the same RDMA (remote DMA) and kernel bypass APIs called OFED verbs – InfiniBand and Ethernet. |
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NVIDIA DOCA | NVIDIA License | The NVIDIA® DOCA® is the key to unlocking the potential of the NVIDIA® BlueField® networking platform to offload, accelerate, and isolate data center workloads. |
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ansible pylibssh | LGPL 2.1 | Python bindings to client functionality of libssh specific to Ansible use case. |
+------------------------------------+------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| perl-DBD-Pg | GNU General Public License v3 | DBD::Pg - PostgreSQL database driver for the DBI module |
Expand Down
12 changes: 11 additions & 1 deletion docs/source/Roles/Network/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ Some of the network features Omnia offers are:

2. Infiniband switch configuration

To install OFED drivers, enter all required parameters in ``input/network_config.yml``:
3. Nvidia DOCA

To install OFED and DOCA drivers, enter all required parameters in ``input/network_config.yml``:


+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Expand All @@ -37,6 +39,14 @@ To install OFED drivers, enter all required parameters in ``input/network_config
| | * ``false`` <- Default |
| | * ``true`` |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nvidia_doca_offline_path | Absolute path to local copy of rpm file containing DOCA package. The package can be downloaded from https://developer.nvidia.com/networking/doca/. |
| [optional] | |
| ``string`` | |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| nvidia_doca_version | Indicates the version of DOCA to be downloaded. If ``nvidia_doca_offline_path`` is not given, declaring this variable is mandatory. |
| [optional] | |
| ``string`` | **Default value**: 2.5.0-0.0.1.23.10.1.1.9.0 |
+------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

To run the script: ::

Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/bmc.csv
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/mapping.csv
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/snmpwalk.csv
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
5 changes: 5 additions & 0 deletions docs/source/Tables/switch-based.csv
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,11 @@ Optional",Absolute path to a local copy of the .iso file containing Mellanox OF
``string``

Optional","Absolute path to local copy of .rpm file containing CUDA packages. The cuda rpm can be downloaded from https://developer.nvidia.com/cuda-downloads. CUDA will be installed post provisioning without any user intervention. Eg: cuda_toolkit_path: ""/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"""
"**nvidia_doca_path**

``string``

Optional","Absolute path to local copy of .rpm file containing DOCA packages. The doca rpm can be downloaded from https://developer.nvidia.com/networking/doca. DOCA will be installed post provisioning without any user intervention. Eg: nvidia_doca_path: ""/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"""
"**apptainer_support**

``boolean`` [1]_
Expand Down
10 changes: 10 additions & 0 deletions input/network_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,13 @@ mlnx_ofed_version: 5.4-2.4.1.3
# Mandatory variable
# Default value: true
mlnx_ofed_add_kernel_support: true

# Absolute path to local copy of .tgz file containing DOCA package.
# The package can be downloaded from https://developer.nvidia.com/networking/doca/
# Optional variable.
nvidia_doca_offline_path: ""

# If nvidia_doca_offline_path is not given, declaring this variable is mandatory.
# The DOCA package is downloaded as per version mentioned in this variable.
# Default value: 2.5.0-0.0.1.23.10.1.1.9.0
nvidia_doca_version: 2.5.0-0.0.1.23.10.1.1.9.0
8 changes: 8 additions & 0 deletions input/provision_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,14 @@ mlnx_ofed_path: ""
# cuda_toolkit_path: "/root/cuda-repo-rhel8-12-0-local-12.0.0_525.60.13-1.x86_64.rpm"
cuda_toolkit_path: ""

#### Optional, discovery_mechanism: mapping or switch_based or bmc or snmpwalk
# Absolute path to local copy of .rpm file containing DOCA packages.
# The cuda rpm can be downloaded from https://developer.nvidia.com/networking/doca
# DOCA will be installed post provisioning without any user intervention requirement
# Example:
# nvidia_doca_path: "/root/doca-host-repo-rhel86-2.5.0-0.0.1.2.5.0108.1.el8.23.10.1.1.9.0.x86_64.rpm"
nvidia_doca_path: ""

#### Mandatory, discovery_mechanism: mapping or switch_based or bmc or snmpwalk
# apptainer will be installed on the cluster to enable execution of HPC benchmarks in a containeraized environment.
# If apptainer_support: false, apptainer will not be installed on the cluster
Expand Down
10 changes: 10 additions & 0 deletions network/network.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@
name: mlnx_ofed
tasks_from: validations.yml

- name: Validate input parameters for nvidia_doca
hosts: localhost
connection: local
gather_facts: true
tasks:
- name: Validate variables from network_config.yml
ansible.builtin.include_role:
name: nvidia_doca
tasks_from: validations.yml

- name: Check nodes having Infiniband Support
hosts: all
tasks:
Expand Down
Empty file.
2 changes: 1 addition & 1 deletion prereq.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ echo ""
echo ""
echo "Download the ISO file required to provision in the control plane."
echo ""
echo "Download OFED ISO and CUDA RPM file to install OFED and CUDA during provisioning."
echo "Download OFED ISO, DOCA and CUDA RPM file to install OFED, DOCA and CUDA during provisioning."
echo ""
echo "Please configure all the NICs and set the hostname for the control plane in the format hostname.domain_name. Eg: controlplane.omnia.test"
echo ""
Expand Down
2 changes: 1 addition & 1 deletion provision/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,6 @@ This playbook achieves the following tasks:

* Configures a docker registry to pull images from the internet and store them locally

* Optionally installs OFED and CUDA
* Optionally installs OFED, DOCA and CUDA

`Click here <https://omnia-doc.readthedocs.io/en/latest/InstallationGuides/InstallingProvisionTool/index.html>`_ for more information on ``provision.yml``.
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
ansible.builtin.set_fact:
ofed_config_status: false
cuda_config_status: false
doca_config_status: false

- name: Set ofed_config_status to true
ansible.builtin.set_fact:
Expand Down Expand Up @@ -75,3 +76,27 @@
when:
- cuda_config_status
- not verify_cuda_path.stat.exists

- name: Set doca_config_status to true
ansible.builtin.set_fact:
doca_config_status: true
when: nvidia_doca_path | default("", true) | length > 1

- name: Warning - waiting for {{ warning_wait_time }} seconds
ansible.builtin.pause:
seconds: "{{ warning_wait_time }}"
prompt: "{{ doca_rpm_empty_msg }}"
when: not doca_config_status

- name: Verify the nvidia_doca_path
ansible.builtin.stat:
path: "{{ nvidia_doca_path }}"
register: verify_doca_path
when: doca_config_status

- name: Assert nvidia_doca_path location
ansible.builtin.fail:
msg: "{{ nvidia_doca_path_missing_msg }}"
when:
- doca_config_status
- not verify_doca_path.stat.exists
3 changes: 3 additions & 0 deletions provision/roles/provision_validation/vars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,9 @@ ofed_rhel_check: "rhel{{ provision_os_version }}"
cuda_rpm_empty_msg: "[WARNING] cuda_toolkit_path variable empty in provision_config.yml. CUDA won't be installed during provisioning."
cuda_toolkit_path_missing_msg: "Failed. Incorrect cuda_toolkit_path: {{ cuda_toolkit_path }} provided.
Make sure CUDA toolkit rpm file is present in the provided cuda_toolkit_path variable in provision_config.yml."
doca_rpm_empty_msg: "[WARNING] nvidia_doca_path variable empty in provision_config.yml. DOCA won't be installed during provisioning."
nvidia_doca_path_missing_msg: "Failed. Incorrect nvidia_doca_path: {{ nvidia_doca_path }} provided.
Make sure DOCA rpm file is present in the provided nvidia_doca_path variable in provision_config.yml."

# Usage: validate_repo_path.yml
update_repos_success_msg: "Validated update_repos"
Expand Down

0 comments on commit b6103c8

Please sign in to comment.