Skip to content
gitrepoidoscar edited this page Sep 22, 2022 · 36 revisions

Agenda (USA Pacific time zone)

Date Time Topic Speaker/Moderator
09/20 08:00-08:15
Opening Remarks and UCF

Unified Communication Framework (UCF) - Collaboration between industry,laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

08:15-09:00
A Deep Dive into DPU Computing – Addressing HPC/AI Performance Bottlenecks

AI and scientific workloads demand ultra-fast processing of high-resolution simulations, extreme-size datasets, and highly parallelized algorithms. As these computing requirements continue to grow, the traditional GPU-CPU architecture further suffers from imbalance computing, data latency and lack of parallel or pre-data-processing. The introduction of the Data Processing Unit (DPU) brings a new tier of computing to address these bottlenecks, and to enable, for the first-time, compute overlapping and nearly zero communication latency. The session will deliver a deep dive into DPU computing, and how it can help address long lasting performance bottlenecks. Performance results of a variety of HPC and AI applications will be presented as well.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

09:00-10:00
Protocols v2 update

An update about the status of protocols v2 implementation - what is upstream, what is planned for next year, performance status, error flows, and debug/analysis infrastructure.

Yossi Itigin, NVIDIA

Yossi Itigin is a UCX team lead at NVIDIA, focuses on high-performance communication middleware, and a maintainer of OpenUCX project. Prior to joining NVIDIA, Mr. Itigin spent nine years at Mellanox Technologies in different technical roles, all related to developing and optimizing RDMA software.

10:00-11:00 Break
11:00-11:45
UCX exported memory key API for better DPU experience

UCX API for managing exported memory keys could be utilized by users to develop their application offloaded on DPU with direct access to a system memory available on HOST.

Dmitry Gladkov, NVIDIA

BIO

11:45-12:30
UCX multirail support for RMA

Multirail support can provide significant performance boost on certain platforms. In this talk we will describe the way multirail is supported for RMA operations in UCX and will demonstrate performance benefits using benchmarks.

Sergey Oblomov, NVIDIA

BIO

12:30-13:30 Break
13:30-14:15
UCX-Py: New C++ Backend and Features

Python is a language that has been growing in utilization over the past years. It is great for prototyping as well as numerous real-world applications, particularly in, but not limited to, the fields of numerical analysis, including machine learning, artificial intelligence and data visualization. Python provides users with a high-level abstraction to various compute resources that are often a barrier of entry for scientists in different fields. Computational resource complexity continues to increase, making Python more relevant every day, including the field of communication. To cope with this ever growing complexity, Python requires abstractions for tasks that are often managed in lower-level, compiled languages, like C and C++. UCX-Py closes that gap by providing a high-level interface for UCX and delivering high-performance communications with the capability to use specialized hardware directly from Python. In this year's talk we will present our ongoing effort to rewrite UCX-Py's backend for several reasons. Previously, UCX-Py was written in Cython, a C-Extensions library for Python. The new C++ backend, currently dubbed UCXX, provides the same basic functionality to the UCP layer that the previous pure-Cython implementation did, but introduces new features that may provide applications with a performance boost, particularly for small-sized messages. Additionally, we hope that the new C++ backend can be extended by the community and used to be directly plugged-in to object-oriented applications without the need for any additional wrapping of UCX's C code, as well as for bindings for object-oriented languages. This effort should also provide simplified debuggability of the Python communication layers. Among the new features included are a dedicated UCX worker thread, delayed submissions, direct Python future notification and Python multi-buffer transfers. The dedicated worker thread is responsible for progressing the worker and setting the state of all UCXX requests. Delayed submissions allow delaying all message transfer calls, even eager ones, to the worker thread. Finally, direct Python future notification allows a Python application to run a separate notifier thread that has the sole purpose of notifying the application when a message transfer has completed, reducing overhead of the Python asynchronous I/O interface, and multi-buffer transfers in a single asynchronous call help reducing overhead further.

Peter Entschev, NVIDIA

Peter Rudenko is a software engineer in High Performance Computing team, focusing on accelerating data intensive applications, developing UCX communication library and various big data solutions.

14:15-15:00
AMD GPUs in the UCX ecosystem

This presentation will give an overview of the current state of using AMD InstinctTM MI Series Accelerators and the ROCm software stack in the UCX ecosystem. The talk will cover work performed in three different libraries. First, we present the current support for ROCm device memory in UCX and the most recent work in this area. This includes revising the hardware agent selection logic for data transfer operations of ROCm device memory; enhancements to the memory type detection to account for different ROCm memory types and Linux kernel versions; and tuning and validation for supporting the AMD InstinctTM MI250x GPUs. The second part of the talk will introduce the support for AMD GPUs in the UCC library, enabling high performance collective operations for data using ROCm device memory in Open MPI and pytorch. Support for AMD GPUs is available through the UCP team layer (TL) component, which requires a ROCm enabled compilation of the UCX library; and the ROCm Communication Collectives Library (RCCL), a library providing high performance collective operations on AMD GPUs. Having support through both UCP and RCCL gives end-users the ability to choose the most performant configuration for their use-case scenario. Special focus of the presentation will be on reduction operations, which necessitate matching temporary buffers used in the algorithm with the memory type of the input and output buffers to maximize the performance of the compute operations. Early results show significant performance benefits for reduction operations when using device buffers. The final part of the presentation will give some details on the ROCm device memory support in Open MPI, a feature introduced earlier this year. The key contribution is improving the performance of non-contiguous derived datatypes in Open MPI when using device memory. Preliminary results indicate significant performance improvements of up to a factor of 38 for certain derived datatypes using this feature. The presentation will also give a brief outlook on the new accelerator framework currently under development in Open MPI. This framework provides an abstraction layer for utilizing GPUs in Open MPI and allows the library to include support for multiple GPU vendors and APIs simultaneously. Hence, GPU support becomes a runtime option when using the accelerator framework, simplifying packaging of Open MPI.

Edgar Gabriel, AMD

Active software developer with 18 years of experience in high performance computing, parallel file I/O, performance tuning, and parallel application development. Main architect, developer and maintainer of the Open MPI parallel file I/O components. One of the original authors of the Open MPI library.

15:00 Adjourn
09/21 08:00-08:15
Opening Remarks
Pavel Shamis (Pasha), NVIDIA

Pavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI.

08:15-09:00
InfiniBand Performance Isolation Best Practices

High performance computing and artificial intelligence have evolved to be the primary data processing engines for wide commercial use. HPC clouds host growing numbers of users and applications, and therefore need to carefully manage the network resources and provide performance isolation between workloads. We'll explore best practices for optimizing the network activity and supporting variety of applications and users on the same network, including application examples from on premise clusters and from Microsoft Azure HPC Cloud.

Gilad Shainer, NVIDIA

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

Jithin Jose, Microsoft

Speaker Bio

09:00-09:45
Unified Collective Communication (UCC) State of the Union 2022

abstract

Manjunath Gorentla Venkata, NVIDIA

Manjunath Gorentla Venkata is a director of architecture and principal HPC architect at NVIDIA. He has researched, architected, and developed multiple HPC products and features. His team is primarily responsible for developing features for parallel programming models, libraries, and network libraries to address the needs of HPC and AI/DL systems. The innovations architected and designed by him and his team land as features in NVIDIA networking products including UCC, UCX, CX HCAs, and BlueField DPUs. Prior to NVIDIA, Manju worked as a research scientist at DOE’s ORNL focused on middleware for HPC systems, including InfiniBand and Cray Systems. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

Valentine Petrov, NVIDIA

BIO

Ferrol Aderholdt, NVIDIA

BIO

Sergey Lebdev, NVIDIA

BIO

09:45-10:45 Break
10:45-11:30
MPICH + UCX: State of the Union

In this talk, we will discuss the current state of MPICH support for the UCX library, focusing on changes since the last annual meeting. Topics covered will include build configuration, point-to-point communication, RMA, multi-threading, GPU support, and more. We also look toward future UCX development items for the coming year.

Yanfei Guo, Argonne National Laboratory

Dr. Yanfei Guo holds an appointment as an Assistant Computer Scientist at the Argonne National Laboratory. He is a member of the Programming Models and the Runtime Systems group. He has been working on multiple software projects including MPI, Yaksa and OSHMPI. His research interests include parallel programming models and runtime systems in extreme-scale supercomputing systems, data-intensive computing and cloud computing systems. Yanfei has received the best paper award at the USENIX International Conference on Autonomic Computing 2013 (ICAC’13). His work on programming models and runtime systems has been published on peer-reviewed conferences and journals including the ACM/IEEE Supercomputing Conference (SC’14, SC’15) and IEEE Transactions on Parallel and Distributed Systems (TPDS).

11:30-12:15
Stream-synchronous communication in UCX

Applications that take advantage of GPU capabilities often use stream abstractions to express dependencies, concurrency and to make the best use of the underlying hardware capabilities. Streams capture the notion of a queue of tasks that the GPU executes in order. This allows for enqueuing and dequeuing of compute (such as GPU kernels) and communication (such as a memory copy between host and device memory) tasks. The GPU is not required to maintain any ordering between tasks belonging to different streams and hence applications commonly use multiple streams to increase occupancy of GPU resources. A task enqueued onto a stream is generally asynchronous from the CPU’s perspective but synchronous with respect to other tasks enqueued on the same stream. A current limitation in UCX (and most of the libraries that take advantage of UCX) is that it does not provide abstractions to build dependencies between tasks enqueued onto streams and UCX communication operations. This means that if the CPU is required to send the result of a GPU kernel to another peer process, it must first synchronize with the stream onto which the GPU kernel was enqueued. This results in CPU resources being wasted when there exist methods of building communication dependencies without explicit CPU intervention in the critical path. The problem is especially important to solve in applications dominated by short running kernels, and kernel launch overheads present the primary bottleneck. Finally, such capabilities are already part of existing communication libraries such as NCCL, so the limitation in UCX presents a gap that applications are looking to have addressed for better composition. In this work, we plan to present 1. the current shortcomings in CPU-synchronous communication; 2. Alternatives to extending UCX API to embed stream objects into communication tasks; 3. Stream-synchronous send/receive and progress semantics; 4. Interoperability with CPU-synchronous semantics; 5. Implications on protocol implementations for performance and overlap.

Akshay Venkatesh, NVIDIA

Speaker Bio

Sreeram Potluri, NVIDIA

Speaker Bio

Jim Dinan, NVIDIA

Jim Dinan is a principal engineer at NVIDIA in the GPU communications team. Prior to joining NVIDIA, Jim was a principal engineer at Intel and a James Wallace Givens postdoctoral fellow at Argonne National Laboratory. He earned a Ph.D. in computer science from The Ohio State University and a B.S. in computer systems engineering from the University of Massachusetts at Amherst. Jim has served for more than a decade on open standards committees for HPC parallel programming models, including MPI and OpenSHMEM, and he currently leads the MPI Hybrid & Accelerator Working Group.

Hessam Mirsadeghi, NVIDIA

Speaker Bio

12:15-12:30 Break
12:30-13:15
Bring the BitCODE - Moving Compute and Data in Distributed Heterogeneous Systems

In this paper, we present a framework for moving compute and data between processing elements in a distributed heterogeneous system. The implementation of the framework is based on the LLVM compiler toolchain combined with the UCX communication framework. The framework can generate binary machine code or LLVM bitcode for multiple CPU architectures and move the code to remote machines while dynamically optimizing and linking the code on the target platform. The remotely injected code can recursively propagate itself to other remote machines or generate new code. The goal of this paper is threefold: (a) to present an architecture and implementation of the framework that provides essential infrastructure to program a new class of disaggregated systems wherein heterogeneous programming elements such as compute nodes and data processing units (DPUs) are distributed across the system, (b) to demonstrate how the framework can be integrated with modern, high-level programming languages such as Julia, and (c) to demonstrate and evaluate a new class of eXtended Remote Direct Memory Access (X-RDMA) communication operations that are enabled by this framework. To evaluate the capabilities of the framework, we used a cluster with Fujitsu CPUs and heterogeneous cluster with Intel CPUs and BlueField-2 DPUs interconnected using high-performance RDMA fabric. We demonstrated an X-RDMA pointer chase application that outperforms an RDMA GET-based implementation by 70% and is as fast as Active Messages, but does not require function predeployment on remote platforms.

Luis E. Peña, Arm

Speaker bio

13:15-14:00
UCX on RISC-V After-Action Review

Tactical Computing Labs recently ported UCX onto RISC-V. Porting UCX to RISC-V presents opportunities for the high performance computing (HPC) community to identify gaps in the current RISC-V GNU/Linux implementation, codify RISC-V ISA extensions for HPC, and identify nuances in the RISC-V ISA specification which need to be clarified for HPC software. RISC-V is an "open source" instruction set architecture (ISA) providing hardware developers a royalty free specification, or contract, for implementing RISC-V processors. The RISC-V ISA is segmented into extensions providing hardware developers "building blocks" to select when designing and implementing a RISC-V processor. Currently, RISC-V enjoys popularity in the IoT (Internet-of-Things) device market. Commercial GNU/Linux distribution support for RISC-V is nascent. As a consequence of RISC-V's popularity in the IoT device market, GNU/Linux support for RISC-V has been driven and focused towards IoT devices. Tactical Computing Labs' port of UCX to RISC-V identified gaps in current operating system support and the ISA specification that are relevant to hardware engineers and software developers interested the utilization of RISC-V in HPC. This technical talk will highlight and explain the gaps discovered by Tactical Computing Lab's RISC-V UCX port, specifically, limitations in the current GNU/Linux support for RISC-V, the impact of the GNU/Linux kernel's RISC-V support on interconnect choices, the base RISC-V ISA's support for managing memory consistency and implications in an HPC context, nuances in the RISC-V specification's advise for handling self-modifying code, and recommendations for parties interested in the use of RISC-V for HPC.

Christopher Taylor, Tactical Computing Labs

Speaker bio

14:00 Adjourn
09/22 08:00-08:15
Opening Remarks for OpenSHMEM session

OpenSHMEM update

Steve Poole, LANL

BIO

08:15-08:45
Rethinking OpenSHMEM Concepts for Better Small Message Performance

OpenSHMEM update

Aaron Welch, University of Houston

BIO

08:45-09:15
QoS-based Interfaces for Taming Tail Latency

OpenSHMEM update

Vishwanath Venkatesan, NVIDIA and Manjunath Gorentla Venkata, NVIDIA

BIO

09:15-10:00 Break
10:00-11:00
Panel: Future Direction of OpenSHMEM and Related Technologies

Community discussion

Steve Poole, LANL and Pavel Shamis, NVIDIA and Oscar Hernandez, NVIDIA and Tony Curtis, Stony Brook University and Jim Dinan, NVIDIA and Manjunath Gorentla Venkata, NVIDIA and Matthew Baker, Voltron Data

BIO

11:00-11:05 Adjourn
Clone this wiki locally