Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work-Around: Segfault in MPI_Init with HIP #4237

Merged
merged 3 commits into from
Aug 28, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 20 additions & 3 deletions Source/ablastr/parallelization/MPIInitHelpers.cpp
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
/* Copyright 2020 Axel Huebl
*
* This file is part of ABLASTR.
/* This file is part of ABLASTR.
*
* Authors: Axel Huebl
* License: BSD-3-Clause-LBNL
*/
#include "MPIInitHelpers.H"
Expand All @@ -15,10 +14,18 @@
# include <mpi.h>
#endif

// OLCFDEV-1655: Segfault during MPI_Init & in PMI_Allgather
// https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init
#if defined(AMREX_USE_HIP)
#include <hip/hip_runtime.h>
#endif

#include <iostream>
#include <string>
#include <utility>
#include <sstream>

ax3l marked this conversation as resolved.
Show resolved Hide resolved

namespace ablastr::parallelization
{
int
Expand All @@ -40,6 +47,16 @@ namespace ablastr::parallelization
std::pair< int, int >
mpi_init (int argc, char* argv[])
{
// OLCFDEV-1655: Segfault during MPI_Init & in PMI_Allgather
// https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init
#if defined(AMREX_USE_HIP) && defined(AMREX_USE_MPI)
hipError_t hip_ok = hipInit(0);
if (hip_ok != hipSuccess) {
std::cerr << "hipInit failed with error code " << hip_ok << "! Aborting now.\n";
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not using ABLASTR_ALWAYS_ASSERT_WITH_MESSAGE here?

Copy link
Member Author

@ax3l ax3l Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: anything calling into AMReX functions cannot be used (safely) before AMReX is initialized. amrex::Assert implements multiple things, not all of them work with an uninitialized AMReX context.

For instance, not even amrex::Print() works before init, I had to develop a work-around falling back to all-print in pyAMReX: AMReX-Codes/pyamrex#174

Raising a standard exception is a clean thing to do here - we init MPI and technically there is no AMReX involved at this point in time yet.

}
ax3l marked this conversation as resolved.
Show resolved Hide resolved
#endif

const int thread_required = mpi_thread_required();
#ifdef AMREX_USE_MPI
int thread_provided = -1;
Expand Down
1 change: 1 addition & 0 deletions Source/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

#include <AMReX_Print.H>


int main(int argc, char* argv[])
{
ablastr::parallelization::mpi_init(argc, argv);
Expand Down
Loading