Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work-Around: Segfault in MPI_Init with HIP #4237

Merged
merged 3 commits into from
Aug 28, 2023

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Aug 25, 2023

See:
https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init

Proposed work-around for @tmsclark in #4236.

I think this might be a general defect of GPU-aware MPI implementations from HPE/Cray at this point, explicit device context init before MPI init can help establishing GPU-aware MPI init assumptions at this point.
Clarifying with AMD if we can have a check for an already existing HIP/ROCm initialized runtime, in case we want to move this safety net to AMReX at some point, too. (Also, not clear if hipInit is idempotent.)

  • make part of ablastr::parallelization::mpi_init?
  • test on AMD GPUs, i.e. Frontier, LUMI

@ax3l ax3l added bug Something isn't working backend: hip Specific to ROCm execution (GPUs) component: parallelization Guard cell exchanges and particle redistribution workaround labels Aug 25, 2023
All that counts is that HIP is initialized before GPU-aware MPI.
Copy link
Member

@lucafedeli88 lucafedeli88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks really a lot for this PR! I've left a small comment. I'll test these changes as soon as possible.

#if defined(AMREX_USE_HIP) && defined(AMREX_USE_MPI)
hipError_t hip_ok = hipInit(0);
if (hip_ok != hipSuccess) {
std::cerr << "hipInit failed with error code " << hip_ok << "! Aborting now.\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for not using ABLASTR_ALWAYS_ASSERT_WITH_MESSAGE here?

Copy link
Member Author

@ax3l ax3l Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: anything calling into AMReX functions cannot be used (safely) before AMReX is initialized. amrex::Assert implements multiple things, not all of them work with an uninitialized AMReX context.

For instance, not even amrex::Print() works before init, I had to develop a work-around falling back to all-print in pyAMReX: AMReX-Codes/pyamrex#174

Raising a standard exception is a clean thing to do here - we init MPI and technically there is no AMReX involved at this point in time yet.

@ax3l
Copy link
Member Author

ax3l commented Aug 28, 2023

I tested this on Frontier and it does not cause issues.

@ax3l
Copy link
Member Author

ax3l commented Aug 28, 2023

@tmsclark @lucafedeli88 let me know if it fixes your LUMI issue :)

@ax3l ax3l merged commit f02ad26 into ECP-WarpX:development Aug 28, 2023
32 checks passed
@ax3l ax3l deleted the work-around-hip-mpi-hpe branch August 28, 2023 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: hip Specific to ROCm execution (GPUs) bug Something isn't working component: parallelization Guard cell exchanges and particle redistribution workaround
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants