Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exawind (CPU) build failure on Perlmutter #570

Open
ajpowelsnl opened this issue Oct 4, 2023 · 8 comments
Open

Exawind (CPU) build failure on Perlmutter #570

ajpowelsnl opened this issue Oct 4, 2023 · 8 comments
Assignees

Comments

@ajpowelsnl
Copy link
Contributor

ajpowelsnl commented Oct 4, 2023

Summary

  • Issue arose in the context of attempting a CPU-only spack exawind build (for the purpose of container engineering) in a spack-manager project
  • Recently submitted amr-wind PR did not correct exawind build failures
  • Possible that a Trilinos patch (kokkos_zero_length_team.patch) did not interact properly with recent "multiphase" additions for CPU-only builds
  • @lastephy, @wyphan are tracking the issue
  • @ndellingwood have you seen this failure before?

Error

==> Installing trilinos-13.0.1-g7whzyzvrjad7gggyhjxsmfd46uluh4s
==> No binary for trilinos-13.0.1-g7whzyzvrjad7gggyhjxsmfd46uluh4s found: installing from source
==> Using cached archive: /pscratch/sd/a/ajpowel/s_man_2/spack-manager/spack/var/spack/cache/_source-cache/archive/0b/0bce7066c27e83085bc189bf524e535e5225636c9ee4b16291a38849d6c2216d.tar.gz
2 out of 2 hunks FAILED -- saving rejects to file packages/kokkos/core/src/Cuda/Kokkos_Cuda_Parallel.hpp.rej


==> Patch /pscratch/sd/a/ajpowel/s_man_2/spack-manager/repos/exawind/packages/trilinos/kokkos_zero_length_team.patch failed.
==> Error: ProcessError: Command exited with status 1:
    '/usr/bin/patch' '-s' '-p' '1' '-i' '/pscratch/sd/a/ajpowel/s_man_2/spack-manager/repos/exawind/packages/trilinos/kokkos_zero_length_team.patch' '-d' '.


==> Warning: Skipping build of nalu-wind-multiphase-onvuy4bsahg4a7ejmfjo6wcecuyqkbxn since trilinos-13.0.1-g7whzyzvrjad7gggyhjxsmfd46uluh4s failed
==> Warning: Skipping build of exawind-multiphase-q7cdvlpy5otawckgddbpi42hqwtklnmy since nalu-wind-multiphase-onvuy4bsahg4a7ejmfjo6wcecuyqkbxn failed
==> Error: exawind-multiphase-q7cdvlpy5otawckgddbpi42hqwtklnmy: Package was not installed
==> Error: Installation request failed.  Refer to reported errors for failing package(s).

Environment

Currently Loaded Modules:
  1) craype-x86-milan                        6) cray-dsmml/0.2.2       11) perftools-base/23.03.0   16) cudatoolkit/11.7
  2) libfabric/1.15.2.0                      7) cray-libsci/23.02.1.1  12) cpe/23.03                17) craype-accel-nvidia80
  3) craype-network-ofi                      8) cray-mpich/8.1.25      13) xalt/2.10.2              18) gpu/1.0
  4) xpmem/2.6.2-2.5_2.27__gd067c3f.shasta   9) craype/2.7.20          14) Nsight-Compute/2022.1.1
  5) PrgEnv-gnu/8.3.3                       10) gcc/11.2.0             15) Nsight-Systems/2022.2.1

Reproducer

git clone --recursive git@github.com:sandialabs/spack-manager.git
cd spack-manager/
export SPACK_MANAGER=${PWD}
source start.sh && spack-start
spack spec exawind%gcc~cuda
spack install exawind%gcc~cuda
@wyphan
Copy link

wyphan commented Oct 5, 2023

I think I've pinpointed this failed patch to this commit in Kokkos core:
kokkos/kokkos@96077d5
This commit moves the function mentioned in the patch into a different file. Looks like affected versions are Kokkos >= 3.7.00.

As shown in the Spack error log above, Trilinos 13.0.1 is affected, but I'm honestly unsure which Kokkos version is embedded in that particular Trilinos release. Summoning @tasmith4 who adjusted this particular patch to apply only to Trilinos <= 13.3.0. Perhaps it needs to be adjusted again...

Also, since this patch pertains to the CUDA support part of Kokkos, perhaps the patch is unnecessary for ~cuda builds?

@tasmith4
Copy link
Contributor

tasmith4 commented Oct 5, 2023

@wyphan I'm actually not on the project anymore, so I don't have the latest info on exactly how that patch should apply.

@psakievich can you take a look?

@wyphan
Copy link

wyphan commented Oct 5, 2023

@tasmith4 Ah, sorry about that. Please feel free to hit unsubscribe from notifications for this issue.

@psakievich psakievich self-assigned this Oct 5, 2023
@psakievich
Copy link
Collaborator

psakievich commented Oct 5, 2023

I am curious as to why trilinos@13.0.1 is being used? That is very old and almost certainly not compatible with the current exawind stack. We have been pinned to a single commit for months to get our challenge problem through. See:

version: [13.4.0.2023.02.28, develop]

and
version("13.4.0.2023.02.28", commit="8b3e2e1db4c7e07db13225c73057230c4814706f")

This is also in direct conflict with the perlmutter package.py requires statement:

trilinos:
require:
- any_of: ["@13.4.0", "@develop"]

@ajpowelsnl and @wyphan will you investigate this further? Seems like something is off in the way you're configuring this case. I would expect it to not even concretize.

@wyphan
Copy link

wyphan commented Oct 5, 2023

@psakievich I think trilinos@13.1.0 got picked up in the container build on Perlmutter, since no Spack package preferences exist yet for that outside of @ajpowelsnl 's fork. Building on bare-metal on Perlmutter does pick trilinos@13.4.0.2023.02.28 as expected.

@psakievich
Copy link
Collaborator

Is the container using spack-manager? If not then all bets are off on getting this to work currently. @jrood-nrel and I have it on our todo list to upstream our repo changes to mainline spack in the coming weeks, but I currently don't have high hopes for a non-spack-manager build.

@ajpowelsnl
Copy link
Contributor Author

Hi @psakievich - @wyphan and I will work through these questions when we have access to the machine again, and provide definitive answers. And yes, the container uses spack-manager.

@wyphan
Copy link

wyphan commented Oct 12, 2023

@psakievich I think I've pinpointed the issue to the old Spack version that is currently pinned to the spack-manager repo. The trilinos Spack recipe at this commit contains the following:

    version(
        "13.0.1",
        sha256="0bce7066c27e83085bc189bf524e535e5225636c9ee4b16291a38849d6c2216d",
        preferred=True,
    )

https://github.com/spack/spack/blob/ee68baf254ce8f401704ef1a62b77057487d4a12/var/spack/repos/builtin/packages/trilinos/package.py#L48-L52

This part no longer exists in Spack develop since commit spack/spack@b85a66f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants