Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong data received when sharing a team with multiple threads #979

Open
nirandaperera opened this issue May 23, 2024 · 22 comments
Open

Wrong data received when sharing a team with multiple threads #979

nirandaperera opened this issue May 23, 2024 · 22 comments

Comments

@nirandaperera
Copy link

nirandaperera commented May 23, 2024

Hi, I have the following setup.

  • UCC lib initialized with UCC_THREAD_MULTIPLE
  • UCC ctx created with UCC_CONTEXT_SHARED
  • 1 team created with UCC_COLLECTIVE_INIT_AND_POST_UNORDERED and UCC_NO_SYNC_COLLECTIVES (IINM these configs are not used underneath)
  • Separate thread for ucc ctx progress

I have multiple threads issuing a set of tagged allgather operations (with callbacks) to the team. I am keeping the send and receive buffers alive in the data object of the callback.
It seems the receive buffers are receiving wrong data from the parallel allgather operations.
It's working fine for a single thread.

Can someone please help me out here?

Is sharing a team with multiple threads, an anti-pattern? I checked the allgather test code, and it seemed to me that there was a separate team for every test proc.

@nirandaperera
Copy link
Author

I am seeing several Message truncated errors in the UCC log.

6479650.125378] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126129] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126132] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.126893] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2
[1716479650.126895] [XPS-15-9510:3236166:7]     tl_ucp_coll.c:133  TL_UCP ERROR failure in recv completion Message truncated
[1716479650.127440] [XPS-15-9510:3236166:7]          ec_cpu.c:186  cpu ec DEBUG executor finalize, eee: 0x584f0ee37600
[1716479650.127449] [XPS-15-9510:3236166:7]     ucp_request.c:748  UCX  DEBUG message truncated: recv_length 10 offset 0 buffer_size 2

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

@nirandaperera Just going out on a limb and say that UCX was also built with thread-multiple, correct?

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

@nirandaperera can you paste a reproducer?

@nirandaperera
Copy link
Author

nirandaperera commented May 24, 2024

@janjust I was using conda ucx 1.15.0

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

I'm assuming it's already built with --thread-multiple, can you check?

$ ucx_info -v
# Library version: 1.17.0
# Library path: /hpc/local/benchmarks/daily/next/2024-05-22/hpcx-gcc-redhat7/ucx/mt/lib/libucs.so.0
# API headers version: 1.17.0
# Git branch 'v1.17.x', revision b67bc34
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --with-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.4.0/redhat7 --with-gdrcopy --prefix=/build-result/hpcx-gcc-redhat7/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

Should say
--enable-mt

@nirandaperera
Copy link
Author

@janjust yes it does

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

Do you have a simple reproducer handy and we'll take a look.
We did something similar to what you're doing with AR but changed to 1 ucc_context/thread due to performance, not correctness

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

Just to add onto this: from what i'm hearing we've never tried this model of multitple threads/team, and to me it seems to mimic multiple threads/mpi_comm which would be against the spec.

What I suspect is happening the threads are progressing each other's callbacks because the context is shared thus each thread picks up the task in no particular order (this is me guessing over what we've observed in our case).

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

@nsarka Can you please look at this thread - does what I'm saying make sense, is this what you observed with sliding-window?

@nirandaperera
Copy link
Author

@janjust FWIW I am using tagged collectives

@nirandaperera
Copy link
Author

nirandaperera commented May 24, 2024

@janjust There's a separate thread that progresses the ctx. From what I understood from the source code that all teams will be ultimately enqueing tasks to a single progress queue in the context, isn't it?

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

yes, I believe that's correct

@nirandaperera
Copy link
Author

nirandaperera commented May 24, 2024

@janjust I was using v1.2 I just tried my code with the ucc master branch build and I don't see this error anymore. I think previously allgather knomial algorithm was active. But FWIU in the master branch, there are two new algorithms,

    3 :            bruck : O(log(N)) Variation of Bruck algorithm
    4 :          sparbit : O(log(N)) SPARBIT algorithm

I think one of these is now running the operation.
(I tried v1.3 as well & it had the same issue)

@nirandaperera
Copy link
Author

Sorry, I misspoke. I put some logs in the code and found out that it's the knomial algorithm in both cases.

@Sergei-Lebedev
Copy link
Contributor

there was a fix for multithreaded case, it was merged into master but we didn't include it into v1.3.x branch
#932

@nirandaperera
Copy link
Author

nirandaperera commented May 24, 2024

@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926

I think this change might be the fix.
https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

hmm, odd, that pr had nothing to do with multi-threading

@Sergei-Lebedev
Copy link
Contributor

@Sergei-Lebedev it turned out it was the latest PR that fixed the problem. #926

I think this change might be the fix. https://github.com/openucx/ucc/pull/926/files#diff-cd0947a917169a44f349c3331aace588a31757bdcc4d555f717048318719a09aR376

yes, could be

@nirandaperera
Copy link
Author

@janjust I think the tags have not been propagated properly until this PR. 🙂

@nirandaperera
Copy link
Author

nirandaperera commented May 28, 2024

@Sergei-Lebedev is there a way to get a patch release for v1.3?

@Sergei-Lebedev
Copy link
Contributor

@Sergei-Lebedev is there a way to get a patch release for v1.3?

@manjugv wdyt?

@nirandaperera
Copy link
Author

@Sergei-Lebedev it might need a proper test case for this scenario anyways, I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants