Replies: 10 comments 21 replies
-
Hi Marcus, Q1: Local (and global) arrays can be defined on a subset of processes of an MPI group that's used in an IO. Moreover, one process can contribute more than one block (with multiple Put() calls in a step). Q2: The fact that you can define a Global Value, and any attribute on every processor, is a convenience. ADIOS has to filter out the duplicates at some point, either in writing or at reading, since only a single value is presented to the reader. Of course this has a performance and memory cost, not much in a few instances, but with thousands of attributes/global values, on thousands of processes, this could lead to an observable cost. In short, if you can, define attributes and global values on a single process, but you are allowed to do so on all processes. Best regards |
Beta Was this translation helpful? Give feedback.
-
@lizdulac @dmitry-ganyushin Can you please check this code? Thanks |
Beta Was this translation helpful? Give feedback.
-
I've taken a quick look, but am unable to reproduce this behaviour. Unless I missed it, you don't specify exactly how you're determining the number of variables in the output file, but sampling some of the degrees of parallelism and output counts that you listed and using bpls as a means of counting variables I get the expected results. For example:
I don't have a good guess that might explain what you're seeing, but maybe if you tell us something about your read/verification method, we can go from there. |
Beta Was this translation helpful? Give feedback.
-
Interesting... That kind of points to some kind of problem with the variable naming and somehow always using 0 instead of the appropriate rank. Not sure how that would happen from looking at the code, but it's something to look at. BTW, it would be much more efficient in terms of ADIOS metadata overhead to use the same variable name across all ranks and instead pull the appropriate block out (block 0 will be from rank 0, block 1 from rank 1, etc. Assuming that all ranks write one block.) |
Beta Was this translation helpful? Give feedback.
-
I don't think the group API is likely to help because it just manipulates the names internally. How about this: Try changing the naming convention. There's some funky name manipulation that happens in BP5 to add internal information as well as the variable name and manipulate them as a set. It's possible that there's a bug there. Maybe instead of underscores or dashes, try different characters. |
Beta Was this translation helpful? Give feedback.
-
Hi, I managed to run my original mwe on two other systems:
In both cases I can reproduce the issue when using BP5:
And, yes, I agree that the behaviour looks very strange. But I am now quite confident that it's not a problem of my local software stack. Cheers |
Beta Was this translation helpful? Give feedback.
-
Installed debian on WSL, which is Debian 12.2, gcc (Debian 12.2.0-14) 12.2.0, mpirun (Open MPI) 4.1.4 and saw the same problem. Specifically, from 9776 variables, I get 1/8th of that but 8 blocks per variable. At 9768 variables, I get 9768 variables with 1 block each. Even if I rename the variables to xxx-_RRRR-. If I increase the length of name by one extra character to xxxx-*, the issue start to show up at 9600 variables. With five x, the limit is 9432. With a single x in the name, the limit is 32320. I printed the list of variables per process, and they generate different variable names as expected. I also changed the function to return the adios2::Variable variable instead of std::string and used that in Put() to make sure adios does not find the wrong variable based on a string. |
Beta Was this translation helpful? Give feedback.
-
Debug
…On Fri, Oct 20, 2023, 8:55 PM Greg Eisenhauer ***@***.***> wrote:
Debug or Release build? I've tried dozens of ways to make this happen, but
totally failed. The MPI version shouldn't matter, but the compiler may be
important.
—
Reply to this email directly, view it on GitHub
<#3822 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYYYLOPQEU3XYALQKEOKSTYAMMRRAVCNFSM6AAAAAA5DLFFIOVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TGNBUGE4DA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Well, I still haven't been able to duplicate this anywhere, which is a bit of a mystery, but I'm sufficiently convinced that it's a hash collision problem to proceed with that as a fix. @mohrmarc , would you please clone https://github.com/eisenhauer/ADIOS2.git and try branch "ExtremeVars" in your situation? This branch changes the FFS internal format hash to SipHash. This isn't ready to be merged yet for several reasons. Most importantly, I expect that any change in the hashing algorithm to change the set of situations in which we see collisions, so just checking that things now work in some circumstance where they didn't before isn't sufficient. What I'd like to see is a bit more of a torture test, and I'd also like to run some numbers on the computational overhead of using a cryptographic hash as opposed to the non-crypto hash we've been using. That's going to take more time than I can spend at the moment, but perhaps this a stopgap solution. |
Beta Was this translation helpful? Give feedback.
-
I'm going to close this and it can be reopened if there are further questions. The summary is that there was a bug in FFS where only the format body hash was calculated only over a portion of the format body, with that portion being determined by the value of the bottom 16 bits of the body length. This meant that for larger formats (many variables), the possibility of hash collision was higher. This bug exists in all versions of ADIOS through 2.9.x, but is not in current GitHub master and will not be in 2.10. This should only be a problem if you 1) are writing thousands of variables, and 2) change the variable names in a structured way across ranks (such as naming the variables based on the MPI rank. If either of these things are not true, your odds of a hash collision are still 1 in 2^62 or so. You also won't get a hash collision if the format body length mod 2^16 is bigger than 300 or so, but you have little control over this at the application level (except maybe by adding dummy variables). |
Beta Was this translation helpful? Give feedback.
-
Hi,
I only recently came across ADIOS2, but find it quite fascinating. We have already successfully implemented an ADIOS2-based data exporter for data visualisaton with ParaView in our HyTeG FEM framework.
However, now I started working on a checkpoint/restore functionality and ran into strange problems which could be based on misconceptions on my side. Thus, I wanted to clarify some details that did not become fully clear to me after going through the documentation. I have some suspicions, but would like to cross-check.
I hope these questions aren't too trivial or that I just overlooked the answers.
Cheers
Marcus
Beta Was this translation helpful? Give feedback.
All reactions