-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise Memory Consumption during Setup #2005
Optimise Memory Consumption during Setup #2005
Conversation
…umonguous-buffer-of-all-the-labels-in-the-world
…umonguous-buffer-of-all-the-labels-in-the-world
…umonguous-buffer-of-all-the-labels-in-the-world
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor issues / comments.
…umonguous-buffer-of-all-the-labels-in-the-world
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'll leave merging up to you, in case you'd like feedback from others on this change.
do we still need the struct |
No and it should no longer be here. |
res.max_rss = (n.max_rss == -1) ? "-" : std::to_string(n.max_rss); | ||
res.fst_rss = (n.fst_rss == -1) ? "-" : std::to_string(n.fst_rss); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity: why are you not using snprintf
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason, that was just the way it was done here before.
arbor/communication/communicator.cpp
Outdated
if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec}); | ||
auto src_lid = sources.at(sgid).resolve(c.source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec}); | |
auto src_lid = sources.at(sgid).resolve(c.source); | |
auto src_lid = sources.try_emplace(sgid, sgid, rec).first->second.resolve(c.source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I do this, will the value be constructed? If so, it's precisely running against what I want.
… on test 8.13MB -> 4.63MB
Ponder whether we can make MT build more efficient.
Looks like this'll be essential for large simulations, the use-case we promote Arbor for. Is it possible to provide either strategy as a user option? Give me fast init, or give me low memory usage? Or even auto-activate the low mem path upon ncells > threshold (e.g. 100M?). If not that, maybe a troubleshooting checklist for when e.g. people run out of memory or other fail situations. |
Use the simple but well-known FNV-1a hash function to map `cell_tag_type` aka `std::string` to an `uint64_t` for label resolution. The former type has a size of 32B or more and the latter 8B, thus cutting the storage and bandwidth requirements by 3/4. The hash function is implemented from the reference given on the authors' website/wikipedia and is extremely simple. If we ever experience issues, we might consider switching this to something of higher quality via an external library, candidates are `xxHASH` and `Murmur3`. https://github.com/Cyan4973/xxHash Note: This should further relieve the memory pressure on larger scale simulation as formulated in #1969 and make #2005 less urgent. There is no performance impact (at laptop scale), but the memory savings are worth it.
Spin-off from #2005. Make the primary load balancing cleaner and faster and more maintainable. Thus: - remove all MPI calls, this is now purely local - remove temporary data structures and/or coral them into their own little scopes - simplify `super_cells` vs `regular_cells` - sort less - sparse connection tables Partially inspired by external feedback
This has been splintered and merged in those splinters. |
Introduction
Historically, our setup phase is quite cavalier with performance and memory which has given rise
to various concerns for larger networks. As we are preparing for large deployments in the next
HPC generation, we need to address this.
Tests on MPI are still failing for domain decomposition due to a cyclic shift of groups
which shouldn't influence the test or functionality. But I am too stupid to fix the test to
pass.
Label Resolution
Our label resolution scheme concatenates a global vector of all source labels across all MPI tasks.$\mathcal{O}(10)$ --- labels per cell.
In my recent experiments, this exhausts the node-local memory (512GB) at around 150M cells,
with only a few ---
Fix
In each cell group, build a lazy map of the GIDs we are connected to and the associated source
labels. This reduces the required memory to the actual high-water-mark, but incurs the cost
of instantiating cells multiple times.
If this still not enough, we can proceed to minimize the memory consumption even more by
(simple) evict labels from the cache if we are close to the limit. Costs more instantiations, butDonesaves memory
long as needed. Now memory is O(1), but we need to fiddle with construction order all the while
keeping the connectivity stable. Which might be complicated
Thingification during Cable Cell Construction
Cable cells constantly invoke
thingify(region, provider)
during construction. Mostly however, wepaint
those region in succession, ie:This costs a lost of temporary allocations.
Fix
Simply cache the last region, following the idea that this is the 90% case.
Temporary data structures during Setup
embed_pwlin
is particularly bad with managing allocation so much so that it shows up on timeprofiles as a major cost centre. This originates from structures which are nested vectors under the
hood and must return values for reasons of genericity.
Fix
Return references where possible, convert the rest to continuation passing style so we never actually
return a value but take a callback which is invoked with the references.
mechanism_catalogue
allocates temporariesoperator[]
returns a value, where a const reference would be sufficient.Fix
Obvious, no?
General
Split large functions -- especially in
fvm_layout
-- into smaller ones. This fences in temporary memory allocations.Clean-up and Extras
Rename
mc_cell_*
to the propercable_cell_*
This also renames -- finally! --
mc_cell_*
tocable_cell_*
Add RSS field to Profiler
To check for the actual memory consumption recording of the high watermark memory was added to the
profiler. It now prints the memory used at the 1st hit of this cost centre and the associated maximum. Thus,
we can determine allocation during runtime and setup.
Example (not very helpful, but illustrative)
Add logging
operator new
This will -- if enabled at compile time -- spit out the amount of Bytes allocated and the callstack⚠️ This is hilariously slow, but really helps pinpointing
of the allocation site to
stderr
during runtime.allocations. Use the
script/memory-log-to-profile.py
to convert log files to a report like this:Linked
Fixes #1969