Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise Memory Consumption during Setup #2005

Conversation

thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Oct 19, 2022

Introduction

Historically, our setup phase is quite cavalier with performance and memory which has given rise
to various concerns for larger networks. As we are preparing for large deployments in the next
HPC generation, we need to address this.

⚠️ Important ⚠️

Tests on MPI are still failing for domain decomposition due to a cyclic shift of groups
which shouldn't influence the test or functionality. But I am too stupid to fix the test to
pass.

Label Resolution

Our label resolution scheme concatenates a global vector of all source labels across all MPI tasks.
In my recent experiments, this exhausts the node-local memory (512GB) at around 150M cells,
with only a few --- $\mathcal{O}(10)$ --- labels per cell.

Fix

In each cell group, build a lazy map of the GIDs we are connected to and the associated source
labels. This reduces the required memory to the actual high-water-mark, but incurs the cost
of instantiating cells multiple times.

If this still not enough, we can proceed to minimize the memory consumption even more by

  • (simple) evict labels from the cache if we are close to the limit. Costs more instantiations, but
    saves memory
    Done
  • (optimal) observing which cells connect to a given source GID, instantiating the cell for only as
    long as needed. Now memory is O(1), but we need to fiddle with construction order all the while
    keeping the connectivity stable. Which might be complicated

Thingification during Cable Cell Construction

Cable cells constantly invoke thingify(region, provider) during construction. Mostly however, we
paint those region in succession, ie:

  # ...
  .paint('(tag 1)', ...)
  .paint('(tag 1)', ...)
  .paint('(tag 1)', ...)
  #...

This costs a lost of temporary allocations.

Fix

Simply cache the last region, following the idea that this is the 90% case.

Temporary data structures during Setup

embed_pwlin is particularly bad with managing allocation so much so that it shows up on time
profiles as a major cost centre. This originates from structures which are nested vectors under the
hood and must return values for reasons of genericity.

Fix

Return references where possible, convert the rest to continuation passing style so we never actually
return a value but take a callback which is invoked with the references.

mechanism_catalogue allocates temporaries

operator[] returns a value, where a const reference would be sufficient.

Fix

Obvious, no?

General

Split large functions -- especially in fvm_layout -- into smaller ones. This fences in temporary memory allocations.

Clean-up and Extras

Rename mc_cell_* to the proper cable_cell_*

This also renames -- finally! -- mc_cell_* to cable_cell_*

Add RSS field to Profiler

To check for the actual memory consumption recording of the high watermark memory was added to the
profiler. It now prints the memory used at the 1st hit of this cost centre and the associated maximum. Thus,
we can determine allocation during runtime and setup.

Example (not very helpful, but illustrative)

REGION             CALLS  WALL/s  THREAD/s      %  MAX_RSS/kB  1st_RSS/kB
------             -----  ------  --------  -----  ----------  ----------
root                   -   0.686     5.485  100.0           -           -
  advance              -   0.685     5.481   99.9           -           -
    integrate          -   0.685     5.476   99.8           -           -
      current          -   0.281     2.246   40.9           -           -
        pas        40200   0.189     1.515   27.6       29248       29232
        zero       40200   0.035     0.283    5.2       29248       29232
        hh         40200   0.031     0.249    4.5       29248       29232
        expsyn     40200   0.025     0.198    3.6       29248       29232

Add logging operator new

This will -- if enabled at compile time -- spit out the amount of Bytes allocated and the callstack
of the allocation site to stderr during runtime. ⚠️ This is hilariously slow, but really helps pinpointing
allocations. Use the script/memory-log-to-profile.py to convert log files to a report like this:

Size/kB   Count
4628      88274     root
4628      88274       start
4627      88269         main
4487      86748           arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487      86748             arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487      86747               arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
4487      86747                 arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
3106      69279                   void arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)
3102      69276                     arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)::{lambda(int)#1}::operator()(int) const
3102      69276                       arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0::operator()(std::__1::unique_ptr<arb::cell_group, std::__1::default_delete<arb::cell_group> >&, int) const

Linked

Fixes #1969

@thorstenhater thorstenhater marked this pull request as ready for review January 19, 2023 11:06
…umonguous-buffer-of-all-the-labels-in-the-world
Copy link
Contributor

@AdhocMan AdhocMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor issues / comments.

arbor/communication/communicator.cpp Outdated Show resolved Hide resolved
arbor/label_resolution.hpp Outdated Show resolved Hide resolved
test/unit-distributed/test_communicator.cpp Show resolved Hide resolved
AdhocMan
AdhocMan previously approved these changes Jan 27, 2023
Copy link
Contributor

@AdhocMan AdhocMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I'll leave merging up to you, in case you'd like feedback from others on this change.

@boeschf
Copy link
Contributor

boeschf commented Jan 31, 2023

do we still need the struct arb::connectivity?

@thorstenhater
Copy link
Contributor Author

No and it should no longer be here.

Comment on lines +369 to +370
res.max_rss = (n.max_rss == -1) ? "-" : std::to_string(n.max_rss);
res.fst_rss = (n.fst_rss == -1) ? "-" : std::to_string(n.fst_rss);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: why are you not using snprintf here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason, that was just the way it was done here before.

Comment on lines 148 to 149
if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec});
auto src_lid = sources.at(sgid).resolve(c.source);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec});
auto src_lid = sources.at(sgid).resolve(c.source);
auto src_lid = sources.try_emplace(sgid, sgid, rec).first->second.resolve(c.source);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I do this, will the value be constructed? If so, it's precisely running against what I want.

@thorstenhater thorstenhater changed the title Remove a global data-structure Optimise Memory Consumption during Setup Jun 17, 2023
@brenthuisman
Copy link
Contributor

brenthuisman commented Jul 4, 2023

Looks like this'll be essential for large simulations, the use-case we promote Arbor for. Is it possible to provide either strategy as a user option? Give me fast init, or give me low memory usage? Or even auto-activate the low mem path upon ncells > threshold (e.g. 100M?).

If not that, maybe a troubleshooting checklist for when e.g. people run out of memory or other fail situations.

thorstenhater added a commit that referenced this pull request Nov 29, 2023
Use the simple but well-known FNV-1a hash function to map
`cell_tag_type` aka `std::string` to an `uint64_t`
for label resolution. The former type has a size of 32B or more and the
latter 8B, thus cutting the storage and bandwidth
requirements by 3/4. 

The hash function is implemented from the reference given on the
authors' website/wikipedia and is extremely
simple. If we ever experience issues, we might consider switching this
to something of higher quality via an
external library, candidates are `xxHASH` and `Murmur3`.

https://github.com/Cyan4973/xxHash

Note: This should further relieve the memory pressure on larger scale
simulation as formulated in #1969 and make
#2005 less urgent.

There is no performance impact (at laptop scale), but the memory savings
are worth it.
thorstenhater added a commit that referenced this pull request Feb 27, 2024
Spin-off from #2005. Make the primary load balancing cleaner and faster
and more maintainable.
Thus:
- remove all MPI calls, this is now purely local
- remove temporary data structures and/or coral them into their own
little scopes
- simplify `super_cells` vs `regular_cells`
- sort less
- sparse connection tables

Partially inspired by external feedback
@thorstenhater
Copy link
Contributor Author

This has been splintered and merged in those splinters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running $2.1\cdot 10^9$ simple cells blows the memory
4 participants