Implement load-balancing for MD #3662

hirschsn · 2020-04-14T11:33:46Z

This PR implements a generic version of the "domain decomposition" cell system/topology that allows for load-balanced grids and repartitioning.

The load-balancing is implemented in an external library (librepa). This PR makes this library an optional dependency to ESPResSo. Additionally, a module called "GenericDD", which is a shared library is compiled. Espresso-Core depends on it. The shared library implements the new cell system. If the dependency librepa is not present, these are simply compiled to stubs that give an error. Additionally, the python interface for cell_system is changed such that it offers a "set_generic_dd" analogously to the other cell systems. The interface functionality for the generic_dd is implemented in an extra Python file generic_dd. The testsuite is changed to also test generic_dd in several smaller tests (collision_detection, pairs, random_pairs) and an additional test that simply checks if the new cell system with its different grid types and repartitionings gets the same energy in a simple NVE setting as ESPResSo's default "domain decomposition" cell system.

Example:
With these chages, it is possible to do:

s = espressomd.system.System(box_l=...)
# Setup system...
dd = s.cell_structure.set_generic_dd("kd_tree", use_verlet_lists=True)
# "kd_tree" is one of the grids that librepa offers. Note "set_generic_dd" returns an object that conveniently allows you to repartition

load_metric  = dd.metric("npart")
while not done:
    s.integrator.run(1000)
    # If the maximum number of particles on any process divided by the average is greater than 1.1
    if load_metric.pimbalance() > 1.1:
        dd.repart(load_metric)

Limitations:

Only MD, no coupling possible that requires ESPResSo's default decompositions (might currently be a hard failure and not caught in the code)
Currently only fully periodic simulation boxes supported
... probably more ...

Description of changes:

Implement a new cell system "generic_dd"
Change cells.[ch]pp to properly dispatch to this cell system
Add python and script interfaces for generic_dd
Add generic_dd cell system to several existing tests

Missing:

Documentation in users guide about usage of generic_dd

Suggestions and feedback welcome.

fweik · 2020-04-14T11:38:17Z

This looks very good, I'll have a look at how to deal with the limitations and review the rest. But I only will have time to give it a proper look next week. So bear with me.

hirschsn · 2020-04-14T11:47:03Z

@fweik Sure, take your time.

hirschsn · 2020-04-20T14:36:21Z

Note to self: The wait_any fix needs some work. Newer boost::mpi versions handle nonblocking communication differently and, thus, for newer boost versions waitany.hpp does not compile.

fweik · 2020-04-28T13:10:04Z

@hirschsn I'm still looking into this. But there are other changes for the cell systems which improve encapsulation, which will need to merged before this. Will keep you posted...

fweik · 2020-05-06T08:14:49Z

@hirschsn I had a first look, and I think there is one point in the design that we should consider. I think it maybe it would be better to trigger the reparting via the resort. This has the advantage that this is called regularly during the simulation (e.g. when the particle moved a certain distance), then your DD could decide internally what to do, e.g. decide based on the metric every 100 invocations, or do nothing (only manual repart) and force a repart on a global resort (those typically occur only if there are new particles or other major changed). Resort can be directly triggered from the interface,
and this is basically what the AtomDecomposition does. What do you think?

hirschsn · 2020-05-06T08:29:57Z

The idea behind triggering it manually is that I (read: anyone :D) can test different strategies with this interface; and–in fact–implement them in python in the simulation script. This might not be, what mere users of load-balancing might want, I agree.

At some point in the near future I also wanted to offer automatic capabilities, which is exactly what you are describing. Different automatic strategies could be implemented locally in generic_dd or elsewhere. The hook, however, into resort, is worth considering right now.

Do you see any problems with also offering manual repart capabilities, in addition, to let's say something like this (conceptually):

system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");

Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.

Could you elaborate? I don't get, what you want to tell me. :)

codecov · 2020-05-06T08:41:55Z

Codecov Report

Merging #3662 into python will decrease coverage by 0%.
The diff coverage is 31%.

@@           Coverage Diff           @@
##           python   #3662    +/-   ##
=======================================
- Coverage      88%     87%    -1%     
=======================================
  Files         524     532     +8     
  Lines       23471   23782   +311     
=======================================
+ Hits        20658   20742    +84     
- Misses       2813    3040   +227

Impacted Files	Coverage Δ
src/core/CellStructure.hpp	`100% <ø> (ø)`
src/core/communication.cpp	`91% <0%> (-4%)`	⬇️
src/core/generic-dd/metric.cpp	`0% <0%> (ø)`
src/core/generic-dd/metric.hpp	`0% <0%> (ø)`
src/core/ghosts.hpp	`100% <ø> (ø)`
src/script_interface/generic_dd/si_generic_dd.hpp	`0% <0%> (ø)`
src/script_interface/generic_dd/si_metric.hpp	`0% <0%> (ø)`
src/core/generic-dd/generic_dd.cpp	`7% <7%> (ø)`
src/core/ghosts.cpp	`82% <12%> (-18%)`	⬇️
src/core/cells.cpp	`82% <25%> (-6%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f105e3...a088aae. Read the comment docs.

fweik · 2020-05-06T08:42:27Z

Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.

I just wanted to say that you'd still have the possibility to call it manually, but I guess you can also directly do that via the python binding of generic_dd.

Do you see any problems with also offering manual repart capabilities

No I think that's fine.

system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");

is about what I had in mind.

As you are saying, this can probably also be addressed later. The test failures are due to the wait_any issue you described earlier, I suppose?

hirschsn · 2020-05-06T08:47:14Z

Test failures: Yes, I will take care of wait_any today. Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].

I am currently looking into the failing test cases and will ping you, once I'm done.

jngrad · 2020-05-06T15:40:16Z

Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].

Just a quick note: the Clang 6 jobs have been removed recently in favor of Clang 9. The osx-cuda job was removed. For AppleClang 9 on osx, I'm not sure why there's an error, it should support attributes.

mkuron · 2020-05-06T15:51:59Z

AppleClang 9

That is somewhere between Clang 6 and Clang 7 if I remember correctly. AppleClang's version numbers match the Xcode major version number, not the Clang major version number.

However, even Clang 6 should have supported [[noreturn]], which was introduced in C++11.

This reverts commit 6e2587a.

hirschsn · 2020-05-07T11:51:57Z

@jngrad @mkuron You're right. This was actually a linker error. Noreturn works fine.

src/core/CMakeLists.txt

This reverts commit 6ad6d48.

Implement a generic domain decomposition

b060856

fweik self-assigned this Apr 14, 2020

fweik added Core Improvement labels Apr 14, 2020

hirschsn added 2 commits April 14, 2020 13:44

Apply style fix

719723a

Remove hash from comment

a77cd21

fweik mentioned this pull request Apr 20, 2020

WIP: Ghosts2 #3481

Closed

4 tasks

hirschsn added 2 commits May 6, 2020 09:36

Add [[noreturn]] attribute to errexit

4e52269

Use boost::mpi::maximum instead of lambda

5b373aa

hirschsn added 5 commits May 6, 2020 11:34

Remove [[noreturn]] attributes

6e2587a

Fix waitany to work with up to date boost versions

d872aad

Add unit test for Utils::Mpi::wait_any

cb79211

Name test

f81a296

Add newline at end of file

43ffc68

hirschsn added 5 commits May 7, 2020 10:03

Adopt some clang suggestions

443b056

Make genericDD a static library

63213f6

Revert "Remove [[noreturn]] attributes"

e0b7e3b

This reverts commit 6e2587a.

Fix cmake 3.17 warning

f57df44

Add GenericDD to link interface

01e117e

hirschsn added 11 commits May 7, 2020 16:33

Return value can be moved

20c926d

Move return value once more instead of copying it

e21e968

Skip test if no lennard_jones

212ab35

Scatter some whitespace

ce51f69

Abort, don't just errorMsg.

6a263e2

Additional parameters available to py interface

4cfdfba

Build GenericDD as part of EspressoCore in its CMakeListst.txt

07267b2

Add test for new init part interface

3eec7fc

Apply style patch

8f56665

Remove unused variable

36eaa15

Link publically

6ad6d48

jngrad reviewed May 13, 2020

View reviewed changes

src/core/CMakeLists.txt Outdated Show resolved Hide resolved

hirschsn added 13 commits May 13, 2020 16:17

Revert "Link publically"

d8b362d

This reverts commit 6ad6d48.

Fix distpairs metric

965d84f

Test for metric module

7b65ba9

Style

2ac9e7f

Remove unused imports

e3c8a2c

Add test to CMakeLists.txt

d75bafa

Skip test if no repa.

73e0691

Adapt to new repa::ExtraParams interface

b702bce

Simply abort on 0 local cells

363ec60

Always reinit

8f2978e

Reformat + 2 asserts

eda5c45

Adapt to new repa interface

88f12f3

Fix style

a088aae

fweik removed their assignment Oct 23, 2020

This was referenced Jan 24, 2022

Rename domain_decomposition #4427

Closed

Add checks for cell system compatibility #4428

Closed

Build-system integration for librepa #4429

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement load-balancing for MD #3662

Implement load-balancing for MD #3662

hirschsn commented Apr 14, 2020

fweik commented Apr 14, 2020 •

edited

Loading

hirschsn commented Apr 14, 2020

hirschsn commented Apr 20, 2020

fweik commented Apr 28, 2020

fweik commented May 6, 2020

hirschsn commented May 6, 2020

codecov bot commented May 6, 2020 •

edited

Loading

fweik commented May 6, 2020

hirschsn commented May 6, 2020

jngrad commented May 6, 2020

mkuron commented May 6, 2020

hirschsn commented May 7, 2020

Implement load-balancing for MD #3662

Are you sure you want to change the base?

Implement load-balancing for MD #3662

Conversation

hirschsn commented Apr 14, 2020

fweik commented Apr 14, 2020 • edited Loading

hirschsn commented Apr 14, 2020

hirschsn commented Apr 20, 2020

fweik commented Apr 28, 2020

fweik commented May 6, 2020

hirschsn commented May 6, 2020

codecov bot commented May 6, 2020 • edited Loading

Codecov Report

fweik commented May 6, 2020

hirschsn commented May 6, 2020

jngrad commented May 6, 2020

mkuron commented May 6, 2020

hirschsn commented May 7, 2020

fweik commented Apr 14, 2020 •

edited

Loading

codecov bot commented May 6, 2020 •

edited

Loading