Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement load-balancing for MD #3662

Open
wants to merge 39 commits into
base: python
Choose a base branch
from

Conversation

hirschsn
Copy link
Contributor

This PR implements a generic version of the "domain decomposition" cell system/topology that allows for load-balanced grids and repartitioning.

The load-balancing is implemented in an external library (librepa). This PR makes this library an optional dependency to ESPResSo. Additionally, a module called "GenericDD", which is a shared library is compiled. Espresso-Core depends on it. The shared library implements the new cell system. If the dependency librepa is not present, these are simply compiled to stubs that give an error. Additionally, the python interface for cell_system is changed such that it offers a "set_generic_dd" analogously to the other cell systems. The interface functionality for the generic_dd is implemented in an extra Python file generic_dd. The testsuite is changed to also test generic_dd in several smaller tests (collision_detection, pairs, random_pairs) and an additional test that simply checks if the new cell system with its different grid types and repartitionings gets the same energy in a simple NVE setting as ESPResSo's default "domain decomposition" cell system.

Example:
With these chages, it is possible to do:

s = espressomd.system.System(box_l=...)
# Setup system...
dd = s.cell_structure.set_generic_dd("kd_tree", use_verlet_lists=True)
# "kd_tree" is one of the grids that librepa offers. Note "set_generic_dd" returns an object that conveniently allows you to repartition

load_metric  = dd.metric("npart")
while not done:
    s.integrator.run(1000)
    # If the maximum number of particles on any process divided by the average is greater than 1.1
    if load_metric.pimbalance() > 1.1:
        dd.repart(load_metric)

Limitations:

  • Only MD, no coupling possible that requires ESPResSo's default decompositions (might currently be a hard failure and not caught in the code)
  • Currently only fully periodic simulation boxes supported
  • ... probably more ...

Description of changes:

  • Implement a new cell system "generic_dd"
  • Change cells.[ch]pp to properly dispatch to this cell system
  • Add python and script interfaces for generic_dd
  • Add generic_dd cell system to several existing tests

Missing:

  • Documentation in users guide about usage of generic_dd

Suggestions and feedback welcome.

@fweik
Copy link
Contributor

fweik commented Apr 14, 2020

This looks very good, I'll have a look at how to deal with the limitations and review the rest. But I only will have time to give it a proper look next week. So bear with me.

@hirschsn
Copy link
Contributor Author

@fweik Sure, take your time.

@fweik fweik mentioned this pull request Apr 20, 2020
4 tasks
@hirschsn
Copy link
Contributor Author

Note to self: The wait_any fix needs some work. Newer boost::mpi versions handle nonblocking communication differently and, thus, for newer boost versions waitany.hpp does not compile.

@fweik
Copy link
Contributor

fweik commented Apr 28, 2020

@hirschsn I'm still looking into this. But there are other changes for the cell systems which improve encapsulation, which will need to merged before this. Will keep you posted...

@fweik
Copy link
Contributor

fweik commented May 6, 2020

@hirschsn I had a first look, and I think there is one point in the design that we should consider. I think it maybe it would be better to trigger the reparting via the resort. This has the advantage that this is called regularly during the simulation (e.g. when the particle moved a certain distance), then your DD could decide internally what to do, e.g. decide based on the metric every 100 invocations, or do nothing (only manual repart) and force a repart on a global resort (those typically occur only if there are new particles or other major changed). Resort can be directly triggered from the interface,
and this is basically what the AtomDecomposition does. What do you think?

@hirschsn
Copy link
Contributor Author

hirschsn commented May 6, 2020

The idea behind triggering it manually is that I (read: anyone :D) can test different strategies with this interface; and–in fact–implement them in python in the simulation script. This might not be, what mere users of load-balancing might want, I agree.

At some point in the near future I also wanted to offer automatic capabilities, which is exactly what you are describing. Different automatic strategies could be implemented locally in generic_dd or elsewhere. The hook, however, into resort, is worth considering right now.

Do you see any problems with also offering manual repart capabilities, in addition, to let's say something like this (conceptually):

system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");

Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.

Could you elaborate? I don't get, what you want to tell me. :)

@codecov
Copy link

codecov bot commented May 6, 2020

Codecov Report

Merging #3662 into python will decrease coverage by 0%.
The diff coverage is 31%.

Impacted file tree graph

@@           Coverage Diff           @@
##           python   #3662    +/-   ##
=======================================
- Coverage      88%     87%    -1%     
=======================================
  Files         524     532     +8     
  Lines       23471   23782   +311     
=======================================
+ Hits        20658   20742    +84     
- Misses       2813    3040   +227     
Impacted Files Coverage Δ
src/core/CellStructure.hpp 100% <ø> (ø)
src/core/communication.cpp 91% <0%> (-4%) ⬇️
src/core/generic-dd/metric.cpp 0% <0%> (ø)
src/core/generic-dd/metric.hpp 0% <0%> (ø)
src/core/ghosts.hpp 100% <ø> (ø)
src/script_interface/generic_dd/si_generic_dd.hpp 0% <0%> (ø)
src/script_interface/generic_dd/si_metric.hpp 0% <0%> (ø)
src/core/generic-dd/generic_dd.cpp 7% <7%> (ø)
src/core/ghosts.cpp 82% <12%> (-18%) ⬇️
src/core/cells.cpp 82% <25%> (-6%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f105e3...a088aae. Read the comment docs.

@fweik
Copy link
Contributor

fweik commented May 6, 2020

Resort can be directly triggered from the interface, and this is basically what the AtomDecomposition does.

I just wanted to say that you'd still have the possibility to call it manually, but I guess you can also directly do that via the python binding of generic_dd.

Do you see any problems with also offering manual repart capabilities

No I think that's fine.

system.cell_system.set_generic_dd(..., auto_loadbalancing="npart");

is about what I had in mind.

As you are saying, this can probably also be addressed later. The test failures are due to the wait_any issue you described earlier, I suppose?

@hirschsn
Copy link
Contributor Author

hirschsn commented May 6, 2020

Test failures: Yes, I will take care of wait_any today. Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].

I am currently looking into the failing test cases and will ping you, once I'm done.

@jngrad
Copy link
Member

jngrad commented May 6, 2020

Also, my [[noreturn]] failed on older compilers because errexit was not [[noreturn]]. However, the osx tests do not seem to like making errexit [[noreturn]].

Just a quick note: the Clang 6 jobs have been removed recently in favor of Clang 9. The osx-cuda job was removed. For AppleClang 9 on osx, I'm not sure why there's an error, it should support attributes.

@mkuron
Copy link
Member

mkuron commented May 6, 2020

AppleClang 9

That is somewhere between Clang 6 and Clang 7 if I remember correctly. AppleClang's version numbers match the Xcode major version number, not the Clang major version number.

However, even Clang 6 should have supported [[noreturn]], which was introduced in C++11.

@hirschsn
Copy link
Contributor Author

hirschsn commented May 7, 2020

@jngrad @mkuron You're right. This was actually a linker error. Noreturn works fine.

src/core/CMakeLists.txt Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants