Scaling example #894

fritzgoebel · 2021-09-30T11:18:49Z

This PR adds an example enabling users to perform simple strong and weak scaling studies for the distributed SpMV.
The example can be run with 2D (5-pt or 9-pt) and 3D (7-pt or 27-pt) stencil matrices. All matrix values are 1 for now as the result values don't really matter for this example.

fritzgoebel · 2021-09-30T11:21:34Z

format!

tcojean · 2021-09-30T11:31:16Z

That's a really useful addition!

I have a few quick comments/questions, not only to you but also to people who know more about the MPI backend

That only works on a single node right, since you directly pass the rank to the executor?
- With MPI, isn't there a mode where you use CUDA_VISIBLE_DEVICES or similar variables so that each MPI process only sees one executor, which would always be number 0, but different between every MPI process?
- Should OpenMP be disabled (until we can bind threads) or do we need to mention somewhere the user needs to take care of that through MPI/OpenMP variables or to only put one rank per machine.
What safety additions are required so that exceptions don't get thrown all the time, device allocation, probably some MPI interaction calls would benefit from try/catch, or is that managed at a lower level?

pratikvn · 2021-09-30T12:41:12Z

@fritzgoebel, I think it might make sense to add this to an updated version of this branch, mpi-base-dist-mat, which also runs CI for MPI ?

@tcojean ,

Yes, I think you can have a local_rank getter, which should be sufficient for 1MPI rank to 1GPU association. See for example: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/examples/distributed-solver/distributed-solver.cpp#L95
Disabling OpenMP is a good idea and I think we should do that until we fix/update the thread binding.
I guess you mean each rank throwing an exception ? MPI calls are wrapped in GKO_ASSERT_NO_MPI_ERRORS to those should be captured and thrown properly.

tcojean · 2021-09-30T13:12:50Z

Yes, I think you can have a local_rank getter, which should be sufficient for 1MPI rank to 1GPU association. See for example: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/examples/distributed-solver/distributed-solver.cpp#L95

Neat, using this would fix the issue in this example then

Disabling OpenMP is a good idea and I think we should do that until we fix/update the thread binding.

It's maybe also fine as long as people are told to only use one process per machine, but it's indeed not going to be as powerful as people could expect it to be (only that mode should be able to work for now I think).

I guess you mean each rank throwing an exception ? MPI calls are wrapped in GKO_ASSERT_NO_MPI_ERRORS to those should be captured and thrown properly.

Yes, they are thrown but then that means most Ginkgo API calls need to be surrounded by the user in try/catch blocks so that the exceptions are caught properly by the main function instead of thrown into the wild? https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/include/ginkgo/core/base/exception_helpers.hpp#L335

We have the same issue with memory allocations for example

Quick mention, it looks like in some places the check are missing: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/mpi/base/bindings.hpp#L87

fritzgoebel · 2021-09-30T13:42:10Z

format!

Slaedr

Looks good. I have a question about the device IDs and ranks though.

Slaedr · 2021-09-30T16:13:37Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+            {"cuda",
+             [local_rank] {
+                 return gko::CudaExecutor::create(
+                     local_rank, gko::ReferenceExecutor::create(), true);


Continuing along @tcojean 's questions, right now this might not work when there's only device 0 on each process, right? Even if the device ID is global and independent of which process is assigned what resource by the scheduler, there's no guarantee device ID 3 would be accessible to rank 3 for example.

Would it not be better to either (1) always assume device 0, assuming the user maps one GPU to one process in the job scheduler (--gpus-per-task=1 on slurm, I think), or (2) use hwloc or something to access the number of GPUs available to this task and then decide device_id. Option 2 becomes useful only when we want to support multiple GPUs per process, so maybe we ignore that for now.

I'm not sure what is best to take care of that right now, but I think that's a fairly important point and a fairly annoying one. AFAIK, both situations can happen where only one device (number 0) is available for each MPI process (as you say using some job scheduler option), or each process sees the full machine and thus GPUs need to be split like now.

We probably should try to make it work in both situations? But what would be a convenient way? Should we have a simple function somewhere doing min(available_devices, local_rank)?

According to this, on Summit, the default setting is to have one GPU per process. On Spock, the default setting is all processes see all GPUs. I guess we need to support both, or document strongly that we need users to set the correct scheduler settings.

I think that is regarding assignment rather than what the process can view. With Summit, AFAIK, it is a bit different with resource sets. I think you always need to specify how many GPUs you need. And the numbering depends on how many resource sets you ask for. By default, I think it is set to 1, so the GPUs will be numbered from 0 to 5.

Slaedr · 2021-09-30T16:17:40Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+    }
+    ranges_array.get_data()[comm->size()] = mat_size;
+    auto partition = gko::share(
+        part_type::build_from_contiguous(exec->get_master(), ranges_array));


We should probably make it explicit in the naming that this is a row partition. Either gko::distributed::RowPartition for the class name, or part_type::build_from_contiguous_rows, something like that.

greole

Here are some very minor typos I found.

greole · 2021-11-04T10:55:57Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+/**
+ * Generates matrix data for a 2D stencil matrix. If restricted is set to true,
+ * creates a 5-pt stencil, if it is false creates a 9-pt stencil. If
+ * strong_scaling is set to true, creates the same problemsize independent of


Suggested change

* strong_scaling is set to true, creates the same problemsize independent of

* strong_scaling is set to true, creates the same problem size independent of

greole · 2021-11-04T10:57:15Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+ * Generates matrix data for a 2D stencil matrix. If restricted is set to true,
+ * creates a 5-pt stencil, if it is false creates a 9-pt stencil. If
+ * strong_scaling is set to true, creates the same problemsize independent of
+ * the number of ranks, if it false the problem size grows with the number of


Suggested change

* the number of ranks, if it false the problem size grows with the number of

* the number of ranks, if set to false the problem size grows with the number of

greole · 2021-11-04T10:58:08Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+/**
+ * Generates matrix data for a 3D stencil matrix. If restricted is set to true,
+ * creates a 7-pt stencil, if it is false creates a 27-pt stencil. If
+ * strong_scaling is set to true, creates the same problemsize independent of


Suggested change

* strong_scaling is set to true, creates the same problemsize independent of

* strong_scaling is set to true, creates the same problem size independent of

greole · 2021-11-04T10:58:28Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+ * Generates matrix data for a 3D stencil matrix. If restricted is set to true,
+ * creates a 7-pt stencil, if it is false creates a 27-pt stencil. If
+ * strong_scaling is set to true, creates the same problemsize independent of
+ * the number of ranks, if it false the problem size grows with the number of


Suggested change

* the number of ranks, if it false the problem size grows with the number of

* the number of ranks, if set to false the problem size grows with the number of

greole · 2021-11-04T11:02:13Z

examples/distributed-spmv-scaling/distributed-spmv-scaling.cpp

+int main(int argc, char* argv[])
+{
+    const auto fin = gko::mpi::init_finalize(argc, argv);
+    // Use some shortcuts. In Ginkgo, vectors are seen as a gko::matrix::Dense


This comment, starting at "In Gingko ..." is a bit convoluted, may be leave it out all together?

MarcelKoch · 2023-07-07T07:14:11Z

Superseded by #1204

fritzgoebel added reg:example This is related to the examples. mod:mpi This is related to the MPI module labels Sep 30, 2021

fritzgoebel requested a review from a team September 30, 2021 11:18

fritzgoebel self-assigned this Sep 30, 2021

ginkgo-bot added mod:hip This is related to the HIP module. reg:build This is related to the build system. type:matrix-format This is related to the Matrix formats labels Sep 30, 2021

Add example for distributed spmv scaling performance

6059d88

Adjust to new interface

d94db77

fritzgoebel force-pushed the scaling_example branch from e120fd6 to d94db77 Compare September 30, 2021 13:24

fritzgoebel changed the base branch from distributed-matrix to mpi-base-dist-mat September 30, 2021 13:25

Create executors with local rank

9f5dbd2

Slaedr reviewed Sep 30, 2021

View reviewed changes

greole reviewed Nov 4, 2021

View reviewed changes

MarcelKoch closed this Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling example #894

Scaling example #894

fritzgoebel commented Sep 30, 2021

fritzgoebel commented Sep 30, 2021

tcojean commented Sep 30, 2021 •

edited

Loading

pratikvn commented Sep 30, 2021

tcojean commented Sep 30, 2021 •

edited

Loading

fritzgoebel commented Sep 30, 2021

Slaedr left a comment

Slaedr Sep 30, 2021 •

edited

Loading

tcojean Sep 30, 2021 •

edited

Loading

Slaedr Oct 1, 2021 •

edited

Loading

pratikvn Oct 1, 2021

Slaedr Sep 30, 2021

greole left a comment

greole Nov 4, 2021

greole Nov 4, 2021

greole Nov 4, 2021

greole Nov 4, 2021

greole Nov 4, 2021

MarcelKoch commented Jul 7, 2023

	* strong_scaling is set to true, creates the same problemsize independent of
	* strong_scaling is set to true, creates the same problem size independent of

	* the number of ranks, if it false the problem size grows with the number of
	* the number of ranks, if set to false the problem size grows with the number of

Scaling example #894

Scaling example #894

Conversation

fritzgoebel commented Sep 30, 2021

fritzgoebel commented Sep 30, 2021

tcojean commented Sep 30, 2021 • edited Loading

pratikvn commented Sep 30, 2021

tcojean commented Sep 30, 2021 • edited Loading

fritzgoebel commented Sep 30, 2021

Slaedr left a comment

Choose a reason for hiding this comment

Slaedr Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

tcojean Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

Slaedr Oct 1, 2021 • edited Loading

Choose a reason for hiding this comment

pratikvn Oct 1, 2021

Choose a reason for hiding this comment

Slaedr Sep 30, 2021

Choose a reason for hiding this comment

greole left a comment

Choose a reason for hiding this comment

greole Nov 4, 2021

Choose a reason for hiding this comment

greole Nov 4, 2021

Choose a reason for hiding this comment

greole Nov 4, 2021

Choose a reason for hiding this comment

greole Nov 4, 2021

Choose a reason for hiding this comment

greole Nov 4, 2021

Choose a reason for hiding this comment

MarcelKoch commented Jul 7, 2023

tcojean commented Sep 30, 2021 •

edited

Loading

tcojean commented Sep 30, 2021 •

edited

Loading

Slaedr Sep 30, 2021 •

edited

Loading

tcojean Sep 30, 2021 •

edited

Loading

Slaedr Oct 1, 2021 •

edited

Loading