Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling example #894

Closed
wants to merge 3 commits into from
Closed

Scaling example #894

wants to merge 3 commits into from

Conversation

fritzgoebel
Copy link
Collaborator

This PR adds an example enabling users to perform simple strong and weak scaling studies for the distributed SpMV.
The example can be run with 2D (5-pt or 9-pt) and 3D (7-pt or 27-pt) stencil matrices. All matrix values are 1 for now as the result values don't really matter for this example.

@fritzgoebel fritzgoebel added reg:example This is related to the examples. mod:mpi This is related to the MPI module labels Sep 30, 2021
@fritzgoebel fritzgoebel requested a review from a team September 30, 2021 11:18
@fritzgoebel fritzgoebel self-assigned this Sep 30, 2021
@ginkgo-bot ginkgo-bot added mod:hip This is related to the HIP module. reg:build This is related to the build system. type:matrix-format This is related to the Matrix formats labels Sep 30, 2021
@fritzgoebel
Copy link
Collaborator Author

format!

@tcojean
Copy link
Member

tcojean commented Sep 30, 2021

That's a really useful addition!

I have a few quick comments/questions, not only to you but also to people who know more about the MPI backend

  • That only works on a single node right, since you directly pass the rank to the executor?
    • With MPI, isn't there a mode where you use CUDA_VISIBLE_DEVICES or similar variables so that each MPI process only sees one executor, which would always be number 0, but different between every MPI process?
    • Should OpenMP be disabled (until we can bind threads) or do we need to mention somewhere the user needs to take care of that through MPI/OpenMP variables or to only put one rank per machine.
  • What safety additions are required so that exceptions don't get thrown all the time, device allocation, probably some MPI interaction calls would benefit from try/catch, or is that managed at a lower level?

@pratikvn
Copy link
Member

@fritzgoebel, I think it might make sense to add this to an updated version of this branch, mpi-base-dist-mat, which also runs CI for MPI ?

@tcojean ,

  1. Yes, I think you can have a local_rank getter, which should be sufficient for 1MPI rank to 1GPU association. See for example: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/examples/distributed-solver/distributed-solver.cpp#L95

  2. Disabling OpenMP is a good idea and I think we should do that until we fix/update the thread binding.

  3. I guess you mean each rank throwing an exception ? MPI calls are wrapped in GKO_ASSERT_NO_MPI_ERRORS to those should be captured and thrown properly.

@tcojean
Copy link
Member

tcojean commented Sep 30, 2021

  1. Yes, I think you can have a local_rank getter, which should be sufficient for 1MPI rank to 1GPU association. See for example: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/examples/distributed-solver/distributed-solver.cpp#L95

Neat, using this would fix the issue in this example then

  1. Disabling OpenMP is a good idea and I think we should do that until we fix/update the thread binding.

It's maybe also fine as long as people are told to only use one process per machine, but it's indeed not going to be as powerful as people could expect it to be (only that mode should be able to work for now I think).

  1. I guess you mean each rank throwing an exception ? MPI calls are wrapped in GKO_ASSERT_NO_MPI_ERRORS to those should be captured and thrown properly.

Yes, they are thrown but then that means most Ginkgo API calls need to be surrounded by the user in try/catch blocks so that the exceptions are caught properly by the main function instead of thrown into the wild? https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/include/ginkgo/core/base/exception_helpers.hpp#L335

We have the same issue with memory allocations for example

Quick mention, it looks like in some places the check are missing: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/mpi/base/bindings.hpp#L87

@fritzgoebel fritzgoebel changed the base branch from distributed-matrix to mpi-base-dist-mat September 30, 2021 13:25
@fritzgoebel
Copy link
Collaborator Author

format!

Copy link
Contributor

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have a question about the device IDs and ranks though.

{"cuda",
[local_rank] {
return gko::CudaExecutor::create(
local_rank, gko::ReferenceExecutor::create(), true);
Copy link
Contributor

@Slaedr Slaedr Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing along @tcojean 's questions, right now this might not work when there's only device 0 on each process, right? Even if the device ID is global and independent of which process is assigned what resource by the scheduler, there's no guarantee device ID 3 would be accessible to rank 3 for example.

Would it not be better to either (1) always assume device 0, assuming the user maps one GPU to one process in the job scheduler (--gpus-per-task=1 on slurm, I think), or (2) use hwloc or something to access the number of GPUs available to this task and then decide device_id. Option 2 becomes useful only when we want to support multiple GPUs per process, so maybe we ignore that for now.

Copy link
Member

@tcojean tcojean Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is best to take care of that right now, but I think that's a fairly important point and a fairly annoying one. AFAIK, both situations can happen where only one device (number 0) is available for each MPI process (as you say using some job scheduler option), or each process sees the full machine and thus GPUs need to be split like now.

We probably should try to make it work in both situations? But what would be a convenient way? Should we have a simple function somewhere doing min(available_devices, local_rank)?

Copy link
Contributor

@Slaedr Slaedr Oct 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this, on Summit, the default setting is to have one GPU per process. On Spock, the default setting is all processes see all GPUs. I guess we need to support both, or document strongly that we need users to set the correct scheduler settings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is regarding assignment rather than what the process can view. With Summit, AFAIK, it is a bit different with resource sets. I think you always need to specify how many GPUs you need. And the numbering depends on how many resource sets you ask for. By default, I think it is set to 1, so the GPUs will be numbered from 0 to 5.

}
ranges_array.get_data()[comm->size()] = mat_size;
auto partition = gko::share(
part_type::build_from_contiguous(exec->get_master(), ranges_array));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably make it explicit in the naming that this is a row partition. Either gko::distributed::RowPartition for the class name, or part_type::build_from_contiguous_rows, something like that.

Copy link
Collaborator

@greole greole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some very minor typos I found.

/**
* Generates matrix data for a 2D stencil matrix. If restricted is set to true,
* creates a 5-pt stencil, if it is false creates a 9-pt stencil. If
* strong_scaling is set to true, creates the same problemsize independent of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* strong_scaling is set to true, creates the same problemsize independent of
* strong_scaling is set to true, creates the same problem size independent of

* Generates matrix data for a 2D stencil matrix. If restricted is set to true,
* creates a 5-pt stencil, if it is false creates a 9-pt stencil. If
* strong_scaling is set to true, creates the same problemsize independent of
* the number of ranks, if it false the problem size grows with the number of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the number of ranks, if it false the problem size grows with the number of
* the number of ranks, if set to false the problem size grows with the number of

/**
* Generates matrix data for a 3D stencil matrix. If restricted is set to true,
* creates a 7-pt stencil, if it is false creates a 27-pt stencil. If
* strong_scaling is set to true, creates the same problemsize independent of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* strong_scaling is set to true, creates the same problemsize independent of
* strong_scaling is set to true, creates the same problem size independent of

* Generates matrix data for a 3D stencil matrix. If restricted is set to true,
* creates a 7-pt stencil, if it is false creates a 27-pt stencil. If
* strong_scaling is set to true, creates the same problemsize independent of
* the number of ranks, if it false the problem size grows with the number of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the number of ranks, if it false the problem size grows with the number of
* the number of ranks, if set to false the problem size grows with the number of

int main(int argc, char* argv[])
{
const auto fin = gko::mpi::init_finalize(argc, argv);
// Use some shortcuts. In Ginkgo, vectors are seen as a gko::matrix::Dense
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment, starting at "In Gingko ..." is a bit convoluted, may be leave it out all together?

@MarcelKoch
Copy link
Member

Superseded by #1204

@MarcelKoch MarcelKoch closed this Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mod:hip This is related to the HIP module. mod:mpi This is related to the MPI module reg:build This is related to the build system. reg:example This is related to the examples. type:matrix-format This is related to the Matrix formats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants