-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling example #894
Scaling example #894
Conversation
format! |
That's a really useful addition! I have a few quick comments/questions, not only to you but also to people who know more about the MPI backend
|
@fritzgoebel, I think it might make sense to add this to an updated version of this branch, @tcojean ,
|
Neat, using this would fix the issue in this example then
It's maybe also fine as long as people are told to only use one process per machine, but it's indeed not going to be as powerful as people could expect it to be (only that mode should be able to work for now I think).
Yes, they are thrown but then that means most Ginkgo API calls need to be surrounded by the user in We have the same issue with memory allocations for example Quick mention, it looks like in some places the check are missing: https://github.com/ginkgo-project/ginkgo/blob/mpi-base-dist-mat/mpi/base/bindings.hpp#L87 |
e120fd6
to
d94db77
Compare
format! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have a question about the device IDs and ranks though.
{"cuda", | ||
[local_rank] { | ||
return gko::CudaExecutor::create( | ||
local_rank, gko::ReferenceExecutor::create(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing along @tcojean 's questions, right now this might not work when there's only device 0 on each process, right? Even if the device ID is global and independent of which process is assigned what resource by the scheduler, there's no guarantee device ID 3 would be accessible to rank 3 for example.
Would it not be better to either (1) always assume device 0, assuming the user maps one GPU to one process in the job scheduler (--gpus-per-task=1
on slurm, I think), or (2) use hwloc or something to access the number of GPUs available to this task and then decide device_id
. Option 2 becomes useful only when we want to support multiple GPUs per process, so maybe we ignore that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is best to take care of that right now, but I think that's a fairly important point and a fairly annoying one. AFAIK, both situations can happen where only one device (number 0) is available for each MPI process (as you say using some job scheduler option), or each process sees the full machine and thus GPUs need to be split like now.
We probably should try to make it work in both situations? But what would be a convenient way? Should we have a simple function somewhere doing min(available_devices, local_rank)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this, on Summit, the default setting is to have one GPU per process. On Spock, the default setting is all processes see all GPUs. I guess we need to support both, or document strongly that we need users to set the correct scheduler settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is regarding assignment rather than what the process can view. With Summit, AFAIK, it is a bit different with resource sets. I think you always need to specify how many GPUs you need. And the numbering depends on how many resource sets you ask for. By default, I think it is set to 1, so the GPUs will be numbered from 0 to 5.
} | ||
ranges_array.get_data()[comm->size()] = mat_size; | ||
auto partition = gko::share( | ||
part_type::build_from_contiguous(exec->get_master(), ranges_array)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably make it explicit in the naming that this is a row partition. Either gko::distributed::RowPartition
for the class name, or part_type::build_from_contiguous_rows
, something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some very minor typos I found.
/** | ||
* Generates matrix data for a 2D stencil matrix. If restricted is set to true, | ||
* creates a 5-pt stencil, if it is false creates a 9-pt stencil. If | ||
* strong_scaling is set to true, creates the same problemsize independent of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* strong_scaling is set to true, creates the same problemsize independent of | |
* strong_scaling is set to true, creates the same problem size independent of |
* Generates matrix data for a 2D stencil matrix. If restricted is set to true, | ||
* creates a 5-pt stencil, if it is false creates a 9-pt stencil. If | ||
* strong_scaling is set to true, creates the same problemsize independent of | ||
* the number of ranks, if it false the problem size grows with the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* the number of ranks, if it false the problem size grows with the number of | |
* the number of ranks, if set to false the problem size grows with the number of |
/** | ||
* Generates matrix data for a 3D stencil matrix. If restricted is set to true, | ||
* creates a 7-pt stencil, if it is false creates a 27-pt stencil. If | ||
* strong_scaling is set to true, creates the same problemsize independent of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* strong_scaling is set to true, creates the same problemsize independent of | |
* strong_scaling is set to true, creates the same problem size independent of |
* Generates matrix data for a 3D stencil matrix. If restricted is set to true, | ||
* creates a 7-pt stencil, if it is false creates a 27-pt stencil. If | ||
* strong_scaling is set to true, creates the same problemsize independent of | ||
* the number of ranks, if it false the problem size grows with the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* the number of ranks, if it false the problem size grows with the number of | |
* the number of ranks, if set to false the problem size grows with the number of |
int main(int argc, char* argv[]) | ||
{ | ||
const auto fin = gko::mpi::init_finalize(argc, argv); | ||
// Use some shortcuts. In Ginkgo, vectors are seen as a gko::matrix::Dense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment, starting at "In Gingko ..." is a bit convoluted, may be leave it out all together?
Superseded by #1204 |
This PR adds an example enabling users to perform simple strong and weak scaling studies for the distributed SpMV.
The example can be run with 2D (5-pt or 9-pt) and 3D (7-pt or 27-pt) stencil matrices. All matrix values are 1 for now as the result values don't really matter for this example.