-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add EFA NCCL test case, unmanaged nodegroup template #427
Conversation
59a3ce0
to
77a7bcb
Compare
containers: | ||
- image: public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3 | ||
- image: TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary until there's an build pipeline set up for the image added in this PR.
- FI_LOG_LEVEL=warn | ||
- -x | ||
- FI_EFA_USE_DEVICE_RDMA=1 | ||
- -x | ||
- OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important, it will fail the test case if something goes wrong with the EFA RDMA instead of falling back to the dirt-slow host buffer copy.
@@ -52,6 +52,7 @@ spec: | |||
- p4de.24xlarge | |||
- trn1.32xlarge | |||
- trn1n.32xlarge | |||
- p5.48xlarge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have some control over it so that we can easily tune/stabilize the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding p5 types to the nodeSelector for the EFA device plugin, it's just a pre-req for using p5's.
Description of changes:
This adds a new option to the eksapi deployer,
--efa
, which will create an EFA-enabled unmanaged nodegroup.It also adds a case to the nvidia e2e tests to perform an AllReduce operation using 2 workers on EFA-enabled nodes.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.