-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARCH/X86: Add Turin support #10272
base: master
Are you sure you want to change the base?
ARCH/X86: Add Turin support #10272
Conversation
Signed-off-by: Arun Chandran <Arun.Chandran@amd.com> Acked-by: Edgar Gabriel <edgar.gabriel@amd.com>
src/ucs/arch/x86_64/cpu.c
Outdated
case 0x00 ... 0x2f: | ||
case 0x40 ... 0x4f: | ||
case 0x60 ... 0x7f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems it's a non-standard extension? maybe use if condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use the 'if condition' for 'model', it would make the code less structured. The 'case ranges' are a well-known extension supported by both GCC and Clang. Can we stick with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd prefer if/else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the style then.
It would be good to have this patch btw. in 1.18.0 (if possible) |
config/ucx.conf
Outdated
@@ -35,3 +35,7 @@ UCX_GDR_COPY_LAT=get:1.65e-6,put:0.65e-6 | |||
[AMD Genoa] | |||
CPU model=Genoa | |||
UCX_DISTANCE_BW=auto,sys:5100MBs | |||
|
|||
[AMD Turin] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible to set UCX_DISTANCE_BW
one time for all AMD CPU families? Something like:
[AMD]
CPU vendor=AMD
UCX_DISTANCE_BW=auto,sys:5100MBs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain more about "UCX_DISTANCE_BW" ? Why we have to keep 'sys' at 5100MBs?[I was not sure about this, I used the same value that used in Rome, Milan & Genoa] Is this should be calculated at runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, that's pretty old setting that was measured long time ago on some AMD system and I am not sure whether we need to set it for Turin systems too. @yosefe WDYT?
My previous comment was only suggesting to set it one time for all AMD system instead of patching each CPU generation separately, but probably if we don't want to set this value for this generation this change would redundant too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe could you please comment on "UCX_DISTANCE_BW" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCX_DISTANCE_BW is specifying the expected PCIe bandwidth as a function of distance as defined here: https://github.com/openucx/ucx/blob/master/src/ucs/config/global_opts.c#L78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently UCX cannot determine the actual ucs_global_opts.dist.sys.bandwidth value, the default is 220 MB/s which is probably not right value for AMD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be derrived reading /proc or /sys entries ? How shall we proceed further here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be derrived reading /proc or /sys entries ? How shall we proceed further here?
This is a question for AMD, I guess. I'm not aware of such way, but would be great to have.
I'd suggest for now to define single BW value for all AMD CPU's , as suggest by @ivankochin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe @ivankochin Apologies for my ignorance, but I want to ensure I understand this correctly before proceeding. With 2 Turin-only nodes connected via IB, are there benchmark tests available to measure the impact of varying "UCX_DISTANCE_BW=auto,sys:values_from_low_to_high" values? Or Does this only apply to GPU-to-RDMA transfers in a GPU cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting may affect the threshold switching of GPU protocols, so probably benchmarks like "osu_bw D D" could uncover the difference on some systems.
d8692fa
to
e591a17
Compare
Acked-by: Edgar Gabriel edgar.gabriel@amd.com
What?
Adding support for AMD's Turin processors.
Ref: https://www.amd.com/en/products/processors/server/epyc/9005-series.html