Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARCH/X86: Add Turin support #10272

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

arun-chandran-edarath
Copy link
Contributor

Acked-by: Edgar Gabriel edgar.gabriel@amd.com

What?

Adding support for AMD's Turin processors.

Ref: https://www.amd.com/en/products/processors/server/epyc/9005-series.html

Signed-off-by: Arun Chandran <Arun.Chandran@amd.com>
Acked-by: Edgar Gabriel <edgar.gabriel@amd.com>
Comment on lines 491 to 493
case 0x00 ... 0x2f:
case 0x40 ... 0x4f:
case 0x60 ... 0x7f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's a non-standard extension? maybe use if condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I use the 'if condition' for 'model', it would make the code less structured. The 'case ranges' are a well-known extension supported by both GCC and Clang. Can we stick with it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer if/else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the style then.

@edgargabriel
Copy link
Contributor

It would be good to have this patch btw. in 1.18.0 (if possible)

config/ucx.conf Outdated
@@ -35,3 +35,7 @@ UCX_GDR_COPY_LAT=get:1.65e-6,put:0.65e-6
[AMD Genoa]
CPU model=Genoa
UCX_DISTANCE_BW=auto,sys:5100MBs

[AMD Turin]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible to set UCX_DISTANCE_BW one time for all AMD CPU families? Something like:

[AMD]
CPU vendor=AMD
UCX_DISTANCE_BW=auto,sys:5100MBs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain more about "UCX_DISTANCE_BW" ? Why we have to keep 'sys' at 5100MBs?[I was not sure about this, I used the same value that used in Rome, Milan & Genoa] Is this should be calculated at runtime?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, that's pretty old setting that was measured long time ago on some AMD system and I am not sure whether we need to set it for Turin systems too. @yosefe WDYT?

My previous comment was only suggesting to set it one time for all AMD system instead of patching each CPU generation separately, but probably if we don't want to set this value for this generation this change would redundant too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe could you please comment on "UCX_DISTANCE_BW" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCX_DISTANCE_BW is specifying the expected PCIe bandwidth as a function of distance as defined here: https://github.com/openucx/ucx/blob/master/src/ucs/config/global_opts.c#L78

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently UCX cannot determine the actual ucs_global_opts.dist.sys.bandwidth value, the default is 220 MB/s which is probably not right value for AMD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be derrived reading /proc or /sys entries ? How shall we proceed further here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be derrived reading /proc or /sys entries ? How shall we proceed further here?

This is a question for AMD, I guess. I'm not aware of such way, but would be great to have.
I'd suggest for now to define single BW value for all AMD CPU's , as suggest by @ivankochin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe @ivankochin Apologies for my ignorance, but I want to ensure I understand this correctly before proceeding. With 2 Turin-only nodes connected via IB, are there benchmark tests available to measure the impact of varying "UCX_DISTANCE_BW=auto,sys:values_from_low_to_high" values? Or Does this only apply to GPU-to-RDMA transfers in a GPU cluster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting may affect the threshold switching of GPU protocols, so probably benchmarks like "osu_bw D D" could uncover the difference on some systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants