Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it a good idea to use GCN cross lane instruction for optimization? #510

Open
fancyIX opened this issue Oct 24, 2023 · 15 comments
Open

Is it a good idea to use GCN cross lane instruction for optimization? #510

fancyIX opened this issue Oct 24, 2023 · 15 comments

Comments

@fancyIX
Copy link

fancyIX commented Oct 24, 2023

Many cuda optimization methods can be migrated to AMD opencl. Besides smaller LDS, one big barrier is that opencl doesn’t have cross lane function of shfl as cuda has. However, in-line assembly is well supported with rocm compiler on Navi cards. We can use dpp instructions to exchange registers between threads even faster. Anyone interested in this work?

@fancyIX
Copy link
Author

fancyIX commented Oct 24, 2023

Seems like even sub_group functions are not used for AMD but only for Intel.
I mean https://bashbaug.github.io/OpenCL-Docs/html/OpenCL_Ext.html

@CNugteren
Copy link
Owner

Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD?
See lines 115 till 144 here and lines 20 till 54 here?

@fancyIX
Copy link
Author

fancyIX commented Oct 24, 2023

Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD?

Partially yes.
Also there are lots of LDS reading/writing. I guess using dpp instructions could improve performance a lot, based on my experience of optimizing miners for AMD GPU.

@fancyIX
Copy link
Author

fancyIX commented Oct 24, 2023

@CNugteren
Copy link
Owner

I'm happy to review a pull request for this feature and/or provide some guidance for anyone that wants to develop this. I don't have time myself (nor the hardware to test on), so we'll have to rely on the community.

@fancyIX
Copy link
Author

fancyIX commented Oct 25, 2023

@CNugteren without modifying the logic much, just replacing LDS r/w, not sure if that can improve the performance a lot. Seems like "invert" and "transpose" can be improved a lot. Basicly any frequency data exchange between threads in a wavefront coud potentially improve the speed. Any suggestions on this?

@CNugteren
Copy link
Owner

Regarding optimizing the loads/stores from memory, I'm not sure there is that much to gain, but it depends on the matrix dimensions of course. In the ideal case GEMM is compute-bound and not memory-bound. But I'm not familiar with AMD's recent GPU architectures and thus I can't say much about the actual benefits of these load instructions you are talking about.

Regarding improving transpose or invert functions, I also don't think that is where the big gains are, because ideally they don't consume much time, it is the matrix-multiplication kernel itself afterwards that matters most. But again this depends on the actual parameters the user supplies to the CLBlast program. And also every small bit can help, so contributions there are also welcome.

I think the main benefit could be by using these cross-lane operations on AMD GPUs in the same way the current 'shuffle' instructions are used: to move data across threads in a cheap way, instead of going through the local SRAM memories or caches. But again I haven't studied recent AMD architectures much so I don't know about the impact these instructions can have on the total picture.

@fancyIX
Copy link
Author

fancyIX commented Oct 25, 2023

Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on?

@fancyIX
Copy link
Author

fancyIX commented Oct 25, 2023

Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html
Not sure in current impelmentation what this shfl logic is. Maybe here:

const realN aval = clblast_sub_group_shuffle(apm[_ki], _ni);

seems like we can replace it with AMD opencl's extension for subgroup shuffling. Not sure how much that could improve the speed. Time saved in r/w LDS may not be much. The may be potential more wavefront can run if we save some LDS usage.

@CNugteren
Copy link
Owner

Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on?

The main kernel would be the level 3 GEMM kernel (the regular, not 'direct' one). That kernel covers most of the compute heavy computations of CLBlast.

Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html
Not sure in current impelmentation what this shfl logic is.

Yes that is the same I think, although that tutorial is quite old compared to the current CLBlast kernel implementation, so some things might have changed.

seems like we can replace it with AMD opencl's extension for subgroup shuffling.

Indeed, see also the links above I posted to point at the Intel and NVIDIA implementations. You can probably add an AMD version there, and then run the CLBLast GEMM tuner and see if you get more performance out.

@fancyIX
Copy link
Author

fancyIX commented Oct 28, 2023

@CNugteren while I am working on a PR for using cross lane instruction to do subgroup shuffling, I have a question:
https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3

Here seems like the instruction:
shfl.sync.idx.b32
only works when "realN" is "float". If N is bigger than 1, or real is double, only one b32 instruction seems not working, assuming one b32 instruction can only process one 32 bit register.

@CNugteren
Copy link
Owner

Here seems like the instruction:
shfl.sync.idx.b32
only works when "realN" is "float". If N is bigger than 1, or real is double, only one b32 instruction seems not working, assuming one b32 instruction can only process one 32 bit register.

You can see the definition of realN here:

And thus, you can use the define VWN to guard your code. So you can do something like:

#if VWN == 1
    // your code
#else
   // regular fallback code 
#endif

Or you could have a specific implementation for VWN == 2 etc. as well?

@fancyIX
Copy link
Author

fancyIX commented Oct 30, 2023

Current AMD PR doesn't work with precision 64 when there needs two registers for double number. I will change the PR.
But still don't know if current Nvidia implementation works. It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2?

@fancyIX
Copy link
Author

fancyIX commented Oct 31, 2023

@tyler-utah what do you think?
https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3

It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2?

@CNugteren
Copy link
Owner

That NVIDIA feature is simply guarded to only activate in single precision:
https://github.com/CNugteren/CLBlast/blob/master/src/utilities/compile.cpp#L69

You can do something similar for AMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants