-
-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable Length Matrix Multiplication Batched Processing #409
Comments
I've labeled this as a feature request, and I'm not sure I can work on this any time soon. But if anyone else wants to pick it up, feel free, and I'll do a proper code review or can help with questions. About your suggestion, have you seen this feature in any other (GPU) BLAS libraries? If not, why don't they have it you think? I can imagine the gain here might not be that much. Because you're m/n/k values will need to be stored in off-chip memory and be retrieved, and you'll probably have to set aside all kinds of other m/n/k-based optimisations (such as run a faster kernel if they are a multiple of 32) and so on... And, have you tried the loop-option with a pre-allocated temporary buffer as described in the other issue - and without synchronisations. How much performance do you loose? |
I don't know of any other libraries that offer this feature. However there is a paper exploring the subject. I'm writing a custom neural network framework which isn't finished yet so I can't give performance numbers yet. But I presume that enqueuing a kernel with an optimal tiling structure once would be faster than enqueuing with different parameters repeatedly. |
I would then suggest that you first build your software using the available CLBlast routines and get back to me when you have some example use-case and performance numbers. Because if cuBLAS/cuDNN don't implement this, I doubt that this is useful for in general. Also if this is indeed for neural networks (training or inference?) I don't see the use-case yet. Operations can either be batched or have to be sequential (e.g. two layers following each other). I don't know of a use-case where you would want to run multiple small matrix-multiplications in parallel with different m/n/k values. |
I am trying to build my software(In Java) using the OpenCL kernels directly since I don't want two different libraries managing GPU memory. My problem is currently that I'm getting a CL_INVALID_WORK_GROUP_SIZE when I attempt to use the suggested tuning parameters. For example with the following defines(With precision left at the default 32 bits):
I've set the calculation for global and local work sizes as follows:
In order to mimic what I see in xgemm.cpp(under level 3 routines)
However when testing with a set of three 3x3 matrices generated with the following code
I get CL_INVALID_WORK_GROUP_SIZE Can you please elucidate for me exactly how local and global group sizes should be calculated in reference to the size of the input matrices? |
In general what can help perhaps is to go through the CLBlast paper and the cited paper by Matsumoto. An other thing that could help is to look at the tuner code: they also need to run the kernels, but in general it is much simpler than in the actual library, since there is only one kernel to run. However, the tuner does make some assumptions as well. In particular for the problem you describe, I would just print out the local and global values and see what they are and whether or not they make sense. You can optionally do the same in the CLBlast library for comparison. From the code you've pasted it looks fine in general, although I doubt that the CLBlast/src/utilities/utilities.cpp Line 369 in 46fb748
Also, in general, be careful that there are two kinds of GEMM kernels in CLBlast, but it looks like you are doing that correct from the details you've shared at least. |
I've looked at the paper and am trying to understand exactly what the rules are for tile sizing. In other-words how do I determine valid definitions for the defines above. The main problem is that I don't have a working CLBlast dev environment. I've imported CLBlast from JOCLBlast which provides the Java -> C++ Bindings; but my IDE doesn't support C++ code debugging (IntelliJ) AFAIK. Looking at the XgemmDirectBatchedNN kernel It sets the following kernel attribute requesting a specific local work group size
and then calls the XgemmDirect Inline Function, Given that I've set my local work group using the defines of MDIMCD and NDIMCD, why is it failing with CL_INVALID_WORK_GROUP_SIZE ? Can you point me in the direction of the sanity checks for this kernel? |
If you look in the OpenCL specification (https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html) you will see that there can be multiple reasons for CL_INVALID_WORK_GROUP_SIZE. As I said in the previous message, I believe yours relates to:
But if you've tried my earlier suggestion already and that doesn't help you should check the other conditions under which this can fail from the spec. If you want to know about the constraints, you could check out the remainder of the source code of the library itself, or check out the tuner as I also suggested above. E.g. here you see some of the constraints of the tuner: https://github.com/CNugteren/CLBlast/blob/master/src/tuning/kernels/xgemm_direct.hpp#L118. You can also run the tuner and play around with some values. Of course, there are also device constraints, not just code-design constraints. |
I've defined the Ceil and isMultiple objects accordingly.. I've also added asserts to match what you linked in xgemm_direct.hpp; however even with those in place and passing I still get CL_INVALID_WORK_GROUP_SIZE My device (Nvidia GTX 1070) has the following parameters:
I use the following code for my defines as well as the localWorkGroup and globalWorkGroup sizes. Does anything else catch your eye as being wrong here(In the context of three sets of 3x3 matrix multiplication)?
|
I see now that you are using the batched kernel (sorry, should have re-read earlier), but you are only supplying 2 dimension of local and global sizes, is that another bug perhaps? The third dimension is the batch dimension (1 for the local size, number of batches for the global size). Other than that, I would really dig deeper. So don't just set some global/local sizes and observe the error, but actually print/inspect those values? What are those values? Do they make sense? Are they computed correctly? How does it relate to the possibly causes for the error as per the OpenCL spec linked above? Or even if all of that doesn't help you: how do they compare to the ones used in CLBlast? Just modify one of the code samples in the |
For the current code, Can you explain to me exactly how I can determine if it makes sense or not? Seeing as I don't have a detailed knowledge of the tiling algorithm; I don't know how to make sense of it. Also even if it doesn't make sense I still presume that in order for CL_INVALID_WORK_GROUP_SIZE to be returned by the OpenCL kernel. It needs to be somewhere in the CLBlast code for this kernel aside from just __kernel attribute((reqd_work_group_size(MDIMCD, NDIMCD, 1))) |
To be honest it feels a bit like I'm debugging your code. It should be fairly simple (without knowledge of CLBlast but with my previous pointers) to do this debugging yourself. To illustrate this, what I've done just now is mimic your problem and run the corresponding sample: diff --git a/samples/sgemm_batched.cpp b/samples/sgemm_batched.cpp
index 32c465c..51521b8 100644
--- a/samples/sgemm_batched.cpp
+++ b/samples/sgemm_batched.cpp
@@ -40,13 +40,13 @@ int main() {
const auto device_id = 0;
// Example arguments
- const size_t batch_count = 261;
- const size_t m = 1;
- const size_t n = 1;
- const size_t k = 40;
- const auto a_ld = 2560;
- const auto b_ld = 160;
- const auto c_ld = 261;
+ const size_t batch_count = 1;
+ const size_t m = 3;
+ const size_t n = 3;
+ const size_t k = 3;
+ const auto a_ld = 3;
+ const auto b_ld = 3;
+ const auto c_ld = 3;
std::vector<float> alphas(batch_count);
std::vector<float> betas(batch_count);
std::vector<size_t> a_offsets(batch_count);
diff --git a/src/routines/levelx/xgemmbatched.cpp b/src/routines/levelx/xgemmbatched.cpp
index 0d7ae5a..7346c41 100644
--- a/src/routines/levelx/xgemmbatched.cpp
+++ b/src/routines/levelx/xgemmbatched.cpp
@@ -315,6 +315,8 @@ void XgemmBatched<T>::BatchedGemmDirect(const size_t m, const size_t n, const si
batch_count
};
const auto local = std::vector<size_t>{db_["MDIMCD"], db_["NDIMCD"], 1};
+ printf("Local: %d %d %d\n", local[0], local[1], local[2]);
+ printf("Global: %d %d %d\n", global[0], global[1], global[2]);
// Launches the kernel
RunKernel(kernel, queue_, device_, global, local, event_); Then I get the following output:
Thus indeed something is different. Now looking logically at the formulas to compute this, you can see it has to be a multiple of WGD, which is the tiling factor, described in a comment as: So in short: Try with WGD equal to 8. |
Sorry, I didn't mean to be troublesome. With the WGD equal to 8 resulted in the same CL_INVALID_WORK_GROUP_SIZE error as before. I realize that this may be something that isn't clearly defined in the documentation so I've gone and taken your advice. I've setup a working build environment a test-project that will let me compare exactly what is going on with the tuning values step-by-step Inshallah I will get to the bottom of this and report back. |
I found the problem shortly afterwards, in my clEnqueueNDRangeKernel workDim was set to 1 whereas it should of been globalWorkSize.length However, I've been wrestling with a new problem over the last week, sometimes during clEnqueueReadBuffer call I will get a CL_OUT_OF_RESOURCES error. Even when it doesn't return CL_OUT_OF_RESOURCES it won't have the correct result in the buffer. I've tried side-stepping the Windows TDR timeout via a registry hack but that didn't fix it. I did some more digging and found out that CLBlast doesn't actually use the arguments as supplied for this function but first modifies them through Xgemm::ProcessArguments Hopefully something there will enlighten me as to why I'm getting this error on a read operation. |
I converted the ProcessArguments logic to Java, and noticed that it actually picked the kernel name dynamically depending on the result. I also fixed some argument parameters for the OpenCL kernel arguments. In actuality the kernel CLBlast was running was XgemmDirectBatchedTN not XgemmDirectBatchedNN (apparently it is necessary in CLBlast for row-major multiplication to transpose both A and C; B is transposed in the case of col-major) However, this alone did not resolve the issue; after praying for an answer, the divine provided one When I was supplying the primitives to the kernel, I was doing so by the means of providing a pointer to a cl_mem buffer. However the kernel itself was expecting copied values not a pointer. So this was the reason for the CL_OUT_OF_RESOURCES(I earnestly wish for better error codes in opencl..) after making some changes in the way primatives are handled, and passing them directly like so It works as expected at last :) |
Currently in XgemmBatched::BatchedGemmDirect (routines/levelx), m,n,k values are singular.
This enables one to process a batch of matrix multiplications where the sizes of m,n,k do not change.
However there may be use cases where the sizes of m,n,k are completely variable. This forces one to use the non-batched call in a tight-loop which isn't ideal.
My suggestion is to make m, n, and k vectors so that you can specify the dimensions of each matrix individually.
Optionally let a vector of size 1 be passed in to indicate that they are all the same length instead of using two functions.
The text was updated successfully, but these errors were encountered: