-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-backend-ops : use flops for some performance tests #9657
Conversation
- parallelize tensor quantization - use a different set of cases for performance and correctness tests ggml-ci
2c54964
to
d4c57cd
Compare
GGML_UNUSED(t); | ||
return 2 * m * n * k * bs[0] * nr[0] * bs[1] * nr[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically you would only need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but most sources seem use
The PR looks good to me based on static code analysis; Right now I'm on a train with an unreliable internet connection, I'll test the code when I get home. |
tests/test-backend-ops.cpp
Outdated
static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() { | ||
std::vector<std::unique_ptr<test_case>> test_cases; | ||
|
||
for (int bs : {1, 8, 16, 32, 512}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matrix multiplications become more efficient for larger batch sizes. At the same time the number of runs is scaled in such a way that the total FLOPs are roughly constant. So you could consider removing 16 and instead adding 1024 or 2048. That would make the test faster and cover a wider range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it to include only 1 and 512. I think that should provide a good overview of the performance that can be expected during generation and prompt processing for people casually running the benchmark, but I expect that people working on optimizing an operation will modify this function to add the test cases relevant to what they are working on. It would be nice to be able to specify the parameters of the test cases from the command line as well, but that's a more complex change.
The way the number of runs was scaled was not very good, and I noticed that some of the test were too short and produced very inaccurate results. I changed it so that the memory size or flops is used as an initial estimate to determine how many times to duplicate the op in the graph, but the graph is evaluated repeatedly until it has run for least one second. This seem to produce much more reliable results.
* test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second
* test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second
* test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second
* test-backend-ops : use flops for some performance tests - parallelize tensor quantization - use a different set of cases for performance and correctness tests - run each test for at least one second
Fixes #8898