Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-backend-ops : use flops for some performance tests #9657

Merged
merged 2 commits into from
Sep 28, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Sep 26, 2024

  • Use flops for some performance tests
  • Parallelize tensor quantization
  • Use a different set of test cases for performance and correctness tests

Fixes #8898

@github-actions github-actions bot added the testing Everything test related label Sep 26, 2024
- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

ggml-ci
GGML_UNUSED(t);
return 2 * m * n * k * bs[0] * nr[0] * bs[1] * nr[1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically you would only need $m \cdot n \cdot k$ multiplications and $m \cdot \cdot (k - 1)$ additions. In practice however this is not going to make a difference due to $k >> 1$.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but most sources seem use $2mnk$, so I thought that it would be more important to be consistent with the way other people measure FLOPS than to correct a small inaccuracy.

@JohannesGaessler
Copy link
Collaborator

The PR looks good to me based on static code analysis; Right now I'm on a train with an unreliable internet connection, I'll test the code when I get home.

static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
std::vector<std::unique_ptr<test_case>> test_cases;

for (int bs : {1, 8, 16, 32, 512}) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matrix multiplications become more efficient for larger batch sizes. At the same time the number of runs is scaled in such a way that the total FLOPs are roughly constant. So you could consider removing 16 and instead adding 1024 or 2048. That would make the test faster and cover a wider range.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to include only 1 and 512. I think that should provide a good overview of the performance that can be expected during generation and prompt processing for people casually running the benchmark, but I expect that people working on optimizing an operation will modify this function to add the test cases relevant to what they are working on. It would be nice to be able to specify the parameters of the test cases from the command line as well, but that's a more complex change.

The way the number of runs was scaled was not very good, and I noticed that some of the test were too short and produced very inaccurate results. I changed it so that the memory size or flops is used as an initial estimate to determine how many times to duplicate the op in the graph, but the graph is evaluated repeatedly until it has run for least one second. This seem to produce much more reliable results.

@slaren slaren merged commit 1b2f992 into master Sep 28, 2024
54 checks passed
@slaren slaren deleted the sl/test-backend-ops-perf-flops branch September 28, 2024 12:32
matiaslin pushed a commit to matiaslin/llama.cpp that referenced this pull request Sep 28, 2024
* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test-backend-ops performance numbers incorrect
2 participants