-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kleidi Integration #5162
Kleidi Integration #5162
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5162
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3abbc5e with merge base b60fa71 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
dd97512
to
89b1783
Compare
Llama Benchmarks on One Plus 12 with 4 Threads
|
so IIUC ~20% faster prefill for longer prompts with QP8? seems like a win :) |
@@ -630,7 +630,11 @@ Error defineConvertNode( | |||
subgraph_ptr, | |||
remapped_ids.at(graph_node->input_id()), | |||
remapped_ids.at(graph_node->output_id()), | |||
#ifdef ENABLE_XNNPACK_KLEIDI | |||
0x00000080); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this magic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XNNPACK folk didn't make this available through xnnpack.h. They actually do the same with mediapipe LOL:
35a5ee0
to
4d96001
Compare
afb5b66
to
9a06ca9
Compare
@@ -630,7 +630,14 @@ Error defineConvertNode( | |||
subgraph_ptr, | |||
remapped_ids.at(graph_node->input_id()), | |||
remapped_ids.at(graph_node->output_id()), | |||
#ifdef ENABLE_XNNPACK_KLEIDI | |||
// This maps to XNNPACK's XNN_FLAG_MAYBE_PACK_FOR_QB4W_GEMM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps we should fix it in XNNPACK, independent of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think they want this public yet, since it would cause backwards compatibility issues in the future. This is likely the path until they have a complete qp8 story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we will add this back or fix this before landing this internally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope to not have this at all anymore
@mcr229 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
9a06ca9
to
05aaa34
Compare
Summary: # Bringing KleidiAI QB4 Kernels to ExecuTorch KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land. ## Dependent PR Tracking * google/XNNPACK#7003 * https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28 ## Notes on the Update When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag. ## Repro ``` git submodule sync git submodule update --init ``` I used the following alias's to make it easier to build llama_main for android: ``` alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DEXECUTORCH_ENABLE_LOGGING=1 \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ -DXNNPACK_ENABLE_ARM_BF16=OFF \ -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release " alias build_llama_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_USE_TIKTOKEN=ON \ -Bcmake-out-android/examples/models/llama2 \ examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release " ``` I run the following: ``` build_et_android build_llama_android cd cmake-out-android/examples/models/llama2 adb push llama_main /data/local/tmp/ adb push <path/to/llama3.pte> /data/local/tmp adb push <path/to/tiktokenizer> /data/local/tmp adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4 ``` ## Benchmarks I ran llama3.1 with * sdpa_w_kvcache * quantized embeddings * 4bit blockwise quantized weights * dynamic shapes * parallel prefill on Samsung S22 w/4 threads ### Baseline (QD8) ``` I 00:00:32.772974 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:32.772980 executorch:stats.h:90] Model Load Time: 15.273000 (seconds) I 00:00:32.773014 executorch:stats.h:100] Total inference time: 17.488000 (seconds) Rate: 6.804666 (tokens/second) I 00:00:32.773019 executorch:stats.h:108] Prompt evaluation: 2.971000 (seconds) Rate: 2.692696 (tokens/second) I 00:00:32.773023 executorch:stats.h:119] Generated 119 tokens: 14.517000 (seconds) Rate: 8.197286 (tokens/second) I 00:00:32.773027 executorch:stats.h:127] Time to first generated token: 2.971000 (seconds) I 00:00:32.773030 executorch:stats.h:134] Sampling time over 127 tokens: 0.173000 (seconds) ``` ### QP8 ``` I 00:00:46.767429 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:46.767437 executorch:stats.h:90] Model Load Time: 28.297000 (seconds) I 00:00:46.767475 executorch:stats.h:100] Total inference time: 18.436000 (seconds) Rate: 6.454762 (tokens/second) I 00:00:46.767483 executorch:stats.h:108] Prompt evaluation: 1.770000 (seconds) Rate: 4.519774 (tokens/second) I 00:00:46.767491 executorch:stats.h:119] Generated 119 tokens: 16.666000 (seconds) Rate: 7.140286 (tokens/second) I 00:00:46.767522 executorch:stats.h:127] Time to first generated token: 1.770000 (seconds) I 00:00:46.767527 executorch:stats.h:134] Sampling time over 127 tokens: 0.189000 (seconds) ``` We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details Pull Request resolved: pytorch#5162 Reviewed By: digantdesai Differential Revision: D63651987 Pulled By: mcr229
This pull request was exported from Phabricator. Differential Revision: D63651987 |
This pull request was exported from Phabricator. Differential Revision: D63651987 |
Summary: # Bringing KleidiAI QB4 Kernels to ExecuTorch KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land. ## Dependent PR Tracking * google/XNNPACK#7003 * https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28 ## Notes on the Update When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag. ## Repro ``` git submodule sync git submodule update --init ``` I used the following alias's to make it easier to build llama_main for android: ``` alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DEXECUTORCH_ENABLE_LOGGING=1 \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ -DXNNPACK_ENABLE_ARM_BF16=OFF \ -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release " alias build_llama_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_USE_TIKTOKEN=ON \ -Bcmake-out-android/examples/models/llama2 \ examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release " ``` I run the following: ``` build_et_android build_llama_android cd cmake-out-android/examples/models/llama2 adb push llama_main /data/local/tmp/ adb push <path/to/llama3.pte> /data/local/tmp adb push <path/to/tiktokenizer> /data/local/tmp adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4 ``` ## Benchmarks I ran llama3.1 with * sdpa_w_kvcache * quantized embeddings * 4bit blockwise quantized weights * dynamic shapes * parallel prefill on Samsung S22 w/4 threads ### Baseline (QD8) ``` I 00:00:32.772974 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:32.772980 executorch:stats.h:90] Model Load Time: 15.273000 (seconds) I 00:00:32.773014 executorch:stats.h:100] Total inference time: 17.488000 (seconds) Rate: 6.804666 (tokens/second) I 00:00:32.773019 executorch:stats.h:108] Prompt evaluation: 2.971000 (seconds) Rate: 2.692696 (tokens/second) I 00:00:32.773023 executorch:stats.h:119] Generated 119 tokens: 14.517000 (seconds) Rate: 8.197286 (tokens/second) I 00:00:32.773027 executorch:stats.h:127] Time to first generated token: 2.971000 (seconds) I 00:00:32.773030 executorch:stats.h:134] Sampling time over 127 tokens: 0.173000 (seconds) ``` ### QP8 ``` I 00:00:46.767429 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:46.767437 executorch:stats.h:90] Model Load Time: 28.297000 (seconds) I 00:00:46.767475 executorch:stats.h:100] Total inference time: 18.436000 (seconds) Rate: 6.454762 (tokens/second) I 00:00:46.767483 executorch:stats.h:108] Prompt evaluation: 1.770000 (seconds) Rate: 4.519774 (tokens/second) I 00:00:46.767491 executorch:stats.h:119] Generated 119 tokens: 16.666000 (seconds) Rate: 7.140286 (tokens/second) I 00:00:46.767522 executorch:stats.h:127] Time to first generated token: 1.770000 (seconds) I 00:00:46.767527 executorch:stats.h:134] Sampling time over 127 tokens: 0.189000 (seconds) ``` We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details Pull Request resolved: pytorch#5162 Reviewed By: digantdesai Differential Revision: D63651987 Pulled By: mcr229
05aaa34
to
8302fad
Compare
Summary: # Bringing KleidiAI QB4 Kernels to ExecuTorch KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land. ## Dependent PR Tracking * google/XNNPACK#7003 * https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28 ## Notes on the Update When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag. ## Repro ``` git submodule sync git submodule update --init ``` I used the following alias's to make it easier to build llama_main for android: ``` alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DEXECUTORCH_ENABLE_LOGGING=1 \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ -DXNNPACK_ENABLE_ARM_BF16=OFF \ -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release " alias build_llama_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_USE_TIKTOKEN=ON \ -Bcmake-out-android/examples/models/llama2 \ examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release " ``` I run the following: ``` build_et_android build_llama_android cd cmake-out-android/examples/models/llama2 adb push llama_main /data/local/tmp/ adb push <path/to/llama3.pte> /data/local/tmp adb push <path/to/tiktokenizer> /data/local/tmp adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4 ``` ## Benchmarks I ran llama3.1 with * sdpa_w_kvcache * quantized embeddings * 4bit blockwise quantized weights * dynamic shapes * parallel prefill on Samsung S22 w/4 threads ### Baseline (QD8) ``` I 00:00:32.772974 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:32.772980 executorch:stats.h:90] Model Load Time: 15.273000 (seconds) I 00:00:32.773014 executorch:stats.h:100] Total inference time: 17.488000 (seconds) Rate: 6.804666 (tokens/second) I 00:00:32.773019 executorch:stats.h:108] Prompt evaluation: 2.971000 (seconds) Rate: 2.692696 (tokens/second) I 00:00:32.773023 executorch:stats.h:119] Generated 119 tokens: 14.517000 (seconds) Rate: 8.197286 (tokens/second) I 00:00:32.773027 executorch:stats.h:127] Time to first generated token: 2.971000 (seconds) I 00:00:32.773030 executorch:stats.h:134] Sampling time over 127 tokens: 0.173000 (seconds) ``` ### QP8 ``` I 00:00:46.767429 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:46.767437 executorch:stats.h:90] Model Load Time: 28.297000 (seconds) I 00:00:46.767475 executorch:stats.h:100] Total inference time: 18.436000 (seconds) Rate: 6.454762 (tokens/second) I 00:00:46.767483 executorch:stats.h:108] Prompt evaluation: 1.770000 (seconds) Rate: 4.519774 (tokens/second) I 00:00:46.767491 executorch:stats.h:119] Generated 119 tokens: 16.666000 (seconds) Rate: 7.140286 (tokens/second) I 00:00:46.767522 executorch:stats.h:127] Time to first generated token: 1.770000 (seconds) I 00:00:46.767527 executorch:stats.h:134] Sampling time over 127 tokens: 0.189000 (seconds) ``` We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details Pull Request resolved: pytorch#5162 Reviewed By: digantdesai Differential Revision: D63651987 Pulled By: mcr229
This pull request was exported from Phabricator. Differential Revision: D63651987 |
8302fad
to
3abbc5e
Compare
Bringing KleidiAI QB4 Kernels to ExecuTorch
KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land.
Dependent PR Tracking
Notes on the Update
When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag.
Repro
I used the following alias's to make it easier to build llama_main for android:
I run the following:
Benchmarks
I ran llama3.1 with
on Samsung S22 w/4 threads
Baseline (QD8)
QP8
We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details