Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Summary: # Bringing KleidiAI QB4 Kernels to ExecuTorch KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land. ## Dependent PR Tracking * google/XNNPACK#7003 * https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28 ## Notes on the Update When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag. ## Repro ``` git submodule sync git submodule update --init ``` I used the following alias's to make it easier to build llama_main for android: ``` alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DEXECUTORCH_ENABLE_LOGGING=1 \ -DCMAKE_BUILD_TYPE=Release \ -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \ -DXNNPACK_ENABLE_ARM_BF16=OFF \ -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release " alias build_llama_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-23 \ -DCMAKE_INSTALL_PREFIX=cmake-out-android \ -DCMAKE_BUILD_TYPE=Release \ -DPYTHON_EXECUTABLE=python \ -DEXECUTORCH_BUILD_XNNPACK=ON \ -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \ -DEXECUTORCH_USE_TIKTOKEN=ON \ -Bcmake-out-android/examples/models/llama2 \ examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release " ``` I run the following: ``` build_et_android build_llama_android cd cmake-out-android/examples/models/llama2 adb push llama_main /data/local/tmp/ adb push <path/to/llama3.pte> /data/local/tmp adb push <path/to/tiktokenizer> /data/local/tmp adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4 ``` ## Benchmarks I ran llama3.1 with * sdpa_w_kvcache * quantized embeddings * 4bit blockwise quantized weights * dynamic shapes * parallel prefill on Samsung S22 w/4 threads ### Baseline (QD8) ``` I 00:00:32.772974 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:32.772980 executorch:stats.h:90] Model Load Time: 15.273000 (seconds) I 00:00:32.773014 executorch:stats.h:100] Total inference time: 17.488000 (seconds) Rate: 6.804666 (tokens/second) I 00:00:32.773019 executorch:stats.h:108] Prompt evaluation: 2.971000 (seconds) Rate: 2.692696 (tokens/second) I 00:00:32.773023 executorch:stats.h:119] Generated 119 tokens: 14.517000 (seconds) Rate: 8.197286 (tokens/second) I 00:00:32.773027 executorch:stats.h:127] Time to first generated token: 2.971000 (seconds) I 00:00:32.773030 executorch:stats.h:134] Sampling time over 127 tokens: 0.173000 (seconds) ``` ### QP8 ``` I 00:00:46.767429 executorch:stats.h:84] Prompt Tokens: 8 Generated Tokens: 119 I 00:00:46.767437 executorch:stats.h:90] Model Load Time: 28.297000 (seconds) I 00:00:46.767475 executorch:stats.h:100] Total inference time: 18.436000 (seconds) Rate: 6.454762 (tokens/second) I 00:00:46.767483 executorch:stats.h:108] Prompt evaluation: 1.770000 (seconds) Rate: 4.519774 (tokens/second) I 00:00:46.767491 executorch:stats.h:119] Generated 119 tokens: 16.666000 (seconds) Rate: 7.140286 (tokens/second) I 00:00:46.767522 executorch:stats.h:127] Time to first generated token: 1.770000 (seconds) I 00:00:46.767527 executorch:stats.h:134] Sampling time over 127 tokens: 0.189000 (seconds) ``` We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details Pull Request resolved: pytorch#5162 Reviewed By: digantdesai Differential Revision: D63651987 Pulled By: mcr229
- Loading branch information