Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kleidi Integration #5162

Closed
wants to merge 1 commit into from
Closed

Kleidi Integration #5162

wants to merge 1 commit into from

Conversation

mcr229
Copy link
Contributor

@mcr229 mcr229 commented Sep 7, 2024

Bringing KleidiAI QB4 Kernels to ExecuTorch

KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land.

Dependent PR Tracking

Notes on the Update

When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag.

Repro

git submodule sync
git submodule update --init

I used the following alias's to make it easier to build llama_main for android:

alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release
"
alias build_llama_android="cmake  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DCMAKE_BUILD_TYPE=Release \
    -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_USE_TIKTOKEN=ON \
    -Bcmake-out-android/examples/models/llama2 \
    examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release
"

I run the following:

build_et_android
build_llama_android
cd cmake-out-android/examples/models/llama2
adb push llama_main /data/local/tmp/
adb push <path/to/llama3.pte> /data/local/tmp
adb push <path/to/tiktokenizer> /data/local/tmp
adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4

Benchmarks

I ran llama3.1 with

  • sdpa_w_kvcache
  • quantized embeddings
  • 4bit blockwise quantized weights
  • dynamic shapes
  • parallel prefill

on Samsung S22 w/4 threads

Baseline (QD8)

I 00:00:32.772974 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:32.772980 executorch:stats.h:90]        Model Load Time:                15.273000 (seconds)
I 00:00:32.773014 executorch:stats.h:100]       Total inference time:           17.488000 (seconds)              Rate:  6.804666 (tokens/second)
I 00:00:32.773019 executorch:stats.h:108]               Prompt evaluation:      2.971000 (seconds)               Rate:  2.692696 (tokens/second)
I 00:00:32.773023 executorch:stats.h:119]               Generated 119 tokens:   14.517000 (seconds)              Rate:  8.197286 (tokens/second)
I 00:00:32.773027 executorch:stats.h:127]       Time to first generated token:  2.971000 (seconds)
I 00:00:32.773030 executorch:stats.h:134]       Sampling time over 127 tokens:  0.173000 (seconds)

QP8

I 00:00:46.767429 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:46.767437 executorch:stats.h:90]        Model Load Time:                28.297000 (seconds)
I 00:00:46.767475 executorch:stats.h:100]       Total inference time:           18.436000 (seconds)              Rate:  6.454762 (tokens/second)
I 00:00:46.767483 executorch:stats.h:108]               Prompt evaluation:      1.770000 (seconds)               Rate:  4.519774 (tokens/second)
I 00:00:46.767491 executorch:stats.h:119]               Generated 119 tokens:   16.666000 (seconds)              Rate:  7.140286 (tokens/second)
I 00:00:46.767522 executorch:stats.h:127]       Time to first generated token:  1.770000 (seconds)
I 00:00:46.767527 executorch:stats.h:134]       Sampling time over 127 tokens:  0.189000 (seconds)

We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details

Copy link

pytorch-bot bot commented Sep 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5162

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3abbc5e with merge base b60fa71 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 7, 2024
@mcr229 mcr229 requested a review from digantdesai September 7, 2024 08:20
@mcr229 mcr229 force-pushed the kleidi_integration branch 3 times, most recently from dd97512 to 89b1783 Compare September 7, 2024 23:37
@mcr229
Copy link
Contributor Author

mcr229 commented Sep 10, 2024

Llama Benchmarks on One Plus 12 with 4 Threads

prompt_size=93
One Plus 12 Run 1 Run 2 Run 3 Average (QP8-QD8)/QD8
Model Load Time (seconds) QD8 13.722 13.772 13.775 13.75633333 1.193486636
QP8 29.989 30.6 29.934 30.17433333
Total Inference Time (tok/s) QD8 3.754002 3.773166 3.770655 3.765941 0.1071610876
QP8 4.157496 4.21627 4.134744 4.169503333
Prompt Evaluation (tok/s) QD8 15.680324 15.773406 15.768057 15.74059567 0.1980304134
QP8 18.753781 19.226793 18.592563 18.85771233
Token Generation (tok/s) QD8 10.87652 10.914928 10.90093 10.89745933 -0.03175853405
QP8 10.562286 10.536102 10.555728 10.551372
prompt_size=8
One Plus 12 Run 1 Run 2 Run 3 Average (QP8-QD8)/QD8
Model Load Time (seconds) QD8 13.709 13.665 13.757 13.71033333 1.18749848
QP8 29.987 29.992 29.995 29.99133333
Total Inference Time (tok/s) QD8 10.653536 10.663082 10.67935 10.66532267 -0.03114314279
QP8 10.350526 10.334347 10.31464 10.333171
Prompt Evaluation (tok/s) QD8 15.355086 15.355086 16.632017 15.78072967 0.1046826753
QP8 17.777778 17.316017 17.204301 17.43269867
Token Generation (tok/s) QD8 11.174758 11.185262 11.161133 11.17371767 -0.036838172
QP8 10.772155 10.766308 10.747832 10.76209833

@digantdesai
Copy link
Contributor

digantdesai commented Sep 10, 2024

so IIUC ~20% faster prefill for longer prompts with QP8? seems like a win :)

backends/xnnpack/CMakeLists.txt Outdated Show resolved Hide resolved
@@ -630,7 +630,11 @@ Error defineConvertNode(
subgraph_ptr,
remapped_ids.at(graph_node->input_id()),
remapped_ids.at(graph_node->output_id()),
#ifdef ENABLE_XNNPACK_KLEIDI
0x00000080);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this magic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XNNPACK folk didn't make this available through xnnpack.h. They actually do the same with mediapipe LOL:

https://github.com/google-ai-edge/mediapipe/blob/cae031ac4ad34cf45a2f29f02615beb049de0e49/mediapipe/tasks/cc/genai/inference/utils/xnn_utils/graph_builder.cc#L384

@mcr229 mcr229 force-pushed the kleidi_integration branch 3 times, most recently from 35a5ee0 to 4d96001 Compare September 14, 2024 00:28
@mcr229 mcr229 force-pushed the kleidi_integration branch 7 times, most recently from afb5b66 to 9a06ca9 Compare September 30, 2024 18:14
@mcr229 mcr229 requested a review from digantdesai September 30, 2024 18:17
@@ -630,7 +630,14 @@ Error defineConvertNode(
subgraph_ptr,
remapped_ids.at(graph_node->input_id()),
remapped_ids.at(graph_node->output_id()),
#ifdef ENABLE_XNNPACK_KLEIDI
// This maps to XNNPACK's XNN_FLAG_MAYBE_PACK_FOR_QB4W_GEMM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should fix it in XNNPACK, independent of this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think they want this public yet, since it would cause backwards compatibility issues in the future. This is likely the path until they have a complete qp8 story.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we will add this back or fix this before landing this internally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope to not have this at all anymore

@facebook-github-bot
Copy link
Contributor

@mcr229 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mcr229 added a commit to mcr229/executorch that referenced this pull request Sep 30, 2024
Summary:
# Bringing KleidiAI QB4 Kernels to ExecuTorch

KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land.

## Dependent PR Tracking
* google/XNNPACK#7003
* https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28

## Notes on the Update
When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag.

## Repro
```
git submodule sync
git submodule update --init
```

I used the following alias's to make it easier to build llama_main for android:
```
alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release
"
alias build_llama_android="cmake  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DCMAKE_BUILD_TYPE=Release \
    -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_USE_TIKTOKEN=ON \
    -Bcmake-out-android/examples/models/llama2 \
    examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release
"
```
I run the following:
```
build_et_android
build_llama_android
cd cmake-out-android/examples/models/llama2
adb push llama_main /data/local/tmp/
adb push <path/to/llama3.pte> /data/local/tmp
adb push <path/to/tiktokenizer> /data/local/tmp
adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4
```

## Benchmarks
I ran llama3.1 with
* sdpa_w_kvcache
* quantized embeddings
* 4bit blockwise quantized weights
* dynamic shapes
* parallel prefill

on Samsung S22 w/4 threads

### Baseline (QD8)
```
I 00:00:32.772974 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:32.772980 executorch:stats.h:90]        Model Load Time:                15.273000 (seconds)
I 00:00:32.773014 executorch:stats.h:100]       Total inference time:           17.488000 (seconds)              Rate:  6.804666 (tokens/second)
I 00:00:32.773019 executorch:stats.h:108]               Prompt evaluation:      2.971000 (seconds)               Rate:  2.692696 (tokens/second)
I 00:00:32.773023 executorch:stats.h:119]               Generated 119 tokens:   14.517000 (seconds)              Rate:  8.197286 (tokens/second)
I 00:00:32.773027 executorch:stats.h:127]       Time to first generated token:  2.971000 (seconds)
I 00:00:32.773030 executorch:stats.h:134]       Sampling time over 127 tokens:  0.173000 (seconds)
```

### QP8
```
I 00:00:46.767429 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:46.767437 executorch:stats.h:90]        Model Load Time:                28.297000 (seconds)
I 00:00:46.767475 executorch:stats.h:100]       Total inference time:           18.436000 (seconds)              Rate:  6.454762 (tokens/second)
I 00:00:46.767483 executorch:stats.h:108]               Prompt evaluation:      1.770000 (seconds)               Rate:  4.519774 (tokens/second)
I 00:00:46.767491 executorch:stats.h:119]               Generated 119 tokens:   16.666000 (seconds)              Rate:  7.140286 (tokens/second)
I 00:00:46.767522 executorch:stats.h:127]       Time to first generated token:  1.770000 (seconds)
I 00:00:46.767527 executorch:stats.h:134]       Sampling time over 127 tokens:  0.189000 (seconds)
```

We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details

Pull Request resolved: pytorch#5162

Reviewed By: digantdesai

Differential Revision: D63651987

Pulled By: mcr229
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63651987

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63651987

mcr229 added a commit to mcr229/executorch that referenced this pull request Sep 30, 2024
Summary:
# Bringing KleidiAI QB4 Kernels to ExecuTorch

KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land.

## Dependent PR Tracking
* google/XNNPACK#7003
* https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28

## Notes on the Update
When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag.

## Repro
```
git submodule sync
git submodule update --init
```

I used the following alias's to make it easier to build llama_main for android:
```
alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release
"
alias build_llama_android="cmake  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DCMAKE_BUILD_TYPE=Release \
    -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_USE_TIKTOKEN=ON \
    -Bcmake-out-android/examples/models/llama2 \
    examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release
"
```
I run the following:
```
build_et_android
build_llama_android
cd cmake-out-android/examples/models/llama2
adb push llama_main /data/local/tmp/
adb push <path/to/llama3.pte> /data/local/tmp
adb push <path/to/tiktokenizer> /data/local/tmp
adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4
```

## Benchmarks
I ran llama3.1 with
* sdpa_w_kvcache
* quantized embeddings
* 4bit blockwise quantized weights
* dynamic shapes
* parallel prefill

on Samsung S22 w/4 threads

### Baseline (QD8)
```
I 00:00:32.772974 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:32.772980 executorch:stats.h:90]        Model Load Time:                15.273000 (seconds)
I 00:00:32.773014 executorch:stats.h:100]       Total inference time:           17.488000 (seconds)              Rate:  6.804666 (tokens/second)
I 00:00:32.773019 executorch:stats.h:108]               Prompt evaluation:      2.971000 (seconds)               Rate:  2.692696 (tokens/second)
I 00:00:32.773023 executorch:stats.h:119]               Generated 119 tokens:   14.517000 (seconds)              Rate:  8.197286 (tokens/second)
I 00:00:32.773027 executorch:stats.h:127]       Time to first generated token:  2.971000 (seconds)
I 00:00:32.773030 executorch:stats.h:134]       Sampling time over 127 tokens:  0.173000 (seconds)
```

### QP8
```
I 00:00:46.767429 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:46.767437 executorch:stats.h:90]        Model Load Time:                28.297000 (seconds)
I 00:00:46.767475 executorch:stats.h:100]       Total inference time:           18.436000 (seconds)              Rate:  6.454762 (tokens/second)
I 00:00:46.767483 executorch:stats.h:108]               Prompt evaluation:      1.770000 (seconds)               Rate:  4.519774 (tokens/second)
I 00:00:46.767491 executorch:stats.h:119]               Generated 119 tokens:   16.666000 (seconds)              Rate:  7.140286 (tokens/second)
I 00:00:46.767522 executorch:stats.h:127]       Time to first generated token:  1.770000 (seconds)
I 00:00:46.767527 executorch:stats.h:134]       Sampling time over 127 tokens:  0.189000 (seconds)
```

We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details

Pull Request resolved: pytorch#5162

Reviewed By: digantdesai

Differential Revision: D63651987

Pulled By: mcr229
Summary:
# Bringing KleidiAI QB4 Kernels to ExecuTorch

KleidiAI has released QB4 Kernels which pack the activation while dynamically quantizating to improve performance of the gemm kernel. We leverage these kernels through XNNPACK by wiring up these kernels there. This Integration is still waiting on a couple of dependent PRs in other Repos to land.

## Dependent PR Tracking
* google/XNNPACK#7003
* https://gitlab.arm.com/kleidi/kleidiai/-/merge_requests/28

## Notes on the Update
When updating XNNPACK to the branch with the integrated Kleidi Kernels, we have to make some changes to the cmake because of refactoring done in XNNPACK. prod-microkernels and kleidiai are both static libraries linked to libXNNPACK.a, since llama runner (which links against xnnpack_backend) is in a seperate project, we need to install these new static libraries so that we can later properly link them to llama runner. These changes can be seen in the corresponding cmake files. The new feature is currently guarded behind EXECUTORCH_XNNPACK_ENABLE_KLEIDI flag.

## Repro
```
git submodule sync
git submodule update --init
```

I used the following alias's to make it easier to build llama_main for android:
```
alias build_et_android="cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_XNNPACK_ENABLE_KLEIDI=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-out-android . && cmake --build cmake-out-android -j16 --target install --config Release
"
alias build_llama_android="cmake  -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-out-android \
    -DCMAKE_BUILD_TYPE=Release \
    -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_USE_TIKTOKEN=ON \
    -Bcmake-out-android/examples/models/llama2 \
    examples/models/llama2 && cmake --build cmake-out-android/examples/models/llama2 -j16 --config Release
"
```
I run the following:
```
build_et_android
build_llama_android
cd cmake-out-android/examples/models/llama2
adb push llama_main /data/local/tmp/
adb push <path/to/llama3.pte> /data/local/tmp
adb push <path/to/tiktokenizer> /data/local/tmp
adb shell "cd /data/local/tmp && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.bin> --cpu_threads=4
```

## Benchmarks
I ran llama3.1 with
* sdpa_w_kvcache
* quantized embeddings
* 4bit blockwise quantized weights
* dynamic shapes
* parallel prefill

on Samsung S22 w/4 threads

### Baseline (QD8)
```
I 00:00:32.772974 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:32.772980 executorch:stats.h:90]        Model Load Time:                15.273000 (seconds)
I 00:00:32.773014 executorch:stats.h:100]       Total inference time:           17.488000 (seconds)              Rate:  6.804666 (tokens/second)
I 00:00:32.773019 executorch:stats.h:108]               Prompt evaluation:      2.971000 (seconds)               Rate:  2.692696 (tokens/second)
I 00:00:32.773023 executorch:stats.h:119]               Generated 119 tokens:   14.517000 (seconds)              Rate:  8.197286 (tokens/second)
I 00:00:32.773027 executorch:stats.h:127]       Time to first generated token:  2.971000 (seconds)
I 00:00:32.773030 executorch:stats.h:134]       Sampling time over 127 tokens:  0.173000 (seconds)
```

### QP8
```
I 00:00:46.767429 executorch:stats.h:84]        Prompt Tokens: 8    Generated Tokens: 119
I 00:00:46.767437 executorch:stats.h:90]        Model Load Time:                28.297000 (seconds)
I 00:00:46.767475 executorch:stats.h:100]       Total inference time:           18.436000 (seconds)              Rate:  6.454762 (tokens/second)
I 00:00:46.767483 executorch:stats.h:108]               Prompt evaluation:      1.770000 (seconds)               Rate:  4.519774 (tokens/second)
I 00:00:46.767491 executorch:stats.h:119]               Generated 119 tokens:   16.666000 (seconds)              Rate:  7.140286 (tokens/second)
I 00:00:46.767522 executorch:stats.h:127]       Time to first generated token:  1.770000 (seconds)
I 00:00:46.767527 executorch:stats.h:134]       Sampling time over 127 tokens:  0.189000 (seconds)
```

We see ~+68% Perf Improvement on Prefill, ad ~-13% regression on Decode. See the dependent XNNPACK PR for more benchmarking details

Pull Request resolved: pytorch#5162

Reviewed By: digantdesai

Differential Revision: D63651987

Pulled By: mcr229
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63651987

@facebook-github-bot
Copy link
Contributor

@mcr229 merged this pull request in 8079eb7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants