Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue #5072

shewu-quic · 2024-09-04T15:02:10Z

Summary:

Add a pass to convert linear to conv2d: We found the accuracy drop because of QNN Linear op in llama3. And it will be fixed with convert linear to conv2d pass.
Workaround the issue about mutable buffer for index_put op: We add a pass to replace the input of index_put op. Under the workaround, it will result in performance regression.
Insert copy op for int64 inputs to convert int64 to int32 in i64toi32 pass
Support QNN RMS Norm and use native rms norm in llama_transformer
Add a pass to compose rms norm

Note that QNN supports RMS Norm after 2.25.

pytorch-bot · 2024-09-04T15:02:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5072

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b98827 with merge base 99fbca3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

shewu-quic · 2024-09-04T15:21:08Z

Hi @cccclai,
This PR is to optimize accuracy for llama3 and workaround for mutable buffer issue.
We found that a memory issue about tokenizer loading in runtime on mainline branch.
Because it will load tik tokenizer path with BPETokenizer first, it seems to result in OOM on 16GB device

Please have a look, thank you.

Result

This PR "with spin quant R1+R2" and calibration one sentence with 128 seq_len

# Prompt:
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
# Result:
Hello! Yes, of course! Facebook is a very popular social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while they were still students at Harvard University. Initially, it was called "Facemaker" and was intended to be a tool to help people create and share content on other websites. However, it quickly grew in popularity and became a social media platform in itself, allowing users to create profiles, share updates, photos, and connect with friends and

This PR with calibration one sentence with 128 seq_len

# Prompt:
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

# Result:
Hey there! Ah, Facebook! That's like, the OG social media platform, dude! Founded way back in 2004 by Mark Zuckerberg and his college buddies, Facebook has been the go-to spot for people to share their lives, connect with friends and family, and even find love (or so they say). With over two billion monthly active users, Facebook is like, the ultimate digital watercooler! You know, where people share their thoughts, photos, videos,

cccclai · 2024-09-04T18:18:01Z

This looks awesome, seems like it generates results with good quality, do I understand it correctly?

Regarding this

We found that a memory issue about tokenizer loading in runtime on mainline branch.

How do I repro and how did you still manage to generate result with the OOM issue?

cccclai · 2024-09-04T18:48:23Z

Looks like you fallback aten.embedding.default to cpu, is it how you resolve the OOM issue?

shewu-quic · 2024-09-04T23:28:29Z

Looks like you fallback aten.embedding.default to cpu, is it how you resolve the OOM issue?

Yes, I currently fallback it to cpu, and it could work on 16GB device.
I think this change benefits memory usage, because it seems to need more spill buffer than our calculation.
We will try to resolve our calculation in the follow-up PR.

But I think we could try to quantize embedding op to reduce memory usage

shewu-quic · 2024-09-04T23:36:49Z

This looks awesome, seems like it generates results with good quality, do I understand it correctly?

Yes, I think that good news is to generate good result.

Regarding this

We found that a memory issue about tokenizer loading in runtime on mainline branch.

How do I repro and how did you still manage to generate result with the OOM issue?

I think you could reproduce it in this PR.
For now, I just comment out the line about loading BPETokenizer to workaround this issue.
And I also have tried moving the tokenizer loading forward to before the model loading but it didn't work on my 16GB device. It seems failed to load tokenizer.model (llama3 tokenizer), it will be killed.

cccclai · 2024-09-04T23:46:13Z

For now, I just comment out the line about loading BPETokenizer to workaround this issue.

Hmm I'm a bit confused. Isn't BPETokenizer necessary for prompt decoding and encoding? How do you generate result without tokenizer?

shewu-quic · 2024-09-04T23:50:58Z

For now, I just comment out the line about loading BPETokenizer to workaround this issue.

Hmm I'm a bit confused. Isn't BPETokenizer necessary for prompt decoding and encoding? How do you generate result without tokenizer?

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

cccclai · 2024-09-05T17:21:52Z

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

cccclai

Looks great! Mind holding a bit before #4942 is merged? Still working with the team to land this.

cccclai · 2024-09-05T22:19:31Z

#4942 is just approved. Will merge it asap

shewu-quic · 2024-09-05T23:56:38Z

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

shewu-quic · 2024-09-06T00:02:52Z

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3?
BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

cccclai · 2024-09-06T04:44:57Z

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

cccclai · 2024-09-06T04:53:42Z

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3? BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

Hmm I think we will need to use tiktokenizer for llama3. bpe tokenizer is only for llama2.

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3? BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

Hmm I think we should just use tiktokenizer for llama3...are you using llama3 tokenizer?

cccclai · 2024-09-06T04:55:27Z

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3? BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

Hmm I think we need to use tik tokenizer for llama3. I thought it just falls back to bpe tokenizer if it fails. Are you using llama3 tokenizer?

shewu-quic · 2024-09-06T04:59:03Z

Hmm I think we need to use tik tokenizer for llama3. I thought it just falls back to bpe tokenizer if it fails. Are you using llama3 tokenizer?

Yes, I am using llama3 tokenizer.

Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer.
However, when it loaded with bpe, it will raise OOM.
So, it would not continue to load tik tokenizer.
Therefore, I comment out those lines and it could normally work.

shewu-quic · 2024-09-06T05:06:13Z

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py

cccclai · 2024-09-06T05:39:10Z

Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer. However, when it loaded with bpe, it will raise OOM. So, it would not continue to load tik tokenizer. Therefore, I comment out those lines and it could normally work.

Hmm I think the default is tiktokenizer, and if it fails, it will be reset and load bpe tokenizer.

shewu-quic · 2024-09-06T05:42:50Z

Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer. However, when it loaded with bpe, it will raise OOM. So, it would not continue to load tik tokenizer. Therefore, I comment out those lines and it could normally work.

Hmm I think the default is tiktokenizer, and if it fails, it will be reset and load bpe tokenizer.

Woops, it changed yesterday.... :)
Great, sorry to bother you.

cccclai · 2024-09-06T05:47:07Z

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py

Actually for this one, do you remember which job is failing? Is it the [test-llama-runner-qnn job[(https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-llama-runner-qnn)? Then it's the qnn llama end to end job...

shewu-quic · 2024-09-06T05:49:07Z

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py

Actually for this one, do you remember which job is failing? Is it the [test-llama-runner-qnn job[(https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-llama-runner-qnn)? Then it's the qnn llama end to end job...

Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support

cccclai · 2024-09-06T06:31:54Z

Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support

I see, can we update these lines then? https://github.com/pytorch/executorch/blob/main/.ci/scripts/setup-qnn-deps.sh#L15-L18

shewu-quic · 2024-09-06T09:42:08Z

Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support

I see, can we update these lines then? https://github.com/pytorch/executorch/blob/main/.ci/scripts/setup-qnn-deps.sh#L15-L18

Hi,
I find there is no libc++.so in ${QNN_SDK_ROOT}/lib/x86_64-linux-clang.
We need to manually install it in system.

cccclai · 2024-09-06T17:06:51Z

@shewu-quic

I see, let me get some help from internal teams on this. In the worst case we disable the CI

cccclai · 2024-09-06T17:43:58Z

The backup plan is to delete the llama qnn ci job temporarily and resume it later, essentially,

remove the qnn version bump

delete these lines

executorch/.github/workflows/pull.yml

Lines 377 to 409 in 7122d31

    
             test-llama-runner-qnn-linux: 
        
               name: test-llama-runner-qnn-linux 
        
               uses: pytorch/test-infra/.github/workflows/linux_job.yml@main 
        
               strategy: 
        
                 matrix: 
        
                   dtype: [fp32] 
        
                   build-tool: [cmake] 
        
                   mode: [qnn] 
        
                 fail-fast: false 
        
               with: 
        
                 runner: linux.2xlarge 
        
                 docker-image: executorch-ubuntu-22.04-clang12-android 
        
                 submodules: 'true' 
        
                 ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} 
        
                 timeout: 900 
        
                 script: | 
        
                   # The generic Linux job chooses to use base env, not the one setup by the image 
        
                   CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]") 
        
                   conda activate "${CONDA_ENV}" 
        
                   DTYPE=${{ matrix.dtype }} 
        
                   BUILD_TOOL=${{ matrix.build-tool }} 
        
                   MODE=${{ matrix.mode }} 
        
                   PYTHON_EXECUTABLE=python bash .ci/scripts/setup-qnn-deps.sh 
        
                   PYTHON_EXECUTABLE=python bash .ci/scripts/build-qnn-sdk.sh 
        
                   # Setup executorch 
        
                   PYTHON_EXECUTABLE=python bash .ci/scripts/setup-linux.sh buck2 
        
                   # Install requirements for export_llama 
        
                   PYTHON_EXECUTABLE=python bash examples/models/llama2/install_requirements.sh 
        
                   # Test llama2 
        
                   PYTHON_EXECUTABLE=python bash .ci/scripts/test_llama.sh stories110M "${BUILD_TOOL}" "${DTYPE}" "${MODE}"

merge this pr
recover the test after we figure out how to do it properly

cccclai · 2024-09-06T23:22:34Z

#4942 is merged, could you rebase, and remove the llama test?

shewu-quic · 2024-09-07T02:31:06Z

#4942 is merged, could you rebase, and remove the llama test?

Thanks, I have rebased and remove the test.
And then we could file another PR to recover it, somehow, I think that we could use apt get to install libc++ or cp it from older QNN library :)

facebook-github-bot · 2024-09-07T17:44:28Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-09-07T19:20:36Z

Hi can you add this change as it breaks some internal tests (we use buck internally)

--- a/fbcode/executorch/examples/models/llama2/TARGETS
+++ b/fbcode/executorch/examples/models/llama2/TARGETS
@@ -71,6 +71,7 @@
         "export_llama_lib.py",
         "model.py",
         "source_transformation/quantize.py",
+        "source_transformation/rms_norm.py",
         "source_transformation/rope.py",
         "source_transformation/sdpa.py",
     ],

facebook-github-bot · 2024-09-08T04:18:56Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-09-08T04:21:59Z

Looks like the patch is not part of the new commit. Did I miss anything?

edit: Actually saw them. Thanks!

shewu-quic · 2024-09-08T23:26:09Z

Looks like the patch is not part of the new commit. Did I miss anything?

edit: Actually saw them. Thanks!

Do I need to rebase again?

cccclai · 2024-09-09T03:56:18Z

Hey sorry could you rebase again? I'm having issues to merge

cccclai · 2024-09-09T03:57:59Z

Hey could you rebase? I'm having issues landing....

Summary: - Add a pass to convert linear to conv2d: We found the accuracy drop because of QNN Linear op in llama3. And it will be fixed with convert linear to conv2d pass. - Workaround the issue about mutable buffer for index_put op: We add a pass to replace the input of index_put op. Under the workaround, it will result in performance regression. - Insert copy op for int64 inputs to convert int64 to int32 in i64toi32 pass - Support QNN RMS Norm and use native rms norm in llama_transformer - Add a pass to compose rms norm

shewu-quic · 2024-09-09T06:53:18Z

@cccclai
done.

facebook-github-bot · 2024-09-09T17:27:25Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2024

shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 59f4b20 to a7e6164 Compare September 4, 2024 15:30

cccclai approved these changes Sep 5, 2024

View reviewed changes

shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch 2 times, most recently from d58fb95 to 8ab0e79 Compare September 6, 2024 05:14

shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 3fbfc6c to 0e26422 Compare September 7, 2024 02:27

shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 0e26422 to 10d5344 Compare September 7, 2024 03:09

shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 10d5344 to be55d6e Compare September 8, 2024 02:14

shewu-quic and others added 3 commits September 9, 2024 14:51

Use transform to replace rms_norm

fcd42fb

temporarily remove test-llama-runner-qnn-linux

9b98827

shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from be55d6e to 9b98827 Compare September 9, 2024 06:52

shewu-quic mentioned this pull request Sep 9, 2024

Qualcomm AI Engine Direct - Uplevel QNN version for ci test #5174

Merged

kirklandsign merged commit 85410e4 into pytorch:main Sep 9, 2024
36 checks passed

Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue #5072

Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue #5072

Conversation

shewu-quic commented Sep 4, 2024 • edited Loading

pytorch-bot bot commented Sep 4, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5072

✅ No Failures

shewu-quic commented Sep 4, 2024 • edited Loading

Result

This PR "with spin quant R1+R2" and calibration one sentence with 128 seq_len

This PR with calibration one sentence with 128 seq_len

cccclai commented Sep 4, 2024

cccclai commented Sep 4, 2024

shewu-quic commented Sep 4, 2024

shewu-quic commented Sep 4, 2024

cccclai commented Sep 4, 2024

shewu-quic commented Sep 4, 2024

cccclai commented Sep 5, 2024

cccclai left a comment

Choose a reason for hiding this comment

cccclai commented Sep 5, 2024

shewu-quic commented Sep 5, 2024

shewu-quic commented Sep 6, 2024

cccclai commented Sep 6, 2024

cccclai commented Sep 6, 2024

cccclai commented Sep 6, 2024

shewu-quic commented Sep 6, 2024 • edited Loading

shewu-quic commented Sep 6, 2024

cccclai commented Sep 6, 2024

shewu-quic commented Sep 6, 2024

cccclai commented Sep 6, 2024

shewu-quic commented Sep 6, 2024 • edited Loading

cccclai commented Sep 6, 2024

shewu-quic commented Sep 6, 2024

cccclai commented Sep 6, 2024

cccclai commented Sep 6, 2024

cccclai commented Sep 6, 2024

shewu-quic commented Sep 7, 2024

facebook-github-bot commented Sep 7, 2024

cccclai commented Sep 7, 2024

facebook-github-bot commented Sep 8, 2024

cccclai commented Sep 8, 2024 • edited Loading

shewu-quic commented Sep 8, 2024

cccclai commented Sep 9, 2024

cccclai commented Sep 9, 2024

shewu-quic commented Sep 9, 2024

facebook-github-bot commented Sep 9, 2024

shewu-quic commented Sep 4, 2024 •

edited

Loading

pytorch-bot bot commented Sep 4, 2024 •

edited

Loading

shewu-quic commented Sep 4, 2024 •

edited

Loading

shewu-quic commented Sep 6, 2024 •

edited

Loading

shewu-quic commented Sep 6, 2024 •

edited

Loading

cccclai commented Sep 8, 2024 •

edited

Loading