-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue #5072
Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue #5072
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5072
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 9b98827 with merge base 99fbca3 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai, Please have a look, thank you. ResultThis PR "with spin quant R1+R2" and calibration one sentence with 128 seq_len
This PR with calibration one sentence with 128 seq_len
|
59f4b20
to
a7e6164
Compare
This looks awesome, seems like it generates results with good quality, do I understand it correctly? Regarding this
How do I repro and how did you still manage to generate result with the OOM issue? |
Looks like you fallback |
Yes, I currently fallback it to cpu, and it could work on 16GB device. But I think we could try to quantize embedding op to reduce memory usage |
Yes, I think that good news is to generate good result.
I think you could reproduce it in this PR. |
Hmm I'm a bit confused. Isn't BPETokenizer necessary for prompt decoding and encoding? How do you generate result without tokenizer? |
Because in llama3 we should use tik_tokenizer to do encoding and decoding, right. |
Hmm if we use llama3, how do we ending up using bpe tokenizer..? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Mind holding a bit before #4942 is merged? Still working with the team to land this.
#4942 is just approved. Will merge it asap |
Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected? |
Ohh, could we use bpe tokenizer for llama3? |
That is unrelated. It's resolved after rebase. |
Hmm I think we will need to use tiktokenizer for llama3. bpe tokenizer is only for llama2.
Hmm I think we should just use tiktokenizer for llama3...are you using llama3 tokenizer? |
Hmm I think we need to use tik tokenizer for llama3. I thought it just falls back to bpe tokenizer if it fails. Are you using llama3 tokenizer? |
Yes, I am using llama3 tokenizer. Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer. |
Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py |
d58fb95
to
8ab0e79
Compare
Hmm I think the default is tiktokenizer, and if it fails, it will be reset and load bpe tokenizer. |
Woops, it changed yesterday.... :) |
Actually for this one, do you remember which job is failing? Is it the [test-llama-runner-qnn job[(https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-llama-runner-qnn)? Then it's the qnn llama end to end job... |
Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support |
I see, can we update these lines then? https://github.com/pytorch/executorch/blob/main/.ci/scripts/setup-qnn-deps.sh#L15-L18 |
Hi, |
I see, let me get some help from internal teams on this. In the worst case we disable the CI |
The backup plan is to delete the llama qnn ci job temporarily and resume it later, essentially,
|
#4942 is merged, could you rebase, and remove the llama test? |
3fbfc6c
to
0e26422
Compare
Thanks, I have rebased and remove the test. |
0e26422
to
10d5344
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi can you add this change as it breaks some internal tests (we use buck internally)
|
10d5344
to
be55d6e
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Looks like the patch is not part of the new commit. Did I miss anything? edit: Actually saw them. Thanks! |
Do I need to rebase again? |
Hey sorry could you rebase again? I'm having issues to merge |
Hey could you rebase? I'm having issues landing.... |
Summary: - Add a pass to convert linear to conv2d: We found the accuracy drop because of QNN Linear op in llama3. And it will be fixed with convert linear to conv2d pass. - Workaround the issue about mutable buffer for index_put op: We add a pass to replace the input of index_put op. Under the workaround, it will result in performance regression. - Insert copy op for int64 inputs to convert int64 to int32 in i64toi32 pass - Support QNN RMS Norm and use native rms norm in llama_transformer - Add a pass to compose rms norm
be55d6e
to
9b98827
Compare
@cccclai |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary:
Note that QNN supports RMS Norm after 2.25.