Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qualcomm AI Engine Direct - Optimization and fix mutable buffer issue #5072

Merged

Conversation

shewu-quic
Copy link
Collaborator

@shewu-quic shewu-quic commented Sep 4, 2024

Summary:

  • Add a pass to convert linear to conv2d: We found the accuracy drop because of QNN Linear op in llama3. And it will be fixed with convert linear to conv2d pass.
  • Workaround the issue about mutable buffer for index_put op: We add a pass to replace the input of index_put op. Under the workaround, it will result in performance regression.
  • Insert copy op for int64 inputs to convert int64 to int32 in i64toi32 pass
  • Support QNN RMS Norm and use native rms norm in llama_transformer
  • Add a pass to compose rms norm

Note that QNN supports RMS Norm after 2.25.

Copy link

pytorch-bot bot commented Sep 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5072

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b98827 with merge base 99fbca3 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2024
@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Sep 4, 2024

Hi @cccclai,
This PR is to optimize accuracy for llama3 and workaround for mutable buffer issue.
We found that a memory issue about tokenizer loading in runtime on mainline branch.
Because it will load tik tokenizer path with BPETokenizer first, it seems to result in OOM on 16GB device

Please have a look, thank you.

Result

This PR "with spin quant R1+R2" and calibration one sentence with 128 seq_len

image
# Prompt:
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
# Result:
Hello! Yes, of course! Facebook is a very popular social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while they were still students at Harvard University. Initially, it was called "Facemaker" and was intended to be a tool to help people create and share content on other websites. However, it quickly grew in popularity and became a social media platform in itself, allowing users to create profiles, share updates, photos, and connect with friends and

This PR with calibration one sentence with 128 seq_len

image
# Prompt:
<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

# Result:
Hey there! Ah, Facebook! That's like, the OG social media platform, dude! Founded way back in 2004 by Mark Zuckerberg and his college buddies, Facebook has been the go-to spot for people to share their lives, connect with friends and family, and even find love (or so they say). With over two billion monthly active users, Facebook is like, the ultimate digital watercooler! You know, where people share their thoughts, photos, videos,

@shewu-quic shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 59f4b20 to a7e6164 Compare September 4, 2024 15:30
@cccclai
Copy link
Contributor

cccclai commented Sep 4, 2024

This looks awesome, seems like it generates results with good quality, do I understand it correctly?

Regarding this

We found that a memory issue about tokenizer loading in runtime on mainline branch.

How do I repro and how did you still manage to generate result with the OOM issue?

@cccclai
Copy link
Contributor

cccclai commented Sep 4, 2024

Looks like you fallback aten.embedding.default to cpu, is it how you resolve the OOM issue?

@shewu-quic
Copy link
Collaborator Author

Looks like you fallback aten.embedding.default to cpu, is it how you resolve the OOM issue?

Yes, I currently fallback it to cpu, and it could work on 16GB device.
I think this change benefits memory usage, because it seems to need more spill buffer than our calculation.
We will try to resolve our calculation in the follow-up PR.

But I think we could try to quantize embedding op to reduce memory usage

@shewu-quic
Copy link
Collaborator Author

This looks awesome, seems like it generates results with good quality, do I understand it correctly?

Yes, I think that good news is to generate good result.

Regarding this

We found that a memory issue about tokenizer loading in runtime on mainline branch.

How do I repro and how did you still manage to generate result with the OOM issue?

I think you could reproduce it in this PR.
For now, I just comment out the line about loading BPETokenizer to workaround this issue.
And I also have tried moving the tokenizer loading forward to before the model loading but it didn't work on my 16GB device. It seems failed to load tokenizer.model (llama3 tokenizer), it will be killed.
image

@cccclai
Copy link
Contributor

cccclai commented Sep 4, 2024

For now, I just comment out the line about loading BPETokenizer to workaround this issue.

Hmm I'm a bit confused. Isn't BPETokenizer necessary for prompt decoding and encoding? How do you generate result without tokenizer?

@shewu-quic
Copy link
Collaborator Author

For now, I just comment out the line about loading BPETokenizer to workaround this issue.

Hmm I'm a bit confused. Isn't BPETokenizer necessary for prompt decoding and encoding? How do you generate result without tokenizer?

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

@cccclai
Copy link
Contributor

cccclai commented Sep 5, 2024

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Mind holding a bit before #4942 is merged? Still working with the team to land this.

@cccclai
Copy link
Contributor

cccclai commented Sep 5, 2024

#4942 is just approved. Will merge it asap

@shewu-quic
Copy link
Collaborator Author

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

@shewu-quic
Copy link
Collaborator Author

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3?
BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3? BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

Hmm I think we will need to use tiktokenizer for llama3. bpe tokenizer is only for llama2.

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3? BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

Hmm I think we should just use tiktokenizer for llama3...are you using llama3 tokenizer?

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

Because in llama3 we should use tik_tokenizer to do encoding and decoding, right.

Hmm if we use llama3, how do we ending up using bpe tokenizer..?

Ohh, could we use bpe tokenizer for llama3? BTW I try to run xnnpack on my device, I encounter the same OOM issue in runner due to bpe tokenizer loading in runtime

Hmm I think we need to use tik tokenizer for llama3. I thought it just falls back to bpe tokenizer if it fails. Are you using llama3 tokenizer?

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Sep 6, 2024

Hmm I think we need to use tik tokenizer for llama3. I thought it just falls back to bpe tokenizer if it fails. Are you using llama3 tokenizer?

Yes, I am using llama3 tokenizer.
image

Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer.
However, when it loaded with bpe, it will raise OOM.
So, it would not continue to load tik tokenizer.
Therefore, I comment out those lines and it could normally work.

@shewu-quic
Copy link
Collaborator Author

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py

@shewu-quic shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch 2 times, most recently from d58fb95 to 8ab0e79 Compare September 6, 2024 05:14
@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer. However, when it loaded with bpe, it will raise OOM. So, it would not continue to load tik tokenizer. Therefore, I comment out those lines and it could normally work.

Hmm I think the default is tiktokenizer, and if it fails, it will be reset and load bpe tokenizer.

@shewu-quic
Copy link
Collaborator Author

Yes, I agree with your mention. It should be failed to load with bpe tokenizer and change to load tik tokenizer. However, when it loaded with bpe, it will raise OOM. So, it would not continue to load tik tokenizer. Therefore, I comment out those lines and it could normally work.

Hmm I think the default is tiktokenizer, and if it fails, it will be reset and load bpe tokenizer.

Woops, it changed yesterday.... :)
Great, sorry to bother you.

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py

Actually for this one, do you remember which job is failing? Is it the [test-llama-runner-qnn job[(https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-llama-runner-qnn)? Then it's the qnn llama end to end job...

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Sep 6, 2024

#4942 is just approved. Will merge it asap

Thanks for your effort. There is some fails for llama runner test in PR check. Is this expected?

That is unrelated. It's resolved after rebase.

Got it. I thought my problem so I add a transform to replace rms norm instead of changing in llama_transform.py

Actually for this one, do you remember which job is failing? Is it the [test-llama-runner-qnn job[(https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=test-llama-runner-qnn)? Then it's the qnn llama end to end job...

Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support

I see, can we update these lines then? https://github.com/pytorch/executorch/blob/main/.ci/scripts/setup-qnn-deps.sh#L15-L18

@shewu-quic
Copy link
Collaborator Author

Oh, for this test, it needs to use qnn 2.25 or above due to RMS Norm support

I see, can we update these lines then? https://github.com/pytorch/executorch/blob/main/.ci/scripts/setup-qnn-deps.sh#L15-L18

Hi,
I find there is no libc++.so in ${QNN_SDK_ROOT}/lib/x86_64-linux-clang.
We need to manually install it in system.

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

@shewu-quic

I see, let me get some help from internal teams on this. In the worst case we disable the CI

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

The backup plan is to delete the llama qnn ci job temporarily and resume it later, essentially,

  1. remove the qnn version bump
  2. delete these lines
    test-llama-runner-qnn-linux:
    name: test-llama-runner-qnn-linux
    uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
    strategy:
    matrix:
    dtype: [fp32]
    build-tool: [cmake]
    mode: [qnn]
    fail-fast: false
    with:
    runner: linux.2xlarge
    docker-image: executorch-ubuntu-22.04-clang12-android
    submodules: 'true'
    ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
    timeout: 900
    script: |
    # The generic Linux job chooses to use base env, not the one setup by the image
    CONDA_ENV=$(conda env list --json | jq -r ".envs | .[-1]")
    conda activate "${CONDA_ENV}"
    DTYPE=${{ matrix.dtype }}
    BUILD_TOOL=${{ matrix.build-tool }}
    MODE=${{ matrix.mode }}
    PYTHON_EXECUTABLE=python bash .ci/scripts/setup-qnn-deps.sh
    PYTHON_EXECUTABLE=python bash .ci/scripts/build-qnn-sdk.sh
    # Setup executorch
    PYTHON_EXECUTABLE=python bash .ci/scripts/setup-linux.sh buck2
    # Install requirements for export_llama
    PYTHON_EXECUTABLE=python bash examples/models/llama2/install_requirements.sh
    # Test llama2
    PYTHON_EXECUTABLE=python bash .ci/scripts/test_llama.sh stories110M "${BUILD_TOOL}" "${DTYPE}" "${MODE}"
  3. merge this pr
  4. recover the test after we figure out how to do it properly

@cccclai
Copy link
Contributor

cccclai commented Sep 6, 2024

#4942 is merged, could you rebase, and remove the llama test?

@shewu-quic shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 3fbfc6c to 0e26422 Compare September 7, 2024 02:27
@shewu-quic
Copy link
Collaborator Author

#4942 is merged, could you rebase, and remove the llama test?

Thanks, I have rebased and remove the test.
And then we could file another PR to recover it, somehow, I think that we could use apt get to install libc++ or cp it from older QNN library :)

@shewu-quic shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 0e26422 to 10d5344 Compare September 7, 2024 03:09
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Sep 7, 2024

Hi can you add this change as it breaks some internal tests (we use buck internally)

--- a/fbcode/executorch/examples/models/llama2/TARGETS
+++ b/fbcode/executorch/examples/models/llama2/TARGETS
@@ -71,6 +71,7 @@
         "export_llama_lib.py",
         "model.py",
         "source_transformation/quantize.py",
+        "source_transformation/rms_norm.py",
         "source_transformation/rope.py",
         "source_transformation/sdpa.py",
     ],

@shewu-quic shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from 10d5344 to be55d6e Compare September 8, 2024 02:14
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Sep 8, 2024

Looks like the patch is not part of the new commit. Did I miss anything?

edit: Actually saw them. Thanks!

@shewu-quic
Copy link
Collaborator Author

Looks like the patch is not part of the new commit. Did I miss anything?

edit: Actually saw them. Thanks!

Do I need to rebase again?

@cccclai
Copy link
Contributor

cccclai commented Sep 9, 2024

Hey sorry could you rebase again? I'm having issues to merge

@cccclai
Copy link
Contributor

cccclai commented Sep 9, 2024

Hey could you rebase? I'm having issues landing....

shewu-quic and others added 3 commits September 9, 2024 14:51
Summary:
- Add a pass to convert linear to conv2d:
We found the accuracy drop because of QNN Linear op in llama3. And it will be fixed with convert linear to conv2d pass.
- Workaround the issue about mutable buffer for index_put op:
We add a pass to replace the input of index_put op.
Under the workaround, it will result in performance regression.
- Insert copy op for int64 inputs to convert int64 to int32 in i64toi32
  pass
- Support QNN RMS Norm and use native rms norm in llama_transformer
- Add a pass to compose rms norm
@shewu-quic shewu-quic force-pushed the dev1/hutton/convert_linear_to_conv2d branch from be55d6e to 9b98827 Compare September 9, 2024 06:52
@shewu-quic
Copy link
Collaborator Author

@cccclai
done.

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kirklandsign kirklandsign merged commit 85410e4 into pytorch:main Sep 9, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants