comparison with qualcomm ai hub model #7411

DongGeun123 · 2024-12-20T01:50:33Z

🐛 Describe the bug

I ran Llama-v3.2-3B-Chat(precision w4a16) from ai-hub-model on a Snapdragon 8 Gen 3 device, achieving 20 tokens/s.
For comparison, I ran inference for the Llama3.2-3B model quantized to W4A16 using executorch with the QNN backend on the same device. The performance I observed was 10 tokens/s.
Could you provide insights into what might be causing this performance difference? Are there issues with how executorch handles quantized models that could explain this gap?
Any guidance or suggestions would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

comparison with qualcomm ai hub model #7411

comparison with qualcomm ai hub model #7411

DongGeun123 commented Dec 20, 2024 •

edited

Loading

comparison with qualcomm ai hub model #7411

comparison with qualcomm ai hub model #7411

Comments

DongGeun123 commented Dec 20, 2024 • edited Loading

🐛 Describe the bug

DongGeun123 commented Dec 20, 2024 •

edited

Loading