Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comparison with qualcomm ai hub model #7411

Open
DongGeun123 opened this issue Dec 20, 2024 · 3 comments
Open

comparison with qualcomm ai hub model #7411

DongGeun123 opened this issue Dec 20, 2024 · 3 comments
Labels
module: qnn Related to Qualcomm's QNN delegate partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

Comments

@DongGeun123
Copy link

DongGeun123 commented Dec 20, 2024

🐛 Describe the bug

I ran Llama-v3.2-3B-Chat(precision w4a16) from ai-hub-model on a Snapdragon 8 Gen 3 device, achieving 20 tokens/s.
For comparison, I ran inference for the Llama3.2-3B model quantized to W4A16 using executorch with the QNN backend on the same device. The performance I observed was 10 tokens/s.
Could you provide insights into what might be causing this performance difference? Are there issues with how executorch handles quantized models that could explain this gap?
Any guidance or suggestions would be greatly appreciated!

@AndreaChiChengdu
Copy link

good question, i found the same issue on llama 3.1 8b via executorch qnn backend,which performance is below 1/2 compare with qualcomm qaihub chaimed
my device is xiaomi14pro with sm8650(v75 npu) 16GB ram

@kimishpatel kimishpatel added the module: qnn Related to Qualcomm's QNN delegate label Jan 3, 2025
@kimishpatel
Copy link
Contributor

cc: @cccclai

@cccclai cccclai added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Jan 3, 2025
@cccclai
Copy link
Contributor

cccclai commented Jan 3, 2025

Yeah we found out the model definition in llama_transformer.py isn't ideal for running llama model on qnn backend. We've started a new model definition in https://github.com/pytorch/executorch/tree/e66cdaf514e15242692073db1271aae4657f2033/examples/qualcomm/oss_scripts/llama3_2 which have better latency number

It's still wip and please expect some burden if trying them out, or maybe wait till it's more settled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: qnn Related to Qualcomm's QNN delegate partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm
Projects
None yet
Development

No branches or pull requests

4 participants