[Question] About CPU performance #1666

mingfeima · 2024-09-03T02:06:29Z

Hi, I am an engineer from Intel and I work mostly on the performance optimization of PyTorch on intel Xeon CPUs (also I am the pytorch module maintainer for cpu performance). Just come across this amazing project and from this blog fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse the chart says DeepSparse accelerates the sparse-quantized Llama models to 6-8x faster over the dense FP32 baseline.

The 6-8x speedup of sparse model against dense model is a fascinating result. My purpose is to check if there is a chance to further improve the performance with our previous effort on LLM optimizations.

I run according the script from https://github.com/neuralmagic/deepsparse?tab=readme-ov-file#try-it-now, however from the hardware profiler I can tell the hardware efficiency is still not very high (only ~12 cores in use on average from a 40-core machine, leading to significant sync overhead and very high CPI (cycles per instructions)). Maybe I can do something to improve this, but I am not very familiar with this codebase, and I need some guidance here:

how can I reproduce the above results?
how the model is deployed ? with onnx-runtime?

Additionally, do you continue this sparse fine tuning job on other models, for example Llama3 ? Also how about int4 ?

mingfeima added the enhancement New feature or request label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] About CPU performance #1666

[Question] About CPU performance #1666

mingfeima commented Sep 3, 2024

[Question] About CPU performance #1666

[Question] About CPU performance #1666

Comments

mingfeima commented Sep 3, 2024