Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model runs slower with ARM-NN than with XNNPACK on Cortex A53 #784

Open
answerdon opened this issue Aug 19, 2024 · 1 comment
Open

Model runs slower with ARM-NN than with XNNPACK on Cortex A53 #784

answerdon opened this issue Aug 19, 2024 · 1 comment
Labels
TIME WAIT Waiting for an approppriate period for a response before closing the issue.

Comments

@answerdon
Copy link

I have experimented multiple models with ARM-NN on Cortex A53(mostly int8 quantized models with latency < 200ms). And I found XNNPACK generally gives a better latency result than ARM-NN. So I am trying to understand what kind of model can perform better with ARM-NN.

For example, I compared the results using the mobilenet model downloaded from ARMNN model zoo: https://github.com/ARM-software/ML-zoo/tree/master/models/image_classification/mobilenet_v2_1.0_224/tflite_int8

./benchmark_model --graph=./mobilenet_v2_1.0_224_INT8.tflite --external_delegate_path=./libarmnnDelegate.so --external_delegate_options="backends:CpuAcc;disable-tflite-runtime-fallback:true;number-of-threads:1"
Log parameter values verbosely: [0]
Graph: [./mobilenet_v2_1.0_224_INT8.tflite]
External delegate path: [./libarmnnDelegate.so]
External delegate options: [backends:CpuAcc,CpuRef;disable-tflite-runtime-fallback:true;number-of-threads:1]
Loaded model ./mobilenet_v2_1.0_224_INT8.tflite
INFO: Initialized TensorFlow Lite runtime.
Couldn't find any of the following OpenCL library: libOpenCL.so libGLES_mali.so libmali.so 
INFO: TfLiteArmnnDelegate: Added backend CpuAcc
INFO: TfLiteArmnnDelegate: Created TfLite ArmNN delegate.
EXTERNAL delegate created.
VERBOSE: Replacing 66 node(s) with delegate (TfLiteArmNnDelegate) node, yielding 1 partitions for the whole graph.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.02094
Initialized session in 287.252ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=2 first=468655 curr=159104 min=159104 max=468655 avg=313880 std=154775

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=177554 curr=131598 min=131528 max=177554 avg=134539 std=6398

Inference timings in us: Init: 287252, First inference: 468655, Warmup (avg): 313880, Inference (avg): 134539
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=67.3633 overall=77.9492
./benchmark_model --graph=./mobilenet_v2_1.0_224_INT8.tflite --num_threads=1
INFO: Initialized TensorFlow Lite runtime.
INFO: Applying 1 TensorFlow Lite delegate(s) lazily.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
VERBOSE: Replacing 64 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 4 partitions for the whole graph.
INFO: Successfully applied the default TensorFlow Lite delegate indexed at 0.
Num threads: [1]
Graph: [./mobilenet_v2_1.0_224_INT8.tflite]
Enable op profiling: [0]
#threads used for CPU inference: [1]
Loaded model mobilenet_v2_1.0_224_INT8.tflite
The input model file size (MB): 4.02094
Initialized session in 108.149ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=158233 curr=138142 min=138142 max=158233 avg=148234 std=7143

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=120254 curr=119630 min=119404 max=123512 avg=119935 std=722

Inference timings in us: Init: 108149, First inference: 158233, Warmup (avg): 148234, Inference (avg): 119935
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=9.45312 overall=13.9961
@Colm-in-Arm
Copy link
Collaborator

Hello answerdon

There are many factors that will effect the execution time and there will be cases where Arm NN does not provide improved performance. Can I suggest you try using the evaluate_network.sh script in armnn/tests/ExecuteNetwork/. This will try different parameters with ExecuteNetwork and the TfLite delegate to help you choose parameters that might improve performance.

Colm.

@orlmon01 orlmon01 added the TIME WAIT Waiting for an approppriate period for a response before closing the issue. label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
TIME WAIT Waiting for an approppriate period for a response before closing the issue.
Projects
None yet
Development

No branches or pull requests

3 participants