Hoping to give a clear view on the subject with curated contents organized
From algorithm to hardware execution
- Approximate Computing in Deep Neural Networks
Table of contents generated with markdown-toc
- PTQ: Post Training Quantization
- QAT: Quantization Aware Training
- 2022 Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, Armeniakos & al.
- 2019 Deep Neural Network Approximation for Custom Hardware:Where We’ve Been, Where We’re Going, Wang & al.
- 2017 Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Sze & al.
- 2019 Recent Advances in Convolutional Neural Network Acceleration, Qianru Zhang, & al.
- 2020 Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey, Deng & al.
- 2020 Approximation Computing Techniques to Accelerate CNN Based Image Processing Applications – A Survey in Hardware/Software Perspective, Manikandan & al.
- 2021 Pruning and Quantization for Deep Neural Network Acceleration: A Survey, Liang & al.
Name | Description | Framework | Supported Approx |
---|---|---|---|
NEMO | small library for minimization of DNNs intended for ultra low power devices like pulp-nn | PyTorch, ONNX | PTQ, QAT |
Microsoft NNI | lightweight toolkit for Feature Engineering, Neural Architecture Search, Hyperparameter Tuning and Model Compression | Pytorch, Tensorflow (+Keras), MXnet, Caffe2 CNTK, Theano | Pruning / PTQ) |
PocketFlow | open-source framework for compressing and accelerating DNNs. | Tensorflow | PTQ, QAT, Prunning |
Tensorflow Model Optimization | Toolkit to optimize ML / DNN model | Tenforflow(Keras) | Clustering, Quantization (PTQ, QAT), Pruning |
QKeras | quantization extension to Keras that provides drop-in replacement for some of the Keras layers | Tensorflow(Keras) | Quantization (QAT) |
Brevitas | Pytorch extension to quantize DNN model | Pytorch | PTQ, QAT |
TFApprox | Add ApproxConv layers to TF to emulate the use of approximated multipliers on GPU, typically from EvoApproxLib | Tensorflow | Approximate Multipliers |
N2D2 | Toolset to import or train model, apply quantization, and export in various format (C/C++ ...) | ONNX | QAT(license required), PTQ |
Distiller | Distiller is an open-source Python package for neural network compression research (fine-tuning capable) | Pytorch | Pruning, Quantization (QAT), Knowledge Distillation, Conditional Computation, Regularization |
Adapt | AdaPT is a fast emulation framework that extends PyTorch to support approximate inference as well as approximation-aware retraining | Pytorch | Approximate Multipliers |
Intel Neural Compressor | INC is an open-source Python lib for neural network compression | TensorFlow, PyTorch, ONNX Runtime, MXNet | Pruning (Magnitude, Grad), Quantization (PQT, dynamic, QAT, Mix precision), Knowledge Distillation |
Qualcomm AIMET | AIMET is an open-source lib for trained neural network quantization and compression + Model Zoo | TensorFlow, PyTorch | Pruning (Channel), Spatial SVD, per-layer compression ratio selection, Quantization (PQT, QAT, Simulation, Rounding, Bias correction, Cross layer equalization, Mix precision) |
OpenMMRazor | MMRazor is an open-source toolkit for model slimming and AutoML | OpenMM | Neural Architecture Search (NAS), Pruning, Knowledge Distillation (KD), Quantization (in the next release) |
- DORY - automatic tool to deploy DNNs on low-cost MCUs with typically less than 1MB of on-chip SRAM memory
- Glow - Glow is a machine learning compiler and execution engine for hardware accelerators (Pytorch, ONNX)
- TensorflowLite - TensorFlow Lite is a set of tools to help developers run TensorFlow models on mobile, embedded, and IoT devices. It enables on-device machine learning inference with low latency and a small binary size (linux, android, mcu). curated content for tflite
- OpenVino - OpenCL based graph compiler for intel environnment (Intel CPU, Intel GPU, Dedicated accelerator)
- N2D2 - Framework capable of training and exporting DNN in different format, particulary standalone C/C++ compilable project with very few dependencis and quantized, support import from ONNX model
- Vitis AI - Optimal Artificial Intelligence Inference from Edge to Cloud (compiler / optimizer / quantizer / profiler / IP set)
- OnnxRuntime Graph optim - Optimize onnx graph (simplification)
- Mirage - (GPU) Mirage is a tensor algebra superoptimizer that automatically discovers highly-optimized tensor programs for DNNs. Mirage automatically identifies and verifies sophisticated optimizations, many of which require joint optimization at the kernel, thread block, and thread levels of the GPU compute hierarchy.
- OpenXLA - XLA (Accelerated Linear Algebra) is an open-source machine learning (ML) compiler for GPUs, CPUs, and ML accelerators. The XLA compiler takes models from popular ML frameworks such as PyTorch, TensorFlow, and JAX, and optimizes them for high-performance execution across different hardware platforms including GPUs, CPUs, and ML accelerators.
Name | Description | Environment | Perf |
---|---|---|---|
Esperanto ET-soc-1 | 1000+ low power risc v core chip energy efficient processing of ML/DNN | Cloud | 800 TOPS @ 20W |
Google TPU | Processing unit for DNN workload, efficient systolic array for computation | Cloud, Edge | V4 - 275 TFLOPS @ 200W / V3 - 90 TOPS @250W / Coral Edge 4TOPS @ 2W |
Greenwave GAP8 | multi-GOPS fully programmable RISC-V IoT-edge computing engine, featuring a 8-core cluster with CNN accelerator, coupled with an ultra-low power MCU with 30 μW state-retentive sleep power (75mW) | Edge | 600 GMAC/s/W |
Intel Movidius Myriad | Vector processing unit for accelerating DNN inference, Interface with the OpenVino toolkit, 16 programmable cores | Edge | 1 TOPS @ 1.5W - 2.67 TOPS/W |
Synaptic NPU VIP9000 | Nerural processing unit for accelerating DNN inference, 22 NN core (Conv) and 8 Tensor Core, support Bfloat16 | Edge | 6,75 TOPS @ ? W |
Sima ML accelerator MLSoC | SoC for accelerating DNN inference (PCIe/SPI/I2C...), support int8 | Edge/Cloud | 50 TOPS @ 5 W |
Moffett Antoum | SoC for accelerating SPARSE CV/LLM DNNs inference | Cloud | 29.5 TOPS / 3.7 TFLOPS @ 70 W |
IBM NorthPole | NPU for DNNs inference, Vector Matrix Multiplication (VMM) + 2xNoC, int 4,8,16 | Cloud | - |
- Maestro - open-source tool for modeling and evaluating the performance and energy-efficiency of different dataflows for DNNs
- HLS4ML - package for creating HLS from various ML framework (good pytorch support), create streamline architecture
- FINN - framework for creating HW accelerator (HLS code) from BREVITAS quantized model, downto BNN, create PE architecture
- N2D2 - framework for creating HLS from N2D2 trained model (support ONNX import), create streamline architecture
- ScaleHLS - HLS framework on MLIR. Can compile HLS C/C++ or ONNX model to optimized HLS C/C++ in order to generate high-efficiency RTL design using downstream tools, such as Vivado HLS. Focus on scalability, automated DSE engine.
- DNN-Neurosim - Framework for evaluating the performance of inference or training of on-chip DNN
- SCALE-Sim - ARM CNN accelerator simulator, that provides cycle-accurate timing, power/energy, memory bandwidth and trace results for a specified accelerator configuration and neural network architecture.
- Eyeriss Energy Estimator - Energy Estimator for MIT's Eyeriss Hardware Accelerator
- Torchbench - collection of deep learning benchmarks you can use to benchmark your models, optimized for the PyTorch framework.
- Renode - Functional simulation platform for MCU dev & test (single and multi-node)
-
2022 Cross-Layer Approximation for Printed Machine Learning Circuits (code), - Algorithmic and logic level approximation (coefficient replacement + netlist pruning) through a full DSE for printed ML applications.
-
2020 Deep Neural Network Compression by In-Parallel Pruning-Quantization - Use Bayesian optimization to solve both pruning and quantization problems jointly and with fine-tuning.
-
2020 OPQ: Compressing Deep Neural Networks with One-shot Pruning-Quantization - Analytical single shot compression (Pruning + Quantization) of DNN using only pretrained weights values, then fine-tuning to recover ACL
- Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity - Large matrix multiplication are tiled, this method propose to maintain a regular pattern at the tile level, improving efficiency.
- 2020 Utilizing Explainable AI for Quantization and Pruning of Deep Neural Networks - Using DeepLift (explainable AI) as hints to improve compression by determining importance of neurons and features
- 2021 Post-training deep neural network pruning via layer-wise calibration - Layer-wise sparse pruning calibration based on the use of fractal images to replace representative data, post quantization, achieving 2x compression.
- 2018 Learning Compression from Limited Unlabeled Data - Use unlabelled data to improve accuracy of quantization in a very fast fine-tuning step
- 2020 Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors - AutoQKeras, Per layer quantization optimization using meta-heuristic DSE based on Bayesian Optimization, make use of Qkeras & hls4ml.
- 2020 Full Approximation of Deep Neural Networks through Efficient Optimization - Select efficient approx multipliers through retraining and minimization of accuracy loss (Evo Approx)
- 2019 ALWANN: Automatic Layer-Wise Approximation of Deep Neural Network Accelerators without Retraining - Use NSGA II to optimize approximate multipliers implemented & DNN mapping onto implemented Ax multipliers (Evo Approx).
- MLPerf / MLCommons - Acceleration contest for ML
- Papers with Code - latest papers / code in ML, SoTA representation for several applications (CV, NLP, Medical ...)
- TIMM - Excellent model zoo & training scripts for pytorch
- ONNX Model Zoo - Collection of pre-trained onnx models
- Tensorflow Hub - pre-trained model that can be imported as keras layers for deployment / fine-tuning
- Keras Applications - pre-trained popular CNNs implemented in Keras - can be customized and fine tuned
- Torchvision - The torch equivalent to keras applications
- Openvino pre-trained models - Intel pre-trained model for use in OpenVino
- Google OR-Tools - Constraint programming, routing and other optimization tools
- Facebook Botorch - Bayesian optimization accelerated by torch backend, python API
- Pymoo - collection of multi-objective optimization implementation in python, user friendly interface
- MMdnn - Microsoft tool for cross-framework conversion, retraining, visualization & deployment
- ONNX - model format to exchange frozen models between ML frameworks
- Tensorboard - Visualization tool for Tensorflow, Pytorch ..., can show graph, metric evolution over training ... very adaptable
- Netron - Tool to show ONNX graph with all the attributes.
- mlflow - very flexible simulation logging tool (client/server) allowing to log parameter & metrics + object storage, python and shell interfaces
- Xilinx Vivado HLS - C/C++ based HLS for XILINX Fpga
- ntel Quartus HLS - C++ HLS for ALTERA/INTEL FPGA
- Mentor Catapult HLS - C++/SystemC HLS For Siemens FPGA
- Blog post - related to recent mobile architectures
- https://github.com/juliagusak/model-compression-and-acceleration-progress
- https://github.com/ZhishengWang/Embedded-Neural-Network
- https://github.com/memoiry/Awesome-model-compression-and-acceleration
- https://github.com/sun254/awesome-model-compression-and-acceleration
- https://github.com/guan-yuan/awesome-AutoML-and-Lightweight-Models
- https://github.com/chester256/Model-Compression-Papers
- https://github.com/mapleam/model-compression-and-acceleration-4-DNN
- https://github.com/cedrickchee/awesome-ml-model-compression
- https://github.com/jnjaby/Model-Compression-Acceleration
- https://github.com/he-y/Awesome-Pruning