Support block-wise quantization #779

huningxin · 2024-11-06T02:31:00Z

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.

Native ML API's support

DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3
CoreML constexpr_blockwise_shift_scale
TFLite: ?

Proposal

No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.

MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});

The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.

The text was updated successfully, but these errors were encountered:

fdwr · 2024-11-07T04:15:01Z

Thanks for the paper link. I'd be surprised if TFLite didn't have some blockwise support somewhere, but if not, it might need decomposition (e.g. scale and zeroPoint blockwise expanded up to the input shape via tf.tile or tf.repeats or tf.imaging.resize or some other similar function, then dq = (input - zeroPoint) * scale).

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: ningxin hu <ningxin.hu@intel.com> Cr-Commit-Position: refs/heads/main@{#1380767}

…or DirectML backend, a=testonly Automatic update from web-platform-tests webnn: Support block-wise quantization for DirectML backend Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: ningxin hu <ningxin.hu@intel.com> Cr-Commit-Position: refs/heads/main@{#1380767} -- wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85 wpt-pr: 49083

…or DirectML backend, a=testonly Automatic update from web-platform-tests webnn: Support block-wise quantization for DirectML backend Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <rafael.cintronmicrosoft.com> Reviewed-by: ningxin hu <ningxin.huintel.com> Commit-Queue: ningxin hu <ningxin.huintel.com> Cr-Commit-Position: refs/heads/main{#1380767} -- wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85 wpt-pr: 49083 UltraBlame original commit: 6b8a19bf1f5562bfae60549575af9c2b422b4975

…or DirectML backend, a=testonly Automatic update from web-platform-tests webnn: Support block-wise quantization for DirectML backend Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization [1]. It is used for popular language models, such as phi-3 mini int4 quantized model [2]. Related WG issue [3] has been opened to discussion. Firstly, this CL validates scale and zero point tensors for block-wise quantization. Besides, this CL also implements the block-wise quantization in DirectML backend by using DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE which are available in FL >= 6.3. More validation and conformance tests are added to verify the implementation. [1]: https://arxiv.org/abs/2110.02861 [2]: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct [3]: webmachinelearning/webnn#779 Bug: 40206287 Change-Id: I977b0be57deebd7afcae216edc3ddc3818b8c09f Cq-Include-Trybots: luci.chromium.try:mac14.arm64-blink-rel, mac14-blink-rel, mac15.arm64-blink-rel, mac15-blink-rel, linux-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5964816 Reviewed-by: Rafael Cintron <rafael.cintron@microsoft.com> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: ningxin hu <ningxin.hu@intel.com> Cr-Commit-Position: refs/heads/main@{#1380767} -- wpt-commits: 8686b7a6d288d3b2c22b5ddb5a21773619b22b85 wpt-pr: 49083

anssiko added the operator specific label Nov 6, 2024

fdwr mentioned this issue Nov 7, 2024

Add QuantizeLinear and DequantizeLinear for mixed precision #93

Open

chromium-wpt-export-bot mentioned this issue Nov 9, 2024

webnn: Support block-wise quantization for DirectML backend web-platform-tests/wpt#49083

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support block-wise quantization #779

Support block-wise quantization #779

huningxin commented Nov 6, 2024

fdwr commented Nov 7, 2024 •

edited

Loading

Support block-wise quantization #779

Support block-wise quantization #779

Comments

huningxin commented Nov 6, 2024

Native ML API's support

Proposal

fdwr commented Nov 7, 2024 • edited Loading

fdwr commented Nov 7, 2024 •

edited

Loading