[tracking] E2EShark Model Tests Onnx Mode #566

saienduri · 2024-03-28T21:32:04Z

Below is the list of issues we are hitting when running vision int8 models end to end using onnx mode (onnx export/import -> torch-mlir -> iree-compile -> iree-runtime). You can find the models that lead to each issue in the issue description.

To reproduce the error, please setup SHARK-TestSuite and then run the run.py file with the respective command line flags (More Instructions can be found here.)

To fix the issue, you need to either modify the OnnxToTorch lowering of the corresponding op or add the missing support in the TorchToLinalg lowering. You can find more information in either the model-run.log or iree-compile.log after running the test. This can help you create a smaller repro and then try to fix that, then check if it fixes the model.

You can find the specific logs on what is failing in these locations for <model_name> where SHARK-TestSuite/e2eshark/test-onnx is test run directory:

$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/model-run.log
$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/iree-compile.log

Issues:

torch-to-linalg

torch.aten.quantize_per_tensor to linalg #683
-> Add transpose before ExtractSliceOp iree-org/iree#17574 @IanWood1
-> Add preprocessing TransposeExtractConcat pass iree-org/iree#17692 @IanWood1
-> [TorchToLinalg] remove extract_slice grid_sample lowering llvm/torch-mlir#3483
onnx/RAFT_vaiq_int8

iree:

'func.func' op exceeded stack allocation limit
ConvNeXt_vaiq_int8 onnx model compile failed #809
large vector sizes failure - cpu compilation - quantised models iree-org/iree#18005
dpn68_vaiq / dpn92_vaiq / dpn98_vaiq / dpn107_vaiq / dpn131_vaiq
skresnet34_vaiq / skresnet18_vaiq
DeepLabV3_resnet50_vaiq_int8 / RAFT_vaiq_int8 / U-2-Net_vaiq_int8 / RAFT_vaiq_int8
Number of dims and results of reindexed AffineMap doesn't match on Vectorization iree-org/iree#17591 @jinchen62 supposed to back to this after 0630
pytorch/bart-large
dominance-related failure occurs when compiling an opt model iree-org/iree#17759 (Solved, uncovered issue below:)
Unable to lower op 'tensor.from_elements' with multiple parameters iree-org/iree#17086
onnx/opt-125M-awq

Onnx VAIQ Models

To run all tests :
python run.py --torchmlirbuild /path_to/torch-mlir/build --ireebuild /path-to/iree-build --cachedir /path-to/model-cache-dir -r test-onnx --tolerance .001 .001 --mode onnx --report -f onnx -g models

To run specific test (ex: onnx/models/AlexNet_vaiq_int8)
python run.py --torchmlirbuild /path_to/torch-mlir/build --ireebuild /path-to/iree-build --cachedir /path-to/model-cache-dir -r test-onnx --tolerance .001 .001 --mode onnx --report --tests onnx/models/AlexNet_vaiq_int8

Versions:
torch-mlir - main - a7302a68
iree - main - 40f25334d2

Status:
Check the latest run report in e2eshark-reports : e2eshark-reports/<DATE>/onnx_reports/statusreport.md

onnx model Pass(with --torchtolinalg):28/34, Day:08/08

tests	torch-mlir	iree-compile	inference	Comments
onnx/models/opt-125M-awq	passed	failed	notrun	move to privatestorage
onnx/models/retinanet_resnet50_fpn_vaiq_int8	passed	notrun	notrun	onnx.if #696
onnx/models/KeypointRCNN_vaiq_int8	passed	notrun	notrun	onnx.if #696
onnx/models/RAFT_vaiq_int8	passed	failed	notrun	large vector size
onnx/models/Inception_v4_vaiq_int8	passed	passed	failed	inference faile: 32 outputs specified but the provided variant list only has 1 elements;

pytorch model , Pass(with --torchtolinalg): 4/17/28, Day: 08/08

tests	torch-mlir	iree-compile	inference	Comments
pytorch/models/opt-1.3b	passed	failed	notrun	onnx.Add torchtolinalg
pytorch/models/bart-large	passed	failed	notrun	Crash & Number of dims and results of reindexed AffineMap doesn't match on Vectorization iree-org/iree#17591
pytorch/models/llama2-7b-hf	notrun	notrun	notrun	too big to run locally
pytorch/models/vicuna-13b-v1.3	notrun	notrun	notrun	too big to run locally, running out 126G memory
pytorch/models/dlrm	notrun	notrun	notrun	onnx import incompatible function arguments.The following argument types are supported: 1. (arg0: str, arg1: str, arg2: str) -> str
pytorch/models/gpt2-xl	notrun	notrun	notrun	onnx import incompatible function arguments.
pytorch/models/llama2-7b-GPTQ	notrun	notrun	notrun	onnx import incompatible function arguments
pytorch/models/phi-1_5	notrun	notrun	notrun	onnx import incompatible function arguments
pytorch/models/phi-2	notrun	notrun	notrun	onnx import incompatible function arguments
pytorch/models/stablelm-3b-4e1t	notrun	notrun	notrun	onnx import incompatible function arguments
pytorch/models/t5-large	notrun	notrun	notrun	onnx import incompatible function arguments

The text was updated successfully, but these errors were encountered:

schnkmwt · 2024-04-01T18:47:14Z

Please add Path to the logs directory to make it clear where to look for them. For model <model_name> they are located here assuming SHARK-TestSuite/e2eshark/test-onnx is the test run directory:

$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/model-run.log
$ SHARK-TestSuite/e2eshark/test-onnx/pytorch/models/<model_name>/iree-compile.log

schnkmwt · 2024-04-01T23:19:00Z

Working on the "Add" Issue. Please assign: #586

AmosLewis · 2024-05-01T00:09:03Z

zjgarvey · 2024-05-13T17:20:39Z

I'm not sure what is the cause of the discrepancy with the current list of issues, but with an up-to-date torch-mlir with a few minor edits to the recent work in fuse quantized ops, here's a triage list for torch-mlir failures when running:

python run.py --cachedir="/home/zjgar/.cache/" --torchtolinalg -c "/home/zjgar/code/torch-mlir/build/" --mode=onnx --groups=models --framework=onnx

list of failures (and a brief triage):

Test onnx/models/VideoResNet_vaiq_int8 failed [torch-mlir]
    onnx.constant??
Test onnx/models/MobileNetV3_small_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/RegNet_y_8gf_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/Inception_v4_vaiq_int8 failed [torch-mlir]
    average Pool
Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir]
    resize
Test onnx/models/ShuffleNet_v2_x2_0_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/MNASNet_1_3_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/LRASPP_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/RRDB_ESRGAN_vaiq_int8 failed [torch-mlir]
    resize
Test onnx/models/KeypointRCNN_vaiq_int8 failed [torch-mlir]
    onnx if
Test onnx/models/EfficientNet_v2_s_vaiq_int8 failed [torch-mlir]
    grouped q convolution
Test onnx/models/retinanet_resnet50_fpn_vaiq_int8 failed [torch-mlir]
    onnx if
Test onnx/models/ConvNeXt_vaiq_int8 failed [torch-mlir]
    grouped q convolution

grouped q convolution @zjgarvey
averagePool @vivekkhandelwal1 @AmosLewis
resize @aldesilv
onnx if @renxida (see: onnx.If errors when the two branches yield with different shapes #696)
VideoResnet onnx.constant issue

renxida · 2024-05-14T04:07:43Z

Ty! On it. Will dive deeper tomorrow.

Also, when posting commands, will so love it if you could do

python run.py --cachedir="~/.cache/" --torchtolinalg -c "~/torch-mlir/build/" --mode=onnx --groups=models --framework=onnx

I think a sizeable chunk have each repo directly cloned to our home dir so something like this would be directly runnable.

AmosLewis · 2024-05-14T17:45:43Z

Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir]
resize @aldesilv

Test onnx/models/RRDB_ESRGAN_vaiq_int8 failed [torch-mlir]
resize @zjgarvey

aldesilv · 2024-05-14T18:20:07Z

Test onnx/models/pytorch-3dunet_vaiq_int8 failed [torch-mlir] resize @aldesilv

the immediate issue is the dynamic dims in the input torch.vtensor<[?,256,?,?,?],f32> checked causing the compile error. Next would be the 3-d input

…ering (#3351) Addresses [Shark-Turbine #196](nod-ai/SHARK-TestSuite#196) Related tracker [Shark-Turbine #566](nod-ai/SHARK-ModelDev#566) Related onnx.Resize issues [Shark-Turbine #616](nod-ai/SHARK-ModelDev#616)

…ering (llvm#3351) Addresses [Shark-Turbine llvm#196](nod-ai/SHARK-TestSuite#196) Related tracker [Shark-Turbine llvm#566](nod-ai/SHARK-ModelDev#566) Related onnx.Resize issues [Shark-Turbine llvm#616](nod-ai/SHARK-ModelDev#616)

AmosLewis · 2024-05-30T17:39:14Z

3 pytorch model failed again from 2024-05-29 to 2024-05-30 e2eshark-reports.

| pytorch/models/bert-large-uncased                | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/bge-base-en-v1.5                  | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/miniLM-L12-H384-uncased           | passed      | passed        | notrun       | failed         | notrun      |

zjgarvey · 2024-05-30T17:50:59Z

3 pytorch model failed again from 2024-05-29 to 2024-05-30 e2eshark-reports.

| pytorch/models/bert-large-uncased                | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/bge-base-en-v1.5                  | passed      | passed        | notrun       | failed         | notrun      |
| pytorch/models/miniLM-L12-H384-uncased           | passed      | passed        | notrun       | failed         | notrun      |

Any idea what they are failing on?

AmosLewis · 2024-05-30T23:22:49Z

Any idea what they are failing on?

Not sure, working with @saienduri to figure it out. I just tested with 0530 torch-mlir d7b8f00 and iree candidate-20240530.909 locally, they passed. It's kind of weird. Sai think it might pass with latest iree, let's see what's going on with 0531 report.

saienduri · 2024-05-31T20:13:14Z

Any idea what they are failing on?

Not sure, working with @saienduri to figure it out.

We have rooted the 3 model regression to 40 passes in https://github.com/nod-ai/e2eshark-reports/tree/main/2024-05-31 to the convert-torch-onnx-to-torch pass being outdated in iree (generating different mlirs compared to torch-mlir TOM). So, once torch-mlir gets bumped in iree, they should pass again :)

This addresses 7 of the model failures I'm seeing in the test suite. See [Shark-Turbine issue #566](nod-ai/SHARK-ModelDev#566). Need the op ```linalg.conv_2d_ngchw_gfchw_q``` to be added upstream before merging this. See [llvm-project PR #92136 ](llvm/llvm-project#92136). A small additional expansion to operand quantization is included in this patch to address a model failure that occurs when unblocking the quantized group convolutions in one of these onnx models.

…ering (llvm#3351) Addresses [Shark-Turbine llvm#196](nod-ai/SHARK-TestSuite#196) Related tracker [Shark-Turbine llvm#566](nod-ai/SHARK-ModelDev#566) Related onnx.Resize issues [Shark-Turbine llvm#616](nod-ai/SHARK-ModelDev#616)

This addresses 7 of the model failures I'm seeing in the test suite. See [Shark-Turbine issue llvm#566](nod-ai/SHARK-ModelDev#566). Need the op ```linalg.conv_2d_ngchw_gfchw_q``` to be added upstream before merging this. See [llvm-project PR #92136 ](llvm/llvm-project#92136). A small additional expansion to operand quantization is included in this patch to address a model failure that occurs when unblocking the quantized group convolutions in one of these onnx models.

saienduri mentioned this issue Mar 28, 2024

[Tracker] Onnx FE Support #564

Open

schnkmwt mentioned this issue Apr 1, 2024

Fix E2EShark Model Tests Onnx Mode Issue #571 #586

Open

This was referenced Apr 19, 2024

[tracking] TorchToLinalg Op Support #347

Open

[tracking] ONNX Op Support #215

Open

AmosLewis assigned AmosLewis, PhaneeshB and saienduri Apr 22, 2024

This was referenced Apr 24, 2024

[tracking] E2EShark IREE Tests - Onnx Op - Compile #563

Open

[tracking] E2EShark IREE Tests - Onnx Op - Runtime #583

Open

zjgarvey mentioned this issue May 14, 2024

[TorchToLinalg] add support for quantized group conv llvm/torch-mlir#3341

Merged

zjgarvey mentioned this issue May 15, 2024

[ONNX][TorchToLinalg] Add support for dynamic dims in Interpolate lowering llvm/torch-mlir#3351

Merged

renxida mentioned this issue May 21, 2024

onnx.If errors when the two branches yield with different shapes #696

Open

jinchen62 mentioned this issue Aug 2, 2024

[tracking] HF CNN fp32 model tests #801

Open

30 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tracking] E2EShark Model Tests Onnx Mode #566

[tracking] E2EShark Model Tests Onnx Mode #566

saienduri commented Mar 28, 2024 •

edited by kumardeepakamd

Loading

schnkmwt commented Apr 1, 2024 •

edited by kumardeepakamd

Loading

schnkmwt commented Apr 1, 2024 •

edited

Loading

AmosLewis commented May 1, 2024 •

edited

Loading

zjgarvey commented May 13, 2024 •

edited by renxida

Loading

renxida commented May 14, 2024

AmosLewis commented May 14, 2024

aldesilv commented May 14, 2024 •

edited

Loading

AmosLewis commented May 30, 2024

zjgarvey commented May 30, 2024

AmosLewis commented May 30, 2024 •

edited

Loading

saienduri commented May 31, 2024 •

edited

Loading

[tracking] E2EShark Model Tests Onnx Mode #566

[tracking] E2EShark Model Tests Onnx Mode #566

Comments

saienduri commented Mar 28, 2024 • edited by kumardeepakamd Loading

Onnx VAIQ Models

onnx model Pass(with --torchtolinalg):28/34, Day:08/08

pytorch model , Pass(with --torchtolinalg): 4/17/28, Day: 08/08

schnkmwt commented Apr 1, 2024 • edited by kumardeepakamd Loading

schnkmwt commented Apr 1, 2024 • edited Loading

AmosLewis commented May 1, 2024 • edited Loading

zjgarvey commented May 13, 2024 • edited by renxida Loading

renxida commented May 14, 2024

AmosLewis commented May 14, 2024

aldesilv commented May 14, 2024 • edited Loading

AmosLewis commented May 30, 2024

zjgarvey commented May 30, 2024

AmosLewis commented May 30, 2024 • edited Loading

saienduri commented May 31, 2024 • edited Loading

saienduri commented Mar 28, 2024 •

edited by kumardeepakamd

Loading

schnkmwt commented Apr 1, 2024 •

edited by kumardeepakamd

Loading

schnkmwt commented Apr 1, 2024 •

edited

Loading

AmosLewis commented May 1, 2024 •

edited

Loading

zjgarvey commented May 13, 2024 •

edited by renxida

Loading

aldesilv commented May 14, 2024 •

edited

Loading

AmosLewis commented May 30, 2024 •

edited

Loading

saienduri commented May 31, 2024 •

edited

Loading