-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault TensorRT 10.3 (and older versions) with GCC13 #4173
Comments
I have the same error when trying to build a TRT file from an ONNX file. On the same OS version, driver version 560.35.03 and using TensorRT 10.4. Also, when I'm trying to load a TRT file that was built last week, before upgrading to GCC 13, and I'm getting this stack trace:
This happens when deallocating a When using
Edit 1I was able to fix my bug above my keeping the IRunTime alive longer than the engine. Then I had a secondary logic bug (which is why the first time I tried that it didn't work). But this wasn't needed before, or it just kept going. As for creating the TRT file from and an Hope this helps diagnosing the problem. |
Please see here for the supported GCC versions on each platform https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html#software-version-platform |
Does setting |
@zeroepoch I built TRT with |
We force the older C++ ABI to increase compatibility with RHEL 7. That will be changing in a future release. |
Could you try the latest release, TRT 10.6? We've officially supported Ubuntu 24.04 with GCC 13 the last few TRT releases. |
Hi @zeroepoch! Many thanks for your support! I created a repo with instructions on how to replicate the issue: https://github.com/jokla/trt_gcc13. I added a vanilla YOLOv8n ONNX model from Ultralytics. It was generated with: !pip install ultralytics
from ultralytics import YOLO
model = YOLO("yolov8n.pt")
success = model.export(format="onnx") I get a segmentation fault when I try to parse the model with FP16, reaching CaskConvolution[0x80000009]. Log before segmentation fault[11/10/2024-17:27:44] [V] [TRT] /models.0/backbone/backbone/dark2/dark2.1/m/m.0/conv2/conv/Conv + PWN(PWN(PWN(/models.0/backbone/backbone/dark2/dark2.1/m/m.0/conv2/act/Sigmoid), PWN(/models.0/backbone/backbone/dark2/dark2.1/m/m.0/conv2/act/Mul)), PWN(/models.0/backbone/backbone/dark2/dark2.1/m/m.0/Add)) (CaskConvolution[0x80000009]) profiling completed in 0.370127 seconds. Fastest Tactic: 0x0866ddee325d07a6 Time: 0.0348142 However, I discovered that trtexec also crashes when I run it with an incorrect parameter like trtexec --test. Tested the following:
I don't think we can easily move to Ubuntu 24.04 since we are using Nvidia Holoscan, so I have tried to avoid installing GCC 13 from apt (Ubuntu 22.04 version is GCC 13.1). Instead, I tried to build GCC 13.2 from source and use it on the I am not sure why TRT is not happy about GCC 13.1 installed by apt. I haven't found a reason yet. Maybe there is something that got fixed in gcc 13.2? This is the list: https://gcc.gnu.org/bugzilla/buglist.cgi?bug_status=RESOLVED&resolution=FIXED&target_milestone=13.2 |
Hi @jokla, I was able to reproduce your problem. Thank you for the very detailed repo step! I'm not exactly sure where the problem is being introduced, but I can speculate that it's due to I was able to find a workaround by rebuilding trtexec. Both the invalid argument case and the original model you're trying to convert work without crashing. I added to your existing Docker container with the following
Within this new container the following commands work:
I want to also mention that 24.11, which will be released in a week or so will be based on Ubuntu 24.04, so it will have GCC 13.2 as you mentioned. Maybe this will help for your Holoscan situation? |
Hi @zeroepoch ! Thanks for the update. I tried to build trtexec as you suggested, the
Could you confirm that it is actually working for you?
With this command, you don't get a segmentation fault message because it exits before printing anything to the terminal. If you add any command afterward, it will show a segmentation fault (at least for me).
Many thanks for your support! |
Hi @jokla, When running this command:
It ends with:
When running this command:
It ends with:
As you mentioned it segfaults at the end. I wasn't seeing it before, but probably because the container ends before the error gets printed. I'll need to investigate further. Based on the backtrace it looks like a similar issue as before when |
Since |
Hi @zeroepoch, I'm curious about what's causing the issue, but unfortunately, I don't have access to the source code, so I can't investigate further. It's not ideal, but I went ahead and built GCC 13.2 from source and configured our environment to use that version. So far, everything seems to be working without any issues. Thanks for your support! |
Description
Tensorrt seg fault when parsing an ONNX model ( yolov8 QAT) with gcc13 installed in Ubuntu 22.04.
Environment
TensorRT Version: 10.3 or olders
NVIDIA GPU: NVIDIA RTX A6000
NVIDIA Driver Version: 560.28.03
CUDA Version: 12.6
CUDNN Version: 8.9.6.50-1+cuda12.2
Operating System:
Container : ubuntu-22.04.Dockerfile + gcc 13 installed
Same issue with nvcr.io/nvidia/tensorrt:24.08-py3 with gcc13 installed on top of it.
Relevant Files
Steps To Reproduce
Commands or scripts:
It seems that the issue is coming from libnvinfer.so.10 and gcc13. The TRT open source version uses a prebuilt nvinfer (from https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.3.0/tars/TensorRT-10.3.0.26.Linux.x86_64-gnu.cuda-12.5.tar.gz) , possibly compiled with an older gcc (gcc 8 looking at this table ). The conversion is working on an Orin with Jetpack 6 ( probably because TRT is build with a newer gcc version).
How can I make TRT (and libnvinfer) compatible with gcc13? Also, is there a specific reason why it's only built with an old version of gcc?
Many thanks!
Have you tried the latest release?: Yes, same issue
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): YesThe text was updated successfully, but these errors were encountered: