guides/triton-inference-server/ #8241

2024-02-16T09:22:33Z

giscus[bot]
bot Feb 16, 2024

guides/triton-inference-server/

A step-by-step guide on integrating Ultralytics YOLOv8 with Triton Inference Server for scalable and high-performance deep learning inference deployments.

https://docs.ultralytics.com/guides/triton-inference-server/

Sachi9soni · 2024-02-16T09:22:34Z

Sachi9soni
Feb 16, 2024 — with giscus

i am following the same approach and able to complete inference with above code,
i did one extra step i also converted onnx model into TensorRT using ( nvcr.io/nvidia/tensorrt:22.11-py3 ) this TRT container
and i ran inference with (nvcr.io/nvidia/tritonserver 22.11-py3 )
and i ran the infernce succesfully on any size of input image:

however i want to run inference with different model size (i.e. 1024 or 800 or 1088), as i want model to be more accuratly work with certain size of images
so i exported the onnx model with different size (1024)
onnx_file = model.export(format='onnx', dynamic=True,imgsz=1024)
converted mode to tesorRT engine using the same conatiner (22.11-py3)
trtexec --onnx=yolov8l.onnx --saveEngine=model.plan

the config.pbxt file i am using are also working fine
model loaded succesfully showing status ready
but at the time of inference its showing this error::
tritonclient.utils.InferenceServerException: [400] [request id: <id_unknown>] unexpected shape for input 'images' for model 'yolov8l'. Expected [1,3,1024,1024], got [1,3,640,640]

@glenn-jocher

10 replies

pderrenger Feb 18, 2024
Maintainer

Hey there! Great job on successfully running inference with Triton Inference Server! 🎉 To work with the response object from your Triton client, you'll want to extract the inference results, which typically include the detected objects' bounding boxes, classes, and confidence scores.

Here's a quick example of how you might process the response object:

# Extract the output from the response
output_data = response.as_numpy('<output_layer_name>')  # Replace with your actual output layer name

# Now, output_data contains the results of the inference
# You can process these results as needed, for example:
for detection in output_data:
    x, y, w, h = detection[:4]  # Bounding box coordinates
    conf = detection[4]  # Confidence score
    class_id = detection[5]  # Class ID
    # ... Now you can use these values as needed

Make sure to replace '<output_layer_name>' with the actual name of the output layer from your YOLO model.

If you need further assistance or have more questions, feel free to reach out. Happy coding! 😊

Sachi9soni Feb 18, 2024 — with giscus

hey thanks for your response , i was following this code for drwaing and visualizing the result which is availble on ultralytics::::::::
model = YOLO(f'http://localhost:8000/v8_p2x_2_1080_50', task='detect')
results=model(input_image)

Visualize the results on the frame

annotated_frame = results[0].plot()

Display the annotated frame

cv2.imshow("YOLOv8 Inference", annotated_frame)

which seems quiet simple to work with... but now i am getting tensor so should i have to write a cutomize python script to extracting the result?

glenn-jocher Feb 18, 2024
Maintainer

@Sachi9soni hey, thanks for reaching out! 😊 You're on the right track with the Triton Inference Server. If you've got the response tensor from Triton and you're looking to visualize the results similar to how it's done with Ultralytics directly, you're correct that you'll need to process the tensor.

The response from Triton will give you raw output, which you'll need to decode to get the bounding boxes, classes, and scores. Here's a simplified way to do it:

# Assuming 'output_data' is the raw tensor you got from Triton
output_data = response.as_numpy('<output_layer_name>')  # Replace with your actual output layer name

# Decode the detections (you'll need to implement the decoding based on your model's output format)
detections = decode_output(output_data)  # This is a placeholder function

# Now you can visualize the detections on your image
for detection in detections:
    x1, y1, x2, y2, conf, class_id = detection
    cv2.rectangle(input_image, (x1, y1), (x2, y2), (255, 0, 0), 2)  # Draw rectangle
    cv2.putText(input_image, f'{class_id} {conf:.2f}', (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.9, (255, 0, 0), 2)  # Put class label

# Show the annotated image
cv2.imshow('YOLOv8 Inference', input_image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Remember to replace '<output_layer_name>' with the name of the output layer from your model and implement the decode_output function according to how your model formats its output.

Hope this helps! If you need more detailed guidance, the Ultralytics docs and community are always here to help. Happy coding! 🚀

Sachi9soni Feb 20, 2024 — with giscus

hey thanks for your responsee , i was trying for interepret the results i was getting (1,5,98260) size tensor .
and also getting negative values and so many zero values after printing out the tensor values.
i am thinking it like this :
image size =1088x1088
is there 1088/32=34 x34 grids
and each grids have 85 values?
so 34x34x85=98260
and there are 5 boxes for each grid??
am i right?
this is the output 5*98260
https://drive.google.com/file/d/1YmRey1AHUNr3lBQDYVjHJQpGu_cLqELi/view?usp=drive_link

glenn-jocher Feb 21, 2024
Maintainer

@Sachi9soni hey there! You're on the right track with your understanding of the output tensor. The tensor shape (1, 5, 98260) indeed suggests that your model is outputting predictions for a grid of 34x34 (since 1088/32=34) with 85 values per grid cell, and 5 anchor boxes per grid cell, which gives you the 98260 when multiplied together.

However, the negative and zero values you're seeing are likely due to raw model outputs that need to be processed through a detection post-processing step. This step typically involves applying a sigmoid function to certain outputs to get probabilities, decoding the bounding boxes, and applying Non-Maximum Suppression (NMS) to remove duplicates and keep only the most confident detections.

The output you're seeing is not directly interpretable as detections; it's an intermediate representation that needs to be decoded. The YOLO class in Ultralytics handles this for you when you use it for inference, as in the code snippet you provided.

If you're working with raw output tensors from Triton, you'll need to implement similar post-processing to interpret the results. If you'd like to stick with the simplicity of the Ultralytics interface, you can continue using the YOLO class as you've been doing, and it will take care of the post-processing for you.

Hope this helps clear things up! If you have any more questions, feel free to ask. 😊👍

Manishthakur2503 · 2024-06-12T12:08:20Z

Manishthakur2503
Jun 12, 2024 — with giscus

Hi , I am following the same thing Triton inference server using ultralytics . But what I understood is the ultralytics library is typically used to run inference locally rather than interacting with a remote server like Triton.

For running inference on a model hosted on Triton, we need to use a client that can communicate with Triton's REST or GRPC APIs.

I am using yolov8n model and deployed it to triton and this is my config file name:
"yolov8n"
platform: "onnxruntime_onnx"
max_batch_size: 0
input [
{
name: "images"
data_type: TYPE_FP32
dims: [1, 3, 640, 640]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [1, 84, 8400]
}
]
instance_group [
{
kind: KIND_CPU
count: 1
}
]

Is there any script to perform inference using the tritonclient library?

3 replies

glenn-jocher Jun 13, 2024
Maintainer

@Manishthakur2503 hi there! 😊

Thank you for your detailed question and for sharing your Triton configuration. You're correct that the Ultralytics library is typically used for local inference, but it can indeed be integrated with Triton Inference Server for remote inference.

To perform inference using the tritonclient library, you can follow the steps below. This script will help you communicate with the Triton server and run inference on your deployed YOLOv8 model.

Prerequisites

Ensure you have the tritonclient library installed:

pip install tritonclient[all]

Example Script for Inference with Triton Client

Here's a Python script to perform inference using the tritonclient library:

import numpy as np
import tritonclient.http as httpclient
from PIL import Image

# Load and preprocess the image
image_path = "path/to/image.jpg"
image = Image.open(image_path).resize((640, 640))
image = np.array(image).astype(np.float32)
image = np.transpose(image, (2, 0, 1))  # Convert to CHW format
image = np.expand_dims(image, axis=0)  # Add batch dimension

# Create Triton client
url = "localhost:8000"
model_name = "yolov8n"
client = httpclient.InferenceServerClient(url=url)

# Prepare inputs and outputs
inputs = [httpclient.InferInput("images", image.shape, "FP32")]
inputs[0].set_data_from_numpy(image)

outputs = [httpclient.InferRequestedOutput("output0")]

# Perform inference
response = client.infer(model_name, inputs, outputs=outputs)

# Get the output
output_data = response.as_numpy("output0")
print("Inference result:", output_data)

Explanation

Image Preprocessing: The image is loaded, resized to 640x640, and converted to the required format (CHW).
Triton Client Setup: A Triton client is created to communicate with the server.
Inputs and Outputs: The input tensor is prepared and set with the image data. The output tensor is specified to retrieve the inference results.
Inference: The infer method is called to perform inference, and the results are printed.

Running the Triton Server

Ensure your Triton server is running with the YOLOv8 model deployed:

docker run -d --rm -v /path/to/triton_repo:/models -p 8000:8000 nvcr.io/nvidia/tritonserver:23.09-py3 tritonserver --model-repository=/models

Cleanup

Don't forget to clean up the Docker container after testing:

docker kill <container_id>

For more detailed steps and additional configurations, please refer to our Triton Inference Server Guide.

If you encounter any issues or have further questions, feel free to reach out. Happy coding! 🚀

gerasim13 Oct 30, 2024

how can i decode this output_data?

glenn-jocher Oct 31, 2024
Maintainer

To decode the output_data, you'll need to reshape it according to your model's output dimensions and interpret the results based on the YOLOv8 output format. Typically, this involves extracting bounding boxes, class scores, and confidence levels. For detailed guidance, you might want to refer to the YOLOv8 documentation or relevant sections in the Triton Inference Server guide.

kwayne-08 · 2024-07-16T18:05:25Z

kwayne-08
Jul 16, 2024 — with giscus

I followed the outline in this post to successfully infer on Triton using the onnx model as instructed. However, the resulting images are labeling with class numbers instead of the names. I can easily translate the numbers to names after the fact, but is there a way to insert the names ahead of inference so images are labeled with names?

1 reply

glenn-jocher Jul 16, 2024
Maintainer

@kwayne-08 great to hear you got Triton working with YOLOv8! To display class names instead of numbers directly during inference, you can modify the config.pbtxt file in your Triton model repository to include the class labels. This way, the server will use the names during inference. If you need more details on configuring this, feel free to ask!

liulan93 · 2024-08-05T05:14:03Z

liulan93
Aug 5, 2024 — with giscus

Hi, I am following the same thing Triton inference server using ultralytics, and using wrk to request the server. However, the GPU utilization keeps low(using 319M, 8G in total) when running inference, is there any advice to improve the GPU utilization? thanks.

1 reply

pderrenger Aug 5, 2024
Maintainer

@liulan93 to improve GPU utilization when using Triton Inference Server with Ultralytics YOLOv8, consider the following suggestions:

Batch Processing: Increase the batch size to process multiple images simultaneously, which can better utilize the GPU.
Concurrency: Run multiple concurrent inference requests to keep the GPU busy.
Model Optimization: Ensure the model is optimized for inference, such as using TensorRT for further performance gains.

For more detailed guidance, refer to the Triton Inference Server documentation: https://docs.ultralytics.com/guides/triton-inference-server/.

Tariq-droid · 2024-09-10T08:10:15Z

Tariq-droid
Sep 10, 2024 — with giscus

I followed the steps and I am encountering this error:
AttributeError: 'AutoBackend' object has no attribute 'task'

can someone help?
I have upgraded to the latest version

3 replies

glenn-jocher Sep 10, 2024
Maintainer

@Tariq-droid it seems like there might be an issue with the model setup. Please ensure that the model is correctly loaded and that the Triton server is running with the appropriate configuration. Double-check the model export and repository setup steps. If the problem persists, consider reviewing the Triton Inference Server documentation for further troubleshooting.

zeograd Oct 15, 2024

I faced the same issue and opened a PR with a tiny fix. It did the trick for me.

glenn-jocher Oct 16, 2024
Maintainer

@zeograd thank you for your contribution! If the PR resolves the issue, please ensure it aligns with the project's guidelines. For further assistance, feel free to engage with the community here.

faheem-khaskheli · 2024-09-30T04:19:05Z

faheem-khaskheli
Sep 30, 2024 — with giscus

hello, i am trying to use .engine (tensorrt) format when using triton inference server but i get error, because triton does not recognize the backend needed for .engine file.

5 replies

glenn-jocher Sep 30, 2024
Maintainer

@faheem-khaskheli it seems Triton doesn't support the .engine format directly. Please export your YOLOv8 model to ONNX format for compatibility with Triton. You can find detailed instructions in the Triton Inference Server guide.

faheem-khaskheli Sep 30, 2024 — with giscus

does .onnx give same speed as .engine file?
also i have read the documentation of triton it does say that it support tensorrt, by doing some research i found it uses model.plan file. but i donot know how to make it.
also i have tried .pt file it say constant.pkl file not found.
in config.pbtxt i changes the backend: pytorch or backend: tensorrt.
still cannot find any working solution.

glenn-jocher Sep 30, 2024
Maintainer

For Triton Inference Server, please export your YOLOv8 model to ONNX format, as Triton supports ONNX natively. The .engine format is not directly supported by Triton. Ensure your config.pbtxt is correctly set up for ONNX. For more details, refer to the Triton documentation.

faheem-khaskheli Sep 30, 2024

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

if you look at the heading major features, first one say it support multiple DL frameworks, also github link clearly mention tensorrr, pytorch, tensorflow and all other popular frameworks.

https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton.

also here is support matrix of OS and DL frameworks for triton.
https://github.com/triton-inference-server/backend/blob/main/docs/backend_platform_support_matrix.md.

i want to find out which format work better with triton specially on GPU.

glenn-jocher Sep 30, 2024
Maintainer

For optimal performance with Triton Inference Server, exporting your YOLOv8 model to ONNX format is recommended. While Triton supports TensorRT, it requires a .plan file, which can be generated using TensorRT tools outside of Triton. ONNX generally provides good performance and compatibility across various frameworks. If you encounter issues with .pt files, ensure your configuration aligns with Triton's supported backends. For further details, please consult the Triton documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

guides/triton-inference-server/ #8241

{{title}}

Replies: 6 comments 23 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ultralytics

guides/triton-inference-server/ #8241

giscus[bot] bot Feb 16, 2024

guides/triton-inference-server/

Replies: 6 comments · 23 replies

Sachi9soni Feb 16, 2024 — with giscus

pderrenger Feb 18, 2024 Maintainer

Sachi9soni Feb 18, 2024 — with giscus

Visualize the results on the frame

Display the annotated frame

glenn-jocher Feb 18, 2024 Maintainer

Sachi9soni Feb 20, 2024 — with giscus

glenn-jocher Feb 21, 2024 Maintainer

Manishthakur2503 Jun 12, 2024 — with giscus

glenn-jocher Jun 13, 2024 Maintainer

Prerequisites

Example Script for Inference with Triton Client

Explanation

Running the Triton Server

Cleanup

gerasim13 Oct 30, 2024

glenn-jocher Oct 31, 2024 Maintainer

kwayne-08 Jul 16, 2024 — with giscus

glenn-jocher Jul 16, 2024 Maintainer

liulan93 Aug 5, 2024 — with giscus

pderrenger Aug 5, 2024 Maintainer

Tariq-droid Sep 10, 2024 — with giscus

glenn-jocher Sep 10, 2024 Maintainer

zeograd Oct 15, 2024

glenn-jocher Oct 16, 2024 Maintainer

faheem-khaskheli Sep 30, 2024 — with giscus

glenn-jocher Sep 30, 2024 Maintainer

faheem-khaskheli Sep 30, 2024 — with giscus

glenn-jocher Sep 30, 2024 Maintainer

faheem-khaskheli Sep 30, 2024

glenn-jocher Sep 30, 2024 Maintainer

giscus[bot]
bot Feb 16, 2024

Replies: 6 comments 23 replies

Sachi9soni
Feb 16, 2024 — with giscus

pderrenger Feb 18, 2024
Maintainer

glenn-jocher Feb 18, 2024
Maintainer

glenn-jocher Feb 21, 2024
Maintainer

Manishthakur2503
Jun 12, 2024 — with giscus

glenn-jocher Jun 13, 2024
Maintainer

glenn-jocher Oct 31, 2024
Maintainer

kwayne-08
Jul 16, 2024 — with giscus

glenn-jocher Jul 16, 2024
Maintainer

liulan93
Aug 5, 2024 — with giscus

pderrenger Aug 5, 2024
Maintainer

Tariq-droid
Sep 10, 2024 — with giscus

glenn-jocher Sep 10, 2024
Maintainer

glenn-jocher Oct 16, 2024
Maintainer

faheem-khaskheli
Sep 30, 2024 — with giscus

glenn-jocher Sep 30, 2024
Maintainer

glenn-jocher Sep 30, 2024
Maintainer

glenn-jocher Sep 30, 2024
Maintainer