Triton Inference Server Documentation

Installation	Getting Started	User Guide	API Guide	Additional Resources	Customization Guide

New to Triton Inference Server? Make use of these tutorials to begin your Triton journey!

Installation

Before you can use the Triton Docker image you must install Docker. If you plan on using a GPU for inference you must also install the NVIDIA Container Toolkit. DGX users should follow Preparing to use NVIDIA Containers.

Pull the image using the following command.

$ docker pull nvcr.io/nvidia/tritonserver:<yy.mm>-py3

Where <yy.mm> is the version of Triton that you want to pull. For a complete list of all the variants and versions of the Triton Inference Server Container, visit the NGC Page. More information about customizing the Triton Container can be found in this section of the User Guide.

Getting Started

This guide covers the simplest possible workflow for deploying a model using a Triton Inference Server.

Create a Model Repository
Launch Triton
Send an Inference Request

Triton Inference Server has a considerable list versatile and powerful features. All new users are recommended to explore the User Guide and the additional resources sections for features most relevant to their use case.

User Guide

The User Guide describes how to configure Triton, organize and configure your models, use the C++ and Python clients, etc. This guide includes the following:

Creating a Model Repository [Overview || Details]
Writing a Model Configuration [Overview || Details]
Buillding a Model Pipeline [Overview]
Managing Model Availability [Overview || Details]
Collecting Server Metrics [Overview || Details]
Supporting Custom Ops/layers [Overview || Details]
Using the Client API [Overview || Details]
Cancelling Inference Requests [Overview || Details]
Analyzing Performance [Overview]
Deploying on edge (Jetson) [Overview]
Debugging Guide Details

Model Repository

Model Repositories are the organizational hub for using Triton. All models, configuration files, and additional resources needed to serve the models are housed inside a model repository.

Cloud Storage
File Organization
Model Versioning

Model Configuration

A Model Configuration file is where you set the model-level options, such as output tensor reshaping and dynamic batch sizing.

Required Model Configuration

Triton Inference Server requires some Minimum Required parameters to be filled in the Model Configuration. These required parameters essentially pertain to the structure of the model. For TensorFlow, ONNX and TensorRT models, users can rely on Triton to Auto Generate the Minimum Required model configuration.

Maximum Batch Size - Batching and Non-Batching Models
Input and Output Tensors

Versioning Models

Users need the ability to save and serve different versions of models based on business requirements. Triton allows users to set policies to make available different versions of the model as needed. Learn More.

Instance Groups

Triton allows users to use of multiple instances of the same model. Users can specify how many instances (copies) of a model to load and whether to use GPU or CPU. If the model is being loaded on GPU, users can also select which GPUs to use. Learn more.

Specifying Multiple Model Instances
CPU and GPU Instances
Configuring Rate Limiter

Optimization Settings

The Model Configuration ModelOptimizationPolicy property is used to specify optimization and prioritization settings for a model. These settings control if/how a model is optimized by the backend and how it is scheduled and executed by Triton. See the ModelConfig Protobuf and Optimization Documentation for the currently available settings.

Framework-Specific Optimization
NUMA Optimization

Scheduling and Batching

Triton supports batching individual inference requests to improve compute resource utilization. This is extremely important as individual requests typically will not saturate GPU resources thus not leveraging the parallelism provided by GPUs to its extent. Learn more about Triton's Batcher and Scheduler.

Default Scheduler - Non-Batching
Dynamic Batcher
Sequence Batcher

Rate Limiter

Rate limiter manages the rate at which requests are scheduled on model instances by Triton. The rate limiter operates across all models loaded in Triton to allow cross-model prioritization. Learn more.

Model Warmup

For a few of the Backends (check Additional Resources) some or all of initialization is deferred until the first inference request is received, the benefit is resource conservation but comes with the downside of the initial requests getting processed slower than expected. Users can pre-"warm up" the model by instructing Triton to initialize the model. Learn more.

Inference Request/Response Cache

Triton has a feature which allows inference responses to get cached. Learn More.

Model Pipeline

Building ensembles is as easy as adding an addition configuration file which outlines the specific flow of tensors from one model to another. Any additional changes required by the model ensemble can be made in existing (individual) model configurations.

Model Ensemble
Business Logic Scripting (BLS)

Model Management

Users can specify policies in the model configuration for loading and unloading of models. This section covers user selectable policy details.

Explicit Model Loading and Unloading
Modifying the Model Repository

Metrics

Triton provides Prometheus metrics like GPU Utilization, Memory Usage, Latency and more. Learn about available metrics.

Framework Custom Operations

Some frameworks provide the option of building custom layers/operations. These can be added to specific Triton Backends for the those frameworks. Learn more

TensorRT
TensorFlow
PyTorch
ONNX

Client Libraries and Examples

Use the Triton Client API to integrate client applications over the network HTTP/gRPC API or integrate applications directly with Triton using CUDA shared memory to remove network overhead.

C++ HTTP/GRPC Libraries
Python HTTP/GRPC Libraries
Java HTTP Library
GRPC Generated Libraries
- go
- Java/Scala
- Javascript
Shared Memory Extension

Cancelling Inference Requests

Triton can detect and handle requests that have been cancelled from the client-side. This document discusses scope and limitations of the feature.

Performance Analysis

Understanding Inference performance is key to better resource utilization. Use Triton's Tools to costomize your deployment.

Performance Tuning Guide
Optimization
Model Analyzer
Performance Analyzer
Inference Request Tracing

Jetson and JetPack

Triton can be deployed on edge devices. Explore resources and examples.

Resources

The following resources are recommended to explore the full suite of Triton Inference Server's functionalities.

Clients: Triton Inference Server comes with C++, Python and Java APIs with which users can send HTTP/REST or gRPC(possible extensions for other languages) requests. Explore the client repository for examples and documentation.
Configuring Deployment: Triton comes with three tools which can be used to configure deployment setting, measure performance and recommend optimizations.
- Model Analyzer Model Analyzer is CLI tool built to recommend deployment configurations for Triton Inference Server based on user's Quality of Service Requirements. It also generates detailed reports about model performance to summarize the benefits and trade offs of different configurations.
- Perf Analyzer: Perf Analyzer is a CLI application built to generate inference requests and measures the latency of those requests and throughput of the model being served.
- Model Navigator: The Triton Model Navigator is a tool that provides the ability to automate the process of moving model from source to optimal format and configuration for deployment on Triton Inference Server. The tool supports export model from source to all possible formats and applies the Triton Inference Server backend optimizations.
Backends: Triton supports a wide variety of frameworks used to run models. Users can extend this functionality by creating custom backends.
- PyTorch: Widely used Open Source DL Framework
- TensorFlow: Widely used Open Source DL Framework
- TensorRT: NVIDIA TensorRT is an inference acceleration SDK that provide a with range of graph optimizations, kernel optimization, use of lower precision, and more.
- ONNX: ONNX Runtime is a cross-platform inference and training machine-learning accelerator.
- OpenVINO: OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference.
- Paddle Paddle: Widely used Open Source DL Framework
- Python: Users can add custom business logic, or any python code/model for serving requests.
- Forest Inference Library: Backend built for forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML)
- DALI: NVIDIA DALI is a Data Loading Library purpose built to accelerated pre-processing and data loading steps in a Deep Learning Pipeline.
- HugeCTR: HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates
- Managed Stateful Models: This backend automatically manages the input and output states of a model. The states are associated with a sequence id and need to be tracked for inference requests associated with the sequence id.
- Faster Transformer: NVIDIA FasterTransformer (FT) is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner.
- Building Custom Backends
- Sample Custom Backend: Repeat_backend: Backend built to demonstrate sending of zero, one, or multiple responses per request.

Customization Guide

This guide describes how to build and test Triton and also how Triton can be extended with new functionality.

Build
Protocols and APIs.
Backends
Repository Agents
Test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Triton Inference Server Documentation

Installation

Getting Started

User Guide

Model Repository

Model Configuration

Required Model Configuration

Versioning Models

Instance Groups

Optimization Settings

Scheduling and Batching

Rate Limiter

Model Warmup

Inference Request/Response Cache

Model Pipeline

Model Management

Metrics

Framework Custom Operations

Client Libraries and Examples

Cancelling Inference Requests

Performance Analysis

Jetson and JetPack

Resources

Customization Guide

Files

README.md

Latest commit

History

README.md

File metadata and controls

Triton Inference Server Documentation

Installation

Getting Started

User Guide

Model Repository

Model Configuration

Required Model Configuration

Versioning Models

Instance Groups

Optimization Settings

Scheduling and Batching

Rate Limiter

Model Warmup

Inference Request/Response Cache

Model Pipeline

Model Management

Metrics

Framework Custom Operations

Client Libraries and Examples

Cancelling Inference Requests

Performance Analysis

Jetson and JetPack

Resources

Customization Guide