Inference in Triton ensemble model is much slower than single model in Triton #7214

AWallyAllah · 2024-05-14T06:49:40Z

Description

I'm using Triton Server ensemble model for several models connected to each other. Let's say [Model A, Model B, Model C, Model D]. The ensemble model takes input image and pass it sequentially to this pipeline (e.g. Model A then Model B then Model C then Model D in order). Only one model is deep learning model (runs on GPU) which is model B, and the other three models are ran on CPU (Model A, Model C and Model D. Typically pre-processing and post-processing models). I use:

Dynamic Batching: Each model produces a single-batch image (1, 3, w, h), but I have multiple clients connecting to Triton.
Ragged tensors (Model C) produces a variable number of detections.
Tensorrt accelerator for GPU model (GPU utilization from metrics: 0.16)
OpenVino accelerator for CPU models (CPU utilization from metrics: 0.997)

It looks like the CPU and GPU utilization is good, except it is not batching correctly! I get a VERY LOW FPS while inferring in Triton compared to outside Triton! For instance, If I set only Model B (Deep Learning model) and I get the pre-processing along with the post-processing outside Triton it performs much better (~25 FPS). But I get (~6 FPS) If I do ensemble model with pre-processing and post-processing is in Triton as ensemble model.

Triton Information

What version of Triton are you using?

nvcr.io/nvidia/tritonserver:24.04-py3

Are you using the Triton container or did you build it yourself?

Trtiton Container

To Reproduce

Model A Config.pbtxt

name: "ModelA"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_UINT8
    dims: [ 1520, 2688, 3 ]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ 3, 768, 1280 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Model B Config.pbtxt:

name: "ModelB"
backend: "tensorrt"
max_batch_size: 64

input [
  {
    name: "input0"
    data_type: TYPE_FP16
    dims:  [3 , 768, 1280]
  }]
output [
  {
    name: "output0"
    data_type: TYPE_FP16
    dims: [61200, 8]
  }
]

dynamic_batching {
}

instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
]

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "4294967296" }
      parameters { key: "trt_engine_cache_enable" value: "1" }
    }]
  }
}

Model C Config.pbtxt

name: "ModelC"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [61200, 8]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Model D Config.pbtxt

name: "ModelD"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
    allow_ragged_batch: true
}
]

batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "INDEX"
    data_type: TYPE_FP32
    source_input: "detection_bytetracker_input"
  }
]

input [
{
    name: "input1"
    data_type: TYPE_FP16
    dims: [ 1 ]
}
]

output [
{
    name: "input2"
    data_type: TYPE_FP16
    dims: [ -1, 7 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Inference from Clients are using cudshm (These are code snippets not the entire code):

self.triton_client = grpcclient.InferenceServerClient(url=self.triton_server_ip, verbose=self.verbose)
self.input = [grpcclient.InferInput("input_image", (1, self.input_camera_height, self.input_camera_width, 3), "UINT8"),
              grpcclient.InferInput("input_camera_id", (1, 1), "FP16")]
self.output = grpcclient.InferRequestedOutput("predictions")

...

# Create Output in Shared Memory and store shared memory handles
self.shm_op_handle = cudashm.create_shared_memory_region(f"output_data_{self.camera_id}",
                                                         self.output_byte_size, 0)
self.shm_ip_image_handle = cudashm.create_shared_memory_region(f"input_data_image_{self.camera_id}",
                                                         self.input_image_byte_size, 0)
self.shm_ip_camera_handle = cudashm.create_shared_memory_region(f"input_data_camera_{self.camera_id}",
                                                         self.input_camera_byte_size, 0)
# Register Output shared memory with Triton Server
self.triton_client.register_cuda_shared_memory(f"output_data_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_op_handle), 0,
                                               self.output_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_image_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_image_handle),
                                               0, self.input_image_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_camera_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_camera_handle),
                                               0, self.input_camera_byte_size)

...

self.input[0].set_shared_memory(f"input_data_image_{self.camera_id}", self.input_image_byte_size)
self.input[1].set_shared_memory(f"input_data_camera_{self.camera_id}", self.input_camera_byte_size)
self.output.set_shared_memory(f"output_data_{self.camera_id}", self.output_byte_size)

...

# Set CUDA Shared memory
cudashm.set_shared_memory_region(self.shm_ip_image_handle, [frame])
cudashm.set_shared_memory_region(self.shm_ip_camera_handle, [np.expand_dims(np.array(self.camera_id), 0).astype(np.float16)])

# Inference with server
results = self.triton_client.infer(model_name="detection_ensemble_model", inputs=[self.input[0], self.input[1]],
                                   outputs=[self.output], client_timeout=self.client_timeout)

Expected behavior

Each model produces an output with batch_size=1 since I have multiple Triton clients each sends a single image at once, I expect if I put Triton models of pre-processing and post-processing should be faster with dynamic batching, I expect Triton to concatenate along the batch dimension, for example I got 10 requests at the same time, each with shape (1, 3, 768, 1280) I expect Triton to batch them as (10, 3, 768, 1280) and process them all at once, but I get a VERY LOW FPS instead. It looks like it is still being processed sequentially instead of being batched!

The text was updated successfully, but these errors were encountered:

statiraju · 2024-05-15T02:52:22Z

opened [DLIS-6702]

Tabrizian · 2024-05-15T15:02:57Z

@AWallyAllah By any chance are you using PyTorch in your Python model? Could you share the code for model A?

krishung5 · 2024-06-07T18:20:30Z

@AWallyAllah Could you please share the model file with us so that we can further investigate?

statiraju added the investigating The developement team is investigating this issue label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference in Triton ensemble model is much slower than single model in Triton #7214

Inference in Triton ensemble model is much slower than single model in Triton #7214

AWallyAllah commented May 14, 2024

statiraju commented May 15, 2024

Tabrizian commented May 15, 2024 •

edited

krishung5 commented Jun 7, 2024

Inference in Triton ensemble model is much slower than single model in Triton #7214

Inference in Triton ensemble model is much slower than single model in Triton #7214

Comments

AWallyAllah commented May 14, 2024

statiraju commented May 15, 2024

Tabrizian commented May 15, 2024 • edited

krishung5 commented Jun 7, 2024

Tabrizian commented May 15, 2024 •

edited