Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference in Triton ensemble model is much slower than single model in Triton #7214

Open
AWallyAllah opened this issue May 14, 2024 · 3 comments
Labels
investigating The developement team is investigating this issue

Comments

@AWallyAllah
Copy link

Description

I'm using Triton Server ensemble model for several models connected to each other. Let's say [Model A, Model B, Model C, Model D]. The ensemble model takes input image and pass it sequentially to this pipeline (e.g. Model A then Model B then Model C then Model D in order). Only one model is deep learning model (runs on GPU) which is model B, and the other three models are ran on CPU (Model A, Model C and Model D. Typically pre-processing and post-processing models). I use:

  1. Dynamic Batching: Each model produces a single-batch image (1, 3, w, h), but I have multiple clients connecting to Triton.
  2. Ragged tensors (Model C) produces a variable number of detections.
  3. Tensorrt accelerator for GPU model (GPU utilization from metrics: 0.16)
  4. OpenVino accelerator for CPU models (CPU utilization from metrics: 0.997)

It looks like the CPU and GPU utilization is good, except it is not batching correctly! I get a VERY LOW FPS while inferring in Triton compared to outside Triton! For instance, If I set only Model B (Deep Learning model) and I get the pre-processing along with the post-processing outside Triton it performs much better (~25 FPS). But I get (~6 FPS) If I do ensemble model with pre-processing and post-processing is in Triton as ensemble model.

Triton Information

What version of Triton are you using?

nvcr.io/nvidia/tritonserver:24.04-py3

Are you using the Triton container or did you build it yourself?

Trtiton Container

To Reproduce

  1. Model A Config.pbtxt
name: "ModelA"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_UINT8
    dims: [ 1520, 2688, 3 ]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ 3, 768, 1280 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}
  1. Model B Config.pbtxt:
name: "ModelB"
backend: "tensorrt"
max_batch_size: 64

input [
  {
    name: "input0"
    data_type: TYPE_FP16
    dims:  [3 , 768, 1280]
  }]
output [
  {
    name: "output0"
    data_type: TYPE_FP16
    dims: [61200, 8]
  }
]

dynamic_batching {
}

instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
]

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "4294967296" }
      parameters { key: "trt_engine_cache_enable" value: "1" }
    }]
  }
}
  1. Model C Config.pbtxt
name: "ModelC"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [61200, 8]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}
  1. Model D Config.pbtxt
name: "ModelD"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
    allow_ragged_batch: true
}
]

batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "INDEX"
    data_type: TYPE_FP32
    source_input: "detection_bytetracker_input"
  }
]

input [
{
    name: "input1"
    data_type: TYPE_FP16
    dims: [ 1 ]
}
]

output [
{
    name: "input2"
    data_type: TYPE_FP16
    dims: [ -1, 7 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}
  1. Inference from Clients are using cudshm (These are code snippets not the entire code):
self.triton_client = grpcclient.InferenceServerClient(url=self.triton_server_ip, verbose=self.verbose)
self.input = [grpcclient.InferInput("input_image", (1, self.input_camera_height, self.input_camera_width, 3), "UINT8"),
              grpcclient.InferInput("input_camera_id", (1, 1), "FP16")]
self.output = grpcclient.InferRequestedOutput("predictions")

...

# Create Output in Shared Memory and store shared memory handles
self.shm_op_handle = cudashm.create_shared_memory_region(f"output_data_{self.camera_id}",
                                                         self.output_byte_size, 0)
self.shm_ip_image_handle = cudashm.create_shared_memory_region(f"input_data_image_{self.camera_id}",
                                                         self.input_image_byte_size, 0)
self.shm_ip_camera_handle = cudashm.create_shared_memory_region(f"input_data_camera_{self.camera_id}",
                                                         self.input_camera_byte_size, 0)
# Register Output shared memory with Triton Server
self.triton_client.register_cuda_shared_memory(f"output_data_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_op_handle), 0,
                                               self.output_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_image_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_image_handle),
                                               0, self.input_image_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_camera_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_camera_handle),
                                               0, self.input_camera_byte_size)

...

self.input[0].set_shared_memory(f"input_data_image_{self.camera_id}", self.input_image_byte_size)
self.input[1].set_shared_memory(f"input_data_camera_{self.camera_id}", self.input_camera_byte_size)
self.output.set_shared_memory(f"output_data_{self.camera_id}", self.output_byte_size)

...

# Set CUDA Shared memory
cudashm.set_shared_memory_region(self.shm_ip_image_handle, [frame])
cudashm.set_shared_memory_region(self.shm_ip_camera_handle, [np.expand_dims(np.array(self.camera_id), 0).astype(np.float16)])

# Inference with server
results = self.triton_client.infer(model_name="detection_ensemble_model", inputs=[self.input[0], self.input[1]],
                                   outputs=[self.output], client_timeout=self.client_timeout)

Expected behavior

Each model produces an output with batch_size=1 since I have multiple Triton clients each sends a single image at once, I expect if I put Triton models of pre-processing and post-processing should be faster with dynamic batching, I expect Triton to concatenate along the batch dimension, for example I got 10 requests at the same time, each with shape (1, 3, 768, 1280) I expect Triton to batch them as (10, 3, 768, 1280) and process them all at once, but I get a VERY LOW FPS instead. It looks like it is still being processed sequentially instead of being batched!

@statiraju statiraju added the investigating The developement team is investigating this issue label May 15, 2024
@statiraju
Copy link

opened [DLIS-6702]

@Tabrizian
Copy link
Member

Tabrizian commented May 15, 2024

@AWallyAllah By any chance are you using PyTorch in your Python model? Could you share the code for model A?

@krishung5
Copy link
Contributor

@AWallyAllah Could you please share the model file with us so that we can further investigate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating The developement team is investigating this issue
Development

No branches or pull requests

4 participants