Python Backend: one model instance over multiple GPUs #7169

CollinHU · 2024-04-29T02:50:42Z

Problem
I am using python backend to deploy LLMs (14B, 28G) on 2 GPUs (16G each). This model is too large to fit on a single GPU. But triton inference server kept creating 2 model instances ( 1 instance per GPU), which resulted in CUDA out of memory error. And instance group seems cannot help on this case.

Features
Is it possible to control the number of model instances created in total? in my case, my machine can only support 1 model instance. but this cannot be codeedin model.py and config.pbtxt

more details
triton version: tritonserver:23.11-py3
transfomers:

self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
self.tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype=torch.float16, padding_side='left')

The text was updated successfully, but these errors were encountered:

jbkyang-nvi · 2024-04-30T23:29:51Z

Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?

CollinHU · 2024-05-01T08:09:53Z

Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?

the cloud I am using haven't introduced the newest version of triton. the config.pbtxt looks like this.

name: "llm"
backend: "python"
max_batch_size: 4
input [
{
name: "prompt"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "generated_text"
data_type: TYPE_STRING
dims: [1]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0, 1]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Backend: one model instance over multiple GPUs #7169

Python Backend: one model instance over multiple GPUs #7169

CollinHU commented Apr 29, 2024

jbkyang-nvi commented Apr 30, 2024

CollinHU commented May 1, 2024

Python Backend: one model instance over multiple GPUs #7169

Python Backend: one model instance over multiple GPUs #7169

Comments

CollinHU commented Apr 29, 2024

jbkyang-nvi commented Apr 30, 2024

CollinHU commented May 1, 2024