Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Backend: one model instance over multiple GPUs #7169

Open
CollinHU opened this issue Apr 29, 2024 · 2 comments
Open

Python Backend: one model instance over multiple GPUs #7169

CollinHU opened this issue Apr 29, 2024 · 2 comments

Comments

@CollinHU
Copy link

Problem
I am using python backend to deploy LLMs (14B, 28G) on 2 GPUs (16G each). This model is too large to fit on a single GPU. But triton inference server kept creating 2 model instances ( 1 instance per GPU), which resulted in CUDA out of memory error. And instance group seems cannot help on this case.

Features
Is it possible to control the number of model instances created in total? in my case, my machine can only support 1 model instance. but this cannot be codeedin model.py and config.pbtxt

more details
triton version: tritonserver:23.11-py3
transfomers:

self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="auto"
        )
self.tokenizer = AutoTokenizer.from_pretrained(model_path, torch_dtype=torch.float16, padding_side='left')
@jbkyang-nvi
Copy link
Contributor

Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?

@CollinHU
Copy link
Author

CollinHU commented May 1, 2024

Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?

the cloud I am using haven't introduced the newest version of triton. the config.pbtxt looks like this.

name: "llm"
backend: "python"
max_batch_size: 4
input [
{
name: "prompt"
data_type: TYPE_STRING
dims: [1]
}
]
output [
{
name: "generated_text"
data_type: TYPE_STRING
dims: [1]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0, 1]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants