You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
I am using python backend to deploy LLMs (14B, 28G) on 2 GPUs (16G each). This model is too large to fit on a single GPU. But triton inference server kept creating 2 model instances ( 1 instance per GPU), which resulted in CUDA out of memory error. And instance group seems cannot help on this case.
Features
Is it possible to control the number of model instances created in total? in my case, my machine can only support 1 model instance. but this cannot be codeedin model.py and config.pbtxt
more details
triton version: tritonserver:23.11-py3
transfomers:
Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?
Hello, what does your model configuration file config.pbtxt look like? Also Triton's up to 24.03 right now. Is there a reason why you are not using the latest version?
the cloud I am using haven't introduced the newest version of triton. the config.pbtxt looks like this.
Problem
I am using python backend to deploy LLMs (14B, 28G) on 2 GPUs (16G each). This model is too large to fit on a single GPU. But triton inference server kept creating 2 model instances ( 1 instance per GPU), which resulted in CUDA out of memory error. And instance group seems cannot help on this case.
Features
Is it possible to control the number of model instances created in total? in my case, my machine can only support 1 model instance. but this cannot be codeedin model.py and config.pbtxt
more details
triton version: tritonserver:23.11-py3
transfomers:
The text was updated successfully, but these errors were encountered: