New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ORT-TRT backend uses too much CPU memory #7180

Open

ShuaiShao93 opened this issue May 2, 2024 · 1 comment

ShuaiShao93 commented May 2, 2024

Description
When using ORT-TRT backend on GPU, the CPU memory usage is as high as the usage when we use CPU inference.

Triton Information
What version of Triton are you using?
2.45.0

Are you using the Triton container or did you build it yourself?
container

To Reproduce

Use any ONNX model like deberta
Use ONNX backend plus TensorRT EP
Start the server on T4 machine with docker run
Verify the model uses GPU
Check CPU memory usage, it's 14GB
Force model to use CPU
Check CPU memory usage, it's not higher than 14GB

Expected behavior
The CPU memory usage should be very low when model uses ORT-TRT backend on GPU

Author

ShuaiShao93 commented May 8, 2024

A similar issue was reported before: #5392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment