Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORT-TRT backend uses too much CPU memory #7180

Open
ShuaiShao93 opened this issue May 2, 2024 · 1 comment
Open

ORT-TRT backend uses too much CPU memory #7180

ShuaiShao93 opened this issue May 2, 2024 · 1 comment

Comments

@ShuaiShao93
Copy link

Description
When using ORT-TRT backend on GPU, the CPU memory usage is as high as the usage when we use CPU inference.

Triton Information
What version of Triton are you using?
2.45.0

Are you using the Triton container or did you build it yourself?
container

To Reproduce

  • Use any ONNX model like deberta
  • Use ONNX backend plus TensorRT EP
  • Start the server on T4 machine with docker run
  • Verify the model uses GPU
  • Check CPU memory usage, it's 14GB
  • Force model to use CPU
  • Check CPU memory usage, it's not higher than 14GB

Expected behavior
The CPU memory usage should be very low when model uses ORT-TRT backend on GPU

@ShuaiShao93
Copy link
Author

A similar issue was reported before: #5392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant