Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is onnxruntime-genai supported? #7182

Open
jackylu0124 opened this issue May 4, 2024 · 2 comments
Open

Is onnxruntime-genai supported? #7182

jackylu0124 opened this issue May 4, 2024 · 2 comments
Labels
question Further information is requested

Comments

@jackylu0124
Copy link

Hey all, I have a quick question, is onnxruntime-genai (https://onnxruntime.ai/docs/genai/api/python.html) supported in Triton Inference Server's ONNX runtime backend? I couldn't find relevant sources in the documentation. Thanks!

@nnshah1 nnshah1 added the question Further information is requested label May 4, 2024
@nnshah1
Copy link
Contributor

nnshah1 commented May 4, 2024

@jackylu0124 Support for onnxruntime-genai is currently work in progress - the python bindings should work within the python backend - but we haven't had a chance to test that ourselves yet.

That being said we are actively investigating support - can you share more about your use case / timeline needed for support?

@jackylu0124
Copy link
Author

@jackylu0124 Support for onnxruntime-genai is currently work in progress - the python bindings should work within the python backend - but we haven't had a chance to test that ourselves yet.

That being said we are actively investigating support - can you share more about your use case / timeline needed for support?

Hi @nnshah1 , thank you very much for your fast reply! By "the python bindings should work within the python backend", you meant that I can do things like import onnxruntime_genai and write the custom inference logic in the Python backend, as opposed to having Triton Inference Server automatically manage all my .onnx model files (that use onnxruntime-genai) in the model repository for me (which is a feature currently in development), is my understanding correct?

My use case is mainly for serving LLM models, where some of which are in the form of ONNX models that depend on onnxruntime_genai. I don't have a specific timeline, I am mainly interested in knowing whether this feature is on Triton Inferencer Server's development roadmap or not.

Also a follow-up question: regarding serving LLM, what would be the best backend for serving and achieving token streaming outside of using the TensorRT-LLM backend?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants