-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to conduct in8 quantilization and calibration in Python? #3858
Comments
By the way, I use TensorRT 8.4.1, dose the calibration API in Python not work? Hoping someone can give some help! Many thanks. [05/15/2024-15:50:14] [TRT] [I] Starting Calibration. |
You can make use of polygraphy, see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/convert/01_int8_calibration_in_tensorrt |
@zerollzeng Thanks a lot for the support. By the way, can the engine build with EngineFromNetwork API be saved to disk, get the new quantilized TensorRT engine file? I use the following code to generate the TensorRT engine:
But why the new engine file size is not 1/4 of the old one generated with float32? From 156M to 95M, the onnx file size is 153M, what's wrong? Is the save engine code right? |
@zerollzeng Sorry for the bother again, I tried to use trtexec tool to generate INT8 quantilize engine without calibration like this:
The quantilized TensorRT engine size become 51M, why it is much smaller than the engine generated with polygraphy? Because the latter contains Q/DQ layers? And I test the inference speed of FP32 engine and INT8 engine, it almost the same, what's wrong? I test them on the A100 GPU. |
Many factor affect the final engine size. I don't have a clear conclusion in your case.
I guess sub-optimal Q/DQ placement. You can check the engine layer information in verbose log or check the layer profile(see trtexec -h) to confirm. You can take PTQ as best perf goal. use model without Q/DQ and build with --best to see how good the perf is. |
Sorry for the late reply, and many thanks for the help. I think some layers can not be quantilized during the engine generation, so the size of the INT8 quantilized engine is not 1/4 of FP32 engine. As to the efficient, I think maybe it depends on the GPU type too, because when I test it again on 3080 GPU, it give a reasonable result. But when I use new TensorRT version to conduct INT8 calibration and quantilization, it failed. Why the TensorRT 8.5 and above version do not support dynamic shape input during the calibration? But it works fine in TensorRT 8.4. For TensorRT 8.5, it gives error like:
And for TensorRT 8.6, it goves error like:
It seems TensorRT 8.5 and 8.6 will only use the optimal shape to conduct calibration? Why? Is there any solution? |
Hi, all, I'm tring to convert an onnx model to TensorRT with INT8 quantilization in Python environment, here is the code:
The model have two input tensor("xs" and "xlen"), they have dynamic input shape, when I run this script, it always give the following error:
[05/13/2024-17:39:19] [TRT] [W] parsers/onnx/onnx2trt_utils.cpp:367: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
Building an engine from file ./onnx_model/model.onnx, this may take a while...
quantilize.py:173: DeprecationWarning: Use build_serialized_network instead.
engine = builder.build_engine(network, config)
[05/13/2024-17:39:26] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[05/13/2024-17:39:26] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[ERROR] Exception caught in get_batch(): Unable to cast Python instance to C++ type (compile in debug mode for details)
[05/13/2024-17:39:44] [TRT] [E] 1: Unexpected exception _Map_base::at
Failed to create the engine
What's wrong? Is there any error in my code? How can I fix this error and successful finish this job? Anyone can give some helps? Thanks a lot in advance!!!
The text was updated successfully, but these errors were encountered: