Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorrt Mix Precision or INT8 conversion, mix precision almost same size and speed with INT8, but better precision, the converted model have good detection result with mix precision. #10046

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ZouJiu1
Copy link
Contributor

@ZouJiu1 ZouJiu1 commented Apr 15, 2024

  1. I have checked for Existing Contributions: Before submitting, my contribution is unique and complementary.

  2. No related issues.

new feature

running model in edge device like nvidia jetson

  1. Purpose: I want to running yolov8x.pt in nvidia jetson orix nx 16G device,So I have to convert the trained yolov8x.pt model to tensorrt model. FP16 conversion and inference both are OK, it speed is faster than before, but I want more higher fps.

So I want to convert it to int8, but when I add codes to ultralytics and convert yolov8x.pt to a int8 engine, it is still ok, but when inference with int8, I find it has no detection result.

I search so many blogs and github issues, including tensorrt github, I was looking for a reason and a solution, but I can not find one. I try all the official yolov8*.pt with int8, these converted int8 engine have no detection result. But I find a possible solution, It is the mix precision https://github.com/NVIDIA/TensorRT/blob/main/samples/python/efficientdet/build_engine.py#L188-L218. Mix precision definition is that some layers are FP16, other layers are INT8.

When I use the mix precision to convert all official yolov8*.pt, keep the first, second, last convolution layers FP16, other layers INT8, these converted engines have a good detection result in bus.jpg, and it size just bigger a little 200KB, because just three convolution layer use FP16 and others still use int8. So my codes is right, my conversion and calibration procedure are correct. I think yolov8 can add the mix precision to convert model for a better performance.

i. The procedure of converting a mix precision engine:

  1. download coco val_2017.zip and unzip to val_2017.

  2. set calib_input=./val_2017, cache_file=./calibration.cache, half=True, int8=True

  3. run the script, it will using the mix precision mode convert the model to engine file and infer with bus.jpg.

    before running, I marked the 322 line raise SyntaxError(string + CLI_HELP_MSG) from e in file ultralytics\cfg\init.py.

ii. The procedure of converting a INT8 engine:all same beside this (half=False, int8=True)

the converted models (INT8 or MIX Precision) can download from here https://www.alipan.com/s/FdfFoPDGCWH in tensorrt 10.0.0b6. *mix is a mix precision engine, *int8 is a INT8 engine, nod is no detection.

mix precision log

TensorRT: input "images" with shape(1, 3, 640, 640) DataType.FLOAT
TensorRT: output "output0" with shape(1, 84, 8400) DataType.FLOAT
Mixed-Precision Layer /model.0/conv/Conv set to HALF STRICT data type
Mixed-Precision Layer /model.1/conv/Conv set to HALF STRICT data type
Mixed-Precision Layer /model.22/dfl/conv/Conv set to HALF STRICT data type
TensorRT: building FP16 engine as yolov8x.engine
TensorRT: building INT8 engine as yolov8x.engine

these engines using mix precision have a good detection result in bus.jpg, and these engines using INT8 have no result.

related pull request #9840 , #9941 and #9969

difference to #9840

I change the input and output to FLOAT, not FP16, FLOAT input and output is better.

use different Calibrator base class

in document tensorrt/developer-guide, we can use differenct Calibrator base class in calibrator.py, IInt8EntropyCalibrator2 is suitable for CNN network, IInt8MinMaxCalibrator is suitable for NLP network, should try different calibrator .

int8 conversion reference
https://github.com/NVIDIA/TensorRT/blob/main/samples/python/detectron2

mix precision conversion reference (much better)

Tips: the firstly layer and last layer should be FP16, others can be int8. just try it.
https://github.com/NVIDIA/TensorRT/blob/main/samples/python/efficientdet

below codes can be used for the first commit commits/82ec7ccd7cf77353d9b39f87c317f88878a2a34b

import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

if __name__ == '__main__':
    file = r'yolov8n.pt'
    # file = r'yolov8n-cls.pt'
    # file = r'yolov8n-seg.pt'
    # file = r'yolov8n-pose.pt'
    # file = r'yolov8n-obb.pt'
    # task: [classify, detect, segment, pose, obb]
    model = YOLO(file, task='detect')  # load a pretrained model (recommended for training)
    calib_input = r'E:\work\codeRepo\deploy\jz\val2017'
    '''
    https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c
    To avoid this issue, calibrate with as large a single batch as possible, 
    and ensure that calibration batches are well randomized and have similar distribution.
    '''
    imgsz = 640
    model.export(format=r"engine", source = calib_input, batch=1, calib_batch=20, 
                 simplify=True, half=True, int8=True, device=0, 
                 imgsz=imgsz)
    del model
    gc.collect()
    model = YOLO(r"E:\work\%s"%(file.replace(".pt", ".engine")))
    result = model.predict('https://ultralytics.com/images/bus.jpg',
                           save=True,
                           imgsz=imgsz)

I have read the CLA Document and I sign the CLA

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Introducing Mixed Precision & Enhanced INT8 Support for Accelerated Inference! ⚡🔍

📊 Key Changes

  • Mixed Precision Mode: New experimental feature combining FP16 and INT8 precision for optimized inference accuracy and latency.
  • Enhanced INT8 Engine: Improved INT8 calibration with the introduction of an Engine Calibrator and updated to utilize TensorRT's IInt8MinMaxCalibrator.
  • Precision Control: Specific layers within the network can now be pinned to FP16 to achieve a balance between accuracy and performance.
  • Calibration Process Enhancements: The calibration process for INT8 has been refined with a more effective and efficient batch and cache handling approach.

🎯 Purpose & Impact

  • Accuracy and Speed: The mix of FP16 and INT8 precision aims to maintain high inference accuracy while benefiting from reduced latency and computational requirements.
  • Enhanced Calibration: With the new Engine Calibrator, users can expect more reliable and quicker INT8 calibration, paving the way for faster model optimization without significant accuracy loss.
  • Greater Flexibility: Developers have more control over precision, allowing them to fine-tune models for an optimal balance between speed and accuracy for their specific use case.
  • User Experience: These enhancements make the model training and deployment process both more efficient and user-friendly, supporting a broader range of applications and hardware capabilities.

⚙️🚀 Whether it's speeding up your existing models or pushing the envelope on accuracy, these updates are all about giving you the tools to make the most out of your AI solutions.

@ZouJiu1
Copy link
Contributor Author

ZouJiu1 commented Apr 15, 2024

@glenn-jocher
my fork main branch is after the main https://github.com/ultralytics/ultralytics, so I abandon my commit to keep up the main.

this is my last pull request.

  1. the calibrator batch size need a large number, but the onnx batch is smaller even 1, so I think the calib_batch is necessary. according to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c

    To avoid this issue, calibrate with as large a single batch as possible, 
    and ensure that calibration batches are well randomized and have similar distribution.
    
  2. the redundant self.Model is removed, I just use the internal self.model in Export. config.int8_calibrator = EngineCalibrator(cache_file, self.args, self.model)

  3. those predefined exact layers are removed also, when using mix precision, the first two convolution and last convolution are FP16, others are INT8. No need to care about which model is used and which pypi package version is used.

            first_conv = 0
            last_conv = 0
            for i in range(network.num_layers):
                layer = network.get_layer(i)
                if layer.type == trt.LayerType.CONVOLUTION:
                    if first_conv < 2:
                        first_conv += 1
                        network.get_layer(i).precision = trt.DataType.HALF
                        LOGGER.info("Mixed-Precision Layer {} set to HALF STRICT data type".format(layer.name))
                    last_conv = layer
            last_conv.precision = trt.DataType.HALF
            LOGGER.info("Mixed-Precision Layer {} set to HALF STRICT data type".format(last_conv.name))
            LOGGER.info(f"{prefix} building a Mix Precision with FP16 and INT8 engine as {f}")

the image size need to keep the same when exporting to engine and inference with engine.

Copy link

codecov bot commented Apr 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 54 lines in your changes are missing coverage. Please review.

Project coverage is 75.67%. Comparing base (911a0ed) to head (8cd59a2).

Files Patch % Lines
ultralytics/nn/calibrator.py 0.00% 43 Missing ⚠️
ultralytics/engine/exporter.py 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10046      +/-   ##
==========================================
- Coverage   78.83%   75.67%   -3.17%     
==========================================
  Files         121      122       +1     
  Lines       15351    15404      +53     
==========================================
- Hits        12102    11657     -445     
- Misses       3249     3747     +498     
Flag Coverage Δ
Benchmarks 35.90% <0.00%> (-0.13%) ⬇️
GPU 37.81% <0.00%> (-0.14%) ⬇️
Tests 70.98% <0.00%> (-3.60%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@glenn-jocher
Copy link
Member

Hey there! 👋 Thanks for sharing your insights and improvements on mixed precision and INT8 conversion. It's great to see your continuous effort to enhance inference efficiency while keeping the process streamlined.

I absolutely agree that leveraging a larger batch size for calibration can be crucial for achieving accurate INT8 quantization, as recommended by the NVIDIA documentation. Your approach to dynamically set calculus batch sizes versus ONNX export settings seems like a practical solution.

As for simplifying the use of self.model within the Export process and removing predefined exact layers for mixed precision conversion, these modifications contribute to cleaner and more adaptable code. It’s smart to focus on the initial and final convolutional layers for FP16 precision while setting others to INT8, simplifying the process regardless of the model architecture or package version - a clever move!

Finally, ensuring consistent image sizes during export and inference is key to maintaining performance and avoiding any unexpected behavior.

Your contributions are highly valued, and your last pull request seems to encapsulate these thoughtful changes well. Keep up the fantastic work! If there's anything more we can assist with or discuss further, feel free to reach out. Happy coding! 😊

@Burhan-Q Burhan-Q self-assigned this Apr 15, 2024
@Burhan-Q
Copy link
Member

@ZouJiu1 currently this is not working with either arguments (int8=True) or (int8=True, half=True) as there is an AttributeError raised due to calib_batch. Additionally, I'm not seeing any way to add a location for calibration data, right now this assumes a .cache file exists, which will not always be the case.

Also, please don't close and re-open a PR when you make changes. The entire reason to use git is to ensure there is a history of changes to know what was attempted and have a log of updates, modifications, and changes. Please just make commits to this PR directly instead.

@ZouJiu1
Copy link
Contributor Author

ZouJiu1 commented Apr 15, 2024

@Burhan-Q ok, I will commit to this PR directly.

the batch size calib_batch will affect the .cache file, and different model will generate different inference result and different .cache file. So using a .cache file to convert a model to a engine, it is a bad idea I think. Using the augment "source" to set a images directory is the right choose.

the calibration data is setted by the augment "source", so I think we need add a example to the document and give a sample like download voc2007 dataset and unzip auto.

calib_batch will lead a AttributeError, it need to modify the cfg/default.yaml file, add the calib_batch augment to it.

now, it is time to sleep. 23:02, see you tomorrow.

@ZouJiu1
Copy link
Contributor Author

ZouJiu1 commented Apr 16, 2024

@Burhan-Q , I add calib_batch augment to default.yaml, and add INT8 and Mix precision tensorrt examples and augment to some documents. there has no AttributeError with calib_batch any more. and the usage and explanation are added to document.

@Burhan-Q
Copy link
Member

@ZouJiu1 I see your additions and will have to provide my feedback tomorrow. I have done some testing and I think that there is a lot of work that is still needed to clean up and optimize the addition of int8 for TensorRT. I'm happy to contribute on this PR, but you have to understand that there will be changes that I will need to do that are specific for Ultralytics and it may need to look very different.

As it is now, this is a lot more code than is needed and is likely easily broken. To incorporate int8 for TensorRT, we need to make certain that it's easy and simple to use and to maintain. Additionally, we will need to verify that all tasks and other functionality is working, plus write some unit tests whenever possible.

@Burhan-Q Burhan-Q added the enhancement New feature or request label Apr 16, 2024
@Burhan-Q
Copy link
Member

@ZouJiu1 I think that the "mixed precision" option is not necessary per the Nvidia TensorRT documentation:

When processing implicitly quantized networks, TensorRT treats the model as a floating-point model when applying the graph optimizations, and uses INT8 opportunistically to optimize layer execution time. If a layer runs faster in INT8 and has assigned quantization scales on its data inputs and outputs, then a kernel with INT8 precision is assigned to that layer. Otherwise, a high-precision floating-point (that is, FP32, FP16 or BF16) kernel is assigned.

While running calibration (implicit quantization), TensorRT will select the optimal data type for each layer. I have confirmed that there are a mix of float and int8 tensors present in the engine model after exporting and believe this makes the need to set specific layers to use a fixed precision is not needed and would only create more issues.

@ZouJiu1
Copy link
Contributor Author

ZouJiu1 commented Apr 17, 2024

@Burhan-Q I understand that explicit quantization have many Q/DQ nodes, like pytorch QAT or pytorch_quantization QAT, it will generate so many quantization/Dequantization nodes. I also agree that tensorrt conversion will select the computational precision based on performance considerations.sampleINT8API, lastest_sampleINT8API, explicit-implicit-quantization

but using mix precision, you can determine the precision on each layer. If the converted engine's precision or recall or mAP is not so good even very bad. Then we can use the mix precision to get more high precison, recall or mAP.

I didn't test the converted engine file's precision, recall or mAP. So I can not tell you how the mix precision will help to increasing the precision, recall or mAP.

but I think mix precision is better than INT8. When I set the onnx input and output to half, and using it to convert a INT8 engine, the converted engine file has half input and half output. all model yolov8x_int8.engine, yolov8n_int8.engine, yolov8l_int8.engine and so on, they have no inference result in bus.jpg.
But When I convert them to mix precision engine (the first two conv and last conv are setted to FP16), the converted engine file also has half input and half output. all model yolov8x_mix.engine, yolov8n_mix.engine, yolov8l_mix.engine and so on, they all have inference result in bus.jpg, and the detection result is right, which has one bus, three person.

After I push several pull request, I do some modification to my codes, and I find the reason why INT8 engine have no detection result, which is the input and output dtype=half(FP16). The input and output dtype have a important and significant impact to the result.
When I set the onnx input and output to float32, and using it to convert a INT8 engine, the converted engine file has float32 input and float32 output. All model yolov8x_int8.engine, yolov8n_int8.engine and so on. They all have inference result in bug.jpg, and the detection result is right. Mix precsion with float32 input and output also have right result.

when the input and output dtype is the same half(FP16), the mix precison have a good result, but the int8 have no result. So I think the mix precison is better. Because setting the precision forces TensorRT to choose the implementations which run at this precision. You can determine the layer’s precision. Not let the tensorrt to choose.
ILayer::SetPrecision

Set the computational precision of this layer. 
Setting the precision forces TensorRT to choose the implementations which run at this precision. 
If precision is not set, TensorRT will select the computational precision based on performance considerations 
and the flags specified to the builder.

The mix precision example in tensorrt github release/10.0/samples/python/efficientdet/build_engine.py#L188-L218

    def set_mixed_precision(self):
        """
        Experimental precision mode.
        Enable mixed-precision mode. When set, the layers defined here will be forced to FP16 to maximize
        INT8 inference accuracy, while having minimal impact on latency.
        """
        self.config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
        self.config.set_flag(trt.BuilderFlag.DIRECT_IO)
        self.config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)

        # All convolution operations in the first four blocks of the graph are pinned to FP16.
        # These layers have been manually chosen as they give a good middle-point between int8 and fp16
        # accuracy in COCO, while maintining almost the same latency as a normal int8 engine.
        # To experiment with other datasets, or a different balance between accuracy/latency, you may
        # add or remove blocks.
        for i in range(self.network.num_layers):
            layer = self.network.get_layer(i)
            if layer.type == trt.LayerType.CONVOLUTION and any([
                    # AutoML Layer Names:
                    "/stem/" in layer.name,
                    "/blocks_0/" in layer.name,
                    "/blocks_1/" in layer.name,
                    "/blocks_2/" in layer.name,
                    # TFOD Layer Names:
                    "/stem_conv2d/" in layer.name,
                    "/stack_0/block_0/" in layer.name,
                    "/stack_1/block_0/" in layer.name,
                    "/stack_1/block_1/" in layer.name,
                ]):
                self.network.get_layer(i).precision = trt.DataType.HALF
                log.info("Mixed-Precision Layer {} set to HALF STRICT data type".format(layer.name))

@Burhan-Q
Copy link
Member

@ZouJiu1 I'm not seeing INT8 input and output layers when exporting the quantized models, they're tensorrt.DataType.FLOAT or tensorrt.float32 and again referencing the Nvidia documentation:

Note that even if the precision flags are enabled, the input/output bindings of the engine defaults to FP32.

The default is that they will be FP32 and not INT8, even when exporting to INT8 quantized models. To be clear, I'm not questioning your tests or results, I believe you. The issue is that including a "mixed precision" option is going to have limited benefit for the majority of users and would primarily become more of an issue for maintenance.

@Burhan-Q
Copy link
Member

I tested with exporting yolov8x.pt to .engine with INT8 (without having to manually configure any layer precisions) and have no issue with detections on bus.jpg. FYI, this isn't using the code you have in this PR, it's code I've written for exporting TensorRT to INT8.

from ultralytics import YOLO, ASSETS

model = YOLO("yolov8x.pt")
im = ASSETS / "bus.jpg"

out = model.export(format="engine", data="coco.yaml", int8=True, batch=8, dynamic=True, workspace=2)

model = YOLO(out, task="detect")
result = model.predict(im)
>>> Loading yolov8x.engine for TensorRT inference...
>>> [04/17/2024-09:28:08] [TRT] [I] Loaded engine size: 72 MiB
>>> [04/17/2024-09:28:08] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1110, now: CPU 0, GPU 1176 (MiB)

>>> image 1/1 ultralytics/assets/bus.jpg: 640x640 4 persons, 1 bus, 10.9ms
Speed: 4.3ms preprocess, 10.9ms inference, 1243.7ms postprocess per image at shape (1, 3, 640, 640)

result[0].boxes.data
>>> tensor([[1.0418e+01, 2.2991e+02, 7.9810e+02, 7.3954e+02, 9.4087e-01, 5.0000e+00],
            [2.2316e+02, 4.0466e+02, 3.4449e+02, 8.4835e+02, 8.6566e-01, 0.0000e+00],
            [6.6894e+02, 3.9504e+02, 8.1000e+02, 8.7364e+02, 8.6506e-01, 0.0000e+00],
            [5.0419e+01, 3.9674e+02, 2.4674e+02, 9.0418e+02, 8.5865e-01, 0.0000e+00],
            [6.2339e-02, 5.5515e+02, 7.8911e+01, 8.7119e+02, 7.1263e-01, 0.0000e+00]], device='cuda:0')

eng = model.predictor.model.model
eng.get_tensor_dtype("images")
>>> <DataType.FLOAT: 0>
eng.get_tensor_dtype("output0")
>>> <DataType.FLOAT: 0>

@Burhan-Q
Copy link
Member

@ZouJiu1 my intention in giving feedback here is that I want to try to help make your PR something that could be merged. I had planned to add TensorRT INT8 support myself, but since you started a PR, I thought I'd try to collaborate with you on making it work. The issue is that in it's current state, this PR is unlikely to be accepted. We can work together on making changes or you can leave it as is and I can open my own PR, which is more likely (not certainly) to be accepted. Please let me know how you'd like to proceed.

@ZouJiu1
Copy link
Contributor Author

ZouJiu1 commented Apr 17, 2024

@Burhan-Q , Ok, I know it, I just don't konw what should I do in the next step, maybe remove the mix precision part from codes and doc and just keep the INT8 in this PR. If the mix precision part should be removed, I will do it.

Also, If you have a better INT8 implement codes, I think you should commit a PR, no need to care about my PR.

What I used before is below. I modify the engine/exporter.py#L240-L241.

if self.args.half and onnx and self.device.type != "cpu":
    im, model = im.half(), model.half()  # to FP16
if (self.args.half or self.args.int8) and engine and self.device.type != "cpu":
    im, model = im.half(), model.half()  # to FP16

Then, the engine's input and output dtype will be FP16(half), and the detection will have no inference result.

TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: output "output0" with shape(1, 84, 8400) DataType.HALF
TensorRT: building INT8 engine as yolov8n.engine

the codes I used to convert.

import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\jz\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

if __name__ == '__main__':
    file = r'yolov8n.pt'
    # file = r'yolov8n-cls.pt'
    # file = r'yolov8n-seg.pt'
    # file = r'yolov8n-pose.pt'
    # file = r'yolov8n-obb.pt'
    # task: [classify, detect, segment, pose, obb]
    model = YOLO(file, task='detect')  # load a pretrained model (recommended for training)
    calib_input = r'E:\work\codeRepo\deploy\jz\val2017'
    '''
    https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c
    To avoid this issue, calibrate with as large a single batch as possible, 
    and ensure that calibration batches are well randomized and have similar distribution.
    '''
    imgsz = 640
    model.export(format=r"engine", source = calib_input, batch=1, calib_batch=20, 
                 simplify=True, half=True, int8=True, device=0, 
                 imgsz=imgsz)
    del model
    gc.collect()
    model = YOLO(r"E:\work\%s"%(file.replace(".pt", ".engine")))
    result = model.predict('https://ultralytics.com/images/bus.jpg',
                           save=True,
                           imgsz=imgsz)
    eng = model.predictor.model.model
    k = eng.get_tensor_dtype("images")
    k1 = eng.get_tensor_dtype("output0")

the result have no detection
image 1/1 E:\work\bus.jpg: 640x640 (no detections), 2.0ms
k = <DataType.HALF: 1>
k1 = <DataType.HALF: 1>

However, if I use the mix precision to convert, it detection will have a good result

    model.export(format=r"engine", source = calib_input, batch=1, calib_batch=20, 
                 simplify=True, half=True, int8=True, device=0, 
                 imgsz=imgsz)

the output log

......
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: output "output0" with shape(1, 84, 8400) DataType.HALF
Mixed-Precision Layer /model.0/conv/Conv set to HALF STRICT data type
Mixed-Precision Layer /model.1/conv/Conv set to HALF STRICT data type
Mixed-Precision Layer /model.22/dfl/conv/Conv set to HALF STRICT data type
TensorRT: building a Mix Precision with FP16 and INT8 engine as yolov8n.engine
TensorRT: building FP16 engine as yolov8n.engine
TensorRT: building INT8 engine as yolov8n.engine
......
......

this is why I think Mix precision is better than INT8. But now I can not duplicate the result in INT8 input and output. Maybe some places is wrong. I am not sure about the wrong place.

@ZouJiu1
Copy link
Contributor Author

ZouJiu1 commented Apr 18, 2024

I agree with you that the maintance of issue will be much higher. So I remove the mix precision part now. If you find that the mix precision is better and is necessary, let me know and I will add it again.

@glenn-jocher
Copy link
Member

Thanks for understanding and taking action on the feedback! It's always great to see such responsive and considerate collaboration. 🙌 If we find a strong need for the mixed precision feature in the future, we'll definitely reach out for your insights and contributions. For now, focusing on improving and refining INT8 support seems like our best path forward. Keep up the fantastic work! If you have any further updates or questions, don't hesitate to share.

@Burhan-Q
Copy link
Member

Burhan-Q commented Apr 19, 2024

I found that this to be an issue in my testing:

if (self.args.half or self.args.int8) and engine and self.device.type != "cpu":
    im, model = im.half(), model.half()  # to FP16

If you set model or im to FP16, this will provide incorrect output to TensorRT when calibrating. I believe it's expecting FP32, and calibrates based this. I'm not 100% certain on this, but I did find that it will cause the problem you describe.

I still have some work to do before it's ready, but I also opened a PR #10165 for adding INT8 with TensorRT and included some of my results as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants