Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop in mAP after TensorRT optimization #315

Open
philipp-schmidt opened this issue Jan 2, 2021 · 29 comments
Open

Drop in mAP after TensorRT optimization #315

philipp-schmidt opened this issue Jan 2, 2021 · 29 comments

Comments

@philipp-schmidt
Copy link
Contributor

@jkjung-avt
Hi, could we work together on the problem of the reduced accuracy? I believe I have similar issues in my implementation and I do not use any onnx conversion whatsoever. I would like to get this fixed and could use additional examples where it goes wrong to determine what's the cause.

We could start to work on the postprocessing method. I started with existing code for the yolo layer plugin similar to yours and had to fix a few errors already. Please let me know if my code increases your precision:

https://github.com/isarsoft/yolov4-triton-tensorrt/blob/master/clients/python/processing.py

@philipp-schmidt
Copy link
Contributor Author

@jkjung-avt
Copy link
Owner

Hi, could we work together on the problem of the reduced accuracy?

That sounds good.

Here are all fixes I made so far:
https://github.com/isarsoft/yolov4-triton-tensorrt/commits/master/clients/python/processing.py

I have read through your commit history. I think my current code does not have those issues you've fixed in your own code...

I did reference the original AlexyAB/darknet code to develop my implementation. For example, "scale_x_y", which is used in yolov4/yolov4-tiny models, would affect how center x/y coordinates of bboxes are calculated. And I implemented that calculation in the "yolo_layer" plugin.

det->bbox[0] = (col + scale_sigmoidGPU(*(cur_input + 0 * total_grids), scale_x_y)) / yolo_width; // [0, 1]
det->bbox[1] = (row + scale_sigmoidGPU(*(cur_input + 1 * total_grids), scale_x_y)) / yolo_height; // [0, 1]

@jkjung-avt
Copy link
Owner

Related issues:

@philipp-schmidt
Copy link
Contributor Author

I will prob. have time this weekend to crosscheck implementations. I will get back at you when I have more info.

@jkjung-avt
Copy link
Owner

@philipp-schmidt Look forward to your updates. Meanwhile, I'm inclined to think the problem lies more likely in darknet -> onnx -> TensorRT conversion. I will also review the code when I have time.

@philipp-schmidt
Copy link
Contributor Author

philipp-schmidt commented Jan 26, 2021

Hi, a main source of wrong results and bad accuracy has been fixed for me in triton inference server. It was a server side race condition... I was hunting ghosts for many weeks... triton-inference-server/server#2339

Now I can focus on mAP, I'll keep you posted.

@philipp-schmidt philipp-schmidt changed the title Post Processing Drop in mAP after conversion Jan 26, 2021
@philipp-schmidt philipp-schmidt changed the title Drop in mAP after conversion Drop in mAP after TensorRT optimization Jan 26, 2021
@jkjung-avt
Copy link
Owner

NVIDIA has this Polygraphy tool which could be used to compare "layer-wise" outputs between the ONNX model and the TensorRT engine. I think that would be an effective way to debug this mAP dropping problem.

Here is an example Polygraphy debugging output: NVIDIA/TensorRT#1087 (comment)

I'm not sure when I'll have time to look into this, though.

@philipp-schmidt
Copy link
Contributor Author

I couldn't yet make the time to fully tackle this as well unfortunately.
This Polygraph tool seems to be very helpful regardless, thanks for the pointer.

@jkjung-avt
Copy link
Owner

NVIDIA's Polygraphy tool turns out to be very easy to use. I just follow the installation instructions and use the following command to debug the models.

$ polygraphy run yolov3-tiny-416.onnx --trt --fp16 --onnxrt
......
[I] Accuracy Comparison | trt-runner-N0-04/10/21-21:37:09 vs. onnxrt-runner-N0-04/10/21-21:37:09
[I]     Comparing Output: '016_convolutional' (dtype=float32, shape=(1, 255, 13, 13)) with '016_convolutional' (dtype=float32, shape=(1, 255, 13, 13))
[I]         Required tolerances: [atol=0.089517] OR [rtol=1e-05, atol=0.089425] OR [rtol=5.9166, atol=1e-05] | Mean Error: Absolute=0.010562, Relative=0.0033428
            Runner: trt-runner-N0-04/10/21-21:37:09          | Stats: mean=-6.5803, min=-15.992 at (0, 174, 0, 0), max=2.1582 at (0, 90, 12, 2)
            Runner: onnxrt-runner-N0-04/10/21-21:37:09       | Stats: mean=-6.5821, min=-16.004 at (0, 174, 0, 12), max=2.1647 at (0, 90, 12, 2)
[E]         FAILED | Difference exceeds tolerance (rtol=1e-05, atol=1e-05)
[I]     Comparing Output: '023_convolutional' (dtype=float32, shape=(1, 255, 26, 26)) with '023_convolutional' (dtype=float32, shape=(1, 255, 26, 26))
[I]         Required tolerances: [atol=0.095589] OR [rtol=1e-05, atol=0.095568] OR [rtol=268.68, atol=1e-05] | Mean Error: Absolute=0.012998, Relative=0.0078038
            Runner: trt-runner-N0-04/10/21-21:37:09          | Stats: mean=-7.1557, min=-18.188 at (0, 174, 15, 25), max=3.3008 at (0, 249, 15, 21)
            Runner: onnxrt-runner-N0-04/10/21-21:37:09       | Stats: mean=-7.1579, min=-18.159 at (0, 174, 15, 25), max=3.3272 at (0, 249, 15, 21)
[E]         FAILED | Difference exceeds tolerance (rtol=1e-05, atol=1e-05)
[E]     FAILED | Mismatched outputs: ['016_convolutional', '023_convolutional']

I summarize the results below. All comparisons are done between TensorRT FP16 and ONNX Runtime.

  • yolov3-tiny-416

    • '016_convolutional' Mean Error: Absolute=0.010562, Relative=0.0033428
    • '023_convolutional' Mean Error: Absolute=0.012998, Relative=0.0078038
  • yolov3-608

    • '082_convolutional' Mean Error: Absolute=0.018218, Relative=0.0046612
    • '094_convolutional' Mean Error: Absolute=0.018218, Relative=0.0046612
    • '106_convolutional' Mean Error: Absolute=0.020347, Relative=0.0078671
  • yolov4-tiny-416

    • '030_convolutional' Mean Error: Absolute=0.01394, Relative=0.0032779
    • '037_convolutional' Mean Error: Absolute=0.013386, Relative=0.0069264
  • yolov4-608

    • '139_convolutional' Mean Error: Absolute=0.0051023, Relative=0.0026887
    • '150_convolutional' Mean Error: Absolute=0.0070509, Relative=0.0040541
    • '161_convolutional' Mean Error: Absolute=0.0074914, Relative=0.001748

@ROBYER1
Copy link

ROBYER1 commented Apr 10, 2021

NVIDIA's Polygraphy tool turns out to be very easy to use. I just follow the installation instructions and use the following command to debug the models.

$ polygraphy run yolov3-tiny-416.onnx --trt --fp16 --onnxrt
......
[I] Accuracy Comparison | trt-runner-N0-04/10/21-21:37:09 vs. onnxrt-runner-N0-04/10/21-21:37:09
[I]     Comparing Output: '016_convolutional' (dtype=float32, shape=(1, 255, 13, 13)) with '016_convolutional' (dtype=float32, shape=(1, 255, 13, 13))
[I]         Required tolerances: [atol=0.089517] OR [rtol=1e-05, atol=0.089425] OR [rtol=5.9166, atol=1e-05] | Mean Error: Absolute=0.010562, Relative=0.0033428
            Runner: trt-runner-N0-04/10/21-21:37:09          | Stats: mean=-6.5803, min=-15.992 at (0, 174, 0, 0), max=2.1582 at (0, 90, 12, 2)
            Runner: onnxrt-runner-N0-04/10/21-21:37:09       | Stats: mean=-6.5821, min=-16.004 at (0, 174, 0, 12), max=2.1647 at (0, 90, 12, 2)
[E]         FAILED | Difference exceeds tolerance (rtol=1e-05, atol=1e-05)
[I]     Comparing Output: '023_convolutional' (dtype=float32, shape=(1, 255, 26, 26)) with '023_convolutional' (dtype=float32, shape=(1, 255, 26, 26))
[I]         Required tolerances: [atol=0.095589] OR [rtol=1e-05, atol=0.095568] OR [rtol=268.68, atol=1e-05] | Mean Error: Absolute=0.012998, Relative=0.0078038
            Runner: trt-runner-N0-04/10/21-21:37:09          | Stats: mean=-7.1557, min=-18.188 at (0, 174, 15, 25), max=3.3008 at (0, 249, 15, 21)
            Runner: onnxrt-runner-N0-04/10/21-21:37:09       | Stats: mean=-7.1579, min=-18.159 at (0, 174, 15, 25), max=3.3272 at (0, 249, 15, 21)
[E]         FAILED | Difference exceeds tolerance (rtol=1e-05, atol=1e-05)
[E]     FAILED | Mismatched outputs: ['016_convolutional', '023_convolutional']

I am guessing this is where there is a loss of accuracy? Will there be a fix?

@philipp-schmidt
Copy link
Contributor Author

philipp-schmidt commented Apr 10, 2021 via email

@jkjung-avt
Copy link
Owner

I re-ran Polygraphy by specifying the correct input data range for the yolo models ("--float-min 0.0 --float-max 1.0"), e.g.

$ polygraphy run yolov3-tiny-416.onnx --trt --fp16 --onnxrt --float-min 0.0 --float-max 1.0
......
[I] Accuracy Comparison | trt-runner-N0-04/11/21-12:47:58 vs. onnxrt-runner-N0-04/11/21-12:47:58
[I]     Comparing Output: '016_convolutional' (dtype=float32, shape=(1, 255, 13, 13)) with '016_convolutional' (dtype=float32, shape=(1, 255, 13, 13))
[I]         Required tolerances: [atol=0.049671] OR [rtol=1e-05, atol=0.049614] OR [rtol=18.328, atol=1e-05] | Mean Error: Absolute=0.008115, Relative=0.0037584
            Runner: trt-runner-N0-04/11/21-12:47:58          | Stats: mean=-5.2187, min=-18.516 at (0, 174, 11, 3), max=1.5859 at (0, 111, 4, 4)
            Runner: onnxrt-runner-N0-04/11/21-12:47:58       | Stats: mean=-5.2171, min=-18.497 at (0, 174, 11, 11), max=1.5708 at (0, 0, 11, 0)
[E]         FAILED | Difference exceeds tolerance (rtol=1e-05, atol=1e-05)
[I]     Comparing Output: '023_convolutional' (dtype=float32, shape=(1, 255, 26, 26)) with '023_convolutional' (dtype=float32, shape=(1, 255, 26, 26))
[I]         Required tolerances: [atol=0.069397] OR [rtol=1e-05, atol=0.069256] OR [rtol=9084.6, atol=1e-05] | Mean Error: Absolute=0.010339, Relative=0.058467
            Runner: trt-runner-N0-04/11/21-12:47:58          | Stats: mean=-5.6, min=-18.625 at (0, 174, 7, 25), max=2.4707 at (0, 19, 12, 23)
            Runner: onnxrt-runner-N0-04/11/21-12:47:58       | Stats: mean=-5.5999, min=-18.625 at (0, 174, 7, 25), max=2.4672 at (0, 19, 12, 23)
[E]         FAILED | Difference exceeds tolerance (rtol=1e-05, atol=1e-05)
[E]     FAILED | Mismatched outputs: ['016_convolutional', '023_convolutional']
[E] FAILED | Command: /home/jkjung/project/MODNet/venv/bin/polygraphy run yolov3-tiny-416.onnx --trt --fp16 --onnxrt --float-min 0.0 --float-max 1.0

Here are the results: (FP16)

  • yolov3-tiny-416

    • '016_convolutional' Mean Error: Absolute=0.008115, Relative=0.0037584
    • '023_convolutional' Mean Error: Absolute=0.010339, Relative=0.058467
  • yolov3-608

    • '082_convolutional' Mean Error: Absolute=0.01309, Relative=0.0043352
    • '094_convolutional' Mean Error: Absolute=0.016002, Relative=0.0091567
    • '106_convolutional' Mean Error: Absolute=0.016827, Relative=0.007058
  • yolov4-tiny-416

    • '030_convolutional' Mean Error: Absolute=0.0065569, Relative=0.0021531
    • '037_convolutional' Mean Error: Absolute=0.0080654, Relative=0.0048672
  • yolov4-608

    • '139_convolutional' Mean Error: Absolute=0.01843, Relative=0.010256
    • '150_convolutional' Mean Error: Absolute=0.014698, Relative=0.0067943
    • '161_convolutional' Mean Error: Absolute=0.010814, Relative=0.0046399

The TensorRT "yolov3-tiny" FP16 engine is the only one which generates an output with >5% mean relative error from onnxruntime (all others are <1%). I think this indeed explains why the TensorRT "yolov3-tiny" engine evaluates to a much worse mAP than its DarkNet counterpart, comparing to the other models ("yolov3-608", "yolov4-tiny-416" and "yolov4-608")...

@Duarte-Nunes
Copy link

Hello, sorry for not adding anything to the discussion but i wanted to check, i'm currently trying to implement this repository on a Jetson Nano.

Does the yolov4-tiny model also present the mAP drop that has been discussed mainly for yolov3?

Anyways if this is unclear i will conduct my own tests on a custom dataset and can report the results back to you.

@jkjung-avt
Copy link
Owner

Does the yolov4-tiny model also present the mAP drop that has been discussed mainly for yolov3?

Based on my mAP evaluation results, "yolov3-tiny" suffers from this problem quite a bit. The other models ("yolov3", "yolov4-tiny" and "yolov4") are probably OK.

I would focus on solving the problem for "yolov3-tiny" if I have time.

@akashAD98
Copy link

akashAD98 commented Nov 15, 2021

@jkjung-avt same problem for yolov4-mish ,yolov4-csp-swish model also, im getting lots of False positive & results are not same as darknet, May i know what are the reasons behind it? & how can we solve the FP problem?

@jkjung-avt
Copy link
Owner

@akashAD98 This is a known issue. I've done my best to make sure the code is correct for both TensorRT engine building and inferencing. But TensorRT engine optimization does result in mAP drop for various YOLO models.

I have also tried to analyze this problem with polygraphy as shown above, but failed to find the root cause and a solution. I don't have a good answer now. That's why I kept this issue open...

@akashAD98
Copy link

@jkjung-avt thanks for your kind reply. we all appreciate your great work. Hope you will get a solution in the future.

@akashAD98
Copy link

@jkjung-avt can we do inference & check the FPS & False prediction of onnx model? what you think about accuracy (False prediction ) its the same as tensorrt?
Do you have any script for doing inference on onnx model? same like tensorrt ?so we will get idea,whether problem with onnx conversion or onnx to tensorrt

@jkjung-avt
Copy link
Owner

Do you have any script for doing inference on onnx model?

I have done that for MODNet, but not for YOLO models. Some of the code could be reused though: https://github.com/jkjung-avt/tensorrt_demos/blob/master/modnet/test_onnx.py

In order to check mAP and false detection with the ONNX YOLO models, you'll also have to implement "yolo" layers in the post-processing code (this part is handled by the "yolo_layer" plugin in TensorRT cases). I don't think I have time to do that in the near future...

@akashAD98
Copy link

akashAD98 commented Nov 20, 2021

hi @jkjung-avt do you have any idea ? how should i solve this issue onnx/tutorials#253 (comment)

image

this is script inference_onnx_yolov4-mish.ipynb.txt

@akashAD98
Copy link

@jkjung-avt please have look

@jkjung-avt
Copy link
Owner

@akashAD98 I already commented: onnx/tutorials#253 (comment)

You need to modify the postprocessing code by yourself.

@akashAD98
Copy link

akashAD98 commented Nov 26, 2021

Preprocessed image orignal shape: (1, 416, 416, 3) is so

i converted channel fist to channel last
image

& got
image

so its saying we need (1,3,416,416)

@akashAD98
Copy link

akashAD98 commented Nov 29, 2021

@jkjung-avt is there any model which has almost similar results like darknet? yolov4-csp,yolo-mish has issues of false prediction ? so im looking for a good model of tensorrt. yolov4 is best??

@jkjung-avt
Copy link
Owner

Please refer to the "mAP and FPS" table in Demo #5: YOLOv4.

@akashAD98
Copy link

akashAD98 commented Mar 3, 2022

@jkjung-avt
I converted yolov models into tensorrt & im getting to many false predictions as I said already in this issue,

one of the observations from my experiments-
I trained home model having 50 classes - which has very low False predictions
I trained music category model having only 10 classes- im getting too many False predictions.

i have done the same experiments with few more category classes & after that experiments, i come to know that its giving less FP if you have more classes & high FP if classes are less.

this is just my experimental observation-if you think this can help us to solve this issue, please let us know. Thanks

@jkjung-avt
Copy link
Owner

@akashAD98 Thanks for sharing the info. I tried to think about possible causes of such results but could not come up with any. I will keep this in mind and share my experience/thoughts when I have new findings.

@ThomasGoud
Copy link

ThomasGoud commented Apr 27, 2022

Hello @jkjung-avt,

For my use case, I am trying to detect with yolov3 only one type of object (only one class).

After comparison with the code of yolov3 (https://github.com/experiencor/keras-yolo3), I observe that there is a major difference in the output classes probabilities processing.

In the original code (https://github.com/experiencor/keras-yolo3/blob/master/utils/utils.py at line 179), they apply softmax to all class probabilities:
netout[..., 5:] = netout[..., 4][..., np.newaxis] * _softmax(netout[..., 5:])

In your code, (https://github.com/jkjung-avt/tensorrt_demos/blob/master/plugins/yolo_layer.cu at line 167), you post-process the class probabilities with a sigmoid:
float max_cls_prob = sigmoidGPU(max_cls_logit);

In my case with only one classe:

  • with the original code: the softmax is always equal to one and the score associated with the bounding box is therefore equal to pobj*pclass0 = pobj.
  • with your implementation: the score of the bounding box is equal to pobj*pclass0 which is smaller

I think that this problem can explain why with more classes, the mAP is better (because the softmax is more similar to the sigmoid).

Thank you for your contribution with tensorrt_demos,
Thomas

@jkjung-avt
Copy link
Owner

@ThomasGoud Thanks for sharing your thoughts. But according to the original DarkNet implementation, the objectness and class scores are calculated by taking LOGISTIC (i.e. sigmoid) activation on the outputs of the previous convolutional layers.

You could refer to the source code as pointed below.

https://github.com/AlexeyAB/darknet/blob/8a0bf84c19e38214219dbd3345f04ce778426c57/src/yolo_layer.c#L680

https://github.com/AlexeyAB/darknet/blob/8a0bf84c19e38214219dbd3345f04ce778426c57/src/yolo_layer.c#L1190

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants