Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Really bad Performance with YOLOv8 #12638

Open
1 task done
SpaceFlier100 opened this issue May 13, 2024 · 6 comments
Open
1 task done

Really bad Performance with YOLOv8 #12638

SpaceFlier100 opened this issue May 13, 2024 · 6 comments
Labels
question Further information is requested

Comments

@SpaceFlier100
Copy link

Search before asking

Question

Running YOLOv8 on a laptop with a 4070, I get an FPS of around 1 or 2. Looking in task manager shows that the GPU usage is around 20%, with it never utilising fully. I have tried resizing the input stream but that has no visible effect on frame times, even going as low as a 360x360 res.
My code:

import cv2
from ultralytics import YOLO

model = YOLO('runs/detect/yolov8n_v8_100e/weights/best.pt')
model.to('cuda')
cam = cv2.VideoCapture(0)

while True:
    result, img = cam.read()
    results = model(img, conf=0.2, show=True)

Additional

No response

@SpaceFlier100 SpaceFlier100 added the question Further information is requested label May 13, 2024
Copy link

👋 Hello @SpaceFlier100, thank you for your interest in Ultralytics YOLOv8 🚀! We recommend a visit to the Docs for new users where you can find many Python and CLI usage examples and where many of the most common questions may already be answered.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Join the vibrant Ultralytics Discord 🎧 community for real-time conversations and collaborations. This platform offers a perfect space to inquire, showcase your work, and connect with fellow Ultralytics users.

Install

Pip install the ultralytics package including all requirements in a Python>=3.8 environment with PyTorch>=1.8.

pip install ultralytics

Environments

YOLOv8 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

Ultralytics CI

If this badge is green, all Ultralytics CI tests are currently passing. CI tests verify correct operation of all YOLOv8 Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@SpaceFlier100 hi there! It sounds frustrating to deal with such performance issues. 🙁 Let's try a couple of things to optimize your model's utilization:

  1. Ensure that CUDA is properly configured - Sometimes, the system may not effectively delegate the task to the GPU, despite calling .to('cuda').

  2. Minimize data transfer overhead - Running inference in a loop with transfer of images from CPU to GPU might be slowing you down. Try minimizing this overhead or batch processing if possible.

  3. Adjust inference settings - Tweak parameters like batch size, number of workers in data loading, or enable features like Automatic Mixed Precision (amp) to possibly enhance the performance:

from ultralytics import YOLO

model = YOLO('runs/detect/yolov8n_v8_100e/weights/best.pt', amp=True)  # Enabling AMP
model.to('cuda')
cam = cv2.VideoCapture(0)

while True:
    result, img = cam.read()
    results = model(img, conf=0.2, show=True, batch=8)  # Adjust the batch size as needed
  1. Check and update drivers - Sometimes, older GPU drivers might lead to poor utilization. Ensure your NVIDIA drivers are up-to-date.

See if these changes help, and please reach back out if you continue experiencing issues or have any more questions!

@SpaceFlier100
Copy link
Author

Hi, I tried implementing some of your recommendations. My interpreter doesn't recognise the amp keywords, so I'm getting a Typeerror error. I'm afraid I need low latency for my application so I can't batch process. I have the latest drivers, and can you elaborate on "Ensure that CUDA is properly configured"? I'm not sure what you mean by that.

@glenn-jocher
Copy link
Member

Hello @SpaceFlier100! Thanks for trying out the suggestions and providing feedback. Let's resolve these issues step by step:

  1. AMP Support: The error with amp seems to be due to usage in a context that doesn't support it directly. Unfortunately, AMP can't be enabled directly through the YOLO API as of now. My apologies for the incorrect advice there!

  2. CUDA Configuration: Ensuring CUDA is properly configured typically involves checking that PyTorch is using the available GPU. You can check this by running:

    import torch
    print(torch.cuda.is_available())  # Should return True
    print(torch.cuda.current_device())  # Should show your GPU index, usually 0
    print(torch.cuda.get_device_name(0))  # Should display your GPU name

Continue to ensure your system's utilization is efficient despite not being able to batch process. Sometimes reducing the resolution, as you've tried, might not be enough if the overhead of setting up each frame's detection is high.

If your setup confirms that CUDA is being used but the GPU still isn’t fully utilized, consider profiling your code to identify any bottlenecks. Tools like NVIDIA’s Nsight Systems can be quite revealing for this purpose.

Hope this helps, and let's keep the optimizations going! 😊

@SpaceFlier100
Copy link
Author

Cuda is outputting everything found, and upon looking into nsight it seems to be a graphics debugger rather than an AI tool, can it be applied to AI applications as well? Do you have any other tests or suggestions to determine why my GPU isn't fully used? The only gap where the code hangs is when results = model(img, conf=0.2, show=True) is called, everywhere else it is lightning fast

@glenn-jocher
Copy link
Member

Hello @SpaceFlier100! Great job diving into CUDA checks and using Nsight for debugging! 🚀 Indeed, while Nsight Systems is primarily targeted towards debugging and profiling applications at the system level, it can definitely be applied to AI applications to visualize compute and memory usage, which can help identify bottlenecks in your application.

If you're noticing that the code hangs specifically at the model inference step (results = model(img, conf=0.2, show=True)), it could be valuable to check if the data loading or preprocessing are indirectly causing delays. Given that the GPU isn't fully used, you might want to ensure that the image data is pre-loaded and ready to be processed by the GPU without waiting.

Another quick thing to check is whether any other processes are competing for GPU resources. Simplifying the model or reducing the resolution were good initial tests but might not address the root cause if it’s related to how the incoming data is managed.

If these suggestions don't help, running a simplified version of your script with minimal overhead can help isolate whether the issue is framework-related or specific to your current setup. Keep pushing; you’re on the right track! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants