Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/Implement NUMA awareness #1093

Open
GermanAizek opened this issue Jan 22, 2024 · 10 comments
Open

Add/Implement NUMA awareness #1093

GermanAizek opened this issue Jan 22, 2024 · 10 comments
Labels
enhancement Enhancements

Comments

@GermanAizek
Copy link
Contributor

Current and Expected Behavior

threadpool(size_t minimum = 2, size_t maximum = std::thread::hardware_concurrency());

_context->thread_count = static_cast<int>(std::thread::hardware_concurrency());

auto p = obs_properties_add_int_slider(grp, ST_KEY_FFMPEG_THREADS, D_TRANSLATE(ST_I18N_FFMPEG_THREADS), 0, static_cast<int64_t>(std::thread::hardware_concurrency()) * 2, 1);

Steps to Reproduce the Problem

For more detailed explanation issue and fixes bug using my PR changes, you can read further at link here: notepad-plus-plus/notepad-plus-plus#14627

Log files & Crash Dumps

No response

Any additional Information we need to know?

No response

@GermanAizek GermanAizek added the bug Bugs such as crashing, freezing, broken functionality, etc. label Jan 22, 2024
@Xaymar
Copy link
Owner

Xaymar commented Jan 22, 2024

Seems fine as it is. It isn't returning 0, so nothing breaks.

@Xaymar Xaymar added enhancement Enhancements and removed bug Bugs such as crashing, freezing, broken functionality, etc. labels Jan 22, 2024
@GermanAizek
Copy link
Contributor Author

On my configuration hes return 36, not 72.

@Xaymar
Copy link
Owner

Xaymar commented Jan 23, 2024

I think I get what's going on, and why you're expecting a different value. You have a NUMA-enabled system with multiple CPUs, which you incorrectly assume to act as free additional performance. However in order to actually take advantage of this NUMA design (even the fake NUMA design on AMD Ryzen), we'd need to be fully aware of it.

Why? Because the OS scheduler will pin our threads and resources to the (fake) NUMA node that is closest to the primary resource. In StreamFX case, this is the CPU closest to the GPU. The other CPU will be deprioritized by the OS scheduler, so more threads won't run faster automatically. If anything, any memory allocated on the wrong CPU will end up significantly slower than normal.

The value you see reported here, 36, is the correct value for NUMA unaware software. Since there is no reason to support NUMA architectures, there's no reason to change this. If you believe that an encoder would work better with more threads, or you believe that the micro-tasks really require more cores (on average there's 1-2 tasks per minute), then you're free to implement the required NUMA awareness code.

@Xaymar Xaymar changed the title Wrong detect all threads in multi-CPU motherboards Add/Implement NUMA awareness Jan 23, 2024
@Xaymar
Copy link
Owner

Xaymar commented Jan 23, 2024

Also you can use the Custom Settings field to override what the UI offers for the FFmpeg encoders. Pretty sure there's no encoder out there that can effectively make use of that many threads in real time anyway.

@GermanAizek
Copy link
Contributor Author

@Xaymar,
well, for correct detection, do not need to rewrite a lot. It is enough to write your own function taking into detect logical cores, I think it will be useful for your project. I can do PR for you to merge, if u dont really want to do it yourself. Without additional half threads, my CPU load does not grow more than ~55-60% for entire OS.

@Xaymar
Copy link
Owner

Xaymar commented Jan 25, 2024

You do need to rewrite a lot for NUMA awareness. What is already being detected is correct for NUMA unaware applications. You can manually override the threads value using Custom Settings as already stated.

@GermanAizek
Copy link
Contributor Author

I understand, u don't know ffmpeg itself can use all logical cores, then I need to do PR there to fix it?

@Xaymar
Copy link
Owner

Xaymar commented Jan 25, 2024

You would need to make FFmpeg and StreamFX and OBS Studio fully NUMA aware on all platforms. This means that you need to replace new/delete, make_shared/make_unique, etc. with NUMA aware allocations close to where the primary resource related to the task is. Anything else will reduce performance.

Again, you can simply override the number of threads to the number of threads in your system. That's what Custom Settings is for. Note that Windows' incorrectly reports Core/Thread pairs as two cores, and thus fully relies on the CPU to report full utilization. A 36 Core system with 72 Threads at 50% is actually 100% utilized on Windows, but it can still do some additional work using unused or stalled pipelines. Threads are not more cores, just subdivisions of the Core that happen while the primary work on the Core is stalled.

AFAIK, only Intel properly reports Thread usage. AMD incorrectly claims Threads are at 0% usage if nothing is running on them or can run on them.

@GermanAizek
Copy link
Contributor Author

GermanAizek commented Jan 26, 2024

@Xaymar,
Indeed, u are right. For example, in the x64dbg debugger, rewriting VirtualAllocExNuma here really gave good performance boost.
x64dbg/x64dbg#3307

@Xaymar
Copy link
Owner

Xaymar commented Jan 26, 2024

Indeed. And for OBS Studio, and by extension StreamFX, to run better, you would need to allocate memory and threads close to the resource in question. This would mean allocating GPU-reliant data near the NUMA node that has the GPU, network-reliant data near the NUMA node that has the NIC, and disk related data near the NUMA node that has the disk. All for a tiny gain in performance that 99.9% of users will never need, and the remaining 0.1% could simply use Custom Settings and type in -threads=NUMBERHERE and be done in StreamFX.

Not to mention that it would actually slow down OBS Studio and StreamFX for users that do not have a NUMA system, since additional care has to be taken on every single allocation. With the amount of data StreamFX throws around per frame, and the amount of data OBS Studio throws around per frame, you'd end up slower than if you just used the Custom Settings as described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancements
Projects
None yet
Development

No branches or pull requests

2 participants