-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] FP8 grouped gemm kernel without TMA #1483
Comments
@masahi : could you share the problem shape for your specific group-gemm for us to better recommend the next steps ? |
Additionally, for really small values of M, you are likely do be b/w bound anyway, for which you can likely get roofline perf from recompiling CUTLASS 2.x Ampere kernels (with or without stream K) |
We did observe that the sm80 kernel is faster for small |
I'm working on LLM inference. The problem shape The batch size 20 is just an example. Depending on the size of the model and the number of active users, the batch size can range from a few dozen to a few hundred. Such problem shapes might be ridiculously small from your perspective, but we can't always keep the batch size big to minimize latency. |
Hi @masahi, I believe the Group Tile Scheduler should be easy to plugin: sm90_tile_scheduler_group.hpp You can then change the pointers to A and B based on the group index: https://github.com/NVIDIA/cutlass/blob/7d49e6c7e2f8896c47f586706e67e1fb215529dc/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp#L344C12-L344C19 And the epilogue from the TMA Hopper example (cutlass::epilogue::PtrArrayNoSmemWarpSpecialized) should work as it is too. |
This issue has been labeled |
We want to try a smaller tile size M than 128, which better fits our workload. I'm assuming that the requirement on the tile size M being a multiple of 128 comes from TMA, but for small problem sizes that we encounter in practice, TMA might be an overkill.
According to @hwu36, this is not currently supported.
The text was updated successfully, but these errors were encountered: