[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319
Labels
accelerated-dag
bug
Something that is supposed to be working; but isn't
P1
Issue that should be fixed within a few weeks
performance
Milestone
What happened + What you expected to happen
Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.
However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.
Versions / Dependencies
3.0dev
Reproduction script
See linked microbenchmarks.
Issue Severity
None
The text was updated successfully, but these errors were encountered: