[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

stephanie-wang · 2024-05-14T00:02:56Z

What happened + What you expected to happen

Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.

However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.

Versions / Dependencies

3.0dev

Reproduction script

See linked microbenchmarks.

Issue Severity

None

…passed via NCCL in accelerated DAG (#45332) This adds support for dynamically sized torch.Tensors to be passed between accelerated DAG nodes via NCCL. Specifically, the following code is now supported, whereas previously `shape` and `dtype` had to be explicitly passed to `TorchTensorType`. ```python with InputNode() as inp: dag = sender.send.bind(inp) dag = dag.with_type_hint(TorchTensorType(transport="nccl")) dag = receiver.recv.bind(dag) compiled_dag = dag.experimental_compile() ``` The feature works by creating a shared memory channel to pass the metadata for the shape and dtype of the tensor. The metadata is then used to create a buffer of the correct size on the NCCL receiver. Initial microbenchmarks shows this adds about 50% throughput overhead compared to statically declaring the shape and dtype, or about 160us/DAG call. This seems a bit higher than expected (see also #45319). This also adds a few other fixes: - adds support for reusing actors to create new NCCL groups, which is needed if a DAG is torn down and a new one is created. - adds a lock to DAG teardown, to prevent the same NCCL group from getting destructed twice. - User-defined TorchTensorType shape or dtype is now used as a hint for the buffer size, instead of a required size. Since buffers are currently static, an error will be thrown if the user tries to return a too-large tensor. Part 1 of #45306, will follow up with a separate PR for nested tensors. --------- Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Co-authored-by: SangBin Cho <rkooo567@gmail.com> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>

anyscalesam · 2024-05-23T22:37:50Z

Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well.

stephanie-wang mentioned this issue May 14, 2024

[core][experimental] Add support for dynamically sized torch.Tensors passed via NCCL in accelerated DAG #45332

Merged

8 tasks

anyscalesam assigned jackhumphries May 23, 2024

anyscalesam added the performance label May 23, 2024

stephanie-wang mentioned this issue May 24, 2024

[core][experimental] Support NCCL-based torch.Tensors nested inside Python objects #45473

Merged

8 tasks

stephanie-wang added this to the ADAG Developer Preview milestone May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

stephanie-wang commented May 14, 2024 •

edited

anyscalesam commented May 23, 2024

[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

Comments

stephanie-wang commented May 14, 2024 • edited

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

anyscalesam commented May 23, 2024

stephanie-wang commented May 14, 2024 •

edited