Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][experimental] Higher than expected overhead for shared memory channels with NCCL #45319

Open
stephanie-wang opened this issue May 14, 2024 · 1 comment
Assignees
Labels
accelerated-dag bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks performance

Comments

@stephanie-wang
Copy link
Contributor

stephanie-wang commented May 14, 2024

What happened + What you expected to happen

Microbenchmark results for a single-actor accelerated DAG shows about 30k calls/s, or about 30us/call. That is consistent with other microbenchmarks that @jackhumphries ran for channel performance, showing low 10s of us / channel op.

However, a microbenchmark for the recently added NCCL transport shows about 5.8k calls/s for NCCL alone and 3.2k calls/s for DAG+NCCL. This translates to about 130us / DAG call, more than 4x what's expected.

Versions / Dependencies

3.0dev

Reproduction script

See linked microbenchmarks.

Issue Severity

None

@stephanie-wang stephanie-wang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks accelerated-dag and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 14, 2024
stephanie-wang added a commit that referenced this issue May 20, 2024
…passed via NCCL in accelerated DAG (#45332)

This adds support for dynamically sized torch.Tensors to be passed
between accelerated DAG nodes via NCCL. Specifically, the following code
is now supported, whereas previously `shape` and `dtype` had to be
explicitly passed to `TorchTensorType`.

```python
    with InputNode() as inp:
        dag = sender.send.bind(inp)
        dag = dag.with_type_hint(TorchTensorType(transport="nccl"))
        dag = receiver.recv.bind(dag)

    compiled_dag = dag.experimental_compile()
```

The feature works by creating a shared memory channel to pass the
metadata for the shape and dtype of the tensor. The metadata is then
used to create a buffer of the correct size on the NCCL receiver.

Initial microbenchmarks shows this adds about 50% throughput overhead
compared to statically declaring the shape and dtype, or about 160us/DAG
call. This seems a bit higher than expected (see also #45319).

This also adds a few other fixes:
- adds support for reusing actors to create new NCCL groups, which is
needed if a DAG is torn down and a new one is created.
- adds a lock to DAG teardown, to prevent the same NCCL group from
getting destructed twice.
- User-defined TorchTensorType shape or dtype is now used as a hint for
the buffer size, instead of a required size. Since buffers are currently
static, an error will be thrown if the user tries to return a too-large
tensor.

Part 1 of #45306, will follow up with a separate PR for nested tensors.


---------

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
@anyscalesam
Copy link
Collaborator

Discussed during standup today > goal is to improve the NCCL performance to be within 50% versus 4x. This will also help with vLLM performance (may or may not impact) but it will also draw that closer as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerated-dag bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks performance
Projects
None yet
Development

No branches or pull requests

3 participants