Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Cluster does not work across multiple docker containers #45252

Open
ccruttjr opened this issue May 10, 2024 · 3 comments
Open

Ray Cluster does not work across multiple docker containers #45252

ccruttjr opened this issue May 10, 2024 · 3 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P2 Important issue, but not time-critical

Comments

@ccruttjr
Copy link

What happened + What you expected to happen

Not using docker, my two computers communicate fine/correctly. Also, if I am using ray on one docker container and connecting to it via another computer without docker, it works fine. If both computers are interacting via docker instances, or the Docker container is not the head, it works for a time, but then the worker docker container stops connecting to head. I know this by using ray status. I have it more detailed below and how to reproduce.

Versions / Dependencies

ray==2.20.0

Reproduction script

How to easily reproduce

This works (straight computer to computer):

# On Computer 1
ray start --head # Local node IP: 192.168.250.20
# On Computer 1
ray status # Shows one node
# On Computer 2 on same network
ray start --address='192.168.250.20:6379'
# On Computer 2, wait a few seconds then
ray status # shows two nodes
# On Computer 1, quickly
ray status # shows two nodes
# On Computer 1, wait a bit and then
ray status # shows two nodes
# On both computers
ray stop # should stop all processes :)

This semi works
DOCKERFILE

FROM nvcr.io/nvidia/pytorch:24.04-py3
WORKDIR /app
CMD ["bash"]
docker build -t my-python-cuda-app .
# I know the port forwarding is clunky and overkill but just wanted to be sure
docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app
# Now in docker instance
pip install ray==2.20.0
ray start --head # should give different ip
# On Computer 2
ray start --address='192.168.250.20:6379' # Still use the host computer's ip
# Run ray status like above and see two nodes are connected and staying connected
# Now stop both ray instances and make Computer 2 the head and Docker the worker
# If you do ray status soon after adding the Docker worker, it will show two nodes.
# If you wait a bit, however, it will only show Computer 2's node - the head

What is happening in Docker that isn't on the "normal" computer? Is it putting the process to sleep? As a side note, when stopping the worker instances when connected to head, it usually stops 2 ray processes. Stopping the Docker ray after ray only sees one node, however, shows that it is only stopping one process.

Issue Severity

None

@ccruttjr ccruttjr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024
@rynewang
Copy link
Contributor

I think it's the ports. Ray by default selects some random ports to serve internal traffic, and these port numbers change every time you start. So you can't forward ports by 1 run's results.

https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations

You can set fixed port numbers based on this doc and see if it works.

@rynewang rynewang added P2 Important issue, but not time-critical @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
@ccruttjr
Copy link
Author

ccruttjr commented May 22, 2024

Hmm but why does the worker connect initially, and then stops? Wouldn't it just never connect? Anyways, I also tried --network=host and --publish-all to no avail if that was supposed to fix something.

I also tried running this as I believe you were referencing in the link above but got the same results.

ray start --head --max-worker-port 10005 --node-manager-port 10006 --object-manager-port 10007 --runtime-env-agent-port 10008

edit: also using ray==2.22.0 now instead of 2.20.0

@rynewang
Copy link
Contributor

docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app

It turns out you can't easily dockerize a Ray worker because we have many different interconnection requirements. Can you try this and see if it works

docker run -it -v /ray/tmp:/ray/tmp --gpus all --ipc=host --pid=host --network=host --userns=keep-id --env-file <(env) --name my-python-gpu-container my-python-cuda-app

and see if it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

3 participants