Ray Cluster does not work across multiple docker containers #45252

ccruttjr · 2024-05-10T20:32:50Z

What happened + What you expected to happen

Not using docker, my two computers communicate fine/correctly. Also, if I am using ray on one docker container and connecting to it via another computer without docker, it works fine. If both computers are interacting via docker instances, or the Docker container is not the head, it works for a time, but then the worker docker container stops connecting to head. I know this by using ray status. I have it more detailed below and how to reproduce.

Versions / Dependencies

ray==2.20.0

Reproduction script

How to easily reproduce

This works (straight computer to computer):

# On Computer 1
ray start --head # Local node IP: 192.168.250.20
# On Computer 1
ray status # Shows one node
# On Computer 2 on same network
ray start --address='192.168.250.20:6379'
# On Computer 2, wait a few seconds then
ray status # shows two nodes
# On Computer 1, quickly
ray status # shows two nodes
# On Computer 1, wait a bit and then
ray status # shows two nodes
# On both computers
ray stop # should stop all processes :)

This semi works
DOCKERFILE

FROM nvcr.io/nvidia/pytorch:24.04-py3
WORKDIR /app
CMD ["bash"]

docker build -t my-python-cuda-app .
# I know the port forwarding is clunky and overkill but just wanted to be sure
docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app
# Now in docker instance
pip install ray==2.20.0
ray start --head # should give different ip
# On Computer 2
ray start --address='192.168.250.20:6379' # Still use the host computer's ip
# Run ray status like above and see two nodes are connected and staying connected
# Now stop both ray instances and make Computer 2 the head and Docker the worker
# If you do ray status soon after adding the Docker worker, it will show two nodes.
# If you wait a bit, however, it will only show Computer 2's node - the head

What is happening in Docker that isn't on the "normal" computer? Is it putting the process to sleep? As a side note, when stopping the worker instances when connected to head, it usually stops 2 ray processes. Stopping the Docker ray after ray only sees one node, however, shows that it is only stopping one process.

Issue Severity

None

The text was updated successfully, but these errors were encountered:

rynewang · 2024-05-20T21:42:35Z

I think it's the ports. Ray by default selects some random ports to serve internal traffic, and these port numbers change every time you start. So you can't forward ports by 1 run's results.

https://docs.ray.io/en/latest/ray-core/configure.html#ports-configurations

You can set fixed port numbers based on this doc and see if it works.

ccruttjr · 2024-05-22T18:02:37Z

Hmm but why does the worker connect initially, and then stops? Wouldn't it just never connect? Anyways, I also tried --network=host and --publish-all to no avail if that was supposed to fix something.

I also tried running this as I believe you were referencing in the link above but got the same results.

ray start --head --max-worker-port 10005 --node-manager-port 10006 --object-manager-port 10007 --runtime-env-agent-port 10008

edit: also using ray==2.22.0 now instead of 2.20.0

rynewang · 2024-05-28T21:00:27Z

docker run -it --gpus all --ipc=host -p 8000:8000 -p 6379:6379 -p 10001:10001 -p 10003:10003 -p 10004:10004 -p 10005:10005 -p 10006:10006 -p 10007:10007 -p 10008:10008 -p 10009:10009 -p 10010:10010 -p 10011:10011 -p 10012:10012 -p 10013:10013 -p 10014:10014 -p 33189:33189 -p 38065:38065 -p 44217:44217 -p 63051:63051 --name my-python-gpu-container my-python-cuda-app

It turns out you can't easily dockerize a Ray worker because we have many different interconnection requirements. Can you try this and see if it works

docker run -it -v /ray/tmp:/ray/tmp --gpus all --ipc=host --pid=host --network=host --userns=keep-id --env-file <(env) --name my-python-gpu-container my-python-cuda-app

and see if it works

ccruttjr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 10, 2024

anyscalesam added the core Issues that should be addressed in Ray Core label May 13, 2024

rynewang added P2 Important issue, but not time-critical @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Cluster does not work across multiple docker containers #45252

Ray Cluster does not work across multiple docker containers #45252

ccruttjr commented May 10, 2024

rynewang commented May 20, 2024

ccruttjr commented May 22, 2024 •

edited

rynewang commented May 28, 2024

Ray Cluster does not work across multiple docker containers #45252

Ray Cluster does not work across multiple docker containers #45252

Comments

ccruttjr commented May 10, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

How to easily reproduce

Issue Severity

rynewang commented May 20, 2024

ccruttjr commented May 22, 2024 • edited

rynewang commented May 28, 2024

ccruttjr commented May 22, 2024 •

edited