Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Possible memory leak in gcs_server #45338

Open
ScottShingler opened this issue May 14, 2024 · 5 comments
Open

[Core] Possible memory leak in gcs_server #45338

ScottShingler opened this issue May 14, 2024 · 5 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@ScottShingler
Copy link

What happened + What you expected to happen

Running Ray Core using a driver that repeatedly invokes a method on an actor results in a continuous overall increase in the peak RSS memory usage of the ray/core/src/ray/gcs/gcs_server process:

graph

Over the course of ~112 hours, the values of the peaks increased from an initial value of ~595 MB to ~662 MB. That works out to roughly 598 KB per hour.

This plot and the data used to create it can be downloaded here.

Versions / Dependencies

ray 2.20.0

Reproduction script

I have created a repo that contains the reproduction script, as well as scripts to collect and plot the data: https://github.com/Prolucid/ray-memleak-repro.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@ScottShingler ScottShingler added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 14, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 20, 2024
@rynewang rynewang added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
@rynewang rynewang self-assigned this May 20, 2024
@lmsh7
Copy link

lmsh7 commented Jun 3, 2024

In my own scenario of using uvicorn + fastapi + single-node ray, similar situation occurs. After high QPS testing, the gcs server did not fall back like uvicorn did
image

@lmsh7
Copy link

lmsh7 commented Jun 3, 2024

In my own scenario of using uvicorn + fastapi + single-node ray, similar situation occurs. After high QPS testing, the gcs server did not fall back like uvicorn did image

However, I found some solutions in this issue: export RAY_task_events_max_num_task_in_gcs=100 can significantly reduce the memory usage of GCS."
image

@lmsh7
Copy link

lmsh7 commented Jun 3, 2024

It is also similar to this issue #43253

@rynewang
Copy link
Contributor

rynewang commented Jun 3, 2024

Per #43253 can you retry with export RAY_task_events_max_num_task_in_gcs=100?

@ScottShingler
Copy link
Author

I'll set up a test to run over the weekend and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

4 participants