Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Incorrectly detected TPU on a HPU-only node. #45302

Open
woshiyyya opened this issue May 13, 2024 · 2 comments
Open

[Core] Incorrectly detected TPU on a HPU-only node. #45302

woshiyyya opened this issue May 13, 2024 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks

Comments

@woshiyyya
Copy link
Member

woshiyyya commented May 13, 2024

What happened + What you expected to happen

image

The author of this PR runs a distributed training workload on a 8-HPU node, however, ray detects there's an additional TPU in the cluster. It could be a ray core's device detection bug.

Versions / Dependencies

nightly

Reproduction script

Issue Severity

Low: It annoys or frustrates me.

@woshiyyya woshiyyya added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels May 13, 2024
@rynewang
Copy link
Contributor

@allenwang28 would you mind taking a look?

@rynewang rynewang added P1 Issue that should be fixed within a few weeks @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
@allenwang28
Copy link
Contributor

Thanks for the tag! Does the HPU node have something listed at /dev/vfio or /dev/accel*?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants