Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xpu: python hangs on exit after check for xpu on multi-dev system #126259

Open
dvrogozh opened this issue May 15, 2024 · 2 comments
Open

xpu: python hangs on exit after check for xpu on multi-dev system #126259

dvrogozh opened this issue May 15, 2024 · 2 comments
Assignees
Labels
module: xpu Intel XPU related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@dvrogozh
Copy link
Contributor

dvrogozh commented May 15, 2024

As of fd48fb9, checking for xpu backend on multi device system (2x Intel GPU ATS-M150 cards) causes hang on exit (see stack in the end):

$ python3 -c 'import torch; print(torch.xpu.is_available())'
True
^C

I caught this issue under the following environment:

Note that I believe this is due to 2x intel gpus on the system because:

  • I did not see such an issue on another system which had just single ATS-M150
  • There is no issue on the system where I reproduced it if I will hack and remove /dev/dri/renderD129 node, i.e.
sudo chmod a-r /dev/dri/renderD129
sudo chmod a-w /dev/dri/renderD129
$ python3 -c 'import torch; print(torch.xpu.is_available())'
True
$     <<<< no hang on exit

Retrying under gdb to get a stack:

(gdb) thread apply all bt
Thread 1 (Thread 0x7ffff7c50740 (LWP 769138) "pt_main_thread"):
#0  __GI___ioctl (fd=3, request=3224396954) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007ffeec6762f7 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#2  0x00007ffeec6c2401 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#3  0x00007ffeec67cb76 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#4  0x00007ffeec66d841 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#5  0x00007ffeec61cf68 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#6  0x00007ffeec2c9fa1 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#7  0x00007ffeec2c9564 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#8  0x00007ffeec2ca452 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#9  0x00007ffeec2caa0d in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#10 0x00007ffeec22ee97 in ?? () from /lib/x86_64-linux-gnu/libze_intel_gpu.so.1
#11 0x00007ffff7fc924e in _dl_fini () at ./elf/dl-fini.c:142
#12 0x00007ffff7c98495 in __run_exit_handlers (status=0, listp=0x7ffff7e6d838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#13 0x00007ffff7c98610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#14 0x00007ffff7c7cd97 in __libc_start_call_main (main=main@entry=0x55555577bff0, argc=argc@entry=3, argv=argv@entry=0x7fffffffcc18) at ../sysdeps/nptl/libc_start_call_main.h:74
#15 0x00007ffff7c7ce40 in __libc_start_main_impl (main=0x55555577bff0, argc=3, argv=0x7fffffffcc18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffcc08) at ../csu/libc-start.c:392
#16 0x000055555577bf25 in _start ()

cc @gujinghui @EikanWang @fengyuan14 @guangyey

@mikaylagawarecki mikaylagawarecki added module: xpu Intel XPU related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 15, 2024
@dvrogozh
Copy link
Contributor Author

I should note that I don't see this issue on the other system which has 1x ATS-M150 and 1x ATS-M75. It would be nice to try reproduce this issue on a clean system having 2x ATS-M150 cards to wave specific system issue.

@EikanWang
Copy link
Collaborator

For ATS-M, it has not been supported. But we need to triage it anyway. @guangyey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: xpu Intel XPU related issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants