You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have searched related issues but cannot get the expected help.
2. I have read the FAQ documentation but cannot get the expected help.
3. The bug has not been fixed in the latest version.
Describe the bug
I am using a new slurm cluster at my university and am unable to get mmdeploy working on the H100 nodes in this cluster. This problem appears to be specific to tensorrt.
compiling mm packages rather than pip/mim installing
using different version of tensorrt, cudnn, onnx, etc
I have installed mmdeploy and been able to install and use mmdeploy on other clusters/gpus on my custom networks in several other environments so I am not sure what's going on here.
Note that I am on a slurm cluster at my university.
Here are the specific commands I ran after getting a node.
module load cuda12.1/toolkit/12.1.1
module load gcc12/12.2.0
conda create --name rtm python=3.8 -y
conda activate rtm
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -U openmim
mim install mmengine
mim install "mmcv>=2.0.0rc2"
# 1. install MMDeploy model converter
pip install mmdeploy==1.3.1
# 2. install MMDeploy sdk inference
# you can install one to install according whether you need gpu inference
# 2.1 support onnxruntime
pip install mmdeploy-runtime==1.3.1
# 2.2 support onnxruntime-gpu, tensorrt
pip install mmdeploy-runtime-gpu==1.3.1
# 3. install inference engine
# 3.1 install TensorRT
# !!! If you want to convert a tensorrt model or inference with tensorrt,
# download TensorRT-8.2.3.0 CUDA 11.x tar package from NVIDIA, and extract it to the current directory
pip install TensorRT-8.2.3.0/python/tensorrt-8.2.3.0-cp38-none-linux_x86_64.whl
pip install pycuda
export TENSORRT_DIR=$(pwd)/TensorRT-8.2.3.0
export LD_LIBRARY_PATH=${TENSORRT_DIR}/lib:$LD_LIBRARY_PATH
# !!! Moreover, download cuDNN 8.2.1 CUDA 11.x tar package from NVIDIA, and extract it to the current directory
export CUDNN_DIR=$(pwd)/cuda
export LD_LIBRARY_PATH=$CUDNN_DIR/lib64:$LD_LIBRARY_PATH
# 3.2 install ONNX Runtime
# you can install one to install according whether you need gpu inference
# 3.2.1 onnxruntime
wget[ https://github.com/microsoft/onnxruntime/releases/download/v1.8.1/onnxruntime-linux-x64-1.8.1.tgz](https://github.com/microsoft/onnxruntime/releases/download/v1.8.1/onnxruntime-linux-x64-1.8.1.tgz)
tar -zxvf onnxruntime-linux-x64-1.8.1.tgz
export ONNXRUNTIME_DIR=$(pwd)/onnxruntime-linux-x64-1.8.1
export LD_LIBRARY_PATH=$ONNXRUNTIME_DIR/lib:$LD_LIBRARY_PATH
# 3.2.2 onnxruntime-gpu
pip install onnxruntime-gpu==1.8.1
wget[ https://github.com/microsoft/onnxruntime/releases/download/v1.8.1/onnxruntime-linux-x64-gpu-1.8.1.tgz](https://github.com/microsoft/onnxruntime/releases/download/v1.8.1/onnxruntime-linux-x64-gpu-1.8.1.tgz)
tar -zxvf onnxruntime-linux-x64-gpu-1.8.1.tgz
export ONNXRUNTIME_DIR=$(pwd)/onnxruntime-linux-x64-gpu-1.8.1
export LD_LIBRARY_PATH=$ONNXRUNTIME_DIR/lib:$LD_LIBRARY_PATH
cd mmdetection
mim install -v -e .
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/onnxruntime/capi/_ld_preload.py", line 12, in <module>
_libcudart = CDLL("libcudart.so.11.0", mode=RTLD_GLOBAL)
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory
I can switch to cuda 11, and try again:
module load cuda11.8/toolkit/11.8.0
This time it runs up until:
[05/05/2024-16:16:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +7, GPU +258, now: CPU 761, GPU 1003 (MiB)
[05/05/2024-16:16:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +6, GPU +264, now: CPU 767, GPU 1267 (MiB)
[05/05/2024-16:16:18] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/05/2024-16:16:20] [TRT] [E] 1: [caskUtils.cpp::trtSmToCask::147] Error Code 1: Internal Error (Unsupported SM: 0x900)
[05/05/2024-16:16:20] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Process Process-3:
Traceback (most recent call last):
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/mmdeploy/apis/core/pipeline_manager.py", line 107, in __call__
ret = func(*args, **kwargs)
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/mmdeploy/apis/utils/utils.py", line 98, in to_backend
return backend_mgr.to_backend(
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/mmdeploy/backend/tensorrt/backend_manager.py", line 127, in to_backend
onnx2tensorrt(
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/mmdeploy/backend/tensorrt/onnx2tensorrt.py", line 79, in onnx2tensorrt
from_onnx(
File "/home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/mmdeploy/backend/tensorrt/utils.py", line 248, in from_onnx
assert engine is not None, 'Failed to create TensorRT engine'
AssertionError: Failed to create TensorRT engine
05/05 16:16:21 - mmengine - ERROR - /home/tis697/miniconda3/envs/rtm/lib/python3.8/site-packages/mmdeploy/apis/core/pipeline_manager.py - pop_mp_output - 80 - `mmdeploy.apis.utils.utils.to_backend` with Call id: 1 failed. exit.
If I use a different environment where I used cuda 11.8 instead of cuda 12 I still get this error.
[05/05/2024-17:04:32] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 111.1.3
[05/05/2024-17:04:32] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +258, now: CPU 748, GPU 1003 (MiB)
[05/05/2024-17:04:32] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +264, now: CPU 750, GPU 1267 (MiB)
[05/05/2024-17:04:32] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[05/05/2024-17:04:34] [TRT] [E] 1: [caskUtils.cpp::trtSmToCask::147] Error Code 1: Internal Error (Unsupported SM: 0x900)
[05/05/2024-17:04:34] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Process Process-3:
Traceback (most recent call last):
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/apis/core/pipeline_manager.py", line 107, in __call__
ret = func(*args, **kwargs)
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/apis/utils/utils.py", line 98, in to_backend
return backend_mgr.to_backend(
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/backend/tensorrt/backend_manager.py", line 127, in to_backend
onnx2tensorrt(
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/backend/tensorrt/onnx2tensorrt.py", line 79, in onnx2tensorrt
from_onnx(
File "/home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/backend/tensorrt/utils.py", line 248, in from_onnx
assert engine is not None, 'Failed to create TensorRT engine'
AssertionError: Failed to create TensorRT engine
05/05 17:04:35 - mmengine - ERROR - /home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/apis/core/pipeline_manager.py - pop_mp_output - 80 - `mmdeploy.apis.utils.utils.to_backend` with Call id: 1 failed. exit.
I get a little farther, being able to generate the output, but fail on visualize tensorrt model failed
[05/05/2024-17:17:47] [TRT] [I] Total Activation Memory: 415622144
[05/05/2024-17:17:47] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 3456, GPU 1851 (MiB)
[05/05/2024-17:17:47] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +264, now: CPU 3456, GPU 2115 (MiB)
[05/05/2024-17:17:47] [TRT] [W] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.7.0
[05/05/2024-17:17:47] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +160, now: CPU 0,GPU 160 (MiB)
05/05 17:17:49 - mmengine - INFO - Finish pipeline mmdeploy.apis.utils.utils.to_backend
05/05 17:17:50 - mmengine - INFO - visualize tensorrt model start.
05/05 17:17:54 - mmengine - WARNING - Failed to search registry with scope "mmdet" in the "Codebases" registry tree. As a workaround, the current "Codebases" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
05/05 17:17:54 - mmengine - WARNING - Failed to search registry with scope "mmdet" in the "mmdet_tasks" registry tree. As a workaround, the current "mmdet_tasks" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
05/05 17:17:54 - mmengine - WARNING - Failed to search registry with scope "mmdet" in the "backend_detectors" registry tree. As a workaround, the current "backend_detectors" registry in "mmdeploy" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmdet" is a correct scope, or whether the registry is initialized.
05/05 17:17:54 - mmengine - INFO - Successfully loaded tensorrt plugins from /home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/lib/libmmdeploy_tensorrt_ops.so
05/05 17:17:54 - mmengine - INFO - Successfully loaded tensorrt plugins from /home/tis697/miniconda3/envs/mmdeploy/lib/python3.8/site-packages/mmdeploy/lib/libmmdeploy_tensorrt_ops.so
[05/05/2024-17:17:55] [TRT] [W] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.7.0
[05/05/2024-17:17:55] [TRT] [W] TensorRT was linked against cuDNN 8.9.0 but loaded cuDNN 8.7.0
#assertion/__w/mmdeploy/mmdeploy/csrc/mmdeploy/backend_ops/tensorrt/batched_nms/trt_batched_nms.cpp,103
05/05 17:17:56 - mmengine - ERROR - mmdeploy/tools/deploy.py - create_process - 82 - visualize tensorrt model failed.
Checklist
Describe the bug
I am using a new slurm cluster at my university and am unable to get mmdeploy working on the H100 nodes in this cluster. This problem appears to be specific to tensorrt.
I initially tried following the instructions at https://mmdeploy.readthedocs.io/en/latest/get_started.html precisely, then tried a number of permutations:
I have installed mmdeploy and been able to install and use mmdeploy on other clusters/gpus on my custom networks in several other environments so I am not sure what's going on here.
Reproduction
I initially followed the instructions at https://mmdeploy.readthedocs.io/en/latest/get_started.html
Note that I am on a slurm cluster at my university.
Here are the specific commands I ran after getting a node.
If I then try to run
I get the error :
I can switch to cuda 11, and try again:
This time it runs up until:
If I use a different environment where I used cuda 11.8 instead of cuda 12 I still get this error.
If I instead install tensorrt via pypi:
I get a little farther, being able to generate the output, but fail on
visualize tensorrt model failed
I can then try to use the models I generated:
Environment
Error traceback
No response
The text was updated successfully, but these errors were encountered: