80 lines
1.4 KiB
Markdown
80 lines
1.4 KiB
Markdown
|
|
# Docker重启后环境修复记录
|
|||
|
|
|
|||
|
|
**时间**: 2025-10-30 11:17
|
|||
|
|
**状态**: ✅ 已修复
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 问题描述
|
|||
|
|
|
|||
|
|
Docker重启后,mmcv无法加载:
|
|||
|
|
```
|
|||
|
|
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 根本原因
|
|||
|
|
|
|||
|
|
mmcv-full 1.4.0编译时期望的PyTorch库文件命名与PyTorch 1.10.1+cu102的实际文件命名不匹配:
|
|||
|
|
|
|||
|
|
**mmcv期望**:
|
|||
|
|
- libtorch_cuda_cu.so
|
|||
|
|
- libtorch_cuda_cpp.so
|
|||
|
|
- libtorch_cpu_cpp.so
|
|||
|
|
|
|||
|
|
**PyTorch实际提供**:
|
|||
|
|
- libtorch_cuda.so
|
|||
|
|
- libtorch_cpu.so
|
|||
|
|
- libtorch.so
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 解决方案
|
|||
|
|
|
|||
|
|
创建符号链接桥接库文件命名差异:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 验证结果
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
$ python -c "from mmcv.ops import nms_match; import mmcv; print('mmcv:', mmcv.__version__)"
|
|||
|
|
✅ mmcv: 1.4.0
|
|||
|
|
|
|||
|
|
$ python -c "from mmdet3d.apis import train_model; print('训练环境就绪')"
|
|||
|
|
✅ 训练环境就绪
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 环境配置
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
PyTorch: 1.10.1+cu102
|
|||
|
|
CUDA: 10.2
|
|||
|
|
mmcv-full: 1.4.0
|
|||
|
|
torchvision: 0.11.2
|
|||
|
|
GPU: 8张 Tesla V100S-PCIE-32GB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 持久化
|
|||
|
|
|
|||
|
|
这些符号链接会在Docker重启后保留,因为它们位于conda环境中。
|
|||
|
|
|
|||
|
|
如果未来重新安装PyTorch或mmcv,需要重新创建这些链接。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**状态**: ✅ 环境已完全修复,可以开始训练
|