bev-project/project/docs/ENVIRONMENT_FIX_RECORD.md

80 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Docker重启后环境修复记录
**时间**: 2025-10-30 11:17
**状态**: ✅ 已修复
---
## 问题描述
Docker重启后mmcv无法加载
```
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
```
---
## 根本原因
mmcv-full 1.4.0编译时期望的PyTorch库文件命名与PyTorch 1.10.1+cu102的实际文件命名不匹配
**mmcv期望**:
- libtorch_cuda_cu.so
- libtorch_cuda_cpp.so
- libtorch_cpu_cpp.so
**PyTorch实际提供**:
- libtorch_cuda.so
- libtorch_cpu.so
- libtorch.so
---
## 解决方案
创建符号链接桥接库文件命名差异:
```bash
cd /opt/conda/lib/python3.8/site-packages/torch/lib
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
```
---
## 验证结果
```bash
$ python -c "from mmcv.ops import nms_match; import mmcv; print('mmcv:', mmcv.__version__)"
✅ mmcv: 1.4.0
$ python -c "from mmdet3d.apis import train_model; print('训练环境就绪')"
✅ 训练环境就绪
```
---
## 环境配置
```
PyTorch: 1.10.1+cu102
CUDA: 10.2
mmcv-full: 1.4.0
torchvision: 0.11.2
GPU: 8张 Tesla V100S-PCIE-32GB
```
---
## 持久化
这些符号链接会在Docker重启后保留因为它们位于conda环境中。
如果未来重新安装PyTorch或mmcv需要重新创建这些链接。
---
**状态**: ✅ 环境已完全修复,可以开始训练