132 lines
3.0 KiB
Markdown
132 lines
3.0 KiB
Markdown
|
|
# 环境问题记录 - Phase 4A启动失败
|
|||
|
|
|
|||
|
|
**时间**: 2025-10-30
|
|||
|
|
**问题**: 无法启动Phase 4A训练
|
|||
|
|
**错误**: ImportError: libtorch_cuda_cu.so: cannot open shared object file
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 错误详情
|
|||
|
|
|
|||
|
|
### 完整错误信息
|
|||
|
|
```
|
|||
|
|
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
|
|||
|
|
|
|||
|
|
File "/opt/conda/lib/python3.8/site-packages/mmcv/ops/assign_score_withk.py", line 5, in <module>
|
|||
|
|
ext_module = ext_loader.load_ext(
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 已尝试的解决方案
|
|||
|
|
|
|||
|
|
1. ❌ 设置LD_LIBRARY_PATH
|
|||
|
|
```bash
|
|||
|
|
export LD_LIBRARY_PATH=/opt/conda/lib:/opt/conda/lib/python3.8/site-packages/torch/lib:$LD_LIBRARY_PATH
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. ❌ 使用完整路径
|
|||
|
|
```bash
|
|||
|
|
/opt/conda/bin/torchpack dist-run -np 6 /opt/conda/bin/python tools/train.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. ❌ 使用torch.distributed.launch
|
|||
|
|
```bash
|
|||
|
|
python -m torch.distributed.launch --nproc_per_node=6
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 对比: Phase 3成功环境
|
|||
|
|
|
|||
|
|
**Phase 3训练 (成功)**:
|
|||
|
|
- 时间: 2025-10-21 ~ 2025-10-29
|
|||
|
|
- 命令: 与Phase 4A相同格式
|
|||
|
|
- 结果: 稳定运行8天,23 epochs完成
|
|||
|
|
- 环境: 相同的Docker容器
|
|||
|
|
|
|||
|
|
**疑问**:
|
|||
|
|
- Phase 3和Phase 4A之间环境发生了什么变化?
|
|||
|
|
- 是否系统更新或Docker重启?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 诊断建议
|
|||
|
|
|
|||
|
|
### 立即检查
|
|||
|
|
```bash
|
|||
|
|
# 1. 检查Docker容器状态
|
|||
|
|
docker ps
|
|||
|
|
|
|||
|
|
# 2. 检查是否需要重启Docker
|
|||
|
|
# (如果之前重启过系统或Docker)
|
|||
|
|
|
|||
|
|
# 3. 检查Python环境
|
|||
|
|
which python
|
|||
|
|
python --version
|
|||
|
|
python -c "import torch; print(torch.__version__)"
|
|||
|
|
|
|||
|
|
# 4. 检查库文件
|
|||
|
|
find /opt/conda -name "libtorch_cuda_cu.so"
|
|||
|
|
ls -l /opt/conda/pkgs/pytorch-*/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cu.so
|
|||
|
|
|
|||
|
|
# 5. 测试简单导入
|
|||
|
|
python -c "from mmcv.ops import nms_match"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 可能的解决方案
|
|||
|
|
|
|||
|
|
**方案1: 重启Docker容器**
|
|||
|
|
```bash
|
|||
|
|
# 退出容器
|
|||
|
|
exit
|
|||
|
|
|
|||
|
|
# 重新进入
|
|||
|
|
docker start [容器ID]
|
|||
|
|
docker exec -it [容器ID] /bin/bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
# 重试
|
|||
|
|
bash start_phase4a_bev2x_fixed.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**方案2: 重新编译mmcv (最后手段)**
|
|||
|
|
```bash
|
|||
|
|
# 这会比较耗时 (1-2小时)
|
|||
|
|
pip uninstall mmcv-full -y
|
|||
|
|
pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**方案3: 使用不同的启动方式**
|
|||
|
|
```bash
|
|||
|
|
# 如果torchpack有问题,直接用PyTorch DDP
|
|||
|
|
python -m torch.distributed.run --nproc_per_node=6 --master_port=29500 \
|
|||
|
|
tools/train.py configs/.../multitask_BEV2X_phase4a.yaml \
|
|||
|
|
--launcher pytorch \
|
|||
|
|
--load_from runs/enhanced_from_epoch19/epoch_23.pth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 临时措施
|
|||
|
|
|
|||
|
|
### 如果无法立即解决
|
|||
|
|
|
|||
|
|
可以先执行其他工作:
|
|||
|
|
1. 分析Phase 3的详细结果
|
|||
|
|
2. 准备实车数据采集计划
|
|||
|
|
3. 研究MapTR集成方案
|
|||
|
|
4. 设计模型压缩方案
|
|||
|
|
|
|||
|
|
### 或降级方案
|
|||
|
|
|
|||
|
|
如果BEV 2x实在无法启动,可以考虑:
|
|||
|
|
1. BEV 1.5x (0.2m分辨率) - 降低50%计算量
|
|||
|
|
2. 继续Phase 3训练更多epochs
|
|||
|
|
3. 直接进入Phase 4B模型压缩
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**状态**: 待解决
|
|||
|
|
**优先级**: P0 (最高)
|
|||
|
|
**影响**: 阻塞Phase 4A训练
|
|||
|
|
**建议**: 优先解决环境问题,这是后续工作的基础
|