bev-project/project/docs/ENVIRONMENT_ISSUE_RECORD.md

132 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 环境问题记录 - Phase 4A启动失败
**时间**: 2025-10-30
**问题**: 无法启动Phase 4A训练
**错误**: ImportError: libtorch_cuda_cu.so: cannot open shared object file
---
## 错误详情
### 完整错误信息
```
ImportError: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
File "/opt/conda/lib/python3.8/site-packages/mmcv/ops/assign_score_withk.py", line 5, in <module>
ext_module = ext_loader.load_ext(
```
### 已尝试的解决方案
1. ❌ 设置LD_LIBRARY_PATH
```bash
export LD_LIBRARY_PATH=/opt/conda/lib:/opt/conda/lib/python3.8/site-packages/torch/lib:$LD_LIBRARY_PATH
```
2. ❌ 使用完整路径
```bash
/opt/conda/bin/torchpack dist-run -np 6 /opt/conda/bin/python tools/train.py
```
3. ❌ 使用torch.distributed.launch
```bash
python -m torch.distributed.launch --nproc_per_node=6
```
---
## 对比: Phase 3成功环境
**Phase 3训练 (成功)**:
- 时间: 2025-10-21 ~ 2025-10-29
- 命令: 与Phase 4A相同格式
- 结果: 稳定运行8天23 epochs完成
- 环境: 相同的Docker容器
**疑问**:
- Phase 3和Phase 4A之间环境发生了什么变化
- 是否系统更新或Docker重启
---
## 诊断建议
### 立即检查
```bash
# 1. 检查Docker容器状态
docker ps
# 2. 检查是否需要重启Docker
# (如果之前重启过系统或Docker)
# 3. 检查Python环境
which python
python --version
python -c "import torch; print(torch.__version__)"
# 4. 检查库文件
find /opt/conda -name "libtorch_cuda_cu.so"
ls -l /opt/conda/pkgs/pytorch-*/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cu.so
# 5. 测试简单导入
python -c "from mmcv.ops import nms_match"
```
### 可能的解决方案
**方案1: 重启Docker容器**
```bash
# 退出容器
exit
# 重新进入
docker start [容器ID]
docker exec -it [容器ID] /bin/bash
cd /workspace/bevfusion
# 重试
bash start_phase4a_bev2x_fixed.sh
```
**方案2: 重新编译mmcv (最后手段)**
```bash
# 这会比较耗时 (1-2小时)
pip uninstall mmcv-full -y
pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html
```
**方案3: 使用不同的启动方式**
```bash
# 如果torchpack有问题直接用PyTorch DDP
python -m torch.distributed.run --nproc_per_node=6 --master_port=29500 \
tools/train.py configs/.../multitask_BEV2X_phase4a.yaml \
--launcher pytorch \
--load_from runs/enhanced_from_epoch19/epoch_23.pth
```
---
## 临时措施
### 如果无法立即解决
可以先执行其他工作:
1. 分析Phase 3的详细结果
2. 准备实车数据采集计划
3. 研究MapTR集成方案
4. 设计模型压缩方案
### 或降级方案
如果BEV 2x实在无法启动可以考虑:
1. BEV 1.5x (0.2m分辨率) - 降低50%计算量
2. 继续Phase 3训练更多epochs
3. 直接进入Phase 4B模型压缩
---
**状态**: 待解决
**优先级**: P0 (最高)
**影响**: 阻塞Phase 4A训练
**建议**: 优先解决环境问题,这是后续工作的基础