108 lines
2.8 KiB
Markdown
108 lines
2.8 KiB
Markdown
|
|
# BEVFusion项目总结 - 一页纸版本
|
|||
|
|
|
|||
|
|
**日期**: 2025-10-30
|
|||
|
|
**状态**: Phase 4A Stage 1 正在训练 🚀
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 项目历程
|
|||
|
|
|
|||
|
|
| 阶段 | Epoch | 配置 | 性能 | 状态 |
|
|||
|
|
|------|-------|------|------|------|
|
|||
|
|
| Phase 1-2 | 1-19 | 基础模型 | Baseline | ✅ |
|
|||
|
|
| Phase 3 | 20-23 | Enhanced头 400×400 | NDS 0.6941, mIoU 0.41 | ✅ |
|
|||
|
|
| Stage 1 | 24-33 | Enhanced头 600×600 | 训练中 | 🔄 |
|
|||
|
|
| Stage 2 | 待定 | Enhanced头 800×800 | 待规划 | ⏸️ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚠️ 8个关键问题与解决方法
|
|||
|
|
|
|||
|
|
### 1. Docker重启后mmcv无法加载 ⭐⭐⭐
|
|||
|
|
**错误**: `ImportError: libtorch_cuda_cu.so`
|
|||
|
|
**原因**: 库文件命名不匹配
|
|||
|
|
**解决**:
|
|||
|
|
```bash
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 800×800显存不足 ⭐⭐⭐
|
|||
|
|
**错误**: `CUDA out of memory` (18GB/32GB)
|
|||
|
|
**原因**: 分辨率提升4x → 显存需求4x
|
|||
|
|
**解决**: 渐进式训练 (600×600 → 800×800)
|
|||
|
|
|
|||
|
|
### 3. Shape不匹配 ⭐⭐
|
|||
|
|
**错误**: `Target 800×800 vs Input 400×400`
|
|||
|
|
**原因**: output_scope配置错误
|
|||
|
|
**解决**: 修改配置 + 添加自适应插值
|
|||
|
|
|
|||
|
|
### 4. 插值类型错误 ⭐⭐
|
|||
|
|
**错误**: `upsample not implemented for Long`
|
|||
|
|
**原因**: PyTorch不支持整型插值
|
|||
|
|
**解决**: 使用`.float()`插值,保持float
|
|||
|
|
|
|||
|
|
### 5. LD_LIBRARY_PATH未传递 ⭐
|
|||
|
|
**原因**: torchpack可能不传递环境变量
|
|||
|
|
**解决**: 在命令前明确声明环境变量
|
|||
|
|
|
|||
|
|
### 6. DataLoader共享内存 ⭐
|
|||
|
|
**错误**: `unable to write to file </torch_xxx>`
|
|||
|
|
**解决**: `--data.workers_per_gpu 0`
|
|||
|
|
|
|||
|
|
### 7. Python缓存 ⭐
|
|||
|
|
**现象**: 代码修改不生效
|
|||
|
|
**解决**: `find . -name __pycache__ -exec rm -rf {} +`
|
|||
|
|
|
|||
|
|
### 8. 配置未同步 ⭐
|
|||
|
|
**现象**: 复制配置后total_epochs错误
|
|||
|
|
**解决**: 检查所有分辨率和epochs配置
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 当前状态 (13:15)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
训练: Phase 4A Stage 1
|
|||
|
|
配置: 600×600, 4层Decoder, Deep Sup + Dice
|
|||
|
|
GPU: 4张 @ 100%利用率, ~30GB显存
|
|||
|
|
进度: Epoch 1, iter 100+/30895
|
|||
|
|
Loss: 6.9 → 6.3 (下降中)
|
|||
|
|
ETA: ~9天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ Docker重启后3步启动
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. 创建符号链接
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib && \
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so && \
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so && \
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
|
|||
|
|
# 2. 验证环境
|
|||
|
|
cd /workspace/bevfusion && python -c "from mmcv.ops import nms_match; print('OK')"
|
|||
|
|
|
|||
|
|
# 3. 启动训练
|
|||
|
|
bash START_PHASE4A_STAGE1.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 完整文档
|
|||
|
|
|
|||
|
|
**最全面**: `项目进展与问题解决总结_20251030.md` (本文档)
|
|||
|
|
**快速查阅**: `QUICK_REFERENCE_CARD.md`
|
|||
|
|
**索引**: `项目状态总览_20251030.md`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**下一检查点**: Epoch 1验证 (~21小时后)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|