137 lines
5.1 KiB
Plaintext
137 lines
5.1 KiB
Plaintext
|
|
================================================================================
|
|||
|
|
BEVFusion项目进展总结 - 给用户
|
|||
|
|
================================================================================
|
|||
|
|
生成时间: 2025-10-30 13:25
|
|||
|
|
当前状态: ✅ Phase 4A Stage 1 训练稳定运行中
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
项目进展概览
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
✅ Phase 3 已完成:
|
|||
|
|
- NDS: 0.6941, mAP: 0.6446, mIoU: 0.41
|
|||
|
|
- Stop Line: 0.27, Divider: 0.19 (需提升)
|
|||
|
|
- Checkpoint: epoch_23.pth (516MB)
|
|||
|
|
|
|||
|
|
🚀 Phase 4A Stage 1 正在训练:
|
|||
|
|
- 分辨率: 600×600 (比Phase 3提升50%)
|
|||
|
|
- 模型: 4层Decoder + Deep Supervision + Dice Loss
|
|||
|
|
- 进度: Epoch 1, iter 350+/30895
|
|||
|
|
- Loss: 6.9 → 5.7 (稳定下降)
|
|||
|
|
- GPU: 4张 @ 100%利用率
|
|||
|
|
- 预计完成: 9天后
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
解决的8个关键问题 (后续训练必看!)
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
⭐⭐⭐ 问题1: Docker重启mmcv无法加载
|
|||
|
|
解决: 创建符号链接
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
|
|||
|
|
⭐⭐⭐ 问题2: 800×800显存不足
|
|||
|
|
解决: 渐进式训练 (600×600 → 800×800)
|
|||
|
|
|
|||
|
|
⭐⭐ 问题3: Shape不匹配 (Target 800×800 vs Output 400×400)
|
|||
|
|
解决: 配置修正 + 代码自适应插值
|
|||
|
|
|
|||
|
|
⭐⭐ 问题4: 插值类型错误 (Long型tensor无法插值)
|
|||
|
|
解决: 使用.float()插值,保持float用于focal loss
|
|||
|
|
|
|||
|
|
⭐ 问题5: LD_LIBRARY_PATH环境变量
|
|||
|
|
解决: 在启动命令前明确声明环境变量
|
|||
|
|
|
|||
|
|
⭐ 问题6: DataLoader共享内存错误
|
|||
|
|
解决: workers_per_gpu=0
|
|||
|
|
|
|||
|
|
⭐ 问题7: Python代码缓存
|
|||
|
|
解决: find . -name __pycache__ -exec rm -rf {} +
|
|||
|
|
|
|||
|
|
⭐ 问题8: 配置参数未同步
|
|||
|
|
解决: 手动检查所有关键配置
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
技术改进 (Phase 3 → Stage 1)
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
分辨率: 360×360 → 540×540 (+50%)
|
|||
|
|
GT标签: 400×400 → 600×600 (+50%)
|
|||
|
|
Decoder: 2层 → 4层 (深度翻倍)
|
|||
|
|
新特性: + Deep Supervision + Dice Loss
|
|||
|
|
显存: ~8GB/GPU → ~30GB/GPU
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
Docker重启后快速恢复 (3步)
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
1. 创建符号链接:
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
|
|||
|
|
2. 验证环境:
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
python -c "from mmcv.ops import nms_match; print('✅ OK')"
|
|||
|
|
|
|||
|
|
3. 启动训练:
|
|||
|
|
bash START_PHASE4A_STAGE1.sh
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
监控命令
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
日常监控: bash monitor_phase4a_stage1.sh
|
|||
|
|
实时日志: tail -f phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
GPU状态: nvidia-smi
|
|||
|
|
停止训练: pkill -9 -f "torchpack\|mpirun"
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
关键文件位置
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
配置: configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/
|
|||
|
|
└─ multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
|
|||
|
|
代码: mmdet3d/models/heads/segm/enhanced.py
|
|||
|
|
(已修复插值类型bug)
|
|||
|
|
|
|||
|
|
Checkpoint:
|
|||
|
|
Phase 3: runs/enhanced_from_epoch19/epoch_23.pth
|
|||
|
|
Stage 1: runs/run-326653dc-c038af2c/epoch_*.pth
|
|||
|
|
|
|||
|
|
日志: phase4a_stage1_20251030_130707.log
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
完整文档 (17个)
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
⭐⭐⭐ 必读3份:
|
|||
|
|
1. 项目进展与问题解决总结_20251030.md (最详细)
|
|||
|
|
2. QUICK_REFERENCE_CARD.md (快速参考)
|
|||
|
|
3. 训练总结_一页纸版本.md (精简版)
|
|||
|
|
|
|||
|
|
其他文档:
|
|||
|
|
- PROJECT_SUMMARY_20251030_FINAL.md (总体状态)
|
|||
|
|
- PHASE4A_STAGE1_LAUNCHED_SUCCESS.md (Stage 1启动记录)
|
|||
|
|
- ENVIRONMENT_FIX_RECORD.md (环境修复记录)
|
|||
|
|
- 项目状态一览_LATEST.txt (实时状态)
|
|||
|
|
|
|||
|
|
... 以及其他10份详细文档
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
下一步
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
短期 (每天): 监控loss和GPU稳定性
|
|||
|
|
Epoch 1 (~21小时): 验证性能提升
|
|||
|
|
Epoch 5 (~4.5天): 评估是否达到预期 (Stop Line 0.32+)
|
|||
|
|
完成 (~9天): Stage 1最终评估 + 规划Stage 2
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
当前训练正常! Loss持续下降! 🎉
|
|||
|
|
================================================================================
|