2.6 KiB
2.6 KiB
BEVFusion训练快速参考卡
更新: 2025-10-30
用途: 后续训练的快速参考手册
🚨 Docker重启后必做
cd /workspace/bevfusion
export PATH=/opt/conda/bin:$PATH
# 1. 创建必要的符号链接 (关键!)
cd /opt/conda/lib/python3.8/site-packages/torch/lib
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
# 2. 验证环境
cd /workspace/bevfusion
python -c "import torch; from mmcv.ops import nms_match; print('✅ 环境OK')"
# 3. 查看训练状态
bash monitor_phase4a_stage1.sh # 如果Stage 1在运行
⚡ 快速启动训练
cd /workspace/bevfusion
# Stage 1 (600×600) - 当前推荐
bash START_PHASE4A_STAGE1.sh
# 监控
bash monitor_phase4a_stage1.sh
tail -f phase4a_stage1_*.log | grep "Epoch \["
🔧 常见问题快速修复
mmcv无法加载
cd /opt/conda/lib/python3.8/site-packages/torch/lib
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
显存不足
# 减少GPU数量或降低分辨率
# 600×600: 4 GPU可行
# 800×800: 3 GPU + gradient checkpointing
代码修改不生效
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
训练卡住
pkill -9 -f "torchpack\|mpirun"
nvidia-smi # 检查GPU
bash START_SCRIPT.sh # 重新启动
📊 性能baseline
Phase 3 (epoch_23):
NDS: 0.6941
mAP: 0.6446
mIoU: 0.41
Stop Line: 0.27
Divider: 0.19
Stage 1目标 (10 epochs):
Stop Line: 0.35+
Divider: 0.28+
mIoU: 0.48+
📂 关键文件位置
Checkpoint:
Phase 3: runs/enhanced_from_epoch19/epoch_23.pth
Stage 1: runs/run-326653dc-c038af2c/epoch_*.pth
配置:
Phase 3: configs/.../multitask_enhanced_phase1_HIGHRES.yaml
Stage 1: configs/.../multitask_BEV2X_phase4a_stage1.yaml
启动脚本:
Stage 1: START_PHASE4A_STAGE1.sh
监控:
monitor_phase4a_stage1.sh
代码:
分割头: mmdet3d/models/heads/segm/enhanced.py
🎯 训练配置速查
| 配置 | Phase 3 | Stage 1 | Stage 2计划 |
|---|---|---|---|
| BEV分辨率 | 0.3m (360×360) | 0.2m (540×540) | 0.15m (720×720) |
| GT分辨率 | 0.25m (400×400) | 0.167m (600×600) | 0.125m (800×800) |
| Decoder | 2层 | 4层 | 4层 |
| Deep Sup | ❌ | ✅ | ✅ |
| Dice Loss | ❌ | ✅ | ✅ |
| GPU | 8张 | 4张 | 3-4张 |
| 显存/GPU | ~8GB | ~30GB | ~32GB |
完整文档: 项目进展与问题解决总结_20251030.md