367 lines
12 KiB
Plaintext
367 lines
12 KiB
Plaintext
|
|
================================================================================
|
|||
|
|
BEVFusion 训练快速参考
|
|||
|
|
最后更新: 2025-11-06 15:46 (北京时间)
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
当前训练阶段: Phase 4A Stage 1 - Task-specific GCA优化
|
|||
|
|
配置文件: multitask_BEV2X_phase4a_stage1_task_gca.yaml
|
|||
|
|
状态: 🟢 正在运行
|
|||
|
|
进度: Epoch 6, Iter 4950/15448 (32%完成)
|
|||
|
|
启动时间: 2025-11-06 03:59 UTC
|
|||
|
|
预计完成: 2025-11-13 (7天后)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
1. 当前训练状态 - 实时更新
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【最新状态 - 2025-11-06 15:46】
|
|||
|
|
✅ 训练正常运行中
|
|||
|
|
- Epoch: [1]/20 (从epoch_5.pth继续,实际为Epoch 6)
|
|||
|
|
- Progress: 4950/15448 iterations (32.0%)
|
|||
|
|
- 运行时长: 3小时47分钟
|
|||
|
|
- Loss: 2.4543 (稳定下降)
|
|||
|
|
- GPU: 8卡全部满载
|
|||
|
|
- 显存: ~29GB/GPU
|
|||
|
|
- 速度: 2.66秒/iteration
|
|||
|
|
|
|||
|
|
【关键性能指标】
|
|||
|
|
Divider Dice Loss: 0.5339 (vs Epoch 5: 0.5140, 改善+3.9%)
|
|||
|
|
Detection IoU: 0.6179 (良好)
|
|||
|
|
Overall Loss: 2.45 (从启动时2.47下降)
|
|||
|
|
|
|||
|
|
【预计时间】
|
|||
|
|
- Epoch 6完成: 今晚23:26 (北京时间)
|
|||
|
|
- 全部训练完成: 2025-11-13 (约7天)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
1.1 历史训练状态
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
Phase 5 Enhanced训练: ✅ 完成
|
|||
|
|
- 模型: EnhancedBEVSegmentationHead
|
|||
|
|
- Epochs: 23/23
|
|||
|
|
- 完成时间: 2025-10-29 23:21
|
|||
|
|
- Checkpoint: enhanced_from_epoch19/epoch_23.pth (516MB)
|
|||
|
|
- 特性: ASPP + Attention + Deep Supervision + GroupNorm
|
|||
|
|
|
|||
|
|
Phase 4A Stage 1 状态: ✅ 运行中
|
|||
|
|
- 脚本: START_PHASE4A_TASK_GCA_BACKGROUND.sh (已启动)
|
|||
|
|
- 配置: multitask_BEV2X_phase4a_stage1_task_gca.yaml
|
|||
|
|
- 起点: epoch_5.pth (已加载)
|
|||
|
|
- 当前: Epoch 6 进行中 (32%完成)
|
|||
|
|
- 目标: epoch_20.pth (还需15个epochs)
|
|||
|
|
- 日志: /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
|||
|
|
- PID: 1234388 (18个相关进程)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
2. Task-specific GCA架构说明
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【核心创新】
|
|||
|
|
Decoder Neck → 原始BEV (512通道,完整信息)
|
|||
|
|
├─ 检测GCA (独立) → 检测最优BEV → 检测头
|
|||
|
|
└─ 分割GCA (独立) → 分割最优BEV → 分割头
|
|||
|
|
|
|||
|
|
【配置参数】
|
|||
|
|
task_specific_gca:
|
|||
|
|
enabled: true
|
|||
|
|
in_channels: 512
|
|||
|
|
reduction: 4
|
|||
|
|
object_reduction: 4 # 检测GCA降维比例
|
|||
|
|
map_reduction: 4 # 分割GCA降维比例
|
|||
|
|
|
|||
|
|
【优势】
|
|||
|
|
✅ 每个任务根据自己需求选择通道
|
|||
|
|
✅ 避免统一选择的折中问题
|
|||
|
|
✅ 参数增加: 仅0.26M
|
|||
|
|
✅ 计算增加: ~1.6ms (0.06%)
|
|||
|
|
|
|||
|
|
【当前效果】
|
|||
|
|
Divider性能改善: +3.9% (Epoch 6进度32%时)
|
|||
|
|
Overall Loss稳定: 2.45 (波动2.3-2.5)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
3. 快速监控命令 - 当前训练
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【实时监控】
|
|||
|
|
# 查看训练日志(最推荐)
|
|||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "Epoch \["
|
|||
|
|
|
|||
|
|
# 只看Loss和关键指标
|
|||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep -E "loss/map/divider|loss:" | grep "Epoch \["
|
|||
|
|
|
|||
|
|
# 查看Divider性能
|
|||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "loss/map/divider/dice"
|
|||
|
|
|
|||
|
|
【GPU监控】
|
|||
|
|
# 实时GPU状态
|
|||
|
|
watch -n 5 nvidia-smi
|
|||
|
|
|
|||
|
|
# 查看GPU利用率
|
|||
|
|
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,temperature.gpu --format=csv
|
|||
|
|
|
|||
|
|
【进程监控】
|
|||
|
|
# 检查训练进程
|
|||
|
|
ps aux | grep train.py | wc -l # 应该显示18
|
|||
|
|
|
|||
|
|
# 查看主进程
|
|||
|
|
ps aux | grep "[t]rain.py" | head -3
|
|||
|
|
|
|||
|
|
【日志分析】
|
|||
|
|
# 查看最近进度
|
|||
|
|
tail -100 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "Epoch \[" | tail -5
|
|||
|
|
|
|||
|
|
# 统计当前iteration
|
|||
|
|
tail -1 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep -oP '\[\d+/\d+\]'
|
|||
|
|
|
|||
|
|
# 查看日志文件大小
|
|||
|
|
ls -lh /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
4. 训练启动命令
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【当前运行的训练】
|
|||
|
|
启动命令:
|
|||
|
|
cd /workspace/bevfusion && bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
|||
|
|
|
|||
|
|
实际执行的命令:
|
|||
|
|
torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
|
|||
|
|
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1_task_gca.yaml \
|
|||
|
|
--load_from /workspace/bevfusion/runs/run-326653dc-2334d461/epoch_5.pth \
|
|||
|
|
--data.samples_per_gpu 1 \
|
|||
|
|
--data.workers_per_gpu 0
|
|||
|
|
|
|||
|
|
状态: ✅ 运行中 (PID: 1234388)
|
|||
|
|
日志: /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
|||
|
|
|
|||
|
|
【如需重启(请勿轻易执行)】
|
|||
|
|
# 1. 先停止当前训练
|
|||
|
|
killall -9 python
|
|||
|
|
|
|||
|
|
# 2. 等待GPU释放
|
|||
|
|
sleep 5
|
|||
|
|
|
|||
|
|
# 3. 重新启动
|
|||
|
|
bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
5. Checkpoint管理
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【当前可用Checkpoints】
|
|||
|
|
runs/run-326653dc-2334d461/
|
|||
|
|
├─ epoch_3.pth (525MB)
|
|||
|
|
├─ epoch_4.pth (525MB)
|
|||
|
|
└─ epoch_5.pth (525MB) ✅ 当前训练起点
|
|||
|
|
|
|||
|
|
runs/enhanced_from_epoch19/
|
|||
|
|
└─ epoch_23.pth (516MB) ✅ Phase 5完成
|
|||
|
|
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/
|
|||
|
|
└─ (Epoch 6+的checkpoint将保存在这里,每个~525MB)
|
|||
|
|
|
|||
|
|
【Checkpoint查看】
|
|||
|
|
# 查看最新checkpoint
|
|||
|
|
ls -lt /data/runs/phase4a_stage1_task_gca/*.pth 2>/dev/null | head -5
|
|||
|
|
|
|||
|
|
# 查看checkpoint大小
|
|||
|
|
du -sh /data/runs/phase4a_stage1_task_gca/*.pth
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
6. 性能追踪
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【当前性能 (Epoch 6, 32%完成)】
|
|||
|
|
分割 (Dice Loss):
|
|||
|
|
Drivable Area: 0.11 ✅ 优秀
|
|||
|
|
Ped Crossing: 0.23 🟡 中等
|
|||
|
|
Walkway: 0.22 🟡 中等
|
|||
|
|
Stop Line: 0.35 🔴 困难
|
|||
|
|
Carpark Area: 0.20 🟡 中等
|
|||
|
|
Divider: 0.53 ⚠️ 最困难 (改善+3.9%)
|
|||
|
|
|
|||
|
|
检测:
|
|||
|
|
Matched IoU: 0.6179 ✅ 良好
|
|||
|
|
|
|||
|
|
【预期最终性能 (Epoch 20)】
|
|||
|
|
总mIoU: 52% → 61% (+17%)
|
|||
|
|
Divider: Dice 0.51 → 0.42 (-18%改善)
|
|||
|
|
检测mAP: 0.67 → 0.70 (+4.5%)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
7. 评估和测试
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【评估配置】
|
|||
|
|
当前训练的评估设置:
|
|||
|
|
- 频率: 每10 epochs (减少75%开销)
|
|||
|
|
- 验证样本: 3010个 (load_interval=2, 减少50%)
|
|||
|
|
- 下次评估: Epoch 10
|
|||
|
|
|
|||
|
|
【手动评估】
|
|||
|
|
# 评估指定checkpoint
|
|||
|
|
torchpack dist-run -np 8 python tools/test.py \
|
|||
|
|
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1_task_gca.yaml \
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/epoch_6.pth \
|
|||
|
|
--eval bbox map
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
8. 当前训练关键信息
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【Task-specific GCA架构】
|
|||
|
|
核心创新:
|
|||
|
|
Decoder Neck → 原始BEV (512通道)
|
|||
|
|
├─ 检测GCA (独立) → 检测最优特征 → 检测头
|
|||
|
|
└─ 分割GCA (独立) → 分割最优特征 → 分割头
|
|||
|
|
|
|||
|
|
优势:
|
|||
|
|
✅ 每个任务独立选择最优特征
|
|||
|
|
✅ 避免shared GCA的折中问题
|
|||
|
|
✅ 参数增加仅0.26M
|
|||
|
|
✅ 计算开销仅+1.6ms
|
|||
|
|
|
|||
|
|
配置参数:
|
|||
|
|
task_specific_gca:
|
|||
|
|
enabled: true
|
|||
|
|
in_channels: 512
|
|||
|
|
reduction: 4
|
|||
|
|
object_reduction: 4 # 检测GCA
|
|||
|
|
map_reduction: 4 # 分割GCA
|
|||
|
|
|
|||
|
|
【当前性能观察】
|
|||
|
|
Epoch 6进度32%时:
|
|||
|
|
- Divider改善: +3.9% (0.5140 → 0.5339)
|
|||
|
|
- Loss稳定: 2.45 (波动在2.3-2.5)
|
|||
|
|
- 检测IoU: 0.6179 (良好)
|
|||
|
|
|
|||
|
|
【预期最终成果】(Epoch 20)
|
|||
|
|
- 总mIoU: 52% → 61% (+17%)
|
|||
|
|
- Divider: Dice 0.51 → 0.42 (-18%改善)
|
|||
|
|
- 检测mAP: 0.67 → 0.70 (+4.5%)
|
|||
|
|
|
|||
|
|
【评估优化】
|
|||
|
|
- 评估频率: 每10 epochs (减少75%开销)
|
|||
|
|
- 验证样本: 3010个 (减少50%)
|
|||
|
|
- 避免了之前Epoch 5的磁盘空间崩溃问题
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
9. 常见问题排查
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【问题1: 训练卡住不动】
|
|||
|
|
检查:
|
|||
|
|
ps aux | grep train.py | wc -l # 应该是18
|
|||
|
|
nvidia-smi # 检查GPU是否在工作
|
|||
|
|
|
|||
|
|
解决:
|
|||
|
|
如果进程存在但GPU不工作,可能是死锁
|
|||
|
|
killall -9 python
|
|||
|
|
bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
|||
|
|
|
|||
|
|
【问题2: 磁盘空间不足】
|
|||
|
|
检查:
|
|||
|
|
df -h /workspace
|
|||
|
|
du -sh /data/runs/phase4a_stage1_task_gca/
|
|||
|
|
|
|||
|
|
解决:
|
|||
|
|
删除旧的checkpoint(保留最新几个)
|
|||
|
|
rm /workspace/bevfusion/runs/run-326653dc-2334d461/epoch_3.pth
|
|||
|
|
|
|||
|
|
【问题3: Loss不下降或异常】
|
|||
|
|
检查:
|
|||
|
|
tail -50 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "loss:"
|
|||
|
|
|
|||
|
|
观察:
|
|||
|
|
- Loss应该在2.3-2.5之间波动
|
|||
|
|
- 如果突然暴涨到10+,可能需要降低学习率
|
|||
|
|
|
|||
|
|
【问题4: GPU OOM】
|
|||
|
|
当前配置已优化:
|
|||
|
|
- samples_per_gpu=1
|
|||
|
|
- workers_per_gpu=0
|
|||
|
|
- 显存占用~29GB/GPU
|
|||
|
|
|
|||
|
|
如果仍OOM:
|
|||
|
|
- 检查是否有其他进程占用GPU
|
|||
|
|
- 考虑使用FP16 (不推荐,已经很稳定)
|
|||
|
|
|
|||
|
|
【问题5: 查看详细错误】
|
|||
|
|
tail -200 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
10. 重要文件路径
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【配置文件】
|
|||
|
|
/workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1_task_gca.yaml
|
|||
|
|
|
|||
|
|
【训练脚本】
|
|||
|
|
/workspace/bevfusion/START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
|||
|
|
|
|||
|
|
【日志文件】
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
|||
|
|
|
|||
|
|
【Checkpoints】
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/epoch_*.pth
|
|||
|
|
/workspace/bevfusion/runs/run-326653dc-2334d461/epoch_5.pth (起点)
|
|||
|
|
|
|||
|
|
【监控脚本】
|
|||
|
|
/workspace/bevfusion/MONITOR_TASK_GCA.sh
|
|||
|
|
|
|||
|
|
【状态文档】
|
|||
|
|
/workspace/bevfusion/TRAINING_STATUS_LIVE.md (最详细)
|
|||
|
|
/workspace/bevfusion/PROJECT_STATUS_SUMMARY.md (项目总览)
|
|||
|
|
/workspace/bevfusion/BEVFUSION_TRAINING_STATUS.md (历史记录)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
11. 训练时间线
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【已完成】
|
|||
|
|
✅ 2025-10-21: Phase 1-4基础训练 (19 epochs)
|
|||
|
|
✅ 2025-10-29: Phase 5 Enhanced训练 (23 epochs)
|
|||
|
|
✅ 2025-11-05: Phase 4A初始训练 (5 epochs)
|
|||
|
|
✅ 2025-11-06 03:59: 启动Phase 4A Task-GCA
|
|||
|
|
|
|||
|
|
【进行中】
|
|||
|
|
🟢 2025-11-06 15:46: Epoch 6训练中 (32%完成)
|
|||
|
|
|
|||
|
|
【预计】
|
|||
|
|
⏳ 2025-11-06 23:26: Epoch 6完成
|
|||
|
|
⏳ 2025-11-07 11:16: Epoch 7完成
|
|||
|
|
⏳ 2025-11-10: Epoch 10完成 (首次评估)
|
|||
|
|
⏳ 2025-11-13: Epoch 20完成 (最终)
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
12. 快速参考命令总结
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
# 查看训练状态
|
|||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "Epoch \["
|
|||
|
|
|
|||
|
|
# 查看GPU
|
|||
|
|
watch -n 5 nvidia-smi
|
|||
|
|
|
|||
|
|
# 检查进程
|
|||
|
|
ps aux | grep train.py | wc -l
|
|||
|
|
|
|||
|
|
# 查看最新Loss
|
|||
|
|
tail -30 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "loss:" | tail -3
|
|||
|
|
|
|||
|
|
# 查看Divider性能
|
|||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "divider/dice"
|
|||
|
|
|
|||
|
|
# 检查checkpoint
|
|||
|
|
ls -lht /data/runs/phase4a_stage1_task_gca/*.pth | head -3
|
|||
|
|
|
|||
|
|
# 磁盘空间
|
|||
|
|
df -h /workspace
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
生成时间: 2025-11-06 15:46
|
|||
|
|
下次更新: 明天查看Epoch 6完整结果
|
|||
|
|
================================================================================
|