367 lines
12 KiB
Plaintext
367 lines
12 KiB
Plaintext
================================================================================
|
||
BEVFusion 训练快速参考
|
||
最后更新: 2025-11-06 15:46 (北京时间)
|
||
================================================================================
|
||
|
||
当前训练阶段: Phase 4A Stage 1 - Task-specific GCA优化
|
||
配置文件: multitask_BEV2X_phase4a_stage1_task_gca.yaml
|
||
状态: 🟢 正在运行
|
||
进度: Epoch 6, Iter 4950/15448 (32%完成)
|
||
启动时间: 2025-11-06 03:59 UTC
|
||
预计完成: 2025-11-13 (7天后)
|
||
|
||
================================================================================
|
||
1. 当前训练状态 - 实时更新
|
||
================================================================================
|
||
|
||
【最新状态 - 2025-11-06 15:46】
|
||
✅ 训练正常运行中
|
||
- Epoch: [1]/20 (从epoch_5.pth继续,实际为Epoch 6)
|
||
- Progress: 4950/15448 iterations (32.0%)
|
||
- 运行时长: 3小时47分钟
|
||
- Loss: 2.4543 (稳定下降)
|
||
- GPU: 8卡全部满载
|
||
- 显存: ~29GB/GPU
|
||
- 速度: 2.66秒/iteration
|
||
|
||
【关键性能指标】
|
||
Divider Dice Loss: 0.5339 (vs Epoch 5: 0.5140, 改善+3.9%)
|
||
Detection IoU: 0.6179 (良好)
|
||
Overall Loss: 2.45 (从启动时2.47下降)
|
||
|
||
【预计时间】
|
||
- Epoch 6完成: 今晚23:26 (北京时间)
|
||
- 全部训练完成: 2025-11-13 (约7天)
|
||
|
||
================================================================================
|
||
1.1 历史训练状态
|
||
================================================================================
|
||
|
||
Phase 5 Enhanced训练: ✅ 完成
|
||
- 模型: EnhancedBEVSegmentationHead
|
||
- Epochs: 23/23
|
||
- 完成时间: 2025-10-29 23:21
|
||
- Checkpoint: enhanced_from_epoch19/epoch_23.pth (516MB)
|
||
- 特性: ASPP + Attention + Deep Supervision + GroupNorm
|
||
|
||
Phase 4A Stage 1 状态: ✅ 运行中
|
||
- 脚本: START_PHASE4A_TASK_GCA_BACKGROUND.sh (已启动)
|
||
- 配置: multitask_BEV2X_phase4a_stage1_task_gca.yaml
|
||
- 起点: epoch_5.pth (已加载)
|
||
- 当前: Epoch 6 进行中 (32%完成)
|
||
- 目标: epoch_20.pth (还需15个epochs)
|
||
- 日志: /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
||
- PID: 1234388 (18个相关进程)
|
||
|
||
================================================================================
|
||
2. Task-specific GCA架构说明
|
||
================================================================================
|
||
|
||
【核心创新】
|
||
Decoder Neck → 原始BEV (512通道,完整信息)
|
||
├─ 检测GCA (独立) → 检测最优BEV → 检测头
|
||
└─ 分割GCA (独立) → 分割最优BEV → 分割头
|
||
|
||
【配置参数】
|
||
task_specific_gca:
|
||
enabled: true
|
||
in_channels: 512
|
||
reduction: 4
|
||
object_reduction: 4 # 检测GCA降维比例
|
||
map_reduction: 4 # 分割GCA降维比例
|
||
|
||
【优势】
|
||
✅ 每个任务根据自己需求选择通道
|
||
✅ 避免统一选择的折中问题
|
||
✅ 参数增加: 仅0.26M
|
||
✅ 计算增加: ~1.6ms (0.06%)
|
||
|
||
【当前效果】
|
||
Divider性能改善: +3.9% (Epoch 6进度32%时)
|
||
Overall Loss稳定: 2.45 (波动2.3-2.5)
|
||
|
||
================================================================================
|
||
3. 快速监控命令 - 当前训练
|
||
================================================================================
|
||
|
||
【实时监控】
|
||
# 查看训练日志(最推荐)
|
||
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "Epoch \["
|
||
|
||
# 只看Loss和关键指标
|
||
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep -E "loss/map/divider|loss:" | grep "Epoch \["
|
||
|
||
# 查看Divider性能
|
||
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "loss/map/divider/dice"
|
||
|
||
【GPU监控】
|
||
# 实时GPU状态
|
||
watch -n 5 nvidia-smi
|
||
|
||
# 查看GPU利用率
|
||
nvidia-smi --query-gpu=index,utilization.gpu,memory.used,temperature.gpu --format=csv
|
||
|
||
【进程监控】
|
||
# 检查训练进程
|
||
ps aux | grep train.py | wc -l # 应该显示18
|
||
|
||
# 查看主进程
|
||
ps aux | grep "[t]rain.py" | head -3
|
||
|
||
【日志分析】
|
||
# 查看最近进度
|
||
tail -100 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "Epoch \[" | tail -5
|
||
|
||
# 统计当前iteration
|
||
tail -1 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep -oP '\[\d+/\d+\]'
|
||
|
||
# 查看日志文件大小
|
||
ls -lh /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
||
|
||
================================================================================
|
||
4. 训练启动命令
|
||
================================================================================
|
||
|
||
【当前运行的训练】
|
||
启动命令:
|
||
cd /workspace/bevfusion && bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
||
|
||
实际执行的命令:
|
||
torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1_task_gca.yaml \
|
||
--load_from /workspace/bevfusion/runs/run-326653dc-2334d461/epoch_5.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0
|
||
|
||
状态: ✅ 运行中 (PID: 1234388)
|
||
日志: /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
||
|
||
【如需重启(请勿轻易执行)】
|
||
# 1. 先停止当前训练
|
||
killall -9 python
|
||
|
||
# 2. 等待GPU释放
|
||
sleep 5
|
||
|
||
# 3. 重新启动
|
||
bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
||
|
||
================================================================================
|
||
5. Checkpoint管理
|
||
================================================================================
|
||
|
||
【当前可用Checkpoints】
|
||
runs/run-326653dc-2334d461/
|
||
├─ epoch_3.pth (525MB)
|
||
├─ epoch_4.pth (525MB)
|
||
└─ epoch_5.pth (525MB) ✅ 当前训练起点
|
||
|
||
runs/enhanced_from_epoch19/
|
||
└─ epoch_23.pth (516MB) ✅ Phase 5完成
|
||
|
||
/data/runs/phase4a_stage1_task_gca/
|
||
└─ (Epoch 6+的checkpoint将保存在这里,每个~525MB)
|
||
|
||
【Checkpoint查看】
|
||
# 查看最新checkpoint
|
||
ls -lt /data/runs/phase4a_stage1_task_gca/*.pth 2>/dev/null | head -5
|
||
|
||
# 查看checkpoint大小
|
||
du -sh /data/runs/phase4a_stage1_task_gca/*.pth
|
||
|
||
================================================================================
|
||
6. 性能追踪
|
||
================================================================================
|
||
|
||
【当前性能 (Epoch 6, 32%完成)】
|
||
分割 (Dice Loss):
|
||
Drivable Area: 0.11 ✅ 优秀
|
||
Ped Crossing: 0.23 🟡 中等
|
||
Walkway: 0.22 🟡 中等
|
||
Stop Line: 0.35 🔴 困难
|
||
Carpark Area: 0.20 🟡 中等
|
||
Divider: 0.53 ⚠️ 最困难 (改善+3.9%)
|
||
|
||
检测:
|
||
Matched IoU: 0.6179 ✅ 良好
|
||
|
||
【预期最终性能 (Epoch 20)】
|
||
总mIoU: 52% → 61% (+17%)
|
||
Divider: Dice 0.51 → 0.42 (-18%改善)
|
||
检测mAP: 0.67 → 0.70 (+4.5%)
|
||
|
||
================================================================================
|
||
7. 评估和测试
|
||
================================================================================
|
||
|
||
【评估配置】
|
||
当前训练的评估设置:
|
||
- 频率: 每10 epochs (减少75%开销)
|
||
- 验证样本: 3010个 (load_interval=2, 减少50%)
|
||
- 下次评估: Epoch 10
|
||
|
||
【手动评估】
|
||
# 评估指定checkpoint
|
||
torchpack dist-run -np 8 python tools/test.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1_task_gca.yaml \
|
||
/data/runs/phase4a_stage1_task_gca/epoch_6.pth \
|
||
--eval bbox map
|
||
|
||
================================================================================
|
||
8. 当前训练关键信息
|
||
================================================================================
|
||
|
||
【Task-specific GCA架构】
|
||
核心创新:
|
||
Decoder Neck → 原始BEV (512通道)
|
||
├─ 检测GCA (独立) → 检测最优特征 → 检测头
|
||
└─ 分割GCA (独立) → 分割最优特征 → 分割头
|
||
|
||
优势:
|
||
✅ 每个任务独立选择最优特征
|
||
✅ 避免shared GCA的折中问题
|
||
✅ 参数增加仅0.26M
|
||
✅ 计算开销仅+1.6ms
|
||
|
||
配置参数:
|
||
task_specific_gca:
|
||
enabled: true
|
||
in_channels: 512
|
||
reduction: 4
|
||
object_reduction: 4 # 检测GCA
|
||
map_reduction: 4 # 分割GCA
|
||
|
||
【当前性能观察】
|
||
Epoch 6进度32%时:
|
||
- Divider改善: +3.9% (0.5140 → 0.5339)
|
||
- Loss稳定: 2.45 (波动在2.3-2.5)
|
||
- 检测IoU: 0.6179 (良好)
|
||
|
||
【预期最终成果】(Epoch 20)
|
||
- 总mIoU: 52% → 61% (+17%)
|
||
- Divider: Dice 0.51 → 0.42 (-18%改善)
|
||
- 检测mAP: 0.67 → 0.70 (+4.5%)
|
||
|
||
【评估优化】
|
||
- 评估频率: 每10 epochs (减少75%开销)
|
||
- 验证样本: 3010个 (减少50%)
|
||
- 避免了之前Epoch 5的磁盘空间崩溃问题
|
||
|
||
================================================================================
|
||
9. 常见问题排查
|
||
================================================================================
|
||
|
||
【问题1: 训练卡住不动】
|
||
检查:
|
||
ps aux | grep train.py | wc -l # 应该是18
|
||
nvidia-smi # 检查GPU是否在工作
|
||
|
||
解决:
|
||
如果进程存在但GPU不工作,可能是死锁
|
||
killall -9 python
|
||
bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
||
|
||
【问题2: 磁盘空间不足】
|
||
检查:
|
||
df -h /workspace
|
||
du -sh /data/runs/phase4a_stage1_task_gca/
|
||
|
||
解决:
|
||
删除旧的checkpoint(保留最新几个)
|
||
rm /workspace/bevfusion/runs/run-326653dc-2334d461/epoch_3.pth
|
||
|
||
【问题3: Loss不下降或异常】
|
||
检查:
|
||
tail -50 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "loss:"
|
||
|
||
观察:
|
||
- Loss应该在2.3-2.5之间波动
|
||
- 如果突然暴涨到10+,可能需要降低学习率
|
||
|
||
【问题4: GPU OOM】
|
||
当前配置已优化:
|
||
- samples_per_gpu=1
|
||
- workers_per_gpu=0
|
||
- 显存占用~29GB/GPU
|
||
|
||
如果仍OOM:
|
||
- 检查是否有其他进程占用GPU
|
||
- 考虑使用FP16 (不推荐,已经很稳定)
|
||
|
||
【问题5: 查看详细错误】
|
||
tail -200 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
||
|
||
================================================================================
|
||
10. 重要文件路径
|
||
================================================================================
|
||
|
||
【配置文件】
|
||
/workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1_task_gca.yaml
|
||
|
||
【训练脚本】
|
||
/workspace/bevfusion/START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
||
|
||
【日志文件】
|
||
/data/runs/phase4a_stage1_task_gca/train_20251106_035913.log
|
||
|
||
【Checkpoints】
|
||
/data/runs/phase4a_stage1_task_gca/epoch_*.pth
|
||
/workspace/bevfusion/runs/run-326653dc-2334d461/epoch_5.pth (起点)
|
||
|
||
【监控脚本】
|
||
/workspace/bevfusion/MONITOR_TASK_GCA.sh
|
||
|
||
【状态文档】
|
||
/workspace/bevfusion/TRAINING_STATUS_LIVE.md (最详细)
|
||
/workspace/bevfusion/PROJECT_STATUS_SUMMARY.md (项目总览)
|
||
/workspace/bevfusion/BEVFUSION_TRAINING_STATUS.md (历史记录)
|
||
|
||
================================================================================
|
||
11. 训练时间线
|
||
================================================================================
|
||
|
||
【已完成】
|
||
✅ 2025-10-21: Phase 1-4基础训练 (19 epochs)
|
||
✅ 2025-10-29: Phase 5 Enhanced训练 (23 epochs)
|
||
✅ 2025-11-05: Phase 4A初始训练 (5 epochs)
|
||
✅ 2025-11-06 03:59: 启动Phase 4A Task-GCA
|
||
|
||
【进行中】
|
||
🟢 2025-11-06 15:46: Epoch 6训练中 (32%完成)
|
||
|
||
【预计】
|
||
⏳ 2025-11-06 23:26: Epoch 6完成
|
||
⏳ 2025-11-07 11:16: Epoch 7完成
|
||
⏳ 2025-11-10: Epoch 10完成 (首次评估)
|
||
⏳ 2025-11-13: Epoch 20完成 (最终)
|
||
|
||
================================================================================
|
||
12. 快速参考命令总结
|
||
================================================================================
|
||
|
||
# 查看训练状态
|
||
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "Epoch \["
|
||
|
||
# 查看GPU
|
||
watch -n 5 nvidia-smi
|
||
|
||
# 检查进程
|
||
ps aux | grep train.py | wc -l
|
||
|
||
# 查看最新Loss
|
||
tail -30 /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "loss:" | tail -3
|
||
|
||
# 查看Divider性能
|
||
tail -f /data/runs/phase4a_stage1_task_gca/train_20251106_035913.log | grep "divider/dice"
|
||
|
||
# 检查checkpoint
|
||
ls -lht /data/runs/phase4a_stage1_task_gca/*.pth | head -3
|
||
|
||
# 磁盘空间
|
||
df -h /workspace
|
||
|
||
================================================================================
|
||
生成时间: 2025-11-06 15:46
|
||
下次更新: 明天查看Epoch 6完整结果
|
||
================================================================================
|