181 lines
3.5 KiB
Markdown
181 lines
3.5 KiB
Markdown
|
|
# BEVFusion训练状态报告
|
|||
|
|
|
|||
|
|
**生成时间**: 2025-10-30 15:15
|
|||
|
|
**训练任务**: Phase 4A Stage 1 (600×600分辨率)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 当前训练状态
|
|||
|
|
|
|||
|
|
### 基本信息
|
|||
|
|
```
|
|||
|
|
任务: Phase 4A Stage 1
|
|||
|
|
配置: 600×600分辨率, 4层Decoder, Deep Supervision + Dice Loss
|
|||
|
|
GPU: 4张 (GPU 0-3)
|
|||
|
|
从checkpoint: epoch_23.pth
|
|||
|
|
目标epochs: 10
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 训练进度
|
|||
|
|
```
|
|||
|
|
当前Epoch: 1 / 10
|
|||
|
|
迭代进度: 正在进行
|
|||
|
|
总迭代数: 30,895 iters/epoch
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Loss趋势
|
|||
|
|
```
|
|||
|
|
起始Loss: ~6.9
|
|||
|
|
当前Loss: ~4.5
|
|||
|
|
下降幅度: ~35%
|
|||
|
|
趋势: 持续稳定下降 ✅
|
|||
|
|
Grad Norm: 正常 (8-13范围)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### GPU状态
|
|||
|
|
```
|
|||
|
|
GPU 0: 30.4 GB / 32 GB (93%) @ 100%利用率
|
|||
|
|
GPU 1: 30.9 GB / 32 GB (94%) @ 100%利用率
|
|||
|
|
GPU 2: 30.7 GB / 32 GB (94%) @ 100%利用率
|
|||
|
|
GPU 3: 30.7 GB / 32 GB (94%) @ 100%利用率
|
|||
|
|
|
|||
|
|
温度: 47-50°C (正常)
|
|||
|
|
显存使用: 稳定在93-94%
|
|||
|
|
利用率: 100% (满载)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 训练效率
|
|||
|
|
```
|
|||
|
|
时间/iter: ~2.61秒
|
|||
|
|
数据加载: ~0.44秒/iter
|
|||
|
|
计算时间: ~2.17秒/iter
|
|||
|
|
|
|||
|
|
预计完成:
|
|||
|
|
Epoch 1: ~21小时 (从启动开始)
|
|||
|
|
10 epochs: ~8.5天
|
|||
|
|
|
|||
|
|
当前ETA: 18天15小时 (会随训练加速递减到实际~8.5天)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 性能指标进展 (iter 2600)
|
|||
|
|
```
|
|||
|
|
分割Loss:
|
|||
|
|
Drivable Area: dice=0.33 ↓, focal=0.043
|
|||
|
|
Ped Crossing: dice=0.63 ↓, focal=0.032
|
|||
|
|
Walkway: dice=0.54 ↓, focal=0.044
|
|||
|
|
Stop Line: dice=0.74 ↓, focal=0.041 ⭐
|
|||
|
|
Carpark: dice=0.63 ↓, focal=0.020
|
|||
|
|
Divider: dice=0.86 ↓, focal=0.029 ⭐
|
|||
|
|
|
|||
|
|
3D检测Loss:
|
|||
|
|
Heatmap: 0.224
|
|||
|
|
Classification: 0.035
|
|||
|
|
BBox: 0.318
|
|||
|
|
Matched IoU: 0.622 ✅
|
|||
|
|
|
|||
|
|
总Loss: 4.583 (从6.9下降)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 与Phase 3对比
|
|||
|
|
|
|||
|
|
### Phase 3 (Epoch 23) Baseline
|
|||
|
|
```
|
|||
|
|
配置: 400×400, 2层Decoder, 无Deep Sup
|
|||
|
|
3D检测: NDS 0.6941, mAP 0.6446
|
|||
|
|
BEV分割: mIoU 0.4130
|
|||
|
|
- Stop Line: 0.2657
|
|||
|
|
- Divider: 0.1903
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Stage 1当前训练 (Epoch 1 进行中)
|
|||
|
|
```
|
|||
|
|
配置: 600×600, 4层Decoder, Deep Sup + Dice
|
|||
|
|
Loss: 持续下降 ✅
|
|||
|
|
预期: Stop Line和Divider显著提升
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📁 重要文件位置
|
|||
|
|
|
|||
|
|
### Checkpoint
|
|||
|
|
```
|
|||
|
|
Phase 3: runs/enhanced_from_epoch19/epoch_23.pth (516MB)
|
|||
|
|
Stage 1: runs/run-326653dc-c038af2c/
|
|||
|
|
→ latest.pth (将在epoch完成时保存)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 配置文件
|
|||
|
|
```
|
|||
|
|
Stage 1: configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/
|
|||
|
|
└─ multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 日志文件
|
|||
|
|
```
|
|||
|
|
训练日志: phase4a_stage1_20251030_130707.log
|
|||
|
|
训练目录: runs/run-326653dc-c038af2c/20251030_130713.log
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 训练健康状态
|
|||
|
|
|
|||
|
|
### 稳定性检查
|
|||
|
|
```
|
|||
|
|
✅ Loss稳定下降 (6.9 → 4.5)
|
|||
|
|
✅ Grad norm正常 (无nan或爆炸)
|
|||
|
|
✅ GPU利用率100% (满载)
|
|||
|
|
✅ 显存使用稳定 (93-94%)
|
|||
|
|
✅ 无错误或警告
|
|||
|
|
✅ 数据加载正常
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 性能指标健康度
|
|||
|
|
```
|
|||
|
|
✅ 3D检测IoU保持: 0.622 (vs Phase 3的0.633)
|
|||
|
|
✅ 分割dice loss下降: 各类别都在改善
|
|||
|
|
✅ Stop Line和Divider loss下降趋势明显
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔄 GPU资源分配
|
|||
|
|
|
|||
|
|
### 当前使用
|
|||
|
|
```
|
|||
|
|
GPU 0-3: Stage 1训练 (93-94%显存, 100%利用)
|
|||
|
|
GPU 4-7: 完全空闲 (可用于epoch23评估)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 训练不受影响的保证
|
|||
|
|
```
|
|||
|
|
✓ GPU物理隔离 (0-3 vs 4-7)
|
|||
|
|
✓ CUDA_VISIBLE_DEVICES强制隔离
|
|||
|
|
✓ 独立进程空间
|
|||
|
|
✓ 独立显存分配
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⏭️ 下一步
|
|||
|
|
|
|||
|
|
### 训练方面 (继续)
|
|||
|
|
- 🔄 Epoch 1继续训练
|
|||
|
|
- ⏸️ ~21小时后Epoch 1完成
|
|||
|
|
- ⏸️ 验证性能提升
|
|||
|
|
|
|||
|
|
### 评估方面 (新Docker)
|
|||
|
|
- 📋 准备新Docker环境指南
|
|||
|
|
- 📋 配置部署测试环境
|
|||
|
|
- 📋 运行epoch_23评估
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**训练状态**: ✅ 优秀!Loss稳定下降,GPU满载,无异常
|
|||
|
|
**可用资源**: GPU 4-7完全空闲
|
|||
|
|
**下一步**: 为您准备新Docker评估指南
|
|||
|
|
|