181 lines
3.5 KiB
Markdown
181 lines
3.5 KiB
Markdown
# BEVFusion训练状态报告
|
||
|
||
**生成时间**: 2025-10-30 15:15
|
||
**训练任务**: Phase 4A Stage 1 (600×600分辨率)
|
||
|
||
---
|
||
|
||
## 📊 当前训练状态
|
||
|
||
### 基本信息
|
||
```
|
||
任务: Phase 4A Stage 1
|
||
配置: 600×600分辨率, 4层Decoder, Deep Supervision + Dice Loss
|
||
GPU: 4张 (GPU 0-3)
|
||
从checkpoint: epoch_23.pth
|
||
目标epochs: 10
|
||
```
|
||
|
||
### 训练进度
|
||
```
|
||
当前Epoch: 1 / 10
|
||
迭代进度: 正在进行
|
||
总迭代数: 30,895 iters/epoch
|
||
```
|
||
|
||
### Loss趋势
|
||
```
|
||
起始Loss: ~6.9
|
||
当前Loss: ~4.5
|
||
下降幅度: ~35%
|
||
趋势: 持续稳定下降 ✅
|
||
Grad Norm: 正常 (8-13范围)
|
||
```
|
||
|
||
### GPU状态
|
||
```
|
||
GPU 0: 30.4 GB / 32 GB (93%) @ 100%利用率
|
||
GPU 1: 30.9 GB / 32 GB (94%) @ 100%利用率
|
||
GPU 2: 30.7 GB / 32 GB (94%) @ 100%利用率
|
||
GPU 3: 30.7 GB / 32 GB (94%) @ 100%利用率
|
||
|
||
温度: 47-50°C (正常)
|
||
显存使用: 稳定在93-94%
|
||
利用率: 100% (满载)
|
||
```
|
||
|
||
### 训练效率
|
||
```
|
||
时间/iter: ~2.61秒
|
||
数据加载: ~0.44秒/iter
|
||
计算时间: ~2.17秒/iter
|
||
|
||
预计完成:
|
||
Epoch 1: ~21小时 (从启动开始)
|
||
10 epochs: ~8.5天
|
||
|
||
当前ETA: 18天15小时 (会随训练加速递减到实际~8.5天)
|
||
```
|
||
|
||
### 性能指标进展 (iter 2600)
|
||
```
|
||
分割Loss:
|
||
Drivable Area: dice=0.33 ↓, focal=0.043
|
||
Ped Crossing: dice=0.63 ↓, focal=0.032
|
||
Walkway: dice=0.54 ↓, focal=0.044
|
||
Stop Line: dice=0.74 ↓, focal=0.041 ⭐
|
||
Carpark: dice=0.63 ↓, focal=0.020
|
||
Divider: dice=0.86 ↓, focal=0.029 ⭐
|
||
|
||
3D检测Loss:
|
||
Heatmap: 0.224
|
||
Classification: 0.035
|
||
BBox: 0.318
|
||
Matched IoU: 0.622 ✅
|
||
|
||
总Loss: 4.583 (从6.9下降)
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 与Phase 3对比
|
||
|
||
### Phase 3 (Epoch 23) Baseline
|
||
```
|
||
配置: 400×400, 2层Decoder, 无Deep Sup
|
||
3D检测: NDS 0.6941, mAP 0.6446
|
||
BEV分割: mIoU 0.4130
|
||
- Stop Line: 0.2657
|
||
- Divider: 0.1903
|
||
```
|
||
|
||
### Stage 1当前训练 (Epoch 1 进行中)
|
||
```
|
||
配置: 600×600, 4层Decoder, Deep Sup + Dice
|
||
Loss: 持续下降 ✅
|
||
预期: Stop Line和Divider显著提升
|
||
```
|
||
|
||
---
|
||
|
||
## 📁 重要文件位置
|
||
|
||
### Checkpoint
|
||
```
|
||
Phase 3: runs/enhanced_from_epoch19/epoch_23.pth (516MB)
|
||
Stage 1: runs/run-326653dc-c038af2c/
|
||
→ latest.pth (将在epoch完成时保存)
|
||
```
|
||
|
||
### 配置文件
|
||
```
|
||
Stage 1: configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/
|
||
└─ multitask_BEV2X_phase4a_stage1.yaml
|
||
```
|
||
|
||
### 日志文件
|
||
```
|
||
训练日志: phase4a_stage1_20251030_130707.log
|
||
训练目录: runs/run-326653dc-c038af2c/20251030_130713.log
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ 训练健康状态
|
||
|
||
### 稳定性检查
|
||
```
|
||
✅ Loss稳定下降 (6.9 → 4.5)
|
||
✅ Grad norm正常 (无nan或爆炸)
|
||
✅ GPU利用率100% (满载)
|
||
✅ 显存使用稳定 (93-94%)
|
||
✅ 无错误或警告
|
||
✅ 数据加载正常
|
||
```
|
||
|
||
### 性能指标健康度
|
||
```
|
||
✅ 3D检测IoU保持: 0.622 (vs Phase 3的0.633)
|
||
✅ 分割dice loss下降: 各类别都在改善
|
||
✅ Stop Line和Divider loss下降趋势明显
|
||
```
|
||
|
||
---
|
||
|
||
## 🔄 GPU资源分配
|
||
|
||
### 当前使用
|
||
```
|
||
GPU 0-3: Stage 1训练 (93-94%显存, 100%利用)
|
||
GPU 4-7: 完全空闲 (可用于epoch23评估)
|
||
```
|
||
|
||
### 训练不受影响的保证
|
||
```
|
||
✓ GPU物理隔离 (0-3 vs 4-7)
|
||
✓ CUDA_VISIBLE_DEVICES强制隔离
|
||
✓ 独立进程空间
|
||
✓ 独立显存分配
|
||
```
|
||
|
||
---
|
||
|
||
## ⏭️ 下一步
|
||
|
||
### 训练方面 (继续)
|
||
- 🔄 Epoch 1继续训练
|
||
- ⏸️ ~21小时后Epoch 1完成
|
||
- ⏸️ 验证性能提升
|
||
|
||
### 评估方面 (新Docker)
|
||
- 📋 准备新Docker环境指南
|
||
- 📋 配置部署测试环境
|
||
- 📋 运行epoch_23评估
|
||
|
||
---
|
||
|
||
**训练状态**: ✅ 优秀!Loss稳定下降,GPU满载,无异常
|
||
**可用资源**: GPU 4-7完全空闲
|
||
**下一步**: 为您准备新Docker评估指南
|
||
|