208 lines
4.0 KiB
Markdown
208 lines
4.0 KiB
Markdown
|
|
# 更新的训练与评估计划
|
|||
|
|
|
|||
|
|
**更新时间**: 2025-10-30 15:04
|
|||
|
|
**策略**: 训练优先,Epoch 1后并行评估
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 并行任务策略
|
|||
|
|
|
|||
|
|
### 当前任务 (GPU 0-3)
|
|||
|
|
```
|
|||
|
|
✅ Phase 4A Stage 1 训练
|
|||
|
|
- 进度: Epoch 1, iter 2600+/30895
|
|||
|
|
- Loss: 6.9 → 4.5 (优秀下降)
|
|||
|
|
- GPU: 4张 @ 100%利用率
|
|||
|
|
- 状态: 稳定运行
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 评估计划 (优化后)
|
|||
|
|
|
|||
|
|
#### 阶段1: 使用训练日志baseline (立即)
|
|||
|
|
```
|
|||
|
|
✅ 从Phase 3训练日志提取validation结果
|
|||
|
|
- 文件: runs/enhanced_from_epoch19/20251021_202200.log
|
|||
|
|
- 包含: Epoch 20-23的validation性能
|
|||
|
|
- 用途: 快速建立baseline
|
|||
|
|
|
|||
|
|
优点:
|
|||
|
|
✓ 立即可用
|
|||
|
|
✓ 零额外成本
|
|||
|
|
✓ 不影响当前训练
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 阶段2: Epoch 1完成后完整评估 (~21小时后)
|
|||
|
|
```
|
|||
|
|
⏸️ 同时评估两个checkpoint:
|
|||
|
|
- epoch_23.pth (Phase 3, 使用GPU 0-3)
|
|||
|
|
- epoch_1.pth (Stage 1, 使用GPU 4-7)
|
|||
|
|
|
|||
|
|
优点:
|
|||
|
|
✓ 并行评估,快速对比
|
|||
|
|
✓ 充分利用8张GPU
|
|||
|
|
✓ 直接量化Stage 1改进
|
|||
|
|
|
|||
|
|
预计时间: 2-3小时
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 阶段3: Epoch 5评估 (~4.5天后)
|
|||
|
|
```
|
|||
|
|
⏸️ 评估epoch_5.pth性能
|
|||
|
|
对比:
|
|||
|
|
- vs Epoch 23: 总体改进
|
|||
|
|
- vs Epoch 1: 训练进展
|
|||
|
|
|
|||
|
|
决策: 是否达到预期,是否需要调整
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 阶段4: Stage 1最终评估 (~9天后)
|
|||
|
|
```
|
|||
|
|
⏸️ 评估epoch_10.pth (或best checkpoint)
|
|||
|
|
完整性能分析:
|
|||
|
|
- 所有指标对比
|
|||
|
|
- 失败case分析
|
|||
|
|
- 改进归因分析
|
|||
|
|
|
|||
|
|
决策: Stage 2实施方案
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 评估指标清单
|
|||
|
|
|
|||
|
|
### 3D检测指标
|
|||
|
|
- [ ] NDS (nuScenes Detection Score)
|
|||
|
|
- [ ] mAP (mean Average Precision)
|
|||
|
|
- [ ] 各类别AP (Car, Pedestrian, Truck等10类)
|
|||
|
|
- [ ] mATE, mASE, mAOE, mAVE, mAAE
|
|||
|
|
- [ ] Per-class详细分析
|
|||
|
|
|
|||
|
|
### BEV分割指标
|
|||
|
|
- [ ] 整体mIoU
|
|||
|
|
- [ ] 各类别IoU (6类)
|
|||
|
|
- [ ] Drivable Area
|
|||
|
|
- [ ] Pedestrian Crossing
|
|||
|
|
- [ ] Walkway
|
|||
|
|
- [ ] Stop Line ⭐ 重点
|
|||
|
|
- [ ] Carpark Area
|
|||
|
|
- [ ] Divider ⭐ 重点
|
|||
|
|
- [ ] Per-scene性能分布
|
|||
|
|
- [ ] 困难case识别
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 行动计划更新
|
|||
|
|
|
|||
|
|
### 现在 (立即)
|
|||
|
|
```
|
|||
|
|
1. ✅ 继续监控Stage 1训练
|
|||
|
|
2. 🔄 从Phase 3日志提取validation结果
|
|||
|
|
3. ✅ 生成Epoch 23 baseline报告
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Epoch 1完成后 (~21小时)
|
|||
|
|
```
|
|||
|
|
1. ⏸️ 暂停训练(或不暂停,用另外的GPU)
|
|||
|
|
2. ⏸️ 同时评估epoch_23和epoch_1
|
|||
|
|
3. ⏸️ 对比性能差异
|
|||
|
|
4. ⏸️ 决策:
|
|||
|
|
- 是否调整GPU数量 (4→6)
|
|||
|
|
- 是否调整workers (0→1)
|
|||
|
|
- 是否调整其他参数
|
|||
|
|
5. ⏸️ 继续Stage 1训练
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Epoch 5 (~4.5天)
|
|||
|
|
```
|
|||
|
|
1. ⏸️ 评估epoch_5
|
|||
|
|
2. ⏸️ 评估是否提前完成或继续
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Stage 1完成 (~9天)
|
|||
|
|
```
|
|||
|
|
1. ⏸️ 最终评估
|
|||
|
|
2. ⏸️ 完整性能报告
|
|||
|
|
3. ⏸️ 规划Stage 2
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 GPU使用优化建议
|
|||
|
|
|
|||
|
|
### 当前使用 (训练期间)
|
|||
|
|
```
|
|||
|
|
GPU 0-3: 训练 (100%利用)
|
|||
|
|
GPU 4-7: 空闲
|
|||
|
|
|
|||
|
|
总体利用率: 50%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 优化方案 (Epoch 1后)
|
|||
|
|
|
|||
|
|
**方案1: 评估间隙并行** (推荐)
|
|||
|
|
```
|
|||
|
|
训练epoch完成 → 开始validation
|
|||
|
|
validation期间 (GPU 0-3轻负载):
|
|||
|
|
→ 使用GPU 4-7运行评估
|
|||
|
|
→ 或使用GPU 0-7全部评估(validation结束后)
|
|||
|
|
|
|||
|
|
总体利用率: 80-100%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**方案2: 持续并行**
|
|||
|
|
```
|
|||
|
|
训练: GPU 0-3 (持续)
|
|||
|
|
评估: GPU 4-7 (定期,如每2天评估一次中间checkpoint)
|
|||
|
|
|
|||
|
|
总体利用率: 100%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 监控命令
|
|||
|
|
|
|||
|
|
### 训练监控
|
|||
|
|
```bash
|
|||
|
|
bash monitor_phase4a_stage1.sh
|
|||
|
|
tail -f phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 评估监控 (Epoch 1后)
|
|||
|
|
```bash
|
|||
|
|
tail -f eval_results/*/eval.log
|
|||
|
|
watch -n 5 'nvidia-smi | head -15'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 并行监控
|
|||
|
|
```bash
|
|||
|
|
bash monitor_all_tasks.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 总结
|
|||
|
|
|
|||
|
|
### 已加入计划
|
|||
|
|
- ✅ 评估脚本已创建
|
|||
|
|
- ✅ 并行监控已准备
|
|||
|
|
- ✅ 任务计划已更新
|
|||
|
|
|
|||
|
|
### 执行策略
|
|||
|
|
1. **现在**: 从日志提取Phase 3性能
|
|||
|
|
2. **Epoch 1后**: 完整并行评估
|
|||
|
|
3. **定期**: Epoch 5, 10评估
|
|||
|
|
4. **最终**: 全面性能对比报告
|
|||
|
|
|
|||
|
|
### 预期收益
|
|||
|
|
- 📊 精确量化每个改进的贡献
|
|||
|
|
- 📊 指导后续优化方向
|
|||
|
|
- 📊 充分利用GPU资源
|
|||
|
|
- ⏱️ 不增加总训练时间
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**状态**:
|
|||
|
|
- 训练: 🚀 正常运行
|
|||
|
|
- 评估: 📋 已规划,等Epoch 1后执行
|
|||
|
|
- 文档: ✅ 已更新
|