281 lines
5.6 KiB
Markdown
281 lines
5.6 KiB
Markdown
# FP16 + Batch=2 优化配置总结
|
||
|
||
**配置时间**: 2025-11-01 22:20 UTC
|
||
**状态**: ✅ Ready to launch
|
||
|
||
---
|
||
|
||
## 🎯 配置概览
|
||
|
||
### 核心优化
|
||
```
|
||
✓ FP16混合精度训练
|
||
✓ Batch size: 1 → 2/GPU
|
||
✓ 总batch: 8 → 16
|
||
✓ 学习率: 2e-5 → 4e-5
|
||
```
|
||
|
||
### 预期性能提升
|
||
| 指标 | FP32原始 | FP16+Batch2 | 改进 |
|
||
|------|----------|------------|------|
|
||
| **训练速度** | 2.65s/iter | ~1.5s/iter | **+43%** ⚡ |
|
||
| **显存占用** | 29GB/GPU | ~24-26GB/GPU | 节省5GB |
|
||
| **Epoch时间** | 11小时 | ~6小时 | **-45%** |
|
||
| **10 epochs** | 9天 | **~5天** | **节省4天** ⭐ |
|
||
|
||
---
|
||
|
||
## 📋 配置详情
|
||
|
||
### 配置文件
|
||
```yaml
|
||
# configs/.../multitask_BEV2X_phase4a_stage1_fp16.yaml
|
||
|
||
work_dir: /data/runs/phase4a_stage1_fp16_batch2
|
||
|
||
# FP16混合精度
|
||
fp16:
|
||
loss_scale: dynamic
|
||
|
||
# Batch优化
|
||
data:
|
||
samples_per_gpu: 2 # ⭐ 增加到2
|
||
workers_per_gpu: 0
|
||
|
||
# 学习率调整
|
||
optimizer:
|
||
type: AdamW
|
||
lr: 4.0e-5 # ⭐ 2倍batch → 2倍lr
|
||
weight_decay: 0.01
|
||
```
|
||
|
||
### 训练命令
|
||
```bash
|
||
torchpack dist-run -np 8 python tools/train.py \
|
||
configs/.../multitask_BEV2X_phase4a_stage1_fp16.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--cfg-options work_dir=/data/runs/phase4a_stage1_fp16_batch2
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 启动方式
|
||
|
||
### 方法1: 一键启动(推荐)⭐
|
||
```bash
|
||
cd /workspace/bevfusion
|
||
bash CLEANUP_AND_START_FP16_BATCH2.sh
|
||
```
|
||
|
||
**功能**:
|
||
- ✅ 自动清理僵尸进程
|
||
- ✅ 检查GPU状态
|
||
- ✅ 启动FP16+Batch2训练
|
||
- ✅ 显示监控命令
|
||
|
||
### 方法2: 手动分步执行
|
||
```bash
|
||
# 步骤1: 清理僵尸进程
|
||
pkill -9 -f "train.py"
|
||
sleep 5
|
||
|
||
# 步骤2: 启动训练
|
||
cd /workspace/bevfusion
|
||
bash RESTART_PHASE4A_STAGE1_FP16.sh
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 预期时间表
|
||
|
||
### 基于1.5s/iter速度
|
||
|
||
```
|
||
Epoch 1完成: ~6小时 (11/2 04:00 UTC)
|
||
Epoch 2完成: ~12小时 (11/2 10:00 UTC)
|
||
Epoch 5完成: ~30小时 (11/3 04:00 UTC)
|
||
Epoch 10完成: ~5天 (11/6 22:00 UTC) ⭐
|
||
```
|
||
|
||
**相比FP32节省**: 4天
|
||
**相比FP16单batch节省**: 1.5天
|
||
|
||
---
|
||
|
||
## ⚠️ 监控要点
|
||
|
||
### 启动后5分钟检查
|
||
|
||
**1. 确认FP16生效**
|
||
```bash
|
||
# 查看显存占用(应该24-26GB)
|
||
nvidia-smi
|
||
|
||
# 如果仍是29GB → FP16未生效
|
||
# 如果是19GB → Batch=2未生效
|
||
# 如果是24-26GB → ✅ 正确
|
||
```
|
||
|
||
**2. 确认速度提升**
|
||
```bash
|
||
# 查看iteration速度
|
||
tail -50 $(ls -t phase4a_stage1_fp16_batch2*.log | head -1) | grep "time:"
|
||
|
||
# 应该看到约1.4-1.6s/iter
|
||
# 如果仍是2.6s → 优化未生效
|
||
```
|
||
|
||
**3. 确认Loss正常**
|
||
```bash
|
||
# 查看Loss值
|
||
tail -100 $(ls -t phase4a_stage1_fp16_batch2*.log | head -1) | grep "loss:" | tail -5
|
||
|
||
# Loss应该在2.5-2.8范围
|
||
# 注意是否有NaN
|
||
```
|
||
|
||
### 1小时后检查
|
||
|
||
**稳定性验证**:
|
||
```bash
|
||
# 1. Loss趋势稳定(无剧烈波动)
|
||
# 2. 梯度范数正常(10-20范围)
|
||
# 3. 无OOM错误
|
||
# 4. 显存稳定在24-26GB
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 可能的问题与解决
|
||
|
||
### 问题1: 显存OOM
|
||
|
||
**症状**:
|
||
```
|
||
RuntimeError: CUDA out of memory
|
||
```
|
||
|
||
**解决**:
|
||
```bash
|
||
# 降回batch=1
|
||
# 修改配置文件:
|
||
data:
|
||
samples_per_gpu: 1
|
||
|
||
optimizer:
|
||
lr: 2.0e-5
|
||
```
|
||
|
||
### 问题2: Loss震荡
|
||
|
||
**症状**: Loss波动剧烈
|
||
|
||
**解决**:
|
||
```yaml
|
||
# 降低学习率
|
||
optimizer:
|
||
lr: 3.0e-5 # 从4e-5降低
|
||
```
|
||
|
||
### 问题3: 速度未提升
|
||
|
||
**检查**:
|
||
```bash
|
||
# 1. 确认FP16生效
|
||
grep -i "fp16\|amp" 最新日志.log
|
||
|
||
# 2. 确认batch=2生效
|
||
grep "samples_per_gpu" 最新日志.log
|
||
```
|
||
|
||
---
|
||
|
||
## 📈 性能基准对比
|
||
|
||
### 三种配置对比
|
||
|
||
| 配置 | Batch/GPU | 总Batch | 显存 | 速度 | Epoch | 10 Epochs | 学习率 |
|
||
|------|-----------|---------|------|------|-------|----------|--------|
|
||
| **FP32原始** | 1 | 8 | 29GB | 2.65s | 11h | **9天** | 2e-5 |
|
||
| **FP16单batch** | 1 | 8 | 19GB | 1.9s | 7.5h | **6.5天** | 2e-5 |
|
||
| **FP16+Batch2** ⭐ | 2 | 16 | 25GB | 1.5s | 6h | **5天** | 4e-5 |
|
||
|
||
### 加速对比
|
||
```
|
||
FP32 → FP16单batch: 加速30%,节省2.5天
|
||
FP32 → FP16+Batch2: 加速43%,节省4天 ⭐
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ 配置验证清单
|
||
|
||
### 文件检查
|
||
- [x] 配置文件: `multitask_BEV2X_phase4a_stage1_fp16.yaml`
|
||
- [x] 启动脚本: `RESTART_PHASE4A_STAGE1_FP16.sh`
|
||
- [x] 一键脚本: `CLEANUP_AND_START_FP16_BATCH2.sh`
|
||
- [x] Checkpoint: `/data/runs/phase4a_stage1/epoch_1.pth`
|
||
- [x] 预训练模型: `/data/pretrained/swint-nuimages-pretrained.pth`
|
||
|
||
### 配置验证
|
||
- [x] FP16启用: ✅ `fp16.loss_scale: dynamic`
|
||
- [x] Batch=2: ✅ `samples_per_gpu: 2`
|
||
- [x] 学习率: ✅ `lr: 4.0e-5`
|
||
- [x] work_dir: ✅ `/data/runs/phase4a_stage1_fp16_batch2`
|
||
- [x] Workers: ✅ `workers_per_gpu: 0`
|
||
|
||
### GPU环境
|
||
- [x] GPU型号: Tesla V100S-PCIE-32GB
|
||
- [x] CUDA能力: 7.0 (支持Tensor Cores)
|
||
- [x] GPU数量: 8
|
||
- [x] PyTorch: 1.10.1+cu102
|
||
- [x] FP16支持: ✅
|
||
|
||
---
|
||
|
||
## 🎯 预期最终效果
|
||
|
||
### 性能指标
|
||
```
|
||
预计完成时间: 2025-11-06 22:00 UTC (5天后)
|
||
vs FP32原始: 节省4天
|
||
vs FP16单batch: 节省1.5天
|
||
|
||
最终性能预期:
|
||
mIoU: 0.48+ (相比Phase 3的0.41)
|
||
Divider IoU: 0.28+ (相比0.19)
|
||
Stop Line IoU: 0.35+ (相比0.27)
|
||
```
|
||
|
||
### 资源效率
|
||
```
|
||
GPU利用率: 100%
|
||
显存利用率: 75-80% (24-26GB/32GB)
|
||
训练效率: 相比FP32提升43%
|
||
```
|
||
|
||
---
|
||
|
||
## 📝 下一步计划
|
||
|
||
### 训练完成后(11/6)
|
||
1. 评估性能指标
|
||
2. 对比Phase 3 baseline
|
||
3. 决策是否继续Stage 2 (800×800)
|
||
4. 或启动Phase 4B (MapTR增强Divider)
|
||
|
||
### 可选优化
|
||
如果显存占用<23GB且稳定,可以考虑:
|
||
- 增加到batch=3 (进一步加速10-15%)
|
||
- 但需要调整学习率到6e-5
|
||
|
||
---
|
||
|
||
**文档版本**: 1.0
|
||
**最后更新**: 2025-11-01 22:20 UTC
|
||
**状态**: ✅ 配置完成,可立即启动
|
||
|
||
**建议**: 立即执行 `CLEANUP_AND_START_FP16_BATCH2.sh` 启动训练!⚡
|
||
|