111 lines
3.2 KiB
Plaintext
111 lines
3.2 KiB
Plaintext
|
|
================================================================================
|
|||
|
|
BEVFusion 内存占用分析与优化方案
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【当前状态】
|
|||
|
|
显存占用: 28.8-29.3GB / 32GB (88-89%)
|
|||
|
|
Batch Size: 1/GPU × 8 = 8
|
|||
|
|
训练速度: 2.67秒/迭代
|
|||
|
|
Epoch耗时: 11小时
|
|||
|
|
10 epochs: 9.5天
|
|||
|
|
|
|||
|
|
【主要内存消耗】
|
|||
|
|
1. LSS Transform外积操作: ~9GB ⚠️⚠️ (最大瓶颈)
|
|||
|
|
2. Swin Transformer激活值: ~3GB ⚠️
|
|||
|
|
3. BEV Decoder (4层): ~4GB ⚠️
|
|||
|
|
4. 优化器状态 (AdamW): ~7GB
|
|||
|
|
5. 其他 (参数+梯度等): ~6GB
|
|||
|
|
|
|||
|
|
【优化方案对比】
|
|||
|
|
|
|||
|
|
方案A: FP16 + Batch=4 (推荐) ⭐⭐⭐
|
|||
|
|
─────────────────────────────────────
|
|||
|
|
显存: 29GB → 20GB (节省9GB)
|
|||
|
|
Batch: 8 → 32 (4倍)
|
|||
|
|
速度: +33% (7.5h/epoch)
|
|||
|
|
完成: 6.5天 (vs 9.5天)
|
|||
|
|
精度: 无影响
|
|||
|
|
难度: 低 (仅修改配置)
|
|||
|
|
|
|||
|
|
配置文件: multitask_BEV2X_phase4a_stage1_fp16_batch4.yaml
|
|||
|
|
启动脚本: START_OPTIMIZED_TRAINING.sh
|
|||
|
|
|
|||
|
|
方案B: FP16 + Gradient CP + Batch=8 ⭐⭐
|
|||
|
|
─────────────────────────────────────
|
|||
|
|
显存: 29GB → 15GB (节省14GB)
|
|||
|
|
Batch: 8 → 64 (8倍)
|
|||
|
|
速度: +40% (但CP降速15%)
|
|||
|
|
完成: ~4天
|
|||
|
|
精度: 轻微影响 (depth分辨率降低)
|
|||
|
|
难度: 中 (需测试收敛性)
|
|||
|
|
|
|||
|
|
方案C: 仅增大Batch=2 (保守) ⭐⭐
|
|||
|
|
─────────────────────────────────────
|
|||
|
|
显存: 29GB → 25GB (节省4GB)
|
|||
|
|
Batch: 8 → 16 (2倍)
|
|||
|
|
速度: +15% (9.5h/epoch)
|
|||
|
|
完成: 8天
|
|||
|
|
精度: 无影响
|
|||
|
|
难度: 低 (立即可行)
|
|||
|
|
|
|||
|
|
【立即可执行 - 方案A】
|
|||
|
|
|
|||
|
|
1. 使用优化配置启动训练:
|
|||
|
|
bash START_OPTIMIZED_TRAINING.sh
|
|||
|
|
|
|||
|
|
2. 监控显存占用:
|
|||
|
|
watch -n 5 'nvidia-smi --query-gpu=index,memory.used --format=csv'
|
|||
|
|
|
|||
|
|
3. 查看训练进度:
|
|||
|
|
tail -f phase4a_stage1_fp16_batch4_*.log | grep "Epoch"
|
|||
|
|
|
|||
|
|
【优化配置说明】
|
|||
|
|
|
|||
|
|
FP16混合精度:
|
|||
|
|
✓ 激活值显存减半
|
|||
|
|
✓ 训练速度提升20-30%
|
|||
|
|
✓ V100原生Tensor Core支持
|
|||
|
|
✓ Dynamic loss scaling防止梯度下溢
|
|||
|
|
|
|||
|
|
Batch增加到4:
|
|||
|
|
✓ 利用FP16节省的显存
|
|||
|
|
✓ 梯度更稳定
|
|||
|
|
✓ BatchNorm统计更准确
|
|||
|
|
✓ 学习率线性缩放 (2e-5 → 4e-5)
|
|||
|
|
|
|||
|
|
Workers增加到2:
|
|||
|
|
✓ 数据加载加速
|
|||
|
|
✓ GPU计算时数据已准备好
|
|||
|
|
✓ 减少data_time占比
|
|||
|
|
|
|||
|
|
【注意事项】
|
|||
|
|
|
|||
|
|
1. FP16训练首次尝试,需监控:
|
|||
|
|
- Loss是否正常下降
|
|||
|
|
- 是否出现NaN/Inf
|
|||
|
|
- 最终精度是否达标
|
|||
|
|
|
|||
|
|
2. Batch=4可能需要:
|
|||
|
|
- 更长warmup (已调整为1000 iters)
|
|||
|
|
- 略微调整学习率 (如果不收敛)
|
|||
|
|
|
|||
|
|
3. 显存监控:
|
|||
|
|
- 前几个iteration可能更高(初始化)
|
|||
|
|
- 稳定后应在18-20GB
|
|||
|
|
|
|||
|
|
【回退方案】
|
|||
|
|
|
|||
|
|
如果FP16训练出现问题:
|
|||
|
|
1. 回到原配置: bash START_FROM_EPOCH1.sh
|
|||
|
|
2. 仅增大batch到2: data.samples_per_gpu=2
|
|||
|
|
|
|||
|
|
【完整文档】
|
|||
|
|
|
|||
|
|
详细分析: project/docs/BEVFusion内存占用分析_20251101.md
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
生成时间: 2025-11-01 12:30 UTC
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
|