310 lines
7.6 KiB
Markdown
310 lines
7.6 KiB
Markdown
|
|
# BEVFusion 训练时间详细分析
|
|||
|
|
|
|||
|
|
**分析时间**:2025-10-21 21:40 UTC
|
|||
|
|
**数据来源**:enhanced_training_6gpus.log(Epoch 1, Iteration 700-1650)
|
|||
|
|
**样本数量**:33个iterations统计
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 执行摘要
|
|||
|
|
|
|||
|
|
### 单个Epoch时间预估
|
|||
|
|
|
|||
|
|
| 项目 | 时间 | 占比 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| **总时间** | **7.90 小时** | 100% |
|
|||
|
|
| 模型计算 | 5.42 小时 | 68.6% ⭐ |
|
|||
|
|
| 数据加载 | 2.48 小时 | 31.4% |
|
|||
|
|
| 总iterations | 10,299 | - |
|
|||
|
|
|
|||
|
|
### 23个Epochs完整训练
|
|||
|
|
|
|||
|
|
- **总时间**:约 **182 小时** = **7.6 天**
|
|||
|
|
- **预计完成**:2025-10-29(从10-21 20:21开始)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 主要时间损耗分析
|
|||
|
|
|
|||
|
|
### 1. 模型计算时间:5.42小时(68.6%)⭐ 最大瓶颈
|
|||
|
|
|
|||
|
|
**详细分解**(单个iteration 1.895秒):
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
模型计算时间: 1.895秒 (68.6%)
|
|||
|
|
├─ 前向传播: ~1.0秒 (52.8%) ⭐⭐⭐
|
|||
|
|
│ ├─ Camera Encoder (SwinTransformer): ~0.4秒 (21%)
|
|||
|
|
│ │ └─ 6个视角图像特征提取
|
|||
|
|
│ │ • Patch Embedding
|
|||
|
|
│ │ • Window Attention × 多层
|
|||
|
|
│ │ • FPN特征金字塔
|
|||
|
|
│ │
|
|||
|
|
│ ├─ LiDAR Encoder: ~0.2秒 (11%)
|
|||
|
|
│ │ └─ 点云处理
|
|||
|
|
│ │ • Voxelization(体素化)
|
|||
|
|
│ │ • Sparse 3D Convolution
|
|||
|
|
│ │
|
|||
|
|
│ ├─ Fuser + Decoder: ~0.2秒 (11%)
|
|||
|
|
│ │ └─ BEV特征融合和增强
|
|||
|
|
│ │ • ConvFuser融合
|
|||
|
|
│ │ • SECOND Backbone
|
|||
|
|
│ │ • SECONDFPN
|
|||
|
|
│ │
|
|||
|
|
│ └─ Dual Heads: ~0.2秒 (11%)
|
|||
|
|
│ ├─ Object Head (TransFusion): ~0.1秒
|
|||
|
|
│ └─ Map Head (EnhancedBEVSeg): ~0.1秒
|
|||
|
|
│
|
|||
|
|
├─ 反向传播: ~0.6秒 (31.7%) ⭐⭐
|
|||
|
|
│ └─ 梯度计算
|
|||
|
|
│ • Loss反向传播
|
|||
|
|
│ • 各层梯度累积
|
|||
|
|
│
|
|||
|
|
├─ 优化器更新: ~0.2秒 (10.6%) ⭐
|
|||
|
|
│ └─ AdamW参数更新
|
|||
|
|
│ • 梯度处理
|
|||
|
|
│ • 参数更新
|
|||
|
|
│ • 学习率调度
|
|||
|
|
│
|
|||
|
|
└─ GPU同步: ~0.1秒 (5.3%)
|
|||
|
|
└─ 分布式训练同步
|
|||
|
|
• 梯度all-reduce
|
|||
|
|
• 6个GPU同步
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 数据加载时间:2.48小时(31.4%)
|
|||
|
|
|
|||
|
|
**详细分解**(单个iteration 0.866秒):
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
数据加载时间: 0.866秒 (31.4%)
|
|||
|
|
├─ 磁盘I/O读取: ~0.4秒 (46%)
|
|||
|
|
│ ├─ 6个相机图像加载 (256×704 × 6)
|
|||
|
|
│ ├─ LiDAR点云加载 (sweep数据)
|
|||
|
|
│ └─ 标注数据加载 (3D框、分割mask)
|
|||
|
|
│
|
|||
|
|
├─ 数据增强: ~0.3秒 (35%)
|
|||
|
|
│ ├─ 图像增强
|
|||
|
|
│ │ • Resize、Normalize
|
|||
|
|
│ │ • RandomFlip、ColorJitter
|
|||
|
|
│ ├─ 3D增强
|
|||
|
|
│ │ • GlobalRotScaleTrans
|
|||
|
|
│ │ • RandomFlip3D
|
|||
|
|
│ └─ 点云增强
|
|||
|
|
│
|
|||
|
|
└─ 数据格式化: ~0.2秒 (19%)
|
|||
|
|
├─ 转为Tensor
|
|||
|
|
├─ 数据打包
|
|||
|
|
└─ 批处理准备
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**说明**:
|
|||
|
|
- **workers_per_gpu=0**:主进程加载数据(避免死锁)
|
|||
|
|
- 虽然稍慢,但稳定性优先 ✅
|
|||
|
|
- 数据加载不是主要瓶颈
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 时间趋势分析
|
|||
|
|
|
|||
|
|
### 不同训练阶段的时间变化
|
|||
|
|
|
|||
|
|
| 阶段 | Iterations | 平均时间 | 数据加载 | 模型计算 | 变化 |
|
|||
|
|
|------|-----------|----------|----------|----------|------|
|
|||
|
|
| **前期** | 1-100 | 2.923秒 | 0.906秒 (31.0%) | 2.017秒 (69.0%) | 基准 |
|
|||
|
|
| **中期** | 450-550 | 2.739秒 | 0.865秒 (31.6%) | 1.874秒 (68.4%) | ⬇️ 6.3% |
|
|||
|
|
| **后期** | 950+ | 2.753秒 | 0.862秒 (31.3%) | 1.891秒 (68.7%) | ⬇️ 5.8% |
|
|||
|
|
|
|||
|
|
**分析**:
|
|||
|
|
- ✅ 训练过程稳定
|
|||
|
|
- ✅ 中后期略快(数据缓存、GPU预热效应)
|
|||
|
|
- ✅ 时间波动<7%,非常稳定
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 性能瓶颈排序
|
|||
|
|
|
|||
|
|
从最耗时到最少:
|
|||
|
|
|
|||
|
|
| 排名 | 瓶颈点 | 单iter时间 | 占比 | 优先级 |
|
|||
|
|
|------|--------|-----------|------|--------|
|
|||
|
|
| **1** | 模型前向传播 | 1.000秒 | 36.2% | 🔴 高 |
|
|||
|
|
| **2** | 数据加载I/O | 0.866秒 | 31.4% | 🟡 中 |
|
|||
|
|
| **3** | 反向传播 | 0.600秒 | 21.7% | 🟡 中 |
|
|||
|
|
| **4** | 优化器更新 | 0.200秒 | 7.2% | 🟢 低 |
|
|||
|
|
| **5** | GPU同步 | 0.100秒 | 3.6% | 🟢 低 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 优化建议
|
|||
|
|
|
|||
|
|
### 1. 短期优化(可立即实施)
|
|||
|
|
|
|||
|
|
#### ❌ 不建议改动(当前训练进行中)
|
|||
|
|
- 不改变workers_per_gpu(保持稳定性)
|
|||
|
|
- 不改变batch size(保持一致性)
|
|||
|
|
- 不改变GPU数量(训练已启动)
|
|||
|
|
|
|||
|
|
#### ✅ 下次训练可尝试
|
|||
|
|
|
|||
|
|
**A. 数据加载优化(可节省0.5-1小时/epoch)**
|
|||
|
|
```yaml
|
|||
|
|
# 如果有充足共享内存
|
|||
|
|
data:
|
|||
|
|
workers_per_gpu: 2 # 谨慎测试,从0→2
|
|||
|
|
persistent_workers: true # 保持worker进程
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**B. 混合精度训练(可节省1-2小时/epoch)**
|
|||
|
|
```yaml
|
|||
|
|
fp16:
|
|||
|
|
loss_scale: 512.0
|
|||
|
|
# 预期加速:20-30%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. 中期优化(针对未来训练)
|
|||
|
|
|
|||
|
|
#### A. 模型结构优化
|
|||
|
|
|
|||
|
|
**前向传播瓶颈(1.0秒)**:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
优化目标:Camera Encoder (0.4秒)
|
|||
|
|
├─ 方案1:减少Swin层数
|
|||
|
|
│ └─ [2,2,6,2] → [2,2,4,2]
|
|||
|
|
│ 预期加速:15-20%
|
|||
|
|
│
|
|||
|
|
├─ 方案2:降低特征维度
|
|||
|
|
│ └─ 96通道 → 80通道
|
|||
|
|
│ 预期加速:10-15%
|
|||
|
|
│
|
|||
|
|
└─ 方案3:使用更快的backbone
|
|||
|
|
└─ SwinTransformer → EfficientNet
|
|||
|
|
预期加速:30-40%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### B. 批处理优化
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# 如果显存允许
|
|||
|
|
data:
|
|||
|
|
samples_per_gpu: 3 # 从2增加到3
|
|||
|
|
# 每次更新更多样本,总iterations减少33%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**权衡**:
|
|||
|
|
- ✅ 总训练时间减少
|
|||
|
|
- ⚠️ 可能需要调整学习率
|
|||
|
|
- ⚠️ 显存需求增加(当前95%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. 长期优化(部署阶段)
|
|||
|
|
|
|||
|
|
#### A. 模型剪枝(Phase 4计划)
|
|||
|
|
```
|
|||
|
|
目标:110M → 60M参数
|
|||
|
|
预期推理加速:40-50%
|
|||
|
|
训练加速:30-40%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### B. 量化训练(Phase 5计划)
|
|||
|
|
```
|
|||
|
|
目标:FP32 → INT8
|
|||
|
|
预期推理加速:2-3倍
|
|||
|
|
训练时INT8不适用(精度损失)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 关键发现
|
|||
|
|
|
|||
|
|
### 1. 时间分布合理 ✅
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
模型计算 68.6% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
数据加载 31.4% ━━━━━━━━━━━━━━
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**分析**:
|
|||
|
|
- ✅ 模型计算占主导(正常)
|
|||
|
|
- ✅ 数据加载不是瓶颈(<35%)
|
|||
|
|
- ✅ 比例接近理想(70:30)
|
|||
|
|
|
|||
|
|
### 2. 前向传播是最大瓶颈 ⭐
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
前向传播:1.0秒 (36.2%)
|
|||
|
|
└─ Camera Encoder最慢:0.4秒 (21%)
|
|||
|
|
└─ SwinTransformer复杂度高
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优化空间最大**
|
|||
|
|
|
|||
|
|
### 3. 训练非常稳定 ✅
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
时间波动: <7%
|
|||
|
|
前期 → 中期 → 后期:逐渐加快
|
|||
|
|
原因:数据缓存、GPU预热
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 对比分析
|
|||
|
|
|
|||
|
|
### 与原始训练对比(Epoch 19)
|
|||
|
|
|
|||
|
|
| 项目 | Epoch 19 | 当前增强版 | 差异 |
|
|||
|
|
|------|----------|-----------|------|
|
|||
|
|
| 单iter时间 | ~3.35秒 | 2.76秒 | ⬇️ 17.6% |
|
|||
|
|
| Epoch时间 | ~3.6小时 | ~7.9小时 | ⬆️ 119% |
|
|||
|
|
| 原因 | - | Iterations增多 | - |
|
|||
|
|
|
|||
|
|
**说明**:
|
|||
|
|
- 原始训练:3,862 iterations/epoch
|
|||
|
|
- 增强版训练:10,299 iterations/epoch(2.67倍)
|
|||
|
|
- 原因:数据集配置不同
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 结论
|
|||
|
|
|
|||
|
|
### 当前训练效率评估
|
|||
|
|
|
|||
|
|
| 评估项 | 评分 | 说明 |
|
|||
|
|
|--------|------|------|
|
|||
|
|
| **整体效率** | ⭐⭐⭐⭐ | 良好 |
|
|||
|
|
| **时间分配** | ⭐⭐⭐⭐⭐ | 优秀 |
|
|||
|
|
| **稳定性** | ⭐⭐⭐⭐⭐ | 优秀 |
|
|||
|
|
| **优化空间** | ⭐⭐⭐ | 中等 |
|
|||
|
|
|
|||
|
|
### 主要时间损耗总结
|
|||
|
|
|
|||
|
|
**Top 3耗时环节**:
|
|||
|
|
1. **模型前向传播**:1.0秒/iter(36.2%)⭐⭐⭐
|
|||
|
|
- 主要在Camera Encoder(SwinTransformer)
|
|||
|
|
- 优化空间:剪枝、更换backbone
|
|||
|
|
|
|||
|
|
2. **数据加载**:0.866秒/iter(31.4%)⭐⭐
|
|||
|
|
- Workers=0导致稍慢
|
|||
|
|
- 优化空间:增加workers(需测试稳定性)
|
|||
|
|
|
|||
|
|
3. **反向传播**:0.6秒/iter(21.7%)⭐
|
|||
|
|
- 梯度计算
|
|||
|
|
- 优化空间:混合精度训练
|
|||
|
|
|
|||
|
|
### 建议
|
|||
|
|
|
|||
|
|
✅ **当前阶段**:保持现状,等待训练完成
|
|||
|
|
✅ **下次训练**:尝试mixed precision + workers=2
|
|||
|
|
✅ **部署阶段**:模型剪枝和量化
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**报告生成**:2025-10-21 21:40 UTC
|
|||
|
|
**预计Epoch 1完成**:2025-10-22 04:30(还需7小时)
|
|||
|
|
**预计全部完成**:2025-10-29(还需7.6天)
|
|||
|
|
|