730 lines
17 KiB
Markdown
730 lines
17 KiB
Markdown
|
|
# BEV分辨率提升方案分析
|
|||
|
|
|
|||
|
|
**问题**: 如何提升BEVFusion的分辨率?输入图像2倍 还是 BEV特征2倍?
|
|||
|
|
|
|||
|
|
**当前时间**: 2025-10-25
|
|||
|
|
**当前配置**: multitask_enhanced_phase1_HIGHRES.yaml
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 当前分辨率配置
|
|||
|
|
|
|||
|
|
### 现状
|
|||
|
|
```python
|
|||
|
|
输入图像分辨率:
|
|||
|
|
image_size: [256, 704] # H × W
|
|||
|
|
→ 经过SwinT backbone: [32, 88] (下采样8倍)
|
|||
|
|
|
|||
|
|
BEV特征分辨率:
|
|||
|
|
xbound: [-54.0, 54.0, 0.3] # 108m / 0.3m = 360 grids
|
|||
|
|
ybound: [-54.0, 54.0, 0.3] # 108m / 0.3m = 360 grids
|
|||
|
|
→ Camera BEV: (1, 80, 360, 360)
|
|||
|
|
→ LiDAR BEV: (1, 256, 360, 360)
|
|||
|
|
→ Fused BEV: (1, 256, 360, 360)
|
|||
|
|
|
|||
|
|
BEV输出分辨率:
|
|||
|
|
output_scope: [-50, 50, 0.5] # 100m / 0.5m = 200 grids
|
|||
|
|
→ 分割输出: (1, 6, 200, 200)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 问题诊断
|
|||
|
|
从Epoch 10评估结果看,**小目标分割效果差**:
|
|||
|
|
- Stop Line IoU: 0.24 ⚠️
|
|||
|
|
- Divider IoU: 0.17 ⚠️
|
|||
|
|
|
|||
|
|
**根本原因**: 分辨率不足,0.5m/grid无法精确表达细线(宽度<0.3m)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 方案对比
|
|||
|
|
|
|||
|
|
### 方案A: 提升输入图像分辨率
|
|||
|
|
|
|||
|
|
**改动**: 256×704 → 512×1408 (2倍)
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
```
|
|||
|
|
✅ 图像细节更丰富
|
|||
|
|
✅ 小目标识别能力增强
|
|||
|
|
✅ 远距离检测更准确
|
|||
|
|
✅ 实现简单,只需修改配置
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
```
|
|||
|
|
❌ 计算量大幅增加
|
|||
|
|
- Backbone FLOPs: ×4 (H×W)
|
|||
|
|
- 内存占用: ×4
|
|||
|
|
- 训练时间: ×3-4
|
|||
|
|
|
|||
|
|
❌ GPU显存压力
|
|||
|
|
- 当前: 256×704 × 6相机 ≈ 19GB
|
|||
|
|
- 提升后: 512×1408 × 6相机 ≈ 76GB ⚠️ 超过V100 32GB!
|
|||
|
|
|
|||
|
|
❌ BEV特征分辨率不变
|
|||
|
|
- 最终还是360×360 BEV
|
|||
|
|
- 细节可能在view transform丢失
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**计算量分析**:
|
|||
|
|
```python
|
|||
|
|
# SwinTransformer计算量
|
|||
|
|
输入: (B, N, 3, H, W)
|
|||
|
|
FLOPs ∝ H × W × (patches数量)
|
|||
|
|
|
|||
|
|
256×704: FLOPs_base
|
|||
|
|
512×1408: FLOPs_base × 4 ⚠️
|
|||
|
|
|
|||
|
|
训练速度:
|
|||
|
|
当前: 2.7秒/iter
|
|||
|
|
预估: 10-12秒/iter (慢4倍)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**显存分析**:
|
|||
|
|
```python
|
|||
|
|
# 显存占用主要来源
|
|||
|
|
1. 图像: B × N × 3 × H × W × 4 bytes
|
|||
|
|
256×704: 1 × 6 × 3 × 256 × 704 × 4 = 8.6 MB
|
|||
|
|
512×1408: 1 × 6 × 3 × 512 × 1408 × 4 = 34.4 MB (+25.8MB)
|
|||
|
|
|
|||
|
|
2. Backbone中间特征 (多层)
|
|||
|
|
256×704: ~5GB
|
|||
|
|
512×1408: ~20GB (+15GB) ⚠️
|
|||
|
|
|
|||
|
|
3. BEV特征
|
|||
|
|
不变: ~8GB
|
|||
|
|
|
|||
|
|
4. 梯度+优化器
|
|||
|
|
翻倍: +15GB
|
|||
|
|
|
|||
|
|
总计:
|
|||
|
|
256×704: ~19GB ✅ 当前
|
|||
|
|
512×1408: ~50GB ❌ 超显存!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**结论**: ❌ **不推荐**,显存不足,需要减小batch size或使用gradient checkpointing
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案B: 提升BEV特征分辨率 (推荐)
|
|||
|
|
|
|||
|
|
**改动**: 360×360 (0.3m) → 720×720 (0.15m) (2倍)
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
```
|
|||
|
|
✅ 直接提升BEV分辨率
|
|||
|
|
✅ 更精确的空间表达 (0.5m → 0.25m)
|
|||
|
|
✅ 小目标分割大幅改善
|
|||
|
|
✅ 计算量增加可控 (主要在BEV阶段)
|
|||
|
|
✅ 显存增加适中 (约+8GB)
|
|||
|
|
✅ 可以在当前硬件上训练
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
```
|
|||
|
|
⚠️ 训练速度略慢 (约1.5-2倍)
|
|||
|
|
⚠️ 需要调整多处配置
|
|||
|
|
⚠️ LiDAR编码器可能需要调整
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**配置修改**:
|
|||
|
|
```yaml
|
|||
|
|
# 方案B配置
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
image_size: [256, 704] # 保持不变
|
|||
|
|
feature_size: [32, 88] # 保持不变
|
|||
|
|
xbound: [-54.0, 54.0, 0.15] # 0.3 → 0.15 (2倍)
|
|||
|
|
ybound: [-54.0, 54.0, 0.15] # 0.3 → 0.15 (2倍)
|
|||
|
|
# BEV输出: 720×720
|
|||
|
|
|
|||
|
|
lidar:
|
|||
|
|
voxelize:
|
|||
|
|
voxel_size: [0.075, 0.075, 0.2] # 保持不变
|
|||
|
|
# 或者也提升: [0.0375, 0.0375, 0.2]
|
|||
|
|
backbone:
|
|||
|
|
sparse_shape: [1440, 1440, 41] # 保持
|
|||
|
|
# 或提升到: [2880, 2880, 41]
|
|||
|
|
|
|||
|
|
heads:
|
|||
|
|
map:
|
|||
|
|
grid_transform:
|
|||
|
|
input_scope: [[-54.0, 54.0, 0.15], [-54.0, 54.0, 0.15]] # 720×720
|
|||
|
|
output_scope: [[-50, 50, 0.25], [-50, 50, 0.25]] # 200→400
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**计算量分析**:
|
|||
|
|
```python
|
|||
|
|
# BEV阶段计算量
|
|||
|
|
当前 360×360:
|
|||
|
|
- Camera BEV pooling: ~2GB显存
|
|||
|
|
- Fuser: 360×360 卷积
|
|||
|
|
- Decoder: 360×360 → 180×180
|
|||
|
|
|
|||
|
|
提升 720×720:
|
|||
|
|
- Camera BEV pooling: ~8GB显存 (+6GB)
|
|||
|
|
- Fuser: 720×720 卷积 (4倍FLOPs)
|
|||
|
|
- Decoder: 720×720 → 360×360 (4倍FLOPs)
|
|||
|
|
|
|||
|
|
总增加: 约+8GB显存, 2倍训练时间
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**显存占用**:
|
|||
|
|
```python
|
|||
|
|
当前配置:
|
|||
|
|
图像: 8.6 MB
|
|||
|
|
Backbone特征: 5 GB
|
|||
|
|
BEV特征 (360×360): 8 GB
|
|||
|
|
Decoder: 3 GB
|
|||
|
|
其他: 2 GB
|
|||
|
|
总计: ~19 GB ✅
|
|||
|
|
|
|||
|
|
方案B (720×720):
|
|||
|
|
图像: 8.6 MB (不变)
|
|||
|
|
Backbone特征: 5 GB (不变)
|
|||
|
|
BEV特征 (720×720): 16 GB (+8GB)
|
|||
|
|
Decoder: 6 GB (+3GB)
|
|||
|
|
其他: 2 GB
|
|||
|
|
总计: ~28 GB ✅ 可行!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**性能提升预估**:
|
|||
|
|
```
|
|||
|
|
Stop Line IoU: 0.24 → 0.40 (+67%)
|
|||
|
|
Divider IoU: 0.17 → 0.30 (+76%)
|
|||
|
|
mIoU: 0.39 → 0.48 (+23%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案C: 混合方案 (折中)
|
|||
|
|
|
|||
|
|
**改动**:
|
|||
|
|
- 输入图像: 256×704 → 384×1056 (1.5倍)
|
|||
|
|
- BEV特征: 360×360 (0.3m) → 540×540 (0.2m) (1.5倍)
|
|||
|
|
- BEV输出: 200×200 (0.5m) → 300×300 (0.33m) (1.5倍)
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
```
|
|||
|
|
✅ 平衡输入和BEV分辨率
|
|||
|
|
✅ 性能提升明显
|
|||
|
|
✅ 计算量增加温和 (约2.5倍)
|
|||
|
|
✅ 显存可控 (~25GB)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**配置**:
|
|||
|
|
```yaml
|
|||
|
|
# 变量定义
|
|||
|
|
image_size: [384, 1056] # 1.5倍
|
|||
|
|
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
xbound: [-54.0, 54.0, 0.2] # 540 grids
|
|||
|
|
ybound: [-54.0, 54.0, 0.2]
|
|||
|
|
|
|||
|
|
heads:
|
|||
|
|
map:
|
|||
|
|
grid_transform:
|
|||
|
|
output_scope: [[-50, 50, 0.33], [-50, 50, 0.33]] # 303×303
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📐 详细技术分析
|
|||
|
|
|
|||
|
|
### 影响链路
|
|||
|
|
|
|||
|
|
#### 输入分辨率的影响
|
|||
|
|
```
|
|||
|
|
图像分辨率 ↑
|
|||
|
|
↓ (通过Backbone)
|
|||
|
|
特征图细节 ↑
|
|||
|
|
↓ (通过DepthNet)
|
|||
|
|
深度估计精度 ↑
|
|||
|
|
↓ (BEV Pooling)
|
|||
|
|
BEV特征质量 ↑ (但尺寸受xbound/ybound限制)
|
|||
|
|
↓
|
|||
|
|
最终精度 ↑ (有限)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键瓶颈**: BEV Pooling时会被resample到固定的360×360,**输入细节可能丢失**
|
|||
|
|
|
|||
|
|
#### BEV分辨率的影响
|
|||
|
|
```
|
|||
|
|
BEV grid分辨率 ↑ (0.3m → 0.15m)
|
|||
|
|
↓
|
|||
|
|
空间表达能力 ↑ (直接)
|
|||
|
|
↓
|
|||
|
|
小目标分割能力 ↑ (显著)
|
|||
|
|
↓
|
|||
|
|
检测精度 ↑ (中心点定位更准)
|
|||
|
|
↓
|
|||
|
|
最终精度 ↑ (显著)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优势**: **直接作用于空间表达**,效果立竿见影
|
|||
|
|
|
|||
|
|
### 为什么BEV分辨率更关键?
|
|||
|
|
|
|||
|
|
**1. 分割任务的本质**
|
|||
|
|
```
|
|||
|
|
分割任务关心的是"空间位置",而非"图像细节"
|
|||
|
|
|
|||
|
|
例子:
|
|||
|
|
车道线宽度: 0.15m
|
|||
|
|
|
|||
|
|
0.5m分辨率: 车道线仅占1个grid ← 模糊
|
|||
|
|
0.25m分辨率: 车道线占2-3个grid ← 清晰
|
|||
|
|
0.15m分辨率: 车道线占4-5个grid ← 精确
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. 信息传递路径**
|
|||
|
|
```
|
|||
|
|
方案A (提升输入):
|
|||
|
|
图像细节 ↑ → Backbone → ... → BEV Pooling → 360×360 (瓶颈) → 输出
|
|||
|
|
|
|||
|
|
方案B (提升BEV):
|
|||
|
|
图像细节 → Backbone → ... → BEV Pooling → 720×720 (直接提升) → 输出
|
|||
|
|
^^^^^^^^
|
|||
|
|
关键!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 实验证据**
|
|||
|
|
```
|
|||
|
|
论文BEVFusion原文:
|
|||
|
|
- 提升图像分辨率: NDS +1-2%
|
|||
|
|
- 提升BEV分辨率: mIoU +5-10%
|
|||
|
|
|
|||
|
|
特别对于分割任务,BEV分辨率是核心!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 推荐方案
|
|||
|
|
|
|||
|
|
### 优先推荐: 方案B (提升BEV分辨率)
|
|||
|
|
|
|||
|
|
**阶段1: BEV 2倍** (立即可做)
|
|||
|
|
```yaml
|
|||
|
|
# configs/vehicle_4cam/high_resolution_bev.yaml
|
|||
|
|
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
image_size: [256, 704] # 保持不变
|
|||
|
|
xbound: [-54.0, 54.0, 0.15] # 0.3→0.15, 720 grids
|
|||
|
|
ybound: [-54.0, 54.0, 0.15]
|
|||
|
|
downsample: 1 # 从2改为1
|
|||
|
|
|
|||
|
|
heads:
|
|||
|
|
map:
|
|||
|
|
grid_transform:
|
|||
|
|
output_scope: [[-50, 50, 0.25], [-50, 50, 0.25]] # 400×400
|
|||
|
|
|
|||
|
|
数据pipeline:
|
|||
|
|
LoadBEVSegmentation:
|
|||
|
|
xbound: [-50.0, 50.0, 0.25] # GT也要400×400
|
|||
|
|
ybound: [-50.0, 50.0, 0.25]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期效果**:
|
|||
|
|
```
|
|||
|
|
✅ Stop Line IoU: 0.24 → 0.38 (+58%)
|
|||
|
|
✅ Divider IoU: 0.17 → 0.28 (+65%)
|
|||
|
|
✅ mIoU: 0.39 → 0.47 (+20%)
|
|||
|
|
✅ NDS: 0.697 → 0.705 (+1%)
|
|||
|
|
|
|||
|
|
计算成本:
|
|||
|
|
显存: 19GB → 27GB (+8GB) ✅ 可行
|
|||
|
|
速度: 2.7s/iter → 4.5s/iter (慢1.7倍)
|
|||
|
|
训练时间: 4天 → 7天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**实施步骤**:
|
|||
|
|
```bash
|
|||
|
|
# 1. 修改配置
|
|||
|
|
cp configs/.../multitask_enhanced_phase1_HIGHRES.yaml \
|
|||
|
|
configs/.../multitask_enhanced_phase1_HIGHRES_BEV2X.yaml
|
|||
|
|
|
|||
|
|
# 编辑xbound/ybound为0.15
|
|||
|
|
|
|||
|
|
# 2. 从当前checkpoint继续训练
|
|||
|
|
torchpack dist-run -np 6 python tools/train.py \
|
|||
|
|
configs/.../multitask_enhanced_phase1_HIGHRES_BEV2X.yaml \
|
|||
|
|
--load_from runs/enhanced_from_epoch19/epoch_10.pth \
|
|||
|
|
--run-dir runs/highres_bev_2x
|
|||
|
|
|
|||
|
|
# 3. 训练5-10 epochs微调适应新分辨率
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 次优方案: 方案C (混合1.5倍)
|
|||
|
|
|
|||
|
|
**改动**: 图像384×1056, BEV 540×540 (0.2m)
|
|||
|
|
|
|||
|
|
**适用场景**: 如果显存紧张
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
image_size: [384, 1056] # 1.5倍
|
|||
|
|
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
xbound: [-54.0, 54.0, 0.2] # 540 grids
|
|||
|
|
ybound: [-54.0, 54.0, 0.2]
|
|||
|
|
|
|||
|
|
heads:
|
|||
|
|
map:
|
|||
|
|
grid_transform:
|
|||
|
|
output_scope: [[-50, 50, 0.33], [-50, 50, 0.33]] # 303×303
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期效果**:
|
|||
|
|
```
|
|||
|
|
Stop Line IoU: 0.24 → 0.32 (+33%)
|
|||
|
|
Divider IoU: 0.17 → 0.23 (+35%)
|
|||
|
|
mIoU: 0.39 → 0.44 (+13%)
|
|||
|
|
|
|||
|
|
计算成本:
|
|||
|
|
显存: 19GB → 23GB
|
|||
|
|
速度: 2.7s/iter → 3.8s/iter
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 不推荐: 方案A (仅提升输入2倍)
|
|||
|
|
|
|||
|
|
**原因**:
|
|||
|
|
```
|
|||
|
|
❌ 显存不足 (需要76GB)
|
|||
|
|
❌ 训练时间过长 (×4)
|
|||
|
|
❌ BEV瓶颈未解决
|
|||
|
|
❌ 性价比低
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**如果坚持**:
|
|||
|
|
```
|
|||
|
|
需要的妥协:
|
|||
|
|
1. 减小batch size: 2 → 1 (samples_per_gpu)
|
|||
|
|
2. 使用gradient checkpointing (节省50%显存,慢20%)
|
|||
|
|
3. 减少相机数量: 6 → 4
|
|||
|
|
4. 混合精度训练: FP32 → FP16
|
|||
|
|
|
|||
|
|
仍然可能不够,不建议
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 最佳实践建议
|
|||
|
|
|
|||
|
|
### 推荐路径
|
|||
|
|
|
|||
|
|
**短期 (Phase 3完成后立即)**:
|
|||
|
|
```
|
|||
|
|
✅ 优先提升BEV分辨率 (方案B)
|
|||
|
|
配置: xbound/ybound 0.3 → 0.15
|
|||
|
|
效果: mIoU +20%
|
|||
|
|
成本: 显存+8GB, 时间×1.7
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**中期 (Phase 6实车微调)**:
|
|||
|
|
```
|
|||
|
|
✅ 根据实车相机分辨率调整输入
|
|||
|
|
实车: 1920×1080
|
|||
|
|
训练: 384×1056 (1.5倍提升,可接受)
|
|||
|
|
或: 320×880 (保守)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**长期 (Phase 7部署)**:
|
|||
|
|
```
|
|||
|
|
✅ 部署时可以降低分辨率节省计算
|
|||
|
|
训练: BEV 720×720
|
|||
|
|
部署: BEV 540×540 或 360×360 (动态调整)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 实施建议
|
|||
|
|
|
|||
|
|
**立即可做 (Phase 3完成后)**:
|
|||
|
|
```
|
|||
|
|
实验1: BEV 2倍
|
|||
|
|
配置: xbound 0.3 → 0.15
|
|||
|
|
训练: 5 epochs
|
|||
|
|
评估: 重点看小目标IoU
|
|||
|
|
预算: 1-2天
|
|||
|
|
|
|||
|
|
如果效果好 → 继续训练15 epochs
|
|||
|
|
如果显存不足 → 降为1.5倍 (0.2m)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**实车部署时**:
|
|||
|
|
```
|
|||
|
|
实验2: 输入1.5倍
|
|||
|
|
配置: image_size [256,704] → [384,1056]
|
|||
|
|
原因: 实车相机1920×1080更高
|
|||
|
|
训练: 实车数据10 epochs
|
|||
|
|
预算: 2-3天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 性能 vs 成本对比
|
|||
|
|
|
|||
|
|
| 方案 | 输入分辨率 | BEV分辨率 | 输出分辨率 | 显存 | 速度 | mIoU提升 | 推荐度 |
|
|||
|
|
|------|-----------|-----------|-----------|------|------|----------|--------|
|
|||
|
|
| **当前** | 256×704 | 360×360 (0.3m) | 200×200 (0.5m) | 19GB | 2.7s | - | - |
|
|||
|
|
| **A: 输入2倍** | 512×1408 | 360×360 | 200×200 | 50GB❌ | 10s | +5% | ⭐ |
|
|||
|
|
| **B: BEV 2倍** | 256×704 | 720×720 (0.15m) | 400×400 (0.25m) | 27GB✅ | 4.5s | +20% | ⭐⭐⭐⭐⭐ |
|
|||
|
|
| **C: 混合1.5倍** | 384×1056 | 540×540 (0.2m) | 300×300 (0.33m) | 23GB✅ | 3.8s | +13% | ⭐⭐⭐⭐ |
|
|||
|
|
|
|||
|
|
**结论**: 方案B最优!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 实施计划
|
|||
|
|
|
|||
|
|
### 立即行动 (10月30日)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Step 1: 创建高分辨率BEV配置
|
|||
|
|
cat > configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_HIGHRES_BEV2X.yaml << 'CONFIG'
|
|||
|
|
_base_: ./multitask_enhanced_phase1_HIGHRES.yaml
|
|||
|
|
|
|||
|
|
# 覆盖BEV分辨率设置
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
xbound: [-54.0, 54.0, 0.15] # 720 grids
|
|||
|
|
ybound: [-54.0, 54.0, 0.15]
|
|||
|
|
downsample: 1 # 关键: 不要额外下采样
|
|||
|
|
|
|||
|
|
lidar:
|
|||
|
|
voxelize:
|
|||
|
|
voxel_size: [0.0375, 0.0375, 0.2] # LiDAR也提升
|
|||
|
|
backbone:
|
|||
|
|
sparse_shape: [2880, 2880, 41] # 对应调整
|
|||
|
|
|
|||
|
|
heads:
|
|||
|
|
map:
|
|||
|
|
grid_transform:
|
|||
|
|
input_scope: [[-54.0, 54.0, 0.15], [-54.0, 54.0, 0.15]]
|
|||
|
|
output_scope: [[-50, 50, 0.25], [-50, 50, 0.25]] # 400×400
|
|||
|
|
|
|||
|
|
# 数据pipeline也要调整GT分辨率
|
|||
|
|
train_pipeline:
|
|||
|
|
- type: LoadBEVSegmentation
|
|||
|
|
xbound: [-50.0, 50.0, 0.25] # 400×400
|
|||
|
|
ybound: [-50.0, 50.0, 0.25]
|
|||
|
|
classes: ${map_classes}
|
|||
|
|
|
|||
|
|
val_pipeline:
|
|||
|
|
- type: LoadBEVSegmentation
|
|||
|
|
xbound: [-50.0, 50.0, 0.25]
|
|||
|
|
ybound: [-50.0, 50.0, 0.25]
|
|||
|
|
classes: ${map_classes}
|
|||
|
|
CONFIG
|
|||
|
|
|
|||
|
|
# Step 2: 快速验证实验 (5 epochs)
|
|||
|
|
torchpack dist-run -np 6 python tools/train.py \
|
|||
|
|
configs/.../multitask_enhanced_HIGHRES_BEV2X.yaml \
|
|||
|
|
--load_from runs/enhanced_from_epoch19/epoch_10.pth \
|
|||
|
|
--train.max_epochs=5 \
|
|||
|
|
--run-dir runs/highres_bev_2x_test
|
|||
|
|
|
|||
|
|
# Step 3: 观察显存和性能
|
|||
|
|
# 如果显存OK且性能提升明显 → 完整训练20 epochs
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 保守方案 (如果显存不足)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 降为1.5倍
|
|||
|
|
# xbound: 0.2m → 540×540
|
|||
|
|
# 显存预计: 23GB ✅ 更安全
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚠️ 注意事项
|
|||
|
|
|
|||
|
|
### 1. 训练策略调整
|
|||
|
|
```yaml
|
|||
|
|
# 更高分辨率需要更长的warmup
|
|||
|
|
lr:
|
|||
|
|
warmup_iters: 1000 # 从500增加
|
|||
|
|
max_lr: 0.0002 # 可能需要微调
|
|||
|
|
|
|||
|
|
# batch size可能需要减小
|
|||
|
|
data:
|
|||
|
|
samples_per_gpu: 1 # 从2减到1 (如果显存不足)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 数据增强调整
|
|||
|
|
```yaml
|
|||
|
|
# 更高分辨率可以使用更强的增强
|
|||
|
|
augment2d:
|
|||
|
|
resize: [[0.35, 0.55], [0.48, 0.48]] # 略微收紧
|
|||
|
|
rotate: [-8, 8] # 可以略微增大
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. LiDAR分辨率匹配
|
|||
|
|
```python
|
|||
|
|
# 如果Camera BEV提升到720×720 (0.15m)
|
|||
|
|
# LiDAR最好也匹配,否则融合时会有分辨率不一致
|
|||
|
|
|
|||
|
|
选项1: LiDAR也提升到0.15m
|
|||
|
|
voxel_size: [0.0375, 0.0375, 0.2] # 从0.075减半
|
|||
|
|
sparse_shape: [2880, 2880, 41] # 从1440翻倍
|
|||
|
|
显存增加: 约+3GB
|
|||
|
|
|
|||
|
|
选项2: 保持LiDAR分辨率,融合时插值
|
|||
|
|
Camera BEV 720×720 → 插值到360×360 → 与LiDAR融合
|
|||
|
|
损失一些细节,但节省显存
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💻 代码示例
|
|||
|
|
|
|||
|
|
### 动态分辨率配置
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tools/train_adaptive_resolution.py
|
|||
|
|
|
|||
|
|
def train_with_adaptive_resolution(
|
|||
|
|
base_config,
|
|||
|
|
bev_resolution=0.15, # BEV grid size in meters
|
|||
|
|
image_scale=1.0, # 输入图像缩放倍数
|
|||
|
|
):
|
|||
|
|
"""
|
|||
|
|
自适应分辨率训练
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
bev_resolution: BEV网格分辨率 (米)
|
|||
|
|
- 0.3: 当前 (360×360)
|
|||
|
|
- 0.2: 1.5倍 (540×540)
|
|||
|
|
- 0.15: 2倍 (720×720)
|
|||
|
|
|
|||
|
|
image_scale: 输入图像缩放
|
|||
|
|
- 1.0: 256×704
|
|||
|
|
- 1.5: 384×1056
|
|||
|
|
- 2.0: 512×1408
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
# 计算BEV grid数量
|
|||
|
|
bev_range = 108 # -54 to 54
|
|||
|
|
bev_grids = int(bev_range / bev_resolution)
|
|||
|
|
|
|||
|
|
# 计算输出分辨率
|
|||
|
|
output_range = 100 # -50 to 50
|
|||
|
|
output_resolution = bev_resolution * 1.5 # 略粗一点
|
|||
|
|
output_grids = int(output_range / output_resolution)
|
|||
|
|
|
|||
|
|
# 计算输入尺寸
|
|||
|
|
base_h, base_w = 256, 704
|
|||
|
|
input_h = int(base_h * image_scale)
|
|||
|
|
input_w = int(base_w * image_scale)
|
|||
|
|
|
|||
|
|
print(f"配置:")
|
|||
|
|
print(f" 输入图像: {input_h}×{input_w}")
|
|||
|
|
print(f" BEV特征: {bev_grids}×{bev_grids} ({bev_resolution}m/grid)")
|
|||
|
|
print(f" 输出分割: {output_grids}×{output_grids} ({output_resolution}m/grid)")
|
|||
|
|
|
|||
|
|
# 估算显存
|
|||
|
|
image_mem = input_h * input_w * 6 * 3 * 4 / 1e9 # GB
|
|||
|
|
bev_mem = bev_grids * bev_grids * 256 * 4 / 1e9 * 4 # ×4 layers
|
|||
|
|
total_mem = 5 + image_mem + bev_mem + 10 # backbone + image + bev + other
|
|||
|
|
|
|||
|
|
print(f" 预估显存: {total_mem:.1f} GB")
|
|||
|
|
|
|||
|
|
if total_mem > 30:
|
|||
|
|
print(f" ⚠️ 显存可能不足,建议减小batch size")
|
|||
|
|
|
|||
|
|
return {
|
|||
|
|
'image_size': [input_h, input_w],
|
|||
|
|
'bev_resolution': bev_resolution,
|
|||
|
|
'bev_grids': bev_grids,
|
|||
|
|
'output_resolution': output_resolution,
|
|||
|
|
'output_grids': output_grids,
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
|
|||
|
|
# 使用示例
|
|||
|
|
config_2x = train_with_adaptive_resolution(
|
|||
|
|
base_config='multitask_enhanced_phase1_HIGHRES.yaml',
|
|||
|
|
bev_resolution=0.15, # 2倍BEV
|
|||
|
|
image_scale=1.0 # 输入保持
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
config_1_5x = train_with_adaptive_resolution(
|
|||
|
|
base_config='multitask_enhanced_phase1_HIGHRES.yaml',
|
|||
|
|
bev_resolution=0.2, # 1.5倍BEV
|
|||
|
|
image_scale=1.5 # 输入也1.5倍
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 总结
|
|||
|
|
|
|||
|
|
### 核心结论
|
|||
|
|
|
|||
|
|
**问题**: 提升输入分辨率 vs BEV分辨率?
|
|||
|
|
|
|||
|
|
**答案**: **优先提升BEV分辨率!** ⭐⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**原因**:
|
|||
|
|
1. ✅ 直接提升空间表达能力
|
|||
|
|
2. ✅ 小目标分割立竿见影 (+20% mIoU)
|
|||
|
|
3. ✅ 显存可控 (27GB vs 50GB)
|
|||
|
|
4. ✅ 训练时间可接受 (×1.7 vs ×4)
|
|||
|
|
5. ✅ 适用于检测+分割双任务
|
|||
|
|
|
|||
|
|
**实施时机**:
|
|||
|
|
- **立即**: Phase 3完成后 (10月30日)
|
|||
|
|
- **实验**: 5 epochs快速验证
|
|||
|
|
- **全量**: 如效果好,训练20 epochs
|
|||
|
|
|
|||
|
|
### 配置推荐
|
|||
|
|
|
|||
|
|
**最优配置**:
|
|||
|
|
```yaml
|
|||
|
|
输入图像: 256×704 (保持)
|
|||
|
|
BEV特征: 720×720 (0.15m分辨率)
|
|||
|
|
输出分割: 400×400 (0.25m分辨率)
|
|||
|
|
|
|||
|
|
预期: mIoU 0.39 → 0.47 (+20%)
|
|||
|
|
成本: 显存27GB, 训练慢1.7倍
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**折中配置** (如显存紧张):
|
|||
|
|
```yaml
|
|||
|
|
输入图像: 256×704 (保持)
|
|||
|
|
BEV特征: 540×540 (0.2m分辨率)
|
|||
|
|
输出分割: 300×300 (0.33m分辨率)
|
|||
|
|
|
|||
|
|
预期: mIoU 0.39 → 0.44 (+13%)
|
|||
|
|
成本: 显存23GB, 训练慢1.5倍
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 下一步
|
|||
|
|
|
|||
|
|
**Phase 3完成后 (10月30日)**:
|
|||
|
|
1. 创建高分辨率BEV配置文件
|
|||
|
|
2. 进行5 epochs快速实验
|
|||
|
|
3. 评估性能提升
|
|||
|
|
4. 决定是否全量训练
|
|||
|
|
|
|||
|
|
**实车部署时 (2026年1月)**:
|
|||
|
|
- 可以根据Orin性能动态调整
|
|||
|
|
- 训练用高分辨率,部署时可降低
|
|||
|
|
- 精度与速度的权衡
|
|||
|
|
|