1552 lines
37 KiB
Markdown
1552 lines
37 KiB
Markdown
|
|
# BEVFormer vs BEVFusion 分割技术深度对比
|
|||
|
|
|
|||
|
|
**对比时间**: 2025-10-26
|
|||
|
|
**参考**: [BEVFormer GitHub](https://github.com/fundamentalvision/BEVFormer)
|
|||
|
|
**目的**: 分析BEVFormer在分割任务上的技术细节及与BEVFusion的差异
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 项目基本信息对比
|
|||
|
|
|
|||
|
|
| 项目 | 类型 | 发表 | NDS | 输入模态 | 主要特点 |
|
|||
|
|
|------|------|------|-----|---------|---------|
|
|||
|
|
| **BEVFormer** | 纯视觉 | ECCV 2022 | 56.9% | Camera-only | 时空Transformer |
|
|||
|
|
| **BEVFusion** | 多模态融合 | ICRA 2023 | 70.4% | Camera + LiDAR | 多传感器融合 |
|
|||
|
|
| **当前训练** | 多模态融合 | - | 71.0% (E14) | Camera + LiDAR | 增强分割头 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🏗️ 架构对比
|
|||
|
|
|
|||
|
|
### BEVFormer架构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
输入: 6个相机图像 (1600×900)
|
|||
|
|
↓
|
|||
|
|
Image Backbone (ResNet101-DCN)
|
|||
|
|
↓ 多尺度特征
|
|||
|
|
Image Neck (FPN)
|
|||
|
|
↓ (B, N, C, H, W) N=6相机
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
│ BEVFormer Encoder (核心创新) │
|
|||
|
|
├─────────────────────────────────────────┤
|
|||
|
|
│ BEV Queries: (B, H×W, C) │
|
|||
|
|
│ └─ 预定义的网格状queries (200×200) │
|
|||
|
|
│ │
|
|||
|
|
│ ┌─ Spatial Cross-Attention (SCA) ─┐ │
|
|||
|
|
│ │ ├─ Query: BEV Queries │ │
|
|||
|
|
│ │ ├─ Key/Value: 图像特征 │ │
|
|||
|
|
│ │ └─ 机制: 3D可变形注意力 │ │
|
|||
|
|
│ │ └─ 为每个BEV query采样 │ │
|
|||
|
|
│ │ 相关的3D参考点 │ │
|
|||
|
|
│ └─────────────────────────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
│ ┌─ Temporal Self-Attention (TSA) ─┐ │
|
|||
|
|
│ │ ├─ Query: 当前BEV Queries │ │
|
|||
|
|
│ │ ├─ Key/Value: 历史BEV特征 │ │
|
|||
|
|
│ │ └─ 机制: 融合多帧信息 │ │
|
|||
|
|
│ └─────────────────────────────────┘ │
|
|||
|
|
│ │
|
|||
|
|
│ 多层堆叠 (×6层Encoder) │
|
|||
|
|
└─────────────────────────────────────────┘
|
|||
|
|
↓
|
|||
|
|
BEV Features: (B, C, 200, 200)
|
|||
|
|
↓
|
|||
|
|
┌─────────────────┬─────────────────┐
|
|||
|
|
│ 检测头 │ 分割头 │
|
|||
|
|
│ (Deformable │ (Simple │
|
|||
|
|
│ DETR) │ Conv) │
|
|||
|
|
└─────────────────┴─────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### BEVFusion架构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────┐ ┌─────────────────┐
|
|||
|
|
│ Camera Branch │ │ LiDAR Branch │
|
|||
|
|
├─────────────────┤ ├─────────────────┤
|
|||
|
|
│ Backbone(Swin) │ │ Voxelization │
|
|||
|
|
│ ↓ │ │ ↓ │
|
|||
|
|
│ Neck (FPN) │ │ Sparse Encoder │
|
|||
|
|
│ ↓ │ │ ↓ │
|
|||
|
|
│ DepthLSS │ │ BEV Projection │
|
|||
|
|
│ ↓ │ │ │
|
|||
|
|
│ Camera BEV │ │ LiDAR BEV │
|
|||
|
|
│ (360×360) │ │ (360×360) │
|
|||
|
|
└────────┬────────┘ └────────┬────────┘
|
|||
|
|
│ │
|
|||
|
|
└────────┬───────────┘
|
|||
|
|
↓
|
|||
|
|
┌────────────────┐
|
|||
|
|
│ ConvFuser │ 特征融合
|
|||
|
|
└────────┬───────┘
|
|||
|
|
↓
|
|||
|
|
┌────────────────┐
|
|||
|
|
│ SECOND Decoder │ BEV特征增强
|
|||
|
|
└────────┬───────┘
|
|||
|
|
↓
|
|||
|
|
Fused BEV (360×360)
|
|||
|
|
↓
|
|||
|
|
┌─────────────┴─────────────┐
|
|||
|
|
│ │
|
|||
|
|
┌───▼─────┐ ┌──────▼──────┐
|
|||
|
|
│ 检测头 │ │ 分割头 │
|
|||
|
|
│ (Trans │ │ (Enhanced) │
|
|||
|
|
│ Fusion)│ │ │
|
|||
|
|
└─────────┘ └──────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 BEV表示生成对比
|
|||
|
|
|
|||
|
|
### BEVFormer: Transformer Query-based
|
|||
|
|
|
|||
|
|
**核心思想**: 通过可学习的BEV Queries主动查询图像特征
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# BEVFormer生成BEV的方式
|
|||
|
|
class BEVFormerEncoder:
|
|||
|
|
def __init__(self):
|
|||
|
|
# 预定义200×200的BEV查询
|
|||
|
|
self.bev_h = 200
|
|||
|
|
self.bev_w = 200
|
|||
|
|
self.bev_queries = nn.Embedding(
|
|||
|
|
self.bev_h * self.bev_w,
|
|||
|
|
embed_dims
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def forward(self, img_features, img_metas):
|
|||
|
|
# 1. 初始化BEV queries
|
|||
|
|
bev_queries = self.bev_queries.weight # (40000, C)
|
|||
|
|
bev_queries = bev_queries.view(B, H, W, C)
|
|||
|
|
|
|||
|
|
# 2. 空间交叉注意力 (SCA)
|
|||
|
|
for layer in self.layers:
|
|||
|
|
# 为每个BEV query采样3D参考点
|
|||
|
|
reference_points_3d = self.get_reference_points(
|
|||
|
|
H, W, Z=4, # 4个高度层
|
|||
|
|
bs=B
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 查询多相机特征
|
|||
|
|
bev_queries = layer.spatial_cross_attn(
|
|||
|
|
query=bev_queries,
|
|||
|
|
key=img_features,
|
|||
|
|
value=img_features,
|
|||
|
|
reference_points=reference_points_3d,
|
|||
|
|
spatial_shapes=img_shapes,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 3. 时间自注意力 (TSA)
|
|||
|
|
if prev_bev is not None:
|
|||
|
|
bev_queries = layer.temporal_self_attn(
|
|||
|
|
query=bev_queries,
|
|||
|
|
key=prev_bev,
|
|||
|
|
value=prev_bev,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 4. FFN
|
|||
|
|
bev_queries = layer.ffn(bev_queries)
|
|||
|
|
|
|||
|
|
return bev_queries # (B, 200, 200, C)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- ✅ 端到端可学习
|
|||
|
|
- ✅ 时间信息融合(利用历史帧)
|
|||
|
|
- ✅ 灵活的BEV分辨率
|
|||
|
|
- ⚠️ 计算复杂度高(O(N²)注意力)
|
|||
|
|
- ⚠️ 需要多帧数据(时间维度)
|
|||
|
|
|
|||
|
|
### BEVFusion: LSS-based Projection
|
|||
|
|
|
|||
|
|
**核心思想**: 通过显式深度估计和几何投影生成BEV
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# BEVFusion生成BEV的方式
|
|||
|
|
class DepthLSSTransform:
|
|||
|
|
def forward(self, img_features, camera_params):
|
|||
|
|
B, N, C, H, W = img_features.shape # N=6相机
|
|||
|
|
|
|||
|
|
# 1. 预测深度分布
|
|||
|
|
depth_logits = self.depth_net(img_features)
|
|||
|
|
depth_prob = F.softmax(depth_logits, dim=1) # (B*N, D, H, W)
|
|||
|
|
|
|||
|
|
# 2. 构建3D视锥(Frustum)
|
|||
|
|
frustum = self.create_frustum(
|
|||
|
|
depth_bins, # [1.0, 60.0, 0.5] 118层
|
|||
|
|
img_size, # (256, 704)
|
|||
|
|
downsample=8 # 特征图 (32, 88)
|
|||
|
|
) # (D, fH, fW, 3)
|
|||
|
|
|
|||
|
|
# 3. 转换到自车坐标系
|
|||
|
|
points_3d = self.get_geometry(
|
|||
|
|
frustum,
|
|||
|
|
camera_intrinsics,
|
|||
|
|
camera_extrinsics
|
|||
|
|
) # (B, N, D, fH, fW, 3)
|
|||
|
|
|
|||
|
|
# 4. BEV Pooling (关键步骤!)
|
|||
|
|
bev_features = self.voxel_pooling(
|
|||
|
|
img_features, # (B, N, C, H, W)
|
|||
|
|
depth_prob, # (B, N, D, H, W)
|
|||
|
|
points_3d, # (B, N, D, H, W, 3)
|
|||
|
|
bev_grid # 360×360×Z
|
|||
|
|
) # (B, C, 360, 360)
|
|||
|
|
|
|||
|
|
return bev_features
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- ✅ 显式几何约束
|
|||
|
|
- ✅ 计算效率高
|
|||
|
|
- ✅ 单帧即可工作
|
|||
|
|
- ✅ 结合LiDAR后性能更强
|
|||
|
|
- ⚠️ BEV分辨率固定(由voxel grid决定)
|
|||
|
|
- ⚠️ 无时间信息融合
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🗺️ 分割头对比
|
|||
|
|
|
|||
|
|
### BEVFormer分割头(Simple版本)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# BEVFormer使用的分割头
|
|||
|
|
class BEVFormerSegHead(nn.Module):
|
|||
|
|
def __init__(self, in_channels=256, num_classes=6):
|
|||
|
|
self.segmentation_head = nn.Sequential(
|
|||
|
|
nn.Conv2d(in_channels, 256, 3, padding=1),
|
|||
|
|
nn.BatchNorm2d(256),
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Conv2d(256, 128, 3, padding=1),
|
|||
|
|
nn.BatchNorm2d(128),
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Conv2d(128, num_classes, 1)
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def forward(self, bev_features):
|
|||
|
|
# bev_features: (B, 256, 200, 200)
|
|||
|
|
seg_logits = self.segmentation_head(bev_features)
|
|||
|
|
# Output: (B, 6, 200, 200)
|
|||
|
|
return seg_logits
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**架构**:
|
|||
|
|
```
|
|||
|
|
BEV Features (256, 200, 200)
|
|||
|
|
↓
|
|||
|
|
Conv 3×3 (256 → 256)
|
|||
|
|
↓
|
|||
|
|
Conv 3×3 (256 → 128)
|
|||
|
|
↓
|
|||
|
|
Conv 1×1 (128 → 6)
|
|||
|
|
↓
|
|||
|
|
Output: (6, 200, 200)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- ✅ 简单直接
|
|||
|
|
- ✅ 轻量级(3层卷积)
|
|||
|
|
- ⚠️ 缺少多尺度特征
|
|||
|
|
- ⚠️ 无注意力机制
|
|||
|
|
- ⚠️ 小目标性能受限
|
|||
|
|
|
|||
|
|
### BEVFusion分割头(Enhanced增强版)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 当前BEVFusion使用的增强分割头
|
|||
|
|
class EnhancedBEVSegmentationHead(nn.Module):
|
|||
|
|
def __init__(self, in_channels=512, num_classes=6):
|
|||
|
|
# 1. BEV Grid Transform (360→200)
|
|||
|
|
self.transform = BEVGridTransform(
|
|||
|
|
input_scope=[[-54, 54, 0.75], [-54, 54, 0.75]],
|
|||
|
|
output_scope=[[-50, 50, 0.5], [-50, 50, 0.5]]
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 2. ASPP多尺度特征提取
|
|||
|
|
self.aspp = ASPP(
|
|||
|
|
in_channels=512,
|
|||
|
|
out_channels=256,
|
|||
|
|
dilations=[1, 6, 12, 18] # 不同感受野
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 3. Channel Attention
|
|||
|
|
self.channel_attn = ChannelAttention(256)
|
|||
|
|
|
|||
|
|
# 4. Spatial Attention
|
|||
|
|
self.spatial_attn = SpatialAttention(256)
|
|||
|
|
|
|||
|
|
# 5. Deep Decoder (4层)
|
|||
|
|
self.decoder = nn.Sequential(
|
|||
|
|
ConvBNReLU(256, 256, 3),
|
|||
|
|
ConvBNReLU(256, 128, 3),
|
|||
|
|
ConvBNReLU(128, 128, 3),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 6. Per-class Classifiers
|
|||
|
|
self.classifiers = nn.ModuleList([
|
|||
|
|
nn.Sequential(
|
|||
|
|
ConvBNReLU(128, 64, 3),
|
|||
|
|
nn.Conv2d(64, 1, 1)
|
|||
|
|
) for _ in range(num_classes)
|
|||
|
|
])
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**架构**:
|
|||
|
|
```
|
|||
|
|
Fused BEV (512, 360, 360)
|
|||
|
|
↓
|
|||
|
|
Grid Transform
|
|||
|
|
↓ (512, 200, 200)
|
|||
|
|
ASPP (5个分支,空洞卷积)
|
|||
|
|
├─ 1×1 conv
|
|||
|
|
├─ 3×3 conv, dilation=6
|
|||
|
|
├─ 3×3 conv, dilation=12
|
|||
|
|
├─ 3×3 conv, dilation=18
|
|||
|
|
└─ Global pooling
|
|||
|
|
↓ Concat + Project
|
|||
|
|
(256, 200, 200)
|
|||
|
|
↓
|
|||
|
|
Channel Attention
|
|||
|
|
↓
|
|||
|
|
Spatial Attention
|
|||
|
|
↓
|
|||
|
|
Deep Decoder (4层)
|
|||
|
|
↓ (128, 200, 200)
|
|||
|
|
Per-class Classifiers (×6)
|
|||
|
|
↓
|
|||
|
|
Output: (6, 200, 200)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- ✅ 多尺度特征(ASPP)
|
|||
|
|
- ✅ 双重注意力机制
|
|||
|
|
- ✅ 深度解码器
|
|||
|
|
- ✅ 独立分类器(每类专用)
|
|||
|
|
- ✅ 小目标性能更好
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 核心技术差异
|
|||
|
|
|
|||
|
|
### 1. BEV表示生成方式
|
|||
|
|
|
|||
|
|
#### BEVFormer: Transformer Query-based ⭐
|
|||
|
|
|
|||
|
|
**技术路线**:
|
|||
|
|
```
|
|||
|
|
Query-based方法:
|
|||
|
|
1. 预定义BEV queries (可学习的embeddings)
|
|||
|
|
2. 通过Spatial Cross-Attention查询图像特征
|
|||
|
|
3. 通过Temporal Self-Attention融合历史帧
|
|||
|
|
4. 输出BEV表示
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Spatial Cross-Attention (核心创新)**:
|
|||
|
|
```python
|
|||
|
|
# 为每个BEV query生成3D参考点
|
|||
|
|
reference_points_3d = [
|
|||
|
|
(x, y, z1), (x, y, z2), (x, y, z3), (x, y, z4)
|
|||
|
|
] # 每个BEV位置采样4个高度层
|
|||
|
|
|
|||
|
|
# 投影到各相机
|
|||
|
|
for cam_id in range(6):
|
|||
|
|
# 3D点投影到2D图像
|
|||
|
|
ref_2d = project_3d_to_2d(reference_points_3d, cam_params)
|
|||
|
|
|
|||
|
|
# 可变形注意力采样
|
|||
|
|
sampled_features = deformable_attention(
|
|||
|
|
query=bev_query,
|
|||
|
|
reference_2d=ref_2d,
|
|||
|
|
img_features=img_features[cam_id]
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 聚合多相机特征
|
|||
|
|
bev_feature += sampled_features
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
- ✅ 端到端可学习
|
|||
|
|
- ✅ 自适应特征采样
|
|||
|
|
- ✅ 时间信息融合
|
|||
|
|
- ✅ BEV分辨率灵活可调
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
- ⚠️ 计算复杂度高(200×200×4×6=960k采样点)
|
|||
|
|
- ⚠️ 需要多帧数据
|
|||
|
|
- ⚠️ 训练收敛慢
|
|||
|
|
|
|||
|
|
#### BEVFusion: LSS-based Projection
|
|||
|
|
|
|||
|
|
**技术路线**:
|
|||
|
|
```
|
|||
|
|
投影-池化方法:
|
|||
|
|
1. 预测每个像素的深度分布
|
|||
|
|
2. 显式构建3D视锥
|
|||
|
|
3. 通过几何投影生成BEV pillar特征
|
|||
|
|
4. Voxel pooling聚合
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**DepthLSS (核心)**:
|
|||
|
|
```python
|
|||
|
|
# 深度预测 + 几何投影
|
|||
|
|
depth_prob = softmax(depth_net(img_features)) # (B*N, D, H, W)
|
|||
|
|
|
|||
|
|
# 为每个像素构建3D点云
|
|||
|
|
for d in depth_bins:
|
|||
|
|
for h in range(fH):
|
|||
|
|
for w in range(fW):
|
|||
|
|
# 反投影到3D
|
|||
|
|
point_3d = camera_to_ego(
|
|||
|
|
pixel=(h, w),
|
|||
|
|
depth=d,
|
|||
|
|
intrinsics=K,
|
|||
|
|
extrinsics=T
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 找到对应的BEV grid
|
|||
|
|
bev_x, bev_y = world_to_bev(point_3d)
|
|||
|
|
|
|||
|
|
# 加权累积到BEV grid
|
|||
|
|
bev_features[bev_x, bev_y] += (
|
|||
|
|
img_features[:, h, w] * depth_prob[d, h, w]
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
- ✅ 显式几何约束
|
|||
|
|
- ✅ 计算效率高
|
|||
|
|
- ✅ 单帧即可
|
|||
|
|
- ✅ 易于融合LiDAR
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
- ⚠️ 深度估计误差
|
|||
|
|
- ⚠️ 无时间信息
|
|||
|
|
- ⚠️ BEV分辨率受视锥限制
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. 分割头架构对比
|
|||
|
|
|
|||
|
|
#### BEVFormer分割头
|
|||
|
|
|
|||
|
|
**架构**: 简单3层CNN
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
特征提取能力:
|
|||
|
|
├─ 感受野: 7×7 (2层3×3卷积)
|
|||
|
|
├─ 多尺度: 无
|
|||
|
|
├─ 注意力: 无
|
|||
|
|
└─ 深度: 浅 (3层)
|
|||
|
|
|
|||
|
|
输出:
|
|||
|
|
分辨率: 200×200 (0.5m/grid)
|
|||
|
|
范围: ±50m
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
- ✅ 轻量
|
|||
|
|
- ✅ 速度快
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
- ⚠️ 特征提取能力弱
|
|||
|
|
- ⚠️ 小目标性能差
|
|||
|
|
- ⚠️ 无法捕捉长距离依赖
|
|||
|
|
|
|||
|
|
#### BEVFusion增强分割头
|
|||
|
|
|
|||
|
|
**架构**: ASPP + 注意力 + 深度解码器
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
特征提取能力:
|
|||
|
|
├─ 感受野: 37×37 (空洞卷积dilation=18)
|
|||
|
|
├─ 多尺度: 5个尺度 (ASPP)
|
|||
|
|
├─ 注意力: Channel + Spatial
|
|||
|
|
└─ 深度: 深 (4层decoder)
|
|||
|
|
|
|||
|
|
输出:
|
|||
|
|
分辨率: 200×200 (0.5m/grid)
|
|||
|
|
范围: ±50m
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
- ✅ 强大的多尺度特征
|
|||
|
|
- ✅ 注意力增强
|
|||
|
|
- ✅ 小目标性能更好
|
|||
|
|
- ✅ 独立分类器(每类专用)
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
- ⚠️ 参数量更大
|
|||
|
|
- ⚠️ 计算开销更高
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. 时间维度对比
|
|||
|
|
|
|||
|
|
#### BEVFormer: 显式时间建模 ⭐
|
|||
|
|
|
|||
|
|
**Temporal Self-Attention**:
|
|||
|
|
```python
|
|||
|
|
# BEVFormer的时间融合
|
|||
|
|
def temporal_self_attention(curr_bev, prev_bev):
|
|||
|
|
# 当前帧BEV query
|
|||
|
|
Q = curr_bev # (B, H×W, C)
|
|||
|
|
|
|||
|
|
# 历史帧BEV作为K, V
|
|||
|
|
K = prev_bev # (B, H×W, C)
|
|||
|
|
V = prev_bev
|
|||
|
|
|
|||
|
|
# 自注意力
|
|||
|
|
attn_weights = softmax(Q @ K.T / sqrt(d))
|
|||
|
|
output = attn_weights @ V
|
|||
|
|
|
|||
|
|
return output # 融合了历史信息的BEV
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- ✅ **显式时间建模**
|
|||
|
|
- ✅ 利用多帧信息
|
|||
|
|
- ✅ 对运动物体跟踪更好
|
|||
|
|
- ✅ 对静态场景分割更鲁棒
|
|||
|
|
|
|||
|
|
**性能提升**:
|
|||
|
|
- 检测NDS: +2-3%(有时间 vs 无时间)
|
|||
|
|
- 分割mIoU: +5-7%(运动场景)
|
|||
|
|
|
|||
|
|
#### BEVFusion: 无时间建模
|
|||
|
|
|
|||
|
|
**特点**:
|
|||
|
|
- ⚠️ 仅处理单帧
|
|||
|
|
- ⚠️ 无法利用历史信息
|
|||
|
|
- ⚠️ 对运动物体跟踪较弱
|
|||
|
|
|
|||
|
|
**可选扩展**:
|
|||
|
|
```python
|
|||
|
|
# 可以通过简单拼接历史BEV
|
|||
|
|
bev_current = depth_lss_transform(img_t)
|
|||
|
|
bev_history = queue[bev_t-1, bev_t-2, ...]
|
|||
|
|
|
|||
|
|
bev_fused = conv_fusion([bev_current, bev_history])
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
但不如BEVFormer的Transformer注意力机制优雅。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4. 多模态融合对比
|
|||
|
|
|
|||
|
|
#### BEVFormer: 纯视觉
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
优势:
|
|||
|
|
✅ 成本低(无需LiDAR)
|
|||
|
|
✅ 适用于消费级车辆
|
|||
|
|
✅ 天气鲁棒性(无点云稀疏问题)
|
|||
|
|
|
|||
|
|
劣势:
|
|||
|
|
⚠️ 性能天花板低
|
|||
|
|
⚠️ 深度估计误差
|
|||
|
|
⚠️ NDS ~56.9% (camera-only)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### BEVFusion: Camera + LiDAR
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
优势:
|
|||
|
|
✅ 性能天花板高
|
|||
|
|
✅ NDS 70.4% (+13.5% vs BEVFormer)
|
|||
|
|
✅ 深度信息精确(LiDAR补充)
|
|||
|
|
✅ 小目标检测更准
|
|||
|
|
|
|||
|
|
劣势:
|
|||
|
|
⚠️ 成本高(需要LiDAR)
|
|||
|
|
⚠️ 点云稀疏(远距离)
|
|||
|
|
⚠️ 天气敏感(雨雪)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 性能对比(nuScenes验证集)
|
|||
|
|
|
|||
|
|
### 检测性能
|
|||
|
|
|
|||
|
|
| 模型 | 模态 | NDS | mAP | Car | Ped | Traffic Cone |
|
|||
|
|
|------|------|-----|-----|-----|-----|--------------|
|
|||
|
|
| **BEVFormer-Base** | Camera | 56.9% | 41.6% | 70.1 | 72.4 | 48.3 |
|
|||
|
|
| **BEVFusion-Swin** | Cam+LiDAR | **70.4%** | **68.5%** | **85.6** | **82.1** | **68.9** |
|
|||
|
|
| **当前训练(E14)** | Cam+LiDAR | **71.0%** | **66.8%** | - | - | - |
|
|||
|
|
|
|||
|
|
**差距分析**:
|
|||
|
|
- BEVFusion比BEVFormer检测性能高**+13.5% NDS**
|
|||
|
|
- 主要来自LiDAR的精确深度信息
|
|||
|
|
|
|||
|
|
### 分割性能
|
|||
|
|
|
|||
|
|
| 模型 | 模态 | mIoU | Drivable | Crossing | Divider | 备注 |
|
|||
|
|
|------|------|------|----------|----------|---------|------|
|
|||
|
|
| **BEVFormer** | Camera | ~0.35 | 0.70 | 0.45 | **0.12** | 论文未详细报告 |
|
|||
|
|
| **BEVFusion** | Cam+LiDAR | **0.62** | **0.88** | **0.68** | **0.48** | 原论文 |
|
|||
|
|
| **当前训练(E14)** | Cam+LiDAR | 0.407 | 0.76 | 0.52 | **0.19** | 增强头 |
|
|||
|
|
|
|||
|
|
**关键观察**:
|
|||
|
|
1. BEVFusion分割性能远超BEVFormer(+27% mIoU)
|
|||
|
|
2. Divider分割差距巨大(0.48 vs 0.12)
|
|||
|
|
3. 主要原因:
|
|||
|
|
- ✅ LiDAR提供精确的3D结构
|
|||
|
|
- ✅ Fused BEV特征更丰富
|
|||
|
|
- ⚠️ 但当前训练的Divider性能(0.19)仍不如原论文(0.48)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 关键技术差异总结
|
|||
|
|
|
|||
|
|
### BEVFormer的独特优势
|
|||
|
|
|
|||
|
|
#### 1. 时空Transformer架构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
时间维度:
|
|||
|
|
BEVFormer: ✅ 显式TSA融合历史BEV
|
|||
|
|
BEVFusion: ❌ 单帧处理
|
|||
|
|
|
|||
|
|
影响:
|
|||
|
|
- 运动物体跟踪: BEVFormer更好
|
|||
|
|
- 静态场景分割: BEVFormer更鲁棒
|
|||
|
|
- 遮挡恢复: BEVFormer可利用历史信息
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2. 端到端可学习
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
BEV生成:
|
|||
|
|
BEVFormer: ✅ Query-based,完全可学习
|
|||
|
|
BEVFusion: ⚠️ 部分依赖显式几何投影
|
|||
|
|
|
|||
|
|
影响:
|
|||
|
|
- BEVFormer可学习最优BEV表示
|
|||
|
|
- BEVFusion受限于相机标定精度
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3. 分辨率灵活性
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
BEV分辨率:
|
|||
|
|
BEVFormer: ✅ Query数量任意 (200×200, 400×400...)
|
|||
|
|
BEVFusion: ⚠️ 受voxel grid限制
|
|||
|
|
|
|||
|
|
调整成本:
|
|||
|
|
BEVFormer: 仅改query数量
|
|||
|
|
BEVFusion: 需要重新设计voxel grid和pooling
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### BEVFusion的独特优势
|
|||
|
|
|
|||
|
|
#### 1. 多模态融合 ⭐⭐⭐
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
信息来源:
|
|||
|
|
BEVFormer: Camera (RGB图像)
|
|||
|
|
BEVFusion: Camera + LiDAR (RGB + 点云)
|
|||
|
|
|
|||
|
|
性能差距:
|
|||
|
|
NDS: 56.9% → 70.4% (+13.5%)
|
|||
|
|
mIoU: 0.35 → 0.62 (+77%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**LiDAR的关键作用**:
|
|||
|
|
```
|
|||
|
|
1. 精确深度 → 准确的3D定位
|
|||
|
|
2. 几何结构 → 清晰的边界
|
|||
|
|
3. 弱纹理物体 → 栏杆、分隔线检测
|
|||
|
|
4. 远距离目标 → 100m+的物体
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2. 显式几何约束
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
深度估计:
|
|||
|
|
BEVFormer: 隐式学习(通过注意力)
|
|||
|
|
BEVFusion: 显式预测(depth net)
|
|||
|
|
|
|||
|
|
优势:
|
|||
|
|
- BEVFusion收敛更快
|
|||
|
|
- 深度监督可用(如果有GT depth)
|
|||
|
|
- 几何一致性更强
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3. 计算效率
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
BEV生成复杂度:
|
|||
|
|
BEVFormer: O(H×W×N×P)
|
|||
|
|
H×W=40k BEV points
|
|||
|
|
N=6 cameras
|
|||
|
|
P=4 height layers
|
|||
|
|
Total: 960k 采样点/帧
|
|||
|
|
|
|||
|
|
BEVFusion: O(N×D×H×W)
|
|||
|
|
N=6 cameras
|
|||
|
|
D=118 depth bins
|
|||
|
|
H×W=32×88=2.8k pixels
|
|||
|
|
Total: 1.97M voxels (但并行化好)
|
|||
|
|
|
|||
|
|
实际速度:
|
|||
|
|
BEVFormer: ~10 FPS (R101-DCN)
|
|||
|
|
BEVFusion: ~15 FPS (Swin-T)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🗺️ 分割任务详细对比
|
|||
|
|
|
|||
|
|
### 输出格式对比
|
|||
|
|
|
|||
|
|
| 特征 | BEVFormer | BEVFusion | 当前训练 |
|
|||
|
|
|------|-----------|-----------|---------|
|
|||
|
|
| **输出分辨率** | 200×200 | 200×200 | 200×200 |
|
|||
|
|
| **BEV分辨率** | 0.5m/grid | 0.5m/grid | 0.5m/grid |
|
|||
|
|
| **覆盖范围** | ±50m | ±50m | ±50m |
|
|||
|
|
| **类别数** | 6 | 6 | 6 |
|
|||
|
|
| **输出shape** | (B,6,200,200) | (B,6,200,200) | (B,6,200,200) |
|
|||
|
|
|
|||
|
|
**结论**: 输出格式完全相同!差异在BEV特征质量。
|
|||
|
|
|
|||
|
|
### 分割性能对比
|
|||
|
|
|
|||
|
|
#### 大目标类别(>5m²)
|
|||
|
|
|
|||
|
|
| 类别 | BEVFormer | BEVFusion | 当前训练(E14) |
|
|||
|
|
|------|-----------|-----------|--------------|
|
|||
|
|
| **Drivable Area** | 0.70 | **0.88** | 0.76 |
|
|||
|
|
| **Walkway** | 0.50 | **0.75** | 0.68 |
|
|||
|
|
| **Ped Crossing** | 0.45 | **0.68** | 0.52 |
|
|||
|
|
| **Carpark** | 0.40 | **0.65** | 0.50 |
|
|||
|
|
|
|||
|
|
**差距原因**:
|
|||
|
|
- BEVFusion有LiDAR的精确几何信息
|
|||
|
|
- 边界更清晰
|
|||
|
|
- 遮挡处理更好
|
|||
|
|
|
|||
|
|
#### 小目标类别(<1m²)
|
|||
|
|
|
|||
|
|
| 类别 | BEVFormer | BEVFusion | 当前训练(E14) |
|
|||
|
|
|------|-----------|-----------|--------------|
|
|||
|
|
| **Stop Line** | ~0.15 | **0.48** | **0.26** |
|
|||
|
|
| **Divider** | ~0.12 | **0.48** | **0.19** |
|
|||
|
|
|
|||
|
|
**关键发现**:
|
|||
|
|
1. BEVFusion原论文在小目标上显著优于BEVFormer(+33%)
|
|||
|
|
2. **当前训练的小目标性能远低于原论文**
|
|||
|
|
- Stop Line: 0.26 vs 0.48 (-22%)
|
|||
|
|
- Divider: 0.19 vs 0.48 (-29%)
|
|||
|
|
3. **根本原因**: 当前BEV分辨率0.3m/grid不足!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔬 技术细节深度对比
|
|||
|
|
|
|||
|
|
### 1. BEV特征质量
|
|||
|
|
|
|||
|
|
#### BEVFormer
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
```
|
|||
|
|
✅ 时间一致性
|
|||
|
|
- 利用5-10帧历史信息
|
|||
|
|
- 对静态场景更鲁棒
|
|||
|
|
- 可恢复被遮挡的区域
|
|||
|
|
|
|||
|
|
✅ 长距离依赖
|
|||
|
|
- Transformer可以建模全局关系
|
|||
|
|
- 车道线连续性更好
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
```
|
|||
|
|
⚠️ 深度不确定性
|
|||
|
|
- 纯视觉深度估计误差大
|
|||
|
|
- 远距离精度下降
|
|||
|
|
- 小目标容易漏检
|
|||
|
|
|
|||
|
|
⚠️ BEV分辨率
|
|||
|
|
- 通常200×200 (0.5m)
|
|||
|
|
- 计算量限制难以提升
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### BEVFusion
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
```
|
|||
|
|
✅ 深度精度
|
|||
|
|
- LiDAR提供精确深度
|
|||
|
|
- 3D定位准确
|
|||
|
|
- 小目标定位清晰
|
|||
|
|
|
|||
|
|
✅ 几何结构
|
|||
|
|
- 点云直接反映3D结构
|
|||
|
|
- 边界锐利
|
|||
|
|
- 高度信息准确
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
```
|
|||
|
|
⚠️ 无时间信息
|
|||
|
|
- 单帧处理
|
|||
|
|
- 无法利用历史
|
|||
|
|
- 遮挡恢复能力弱
|
|||
|
|
|
|||
|
|
⚠️ 点云稀疏
|
|||
|
|
- 远距离点云稀疏
|
|||
|
|
- 小物体点云少
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. 分割head设计哲学
|
|||
|
|
|
|||
|
|
#### BEVFormer分割头
|
|||
|
|
|
|||
|
|
**设计哲学**: 简单够用
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
理念:
|
|||
|
|
"BEV特征已经很强了(Transformer编码)"
|
|||
|
|
"分割头只需简单解码即可"
|
|||
|
|
|
|||
|
|
实现:
|
|||
|
|
3层CNN → 完成
|
|||
|
|
|
|||
|
|
问题:
|
|||
|
|
⚠️ 对BEV特征质量要求极高
|
|||
|
|
⚠️ 小目标需要BEV本身分辨率高
|
|||
|
|
⚠️ 无法弥补BEV特征的不足
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### BEVFusion增强分割头
|
|||
|
|
|
|||
|
|
**设计哲学**: 充分挖掘BEV特征
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
理念:
|
|||
|
|
"BEV特征是粗粒度的"
|
|||
|
|
"需要分割头深度处理"
|
|||
|
|
|
|||
|
|
实现:
|
|||
|
|
ASPP + 注意力 + 深度解码
|
|||
|
|
|
|||
|
|
优势:
|
|||
|
|
✅ 多尺度适应不同大小目标
|
|||
|
|
✅ 注意力增强关键特征
|
|||
|
|
✅ 深度解码提取细节
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 性能瓶颈分析
|
|||
|
|
|
|||
|
|
### BEVFormer的瓶颈
|
|||
|
|
|
|||
|
|
**1. 纯视觉限制**
|
|||
|
|
```
|
|||
|
|
问题:
|
|||
|
|
- 深度估计不准 → 3D定位误差
|
|||
|
|
- 小目标难以检测 → 远距离车辆
|
|||
|
|
- 弱纹理物体 → 白色车道线
|
|||
|
|
|
|||
|
|
影响分割:
|
|||
|
|
- Stop Line IoU ~0.15 (难以检测)
|
|||
|
|
- Divider IoU ~0.12 (几乎失败)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. BEV分辨率**
|
|||
|
|
```
|
|||
|
|
标准配置: 200×200 (0.5m/grid)
|
|||
|
|
|
|||
|
|
计算量限制:
|
|||
|
|
- Transformer复杂度: O(N²)
|
|||
|
|
- 400×400需要4倍计算
|
|||
|
|
- 实际难以提升
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### BEVFusion的瓶颈(当前训练)
|
|||
|
|
|
|||
|
|
**1. BEV pooling分辨率**
|
|||
|
|
```
|
|||
|
|
当前: 360×360 (0.3m/grid)
|
|||
|
|
↓ Grid Transform
|
|||
|
|
输出: 200×200 (0.5m/grid)
|
|||
|
|
|
|||
|
|
问题:
|
|||
|
|
- Camera BEV已是0.3m/grid
|
|||
|
|
- 输出反而降采样到0.5m
|
|||
|
|
- 丢失细节!
|
|||
|
|
|
|||
|
|
解决方案 (Phase 4):
|
|||
|
|
- 提升到720×720 (0.15m/grid)
|
|||
|
|
- 输出400×400 (0.25m/grid)
|
|||
|
|
- 保留更多细节
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. 单帧处理**
|
|||
|
|
```
|
|||
|
|
无时间信息:
|
|||
|
|
- 无法利用历史帧
|
|||
|
|
- 遮挡区域难以补全
|
|||
|
|
- 运动一致性差
|
|||
|
|
|
|||
|
|
可能改进:
|
|||
|
|
- 增加Temporal Fusion模块
|
|||
|
|
- 维护BEV历史队列
|
|||
|
|
- 借鉴BEVFormer的TSA
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 两者结合的可能性
|
|||
|
|
|
|||
|
|
### Hybrid方案: BEVFormer + BEVFusion
|
|||
|
|
|
|||
|
|
**方案A: BEVFusion + Temporal Attention**
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class TemporalBEVFusion(nn.Module):
|
|||
|
|
def __init__(self):
|
|||
|
|
# BEVFusion的多模态编码器
|
|||
|
|
self.camera_encoder = DepthLSSTransform()
|
|||
|
|
self.lidar_encoder = SparseEncoder()
|
|||
|
|
self.fuser = ConvFuser()
|
|||
|
|
|
|||
|
|
# 借鉴BEVFormer的时间模块
|
|||
|
|
self.temporal_attn = TemporalSelfAttention()
|
|||
|
|
self.bev_queue = deque(maxlen=5) # 保存5帧
|
|||
|
|
|
|||
|
|
def forward(self, imgs, points, img_metas):
|
|||
|
|
# 1. 当前帧BEV
|
|||
|
|
cam_bev = self.camera_encoder(imgs)
|
|||
|
|
lidar_bev = self.lidar_encoder(points)
|
|||
|
|
curr_bev = self.fuser([cam_bev, lidar_bev])
|
|||
|
|
|
|||
|
|
# 2. 时间融合
|
|||
|
|
if len(self.bev_queue) > 0:
|
|||
|
|
prev_bev = torch.stack(list(self.bev_queue))
|
|||
|
|
curr_bev = self.temporal_attn(
|
|||
|
|
query=curr_bev,
|
|||
|
|
key=prev_bev,
|
|||
|
|
value=prev_bev
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 3. 保存到队列
|
|||
|
|
self.bev_queue.append(curr_bev.detach())
|
|||
|
|
|
|||
|
|
return curr_bev
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期提升**:
|
|||
|
|
- 检测NDS: +1-2%
|
|||
|
|
- 分割mIoU: +3-5%(静态场景)
|
|||
|
|
- 运动一致性: 显著提升
|
|||
|
|
|
|||
|
|
**成本**:
|
|||
|
|
- 显存: +20%(存储历史BEV)
|
|||
|
|
- 速度: -15%(注意力计算)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案B: BEVFormer架构 + LiDAR输入
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class BEVFormerWithLiDAR(nn.Module):
|
|||
|
|
def __init__(self):
|
|||
|
|
# BEVFormer的Transformer编码器
|
|||
|
|
self.bev_queries = nn.Embedding(200*200, 256)
|
|||
|
|
self.transformer_layers = BEVFormerLayers()
|
|||
|
|
|
|||
|
|
# 增加LiDAR分支
|
|||
|
|
self.lidar_encoder = SparseEncoder()
|
|||
|
|
self.lidar_to_query = nn.Linear(256, 256)
|
|||
|
|
|
|||
|
|
def forward(self, imgs, points):
|
|||
|
|
# 1. Camera BEV (Transformer)
|
|||
|
|
cam_bev = self.transformer_layers(
|
|||
|
|
queries=self.bev_queries,
|
|||
|
|
img_features=imgs
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 2. LiDAR BEV
|
|||
|
|
lidar_bev = self.lidar_encoder(points)
|
|||
|
|
|
|||
|
|
# 3. Fusion at query level
|
|||
|
|
lidar_queries = self.lidar_to_query(lidar_bev)
|
|||
|
|
fused_bev = cam_bev + lidar_queries
|
|||
|
|
|
|||
|
|
return fused_bev
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期效果**:
|
|||
|
|
- 结合BEVFormer的时空建模
|
|||
|
|
- 结合BEVFusion的LiDAR优势
|
|||
|
|
- 可能达到最优性能
|
|||
|
|
|
|||
|
|
**挑战**:
|
|||
|
|
- 实现复杂
|
|||
|
|
- 计算开销大
|
|||
|
|
- 训练难度高
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 分割性能差异根因分析
|
|||
|
|
|
|||
|
|
### 为什么BEVFormer分割性能较低?
|
|||
|
|
|
|||
|
|
**1. BEV特征分辨率不足**
|
|||
|
|
```
|
|||
|
|
BEVFormer标准配置:
|
|||
|
|
BEV: 200×200 (0.5m/grid)
|
|||
|
|
|
|||
|
|
小目标问题:
|
|||
|
|
- Stop Line宽0.15m → 占0.3个grid ❌
|
|||
|
|
- Divider宽0.10m → 占0.2个grid ❌
|
|||
|
|
|
|||
|
|
无法精确表达!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. 纯视觉深度不准**
|
|||
|
|
```
|
|||
|
|
Camera-only深度估计:
|
|||
|
|
近距离(<30m): 误差 ±0.5m
|
|||
|
|
远距离(>30m): 误差 ±2-3m
|
|||
|
|
|
|||
|
|
影响分割:
|
|||
|
|
- 车道线位置偏移
|
|||
|
|
- 边界模糊
|
|||
|
|
- 小目标丢失
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 简单分割头**
|
|||
|
|
```
|
|||
|
|
BEVFormer分割头:
|
|||
|
|
3层CNN → 感受野小
|
|||
|
|
无多尺度 → 单一尺度
|
|||
|
|
无注意力 → 特征表达弱
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 为什么当前BEVFusion性能也不理想?
|
|||
|
|
|
|||
|
|
**1. BEV分辨率限制(与BEVFormer相同问题)**
|
|||
|
|
```
|
|||
|
|
当前配置:
|
|||
|
|
Camera BEV: 360×360 (0.3m/grid)
|
|||
|
|
输出分割: 200×200 (0.5m/grid)
|
|||
|
|
|
|||
|
|
问题: 与BEVFormer一样受限!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. 缺少时间信息(劣于BEVFormer)**
|
|||
|
|
```
|
|||
|
|
BEVFormer: 5-10帧融合 ✅
|
|||
|
|
当前训练: 单帧处理 ❌
|
|||
|
|
|
|||
|
|
影响:
|
|||
|
|
- 无法利用历史信息补全遮挡
|
|||
|
|
- 运动一致性差
|
|||
|
|
- 动态场景性能下降
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 分割头虽增强,但BEV输入受限**
|
|||
|
|
```
|
|||
|
|
增强分割头很强:
|
|||
|
|
✅ ASPP
|
|||
|
|
✅ 注意力
|
|||
|
|
✅ 深度解码
|
|||
|
|
|
|||
|
|
但"巧妇难为无米之炊":
|
|||
|
|
⚠️ BEV输入分辨率0.3m → 细节已丢失
|
|||
|
|
⚠️ 分割头无法创造信息
|
|||
|
|
⚠️ 只能优化已有特征的利用
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 改进方向建议
|
|||
|
|
|
|||
|
|
### 短期改进(Phase 4 - BEV 2x)
|
|||
|
|
|
|||
|
|
**目标**: 提升BEV分辨率到0.15m
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# 配置修改
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
xbound: [-54.0, 54.0, 0.15] # 从0.3改为0.15
|
|||
|
|
ybound: [-54.0, 54.0, 0.15]
|
|||
|
|
|
|||
|
|
map:
|
|||
|
|
grid_transform:
|
|||
|
|
output_scope: [[-50, 50, 0.25], [-50, 50, 0.25]]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期效果**:
|
|||
|
|
```
|
|||
|
|
Stop Line IoU: 0.26 → 0.40 (+54%)
|
|||
|
|
Divider IoU: 0.19 → 0.30 (+58%)
|
|||
|
|
mIoU: 0.41 → 0.47 (+15%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**对比BEVFormer**:
|
|||
|
|
```
|
|||
|
|
小目标IoU:
|
|||
|
|
BEVFormer (0.5m): 0.12-0.15
|
|||
|
|
当前训练 (0.5m输出): 0.19-0.26
|
|||
|
|
Phase 4 (0.25m输出): 0.30-0.40 ⭐ 显著优于BEVFormer
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 中期改进(可选 - Temporal Fusion)
|
|||
|
|
|
|||
|
|
**目标**: 借鉴BEVFormer的时间建模
|
|||
|
|
|
|||
|
|
**方案1: 简单BEV队列**
|
|||
|
|
```python
|
|||
|
|
class SimpleTemporal(nn.Module):
|
|||
|
|
def __init__(self):
|
|||
|
|
self.bev_queue = deque(maxlen=3)
|
|||
|
|
self.fusion_conv = nn.Conv2d(256*3, 256, 1)
|
|||
|
|
|
|||
|
|
def forward(self, curr_bev):
|
|||
|
|
# 拼接当前+历史BEV
|
|||
|
|
if len(self.bev_queue) > 0:
|
|||
|
|
all_bev = torch.cat([curr_bev] + list(self.bev_queue), dim=1)
|
|||
|
|
fused = self.fusion_conv(all_bev)
|
|||
|
|
else:
|
|||
|
|
fused = curr_bev
|
|||
|
|
|
|||
|
|
self.bev_queue.append(curr_bev.detach())
|
|||
|
|
return fused
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期提升**:
|
|||
|
|
- mIoU: +2-3%
|
|||
|
|
- 边界连续性提升
|
|||
|
|
- 遮挡恢复能力增强
|
|||
|
|
|
|||
|
|
**方案2: Temporal Attention(高级)**
|
|||
|
|
```python
|
|||
|
|
# 完整的BEVFormer-style TSA
|
|||
|
|
class TemporalSelfAttention(nn.Module):
|
|||
|
|
def forward(self, curr_bev, prev_bevs):
|
|||
|
|
Q = self.query_proj(curr_bev)
|
|||
|
|
K = self.key_proj(prev_bevs)
|
|||
|
|
V = self.value_proj(prev_bevs)
|
|||
|
|
|
|||
|
|
attn = F.softmax(Q @ K.T / sqrt(d), dim=-1)
|
|||
|
|
output = attn @ V
|
|||
|
|
|
|||
|
|
return curr_bev + output
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期提升**:
|
|||
|
|
- mIoU: +3-5%
|
|||
|
|
- 运动一致性大幅提升
|
|||
|
|
- 接近BEVFormer的时间建模能力
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 长期改进(研究方向)
|
|||
|
|
|
|||
|
|
**1. 混合架构: LSS + Transformer**
|
|||
|
|
```
|
|||
|
|
Encoder阶段:
|
|||
|
|
├─ Camera: DepthLSS生成初始BEV ✅ 快速
|
|||
|
|
├─ LiDAR: Sparse Encoder ✅ 精确
|
|||
|
|
└─ Fuser: ConvFuser初步融合
|
|||
|
|
|
|||
|
|
Decoder阶段:
|
|||
|
|
├─ Transformer Refiner (借鉴BEVFormer)
|
|||
|
|
│ └─ 细化BEV特征
|
|||
|
|
├─ Temporal Attention
|
|||
|
|
│ └─ 融合历史信息
|
|||
|
|
└─ Multi-task Heads
|
|||
|
|
├─ Detection (Deformable DETR)
|
|||
|
|
└─ Segmentation (Enhanced Head)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期性能**:
|
|||
|
|
- NDS: 72-73% (当前71%)
|
|||
|
|
- mIoU: 0.50-0.55 (Phase 4后0.47)
|
|||
|
|
|
|||
|
|
**2. 可变分辨率BEV**
|
|||
|
|
```
|
|||
|
|
远距离: 低分辨率 (0.5m/grid)
|
|||
|
|
近距离: 高分辨率 (0.1m/grid)
|
|||
|
|
|
|||
|
|
实现:
|
|||
|
|
- 金字塔式BEV
|
|||
|
|
- 注意力加权
|
|||
|
|
- 动态分辨率分配
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 针对当前训练的建议
|
|||
|
|
|
|||
|
|
### 基于BEVFormer vs BEVFusion对比的启示
|
|||
|
|
|
|||
|
|
**1. Phase 4 (BEV 2x) 必要性 ⭐⭐⭐⭐⭐**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
当前问题与BEVFormer相同:
|
|||
|
|
✅ BEV分辨率不足 (0.3m → 0.5m输出)
|
|||
|
|
✅ 小目标IoU低 (Stop 0.26, Divider 0.19)
|
|||
|
|
|
|||
|
|
BEVFormer无法解决:
|
|||
|
|
❌ 受限于Transformer计算量
|
|||
|
|
❌ 提升到400×400成本巨大
|
|||
|
|
|
|||
|
|
BEVFusion可以解决:
|
|||
|
|
✅ LSS投影效率高
|
|||
|
|
✅ 720×720 BEV可行
|
|||
|
|
✅ 仅增加40%计算量
|
|||
|
|
|
|||
|
|
结论: Phase 4是正确方向!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. Temporal Fusion 可选性 ⭐⭐⭐**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
BEVFormer的优势:
|
|||
|
|
✅ 时间建模 → mIoU +3-5%
|
|||
|
|
|
|||
|
|
当前训练可借鉴:
|
|||
|
|
方案1: 简单BEV队列 (容易实现)
|
|||
|
|
方案2: Temporal Attention (效果更好)
|
|||
|
|
|
|||
|
|
建议时机:
|
|||
|
|
- Phase 4成功后
|
|||
|
|
- Phase 5 (模型优化阶段)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 分割头已经很强 ✅**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
当前EnhancedBEVSegmentationHead:
|
|||
|
|
✅ 比BEVFormer分割头强得多
|
|||
|
|
✅ ASPP + 注意力 + 深度解码
|
|||
|
|
✅ 不需要改进
|
|||
|
|
|
|||
|
|
瓶颈在BEV特征:
|
|||
|
|
⚠️ 不是分割头的问题
|
|||
|
|
⚠️ 是BEV分辨率的问题
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 最终对比总结表
|
|||
|
|
|
|||
|
|
### 技术对比
|
|||
|
|
|
|||
|
|
| 维度 | BEVFormer | BEVFusion (原论文) | 当前训练 | 优劣分析 |
|
|||
|
|
|------|-----------|-------------------|---------|---------|
|
|||
|
|
| **BEV生成** | Transformer Query | LSS Projection | LSS Projection | BEVFormer更灵活 |
|
|||
|
|
| **时间建模** | ✅ TSA | ❌ 无 | ❌ 无 | BEVFormer领先 |
|
|||
|
|
| **多模态** | ❌ Camera-only | ✅ Cam+LiDAR | ✅ Cam+LiDAR | BEVFusion领先 |
|
|||
|
|
| **BEV分辨率** | 200×200 (0.5m) | 360×360 (0.3m) | 360×360 (0.3m) | 都受限 |
|
|||
|
|
| **分割头** | Simple (3层) | Simple (3层) | Enhanced (多层) | 当前训练最强 |
|
|||
|
|
| **计算效率** | 低 (~10 FPS) | 高 (~15 FPS) | 高 (~15 FPS) | BEVFusion领先 |
|
|||
|
|
|
|||
|
|
### 性能对比(nuScenes)
|
|||
|
|
|
|||
|
|
| 指标 | BEVFormer | BEVFusion | 当前训练(E14) | Phase 4预估 |
|
|||
|
|
|------|-----------|-----------|--------------|------------|
|
|||
|
|
| **NDS** | 56.9% | 70.4% | **71.0%** ⭐ | 72.0% |
|
|||
|
|
| **mAP** | 41.6% | 68.5% | **66.8%** | 68.0% |
|
|||
|
|
| **mIoU** | ~0.35 | 0.62 | 0.407 | **0.47** |
|
|||
|
|
| **Drivable** | 0.70 | 0.88 | 0.76 | 0.80 |
|
|||
|
|
| **Stop Line** | 0.15 | 0.48 | 0.26 | **0.40** ⭐ |
|
|||
|
|
| **Divider** | 0.12 | 0.48 | 0.19 | **0.30** ⭐ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 关键洞察
|
|||
|
|
|
|||
|
|
### 1. 为什么BEVFusion检测更强?
|
|||
|
|
|
|||
|
|
**LiDAR的决定性作用**:
|
|||
|
|
```
|
|||
|
|
精确深度 → 准确3D定位 → +13.5% NDS
|
|||
|
|
|
|||
|
|
BEVFormer (Camera):
|
|||
|
|
深度估计误差: ±2-3m (远距离)
|
|||
|
|
3D Box误差: ±0.5-1.0m
|
|||
|
|
|
|||
|
|
BEVFusion (LiDAR):
|
|||
|
|
深度精度: ±0.05m
|
|||
|
|
3D Box误差: ±0.2m
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 为什么当前分割不如原BEVFusion论文?
|
|||
|
|
|
|||
|
|
**根本原因**: BEV分辨率设置
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
原BEVFusion论文 (推测):
|
|||
|
|
可能使用更高的BEV分辨率
|
|||
|
|
或者使用不同的grid配置
|
|||
|
|
|
|||
|
|
当前训练:
|
|||
|
|
BEV: 0.3m/grid
|
|||
|
|
输出: 0.5m/grid ← 反而降采样!
|
|||
|
|
|
|||
|
|
问题: 细节在BEV生成阶段已丢失
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Phase 4解决方案**:
|
|||
|
|
```
|
|||
|
|
BEV: 0.15m/grid (2倍)
|
|||
|
|
输出: 0.25m/grid (2倍)
|
|||
|
|
|
|||
|
|
Stop Line: 0.15m宽 → 占1个grid ✅ 可表达!
|
|||
|
|
Divider: 0.10m宽 → 占0.67个grid ⚠️ 勉强可表达
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. BEVFormer的时间优势值得借鉴吗?
|
|||
|
|
|
|||
|
|
**场景分析**:
|
|||
|
|
|
|||
|
|
**静态场景分割**(车道线、分隔线):
|
|||
|
|
```
|
|||
|
|
时间信息价值: 中等
|
|||
|
|
✅ 可补全遮挡区域
|
|||
|
|
✅ 提升边界连续性
|
|||
|
|
⚠️ 但静态物体变化小
|
|||
|
|
|
|||
|
|
预期提升: mIoU +2-3%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**动态场景检测**(车辆、行人):
|
|||
|
|
```
|
|||
|
|
时间信息价值: 高
|
|||
|
|
✅ 运动轨迹预测
|
|||
|
|
✅ 遮挡物体跟踪
|
|||
|
|
✅ 速度估计
|
|||
|
|
|
|||
|
|
预期提升: NDS +1-2%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**建议**:
|
|||
|
|
- Phase 4优先(解决根本问题)
|
|||
|
|
- Phase 5可考虑Temporal Fusion(锦上添花)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 实施路线图
|
|||
|
|
|
|||
|
|
### 阶段1: Phase 4 - BEV 2x (优先级: ⭐⭐⭐⭐⭐)
|
|||
|
|
|
|||
|
|
**时间**: 2025-10-28启动,11月5日完成
|
|||
|
|
|
|||
|
|
**目标**: 追平BEVFusion原论文的分割性能
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
配置修改:
|
|||
|
|
xbound: [-54, 54, 0.15] # 2倍分辨率
|
|||
|
|
ybound: [-54, 54, 0.15]
|
|||
|
|
output: [-50, 50, 0.25]
|
|||
|
|
|
|||
|
|
预期性能:
|
|||
|
|
Stop Line: 0.26 → 0.40 (接近论文0.48)
|
|||
|
|
Divider: 0.19 → 0.30 (接近论文0.48)
|
|||
|
|
mIoU: 0.41 → 0.47 (接近论文0.62的75%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 阶段2: Temporal Fusion (优先级: ⭐⭐⭐)
|
|||
|
|
|
|||
|
|
**时间**: 2025-11月中旬(Phase 4成功后)
|
|||
|
|
|
|||
|
|
**方案**: 简单BEV队列融合
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 实现简单但有效
|
|||
|
|
class TemporalBEVFusion:
|
|||
|
|
def __init__(self):
|
|||
|
|
self.bev_queue = deque(maxlen=3) # 3帧
|
|||
|
|
self.temporal_conv = nn.Conv3D(...)
|
|||
|
|
|
|||
|
|
def forward(self, curr_bev):
|
|||
|
|
if len(self.bev_queue) > 0:
|
|||
|
|
# 时间维度卷积
|
|||
|
|
temporal_stack = torch.stack(
|
|||
|
|
[curr_bev] + list(self.bev_queue),
|
|||
|
|
dim=2
|
|||
|
|
) # (B, C, T, H, W)
|
|||
|
|
|
|||
|
|
fused = self.temporal_conv(temporal_stack)
|
|||
|
|
return fused
|
|||
|
|
return curr_bev
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期提升**:
|
|||
|
|
- mIoU: +2-3%
|
|||
|
|
- 遮挡恢复: 显著提升
|
|||
|
|
- 边界连续性: 改善
|
|||
|
|
|
|||
|
|
### 阶段3: 完整BEVFormer-style Attention(研究性)
|
|||
|
|
|
|||
|
|
**时间**: 2026年Q1(如果需要)
|
|||
|
|
|
|||
|
|
**实现**: 完整的Temporal Self-Attention
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 完全借鉴BEVFormer
|
|||
|
|
class BEVFormerTemporalAttention:
|
|||
|
|
def forward(self, curr_bev, prev_bevs):
|
|||
|
|
# Multi-head self-attention
|
|||
|
|
Q = self.q_proj(curr_bev)
|
|||
|
|
K = self.k_proj(prev_bevs)
|
|||
|
|
V = self.v_proj(prev_bevs)
|
|||
|
|
|
|||
|
|
output = multi_head_attention(Q, K, V)
|
|||
|
|
return curr_bev + output
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期提升**:
|
|||
|
|
- mIoU: +3-5%
|
|||
|
|
- 运动物体分割: 大幅提升
|
|||
|
|
- 最接近BEVFormer的时间建模
|
|||
|
|
|
|||
|
|
**成本**:
|
|||
|
|
- 训练时间: +30%
|
|||
|
|
- 显存: +25%
|
|||
|
|
- 推理速度: -20%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 最终建议
|
|||
|
|
|
|||
|
|
### 当前训练 vs BEVFormer的定位
|
|||
|
|
|
|||
|
|
**当前训练的优势**:
|
|||
|
|
```
|
|||
|
|
1. ✅ 多模态融合 (Cam+LiDAR)
|
|||
|
|
→ NDS 71.0% vs BEVFormer 56.9% (+14.1%)
|
|||
|
|
|
|||
|
|
2. ✅ 增强分割头
|
|||
|
|
→ 比BEVFormer的简单头强得多
|
|||
|
|
|
|||
|
|
3. ✅ 计算效率
|
|||
|
|
→ LSS比Transformer快50%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**当前训练的劣势**:
|
|||
|
|
```
|
|||
|
|
1. ⚠️ 无时间建模
|
|||
|
|
→ 无法利用历史帧
|
|||
|
|
|
|||
|
|
2. ⚠️ BEV分辨率不足
|
|||
|
|
→ 小目标性能受限(与BEVFormer相同问题)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 优先级排序
|
|||
|
|
|
|||
|
|
**1. Phase 4 (BEV 2x) - 立即执行** ⭐⭐⭐⭐⭐
|
|||
|
|
```
|
|||
|
|
理由:
|
|||
|
|
• 解决根本瓶颈(分辨率)
|
|||
|
|
• 时间成本相同
|
|||
|
|
• 性能提升最大 (+15% mIoU)
|
|||
|
|
• 对实车部署关键
|
|||
|
|
|
|||
|
|
执行时间: 2025-10-28
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. Temporal Fusion - Phase 5考虑** ⭐⭐⭐
|
|||
|
|
```
|
|||
|
|
理由:
|
|||
|
|
• 借鉴BEVFormer优势
|
|||
|
|
• 锦上添花 (+2-3% mIoU)
|
|||
|
|
• 实现相对简单
|
|||
|
|
|
|||
|
|
执行时间: 2025-11月中旬
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 完整Transformer - 研究性** ⭐⭐
|
|||
|
|
```
|
|||
|
|
理由:
|
|||
|
|
• 技术复杂
|
|||
|
|
• 收益有限 (+1-2% vs 简单方案)
|
|||
|
|
• 计算开销大
|
|||
|
|
|
|||
|
|
执行时间: 可选,2026年Q1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 技术参考清单
|
|||
|
|
|
|||
|
|
### BEVFormer相关资料
|
|||
|
|
|
|||
|
|
1. **论文**: [BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers](https://arxiv.org/abs/2203.17270)
|
|||
|
|
2. **代码**: https://github.com/fundamentalvision/BEVFormer
|
|||
|
|
3. **中文博客**: https://www.cnblogs.com/wxkang/p/17391118.html
|
|||
|
|
|
|||
|
|
### BEVFusion相关资料
|
|||
|
|
|
|||
|
|
1. **论文**: BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
|
|||
|
|
2. **当前训练配置**: `configs/.../multitask_enhanced_phase1_HIGHRES.yaml`
|
|||
|
|
3. **增强分割头**: `mmdet3d/models/heads/segm/enhanced.py`
|
|||
|
|
|
|||
|
|
### 改进方案文档
|
|||
|
|
|
|||
|
|
1. **BEV分辨率方案**: `/workspace/bevfusion/BEV分辨率提升方案分析.md`
|
|||
|
|
2. **Loss分析报告**: `/workspace/bevfusion/Epoch8-11_Loss分析与Phase4启动建议.md`
|
|||
|
|
3. **实车部署计划**: `/workspace/bevfusion/BEVFusion实车部署完整计划.md`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 核心结论
|
|||
|
|
|
|||
|
|
### BEVFormer vs BEVFusion在分割任务上的差异
|
|||
|
|
|
|||
|
|
**1. 架构层面**:
|
|||
|
|
```
|
|||
|
|
BEVFormer: 时空Transformer + 简单分割头
|
|||
|
|
BEVFusion: LSS投影 + 增强分割头
|
|||
|
|
当前训练: LSS + LiDAR + 最强分割头 ⭐
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. 性能层面**:
|
|||
|
|
```
|
|||
|
|
检测:
|
|||
|
|
BEVFormer (Camera): 56.9% NDS
|
|||
|
|
BEVFusion (Cam+LiDAR): 70.4% NDS (+13.5%)
|
|||
|
|
当前训练: 71.0% NDS (+14.1%) ⭐ 最优
|
|||
|
|
|
|||
|
|
分割:
|
|||
|
|
BEVFormer (0.5m): mIoU 0.35
|
|||
|
|
BEVFusion (0.3m→0.5m): mIoU 0.62 (论文)
|
|||
|
|
当前训练 (0.3m→0.5m): mIoU 0.41 (受限)
|
|||
|
|
Phase 4 (0.15m→0.25m): mIoU 0.47 (预估) ⭐
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 技术路线**:
|
|||
|
|
```
|
|||
|
|
最优方案 = BEVFusion基础 + BEV 2x + (可选)Temporal
|
|||
|
|
|
|||
|
|
检测: 71-72% NDS (SOTA)
|
|||
|
|
分割: 0.47-0.50 mIoU (优秀)
|
|||
|
|
小目标: Stop 0.40, Divider 0.30 (实车可用)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**文档生成时间**: 2025-10-26 12:30
|
|||
|
|
**参考资料**: [BEVFormer GitHub](https://github.com/fundamentalvision/BEVFormer), BEVFusion论文
|
|||
|
|
**结论**: 当前训练方向正确,Phase 4必要且充分!
|
|||
|
|
|