897 lines
24 KiB
Markdown
897 lines
24 KiB
Markdown
|
|
# BEVFusion Camera配置灵活性分析与方案
|
|||
|
|
|
|||
|
|
**分析时间**: 2025-11-06
|
|||
|
|
**当前配置**: 6 Cameras + LiDAR (nuScenes标准)
|
|||
|
|
**目标**: 支持灵活的camera配置 (1-N个cameras)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 当前架构分析
|
|||
|
|
|
|||
|
|
### 现有Camera处理流程
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
数据流:
|
|||
|
|
img: (B, N, C, H, W) # N=6 cameras
|
|||
|
|
↓
|
|||
|
|
Backbone: (B*N, C', H', W') # 6个相机共享权重
|
|||
|
|
↓
|
|||
|
|
Neck: (B*N, 256, H'', W'')
|
|||
|
|
↓
|
|||
|
|
VTransform: (B, N, 80, D, H, W) # 转到BEV空间
|
|||
|
|
↓
|
|||
|
|
BEV Pooling: (B, 80*D, H, W) # 聚合N个相机
|
|||
|
|
↓
|
|||
|
|
Fuser + Decoder: (B, 512, H, W)
|
|||
|
|
↓
|
|||
|
|
Task Heads
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 关键发现
|
|||
|
|
|
|||
|
|
1. ✅ **Camera数量N是动态的**:代码中N可以是任意值
|
|||
|
|
2. ✅ **共享权重**:所有cameras共享同一个backbone/neck
|
|||
|
|
3. ✅ **BEV Pooling自动聚合**:无论多少cameras,最终都pool到同一个BEV空间
|
|||
|
|
4. ⚠️ **固定配置**:nuScenes硬编码为6个特定cameras
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 支持灵活Camera配置的方案
|
|||
|
|
|
|||
|
|
### 方案1: 简单动态配置 (最简单) ⭐
|
|||
|
|
|
|||
|
|
**适用场景**: 3-8个相似类型的cameras
|
|||
|
|
|
|||
|
|
**实现方式**:
|
|||
|
|
- 只需修改数据加载,模型无需改动
|
|||
|
|
- 所有cameras共享权重
|
|||
|
|
- BEV pooling自动适配
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# configs/custom/flexible_cameras.yaml
|
|||
|
|
|
|||
|
|
# 配置camera数量
|
|||
|
|
num_cameras: 4 # 可以是1-N
|
|||
|
|
|
|||
|
|
camera_names:
|
|||
|
|
- CAM_FRONT
|
|||
|
|
- CAM_FRONT_LEFT
|
|||
|
|
- CAM_FRONT_RIGHT
|
|||
|
|
- CAM_BACK
|
|||
|
|
# 可以增减
|
|||
|
|
|
|||
|
|
# 模型配置 - 无需修改!
|
|||
|
|
model:
|
|||
|
|
type: BEVFusion
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
backbone:
|
|||
|
|
type: SwinTransformer # 权重共享
|
|||
|
|
neck:
|
|||
|
|
type: GeneralizedLSSFPN
|
|||
|
|
vtransform:
|
|||
|
|
type: DepthLSSTransform
|
|||
|
|
# BEV pooling会自动处理N个cameras
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 实现简单,无需修改模型代码
|
|||
|
|
- ✅ 可以任意增减cameras
|
|||
|
|
- ✅ 训练稳定
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
- ⚠️ 所有cameras必须共享权重
|
|||
|
|
- ⚠️ 无法针对特定camera优化
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案2: Camera-specific Adapter (推荐) ⭐⭐⭐
|
|||
|
|
|
|||
|
|
**适用场景**: 不同类型的cameras (如广角+长焦)
|
|||
|
|
|
|||
|
|
**核心思想**:
|
|||
|
|
- Backbone共享
|
|||
|
|
- 每个camera有独立的adapter
|
|||
|
|
- 适合heterogeneous cameras
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# mmdet3d/models/vtransforms/camera_aware_lss.py
|
|||
|
|
|
|||
|
|
class CameraAwareLSS(BaseTransform):
|
|||
|
|
"""
|
|||
|
|
Camera感知的LSS Transform
|
|||
|
|
每个camera有独立的adapter处理特征
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
in_channels: int,
|
|||
|
|
out_channels: int,
|
|||
|
|
num_cameras: int = 6, # 动态camera数量
|
|||
|
|
camera_types: list = None, # ['wide', 'tele', 'wide', ...]
|
|||
|
|
**kwargs
|
|||
|
|
):
|
|||
|
|
super().__init__(in_channels, out_channels, **kwargs)
|
|||
|
|
|
|||
|
|
# 为每个camera创建lightweight adapter
|
|||
|
|
self.camera_adapters = nn.ModuleList([
|
|||
|
|
nn.Sequential(
|
|||
|
|
nn.Conv2d(in_channels, in_channels, 3, 1, 1),
|
|||
|
|
nn.BatchNorm2d(in_channels),
|
|||
|
|
nn.ReLU(inplace=True),
|
|||
|
|
nn.Conv2d(in_channels, in_channels, 1),
|
|||
|
|
) for _ in range(num_cameras)
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
# Camera类型embedding (可选)
|
|||
|
|
self.use_camera_embedding = camera_types is not None
|
|||
|
|
if self.use_camera_embedding:
|
|||
|
|
unique_types = list(set(camera_types))
|
|||
|
|
self.camera_type_embed = nn.Embedding(
|
|||
|
|
len(unique_types),
|
|||
|
|
in_channels
|
|||
|
|
)
|
|||
|
|
self.type_to_id = {t: i for i, t in enumerate(unique_types)}
|
|||
|
|
|
|||
|
|
# 参数量: num_cameras × (in_channels² × 9 + in_channels²)
|
|||
|
|
# 示例: 6 × (256² × 10) ≈ 4M参数
|
|||
|
|
|
|||
|
|
def get_cam_feats(self, x, mats_dict):
|
|||
|
|
"""
|
|||
|
|
Args:
|
|||
|
|
x: (B, N, C, fH, fW) - N个cameras的特征
|
|||
|
|
mats_dict: camera矩阵
|
|||
|
|
"""
|
|||
|
|
B, N, C, fH, fW = x.shape
|
|||
|
|
|
|||
|
|
# 为每个camera应用adapter
|
|||
|
|
adapted_features = []
|
|||
|
|
for i in range(N):
|
|||
|
|
cam_feat = x[:, i] # (B, C, fH, fW)
|
|||
|
|
|
|||
|
|
# Camera-specific adapter
|
|||
|
|
cam_feat = self.camera_adapters[i](cam_feat)
|
|||
|
|
|
|||
|
|
# (可选) 添加camera类型embedding
|
|||
|
|
if self.use_camera_embedding:
|
|||
|
|
cam_type_id = self.type_to_id[self.camera_types[i]]
|
|||
|
|
type_embed = self.camera_type_embed(
|
|||
|
|
torch.tensor(cam_type_id).to(cam_feat.device)
|
|||
|
|
)
|
|||
|
|
cam_feat = cam_feat + type_embed.view(1, -1, 1, 1)
|
|||
|
|
|
|||
|
|
adapted_features.append(cam_feat)
|
|||
|
|
|
|||
|
|
# 重新组合
|
|||
|
|
x = torch.stack(adapted_features, dim=1) # (B, N, C, fH, fW)
|
|||
|
|
|
|||
|
|
# 继续LSS处理
|
|||
|
|
return super().get_cam_feats(x, mats_dict)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**配置示例**:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
type: CameraAwareLSS
|
|||
|
|
num_cameras: 4
|
|||
|
|
camera_types: ['wide', 'tele', 'wide', 'wide']
|
|||
|
|
in_channels: 256
|
|||
|
|
out_channels: 80
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 每个camera有独立处理能力
|
|||
|
|
- ✅ 参数增加不多 (~4M)
|
|||
|
|
- ✅ 可以适配不同类型cameras
|
|||
|
|
- ✅ 向后兼容
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
- ⚠️ 需要修改vtransform代码
|
|||
|
|
- ⚠️ 训练时间略增加 (~5%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案3: Mixture of Experts (MoE) ⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**适用场景**: 多种camera类型,需要最优性能
|
|||
|
|
|
|||
|
|
**核心思想**:
|
|||
|
|
- 为不同camera类型创建专家网络
|
|||
|
|
- Router动态选择experts
|
|||
|
|
- 可以根据camera属性自动路由
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# mmdet3d/models/modules/camera_moe.py
|
|||
|
|
|
|||
|
|
class CameraMoE(nn.Module):
|
|||
|
|
"""
|
|||
|
|
Camera Mixture of Experts
|
|||
|
|
|
|||
|
|
根据camera类型动态选择expert
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
in_channels: int,
|
|||
|
|
num_experts: int = 3, # 3种expert: wide, tele, fisheye
|
|||
|
|
expert_capacity: int = 256,
|
|||
|
|
**kwargs
|
|||
|
|
):
|
|||
|
|
super().__init__()
|
|||
|
|
|
|||
|
|
# 创建多个experts
|
|||
|
|
self.experts = nn.ModuleList([
|
|||
|
|
CameraExpert(in_channels, expert_capacity)
|
|||
|
|
for _ in range(num_experts)
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
# Router: 决定使用哪个expert
|
|||
|
|
self.router = nn.Sequential(
|
|||
|
|
nn.AdaptiveAvgPool2d(1), # Global pooling
|
|||
|
|
nn.Flatten(),
|
|||
|
|
nn.Linear(in_channels, num_experts),
|
|||
|
|
nn.Softmax(dim=-1)
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Camera属性encoder (FOV, focal length, etc.)
|
|||
|
|
self.camera_attr_encoder = nn.Sequential(
|
|||
|
|
nn.Linear(4, 64), # [FOV, focal_x, focal_y, type_id]
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Linear(64, in_channels),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def forward(self, x, camera_attrs):
|
|||
|
|
"""
|
|||
|
|
Args:
|
|||
|
|
x: (B, N, C, H, W) - N个cameras
|
|||
|
|
camera_attrs: (B, N, 4) - [FOV, focal_x, focal_y, type_id]
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
(B, N, C, H, W) - processed features
|
|||
|
|
"""
|
|||
|
|
B, N, C, H, W = x.shape
|
|||
|
|
|
|||
|
|
outputs = []
|
|||
|
|
for i in range(N):
|
|||
|
|
cam_feat = x[:, i] # (B, C, H, W)
|
|||
|
|
cam_attr = camera_attrs[:, i] # (B, 4)
|
|||
|
|
|
|||
|
|
# 1. 编码camera属性
|
|||
|
|
attr_embed = self.camera_attr_encoder(cam_attr) # (B, C)
|
|||
|
|
attr_embed = attr_embed.unsqueeze(-1).unsqueeze(-1) # (B, C, 1, 1)
|
|||
|
|
|
|||
|
|
# 2. Router选择expert
|
|||
|
|
router_input = cam_feat + attr_embed
|
|||
|
|
expert_weights = self.router(router_input) # (B, num_experts)
|
|||
|
|
|
|||
|
|
# 3. 组合多个experts
|
|||
|
|
expert_outputs = []
|
|||
|
|
for expert in self.experts:
|
|||
|
|
expert_out = expert(cam_feat)
|
|||
|
|
expert_outputs.append(expert_out)
|
|||
|
|
|
|||
|
|
# 加权组合: (B, num_experts, C, H, W)
|
|||
|
|
expert_outputs = torch.stack(expert_outputs, dim=1)
|
|||
|
|
|
|||
|
|
# 应用router权重
|
|||
|
|
expert_weights = expert_weights.view(B, -1, 1, 1, 1)
|
|||
|
|
cam_output = (expert_outputs * expert_weights).sum(dim=1)
|
|||
|
|
|
|||
|
|
outputs.append(cam_output)
|
|||
|
|
|
|||
|
|
return torch.stack(outputs, dim=1) # (B, N, C, H, W)
|
|||
|
|
|
|||
|
|
|
|||
|
|
class CameraExpert(nn.Module):
|
|||
|
|
"""单个Expert网络"""
|
|||
|
|
|
|||
|
|
def __init__(self, in_channels, out_channels):
|
|||
|
|
super().__init__()
|
|||
|
|
self.conv = nn.Sequential(
|
|||
|
|
nn.Conv2d(in_channels, out_channels, 3, 1, 1),
|
|||
|
|
nn.BatchNorm2d(out_channels),
|
|||
|
|
nn.ReLU(inplace=True),
|
|||
|
|
nn.Conv2d(out_channels, in_channels, 3, 1, 1),
|
|||
|
|
nn.BatchNorm2d(in_channels),
|
|||
|
|
)
|
|||
|
|
self.shortcut = nn.Identity()
|
|||
|
|
|
|||
|
|
def forward(self, x):
|
|||
|
|
return F.relu(self.conv(x) + self.shortcut(x))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**配置示例**:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
backbone:
|
|||
|
|
type: SwinTransformer
|
|||
|
|
# 共享backbone
|
|||
|
|
|
|||
|
|
neck:
|
|||
|
|
type: GeneralizedLSSFPN
|
|||
|
|
# 添加MoE
|
|||
|
|
use_camera_moe: true
|
|||
|
|
moe_config:
|
|||
|
|
num_experts: 3 # wide, tele, fisheye
|
|||
|
|
expert_capacity: 256
|
|||
|
|
|
|||
|
|
vtransform:
|
|||
|
|
type: DepthLSSTransform
|
|||
|
|
|
|||
|
|
# 数据配置
|
|||
|
|
camera_attributes:
|
|||
|
|
CAM_FRONT:
|
|||
|
|
type: 'wide'
|
|||
|
|
fov: 120.0
|
|||
|
|
focal_length: [1266.0, 1266.0]
|
|||
|
|
|
|||
|
|
CAM_FRONT_TELE:
|
|||
|
|
type: 'tele'
|
|||
|
|
fov: 30.0
|
|||
|
|
focal_length: [2532.0, 2532.0]
|
|||
|
|
|
|||
|
|
CAM_FRONT_LEFT:
|
|||
|
|
type: 'wide'
|
|||
|
|
fov: 120.0
|
|||
|
|
focal_length: [1266.0, 1266.0]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 最强的表达能力
|
|||
|
|
- ✅ 自动学习不同camera的最优处理
|
|||
|
|
- ✅ 支持高度异质化的cameras
|
|||
|
|
- ✅ 可以融合camera属性信息
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
- ⚠️ 实现复杂度高
|
|||
|
|
- ⚠️ 参数增加较多 (~10M)
|
|||
|
|
- ⚠️ 训练时间增加 (~10-15%)
|
|||
|
|
- ⚠️ 需要足够数据训练router
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案4: Per-Camera Attention (最灵活) ⭐⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**适用场景**: 任意数量和类型的cameras
|
|||
|
|
|
|||
|
|
**核心思想**:
|
|||
|
|
- 让模型学习每个camera的重要性
|
|||
|
|
- Cross-camera attention
|
|||
|
|
- 动态融合不同cameras
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# mmdet3d/models/modules/camera_attention.py
|
|||
|
|
|
|||
|
|
class MultiCameraAttentionFusion(nn.Module):
|
|||
|
|
"""
|
|||
|
|
多相机注意力融合
|
|||
|
|
|
|||
|
|
特点:
|
|||
|
|
- 支持任意数量cameras (1-N)
|
|||
|
|
- 自动学习camera间的关系
|
|||
|
|
- 位置感知的camera融合
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
in_channels: int,
|
|||
|
|
num_cameras: int = 6, # 可变
|
|||
|
|
use_camera_position: bool = True, # 使用camera位置信息
|
|||
|
|
use_cross_attention: bool = True, # camera间交互
|
|||
|
|
):
|
|||
|
|
super().__init__()
|
|||
|
|
|
|||
|
|
self.num_cameras = num_cameras
|
|||
|
|
self.use_camera_position = use_camera_position
|
|||
|
|
|
|||
|
|
# Camera位置编码 (相对vehicle的位置)
|
|||
|
|
if use_camera_position:
|
|||
|
|
self.position_encoder = nn.Sequential(
|
|||
|
|
nn.Linear(6, 128), # [x, y, z, roll, pitch, yaw]
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Linear(128, in_channels),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Self-attention for each camera
|
|||
|
|
self.self_attention = nn.MultiheadAttention(
|
|||
|
|
embed_dim=in_channels,
|
|||
|
|
num_heads=8,
|
|||
|
|
dropout=0.1,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Cross-camera attention
|
|||
|
|
if use_cross_attention:
|
|||
|
|
self.cross_attention = nn.MultiheadAttention(
|
|||
|
|
embed_dim=in_channels,
|
|||
|
|
num_heads=8,
|
|||
|
|
dropout=0.1,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Camera importance weighting
|
|||
|
|
self.camera_weight_net = nn.Sequential(
|
|||
|
|
nn.Linear(in_channels, in_channels // 4),
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Linear(in_channels // 4, 1),
|
|||
|
|
nn.Sigmoid(),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
def forward(self, x, camera_positions=None):
|
|||
|
|
"""
|
|||
|
|
Args:
|
|||
|
|
x: (B, N, C, H, W) - N个cameras
|
|||
|
|
camera_positions: (B, N, 6) - 每个camera的位置[x,y,z,r,p,y]
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
(B, N, C, H, W) - attention-enhanced features
|
|||
|
|
"""
|
|||
|
|
B, N, C, H, W = x.shape
|
|||
|
|
|
|||
|
|
# Reshape for attention: (N, B*H*W, C)
|
|||
|
|
x_flat = x.permute(1, 0, 3, 4, 2).reshape(N, B*H*W, C)
|
|||
|
|
|
|||
|
|
# 1. 添加camera位置编码
|
|||
|
|
if self.use_camera_position and camera_positions is not None:
|
|||
|
|
pos_embed = self.position_encoder(camera_positions) # (B, N, C)
|
|||
|
|
pos_embed = pos_embed.permute(1, 0, 2).unsqueeze(2) # (N, B, 1, C)
|
|||
|
|
pos_embed = pos_embed.expand(-1, -1, H*W, -1).reshape(N, B*H*W, C)
|
|||
|
|
x_flat = x_flat + pos_embed
|
|||
|
|
|
|||
|
|
# 2. Self-attention (within each camera)
|
|||
|
|
x_self, _ = self.self_attention(x_flat, x_flat, x_flat)
|
|||
|
|
|
|||
|
|
# 3. Cross-camera attention (between cameras)
|
|||
|
|
if hasattr(self, 'cross_attention'):
|
|||
|
|
# Query from each camera, Key/Value from all cameras
|
|||
|
|
x_cross = []
|
|||
|
|
for i in range(N):
|
|||
|
|
query = x_self[i:i+1] # (1, B*H*W, C)
|
|||
|
|
key_value = x_self # (N, B*H*W, C)
|
|||
|
|
attended, attn_weights = self.cross_attention(
|
|||
|
|
query, key_value, key_value
|
|||
|
|
)
|
|||
|
|
x_cross.append(attended)
|
|||
|
|
|
|||
|
|
x_attended = torch.cat(x_cross, dim=0) # (N, B*H*W, C)
|
|||
|
|
else:
|
|||
|
|
x_attended = x_self
|
|||
|
|
|
|||
|
|
# 4. 计算每个camera的重要性权重
|
|||
|
|
# Global pooling for each camera
|
|||
|
|
x_pooled = x_attended.mean(dim=1) # (N, C)
|
|||
|
|
camera_weights = self.camera_weight_net(x_pooled) # (N, 1)
|
|||
|
|
|
|||
|
|
# 应用权重
|
|||
|
|
camera_weights = camera_weights.view(N, 1, 1).expand(-1, B*H*W, C)
|
|||
|
|
x_weighted = x_attended * camera_weights
|
|||
|
|
|
|||
|
|
# Reshape back: (N, B*H*W, C) -> (B, N, C, H, W)
|
|||
|
|
output = x_weighted.reshape(N, B, H, W, C).permute(1, 0, 4, 2, 3)
|
|||
|
|
|
|||
|
|
return output
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**集成到BEVFusion**:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# mmdet3d/models/fusion_models/bevfusion.py (修改)
|
|||
|
|
|
|||
|
|
def extract_camera_features(self, img, ...):
|
|||
|
|
B, N, C, H, W = img.shape
|
|||
|
|
|
|||
|
|
# Backbone (共享)
|
|||
|
|
x = img.view(B * N, C, H, W)
|
|||
|
|
x = self.encoders["camera"]["backbone"](x)
|
|||
|
|
x = self.encoders["camera"]["neck"](x)
|
|||
|
|
|
|||
|
|
# Reshape: (B*N, C', H', W') -> (B, N, C', H', W')
|
|||
|
|
_, C_out, H_out, W_out = x.shape
|
|||
|
|
x = x.view(B, N, C_out, H_out, W_out)
|
|||
|
|
|
|||
|
|
# ✨ Multi-camera attention fusion
|
|||
|
|
if hasattr(self, 'camera_attention'):
|
|||
|
|
x = self.camera_attention(x, camera_positions)
|
|||
|
|
|
|||
|
|
# VTransform
|
|||
|
|
x = self.encoders["camera"]["vtransform"](x, ...)
|
|||
|
|
|
|||
|
|
return x
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 最灵活,支持任意N个cameras
|
|||
|
|
- ✅ 自动学习camera重要性
|
|||
|
|
- ✅ camera间信息交互
|
|||
|
|
- ✅ 位置感知融合
|
|||
|
|
- ✅ 可以处理missing cameras (动态N)
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
- ⚠️ 实现复杂
|
|||
|
|
- ⚠️ 参数较多 (~8M)
|
|||
|
|
- ⚠️ 计算开销 (+15-20ms)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案5: Sparse MoE (高效版) ⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**适用场景**: 大量cameras (>6),需要效率
|
|||
|
|
|
|||
|
|
**核心思想**:
|
|||
|
|
- 每次只激活top-K experts
|
|||
|
|
- 降低计算开销
|
|||
|
|
- 适合异构camera系统
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# mmdet3d/models/modules/sparse_camera_moe.py
|
|||
|
|
|
|||
|
|
class SparseCameraMoE(nn.Module):
|
|||
|
|
"""
|
|||
|
|
稀疏Camera MoE
|
|||
|
|
|
|||
|
|
对于每个sample,只使用top-K个experts
|
|||
|
|
大幅降低计算开销
|
|||
|
|
"""
|
|||
|
|
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
in_channels: int,
|
|||
|
|
num_experts: int = 8, # 支持8种camera类型
|
|||
|
|
num_active_experts: int = 3, # 每次只用3个
|
|||
|
|
expert_hidden_dim: int = 256,
|
|||
|
|
):
|
|||
|
|
super().__init__()
|
|||
|
|
|
|||
|
|
self.num_experts = num_experts
|
|||
|
|
self.num_active_experts = num_active_experts
|
|||
|
|
|
|||
|
|
# 创建experts
|
|||
|
|
self.experts = nn.ModuleList([
|
|||
|
|
nn.Sequential(
|
|||
|
|
nn.Conv2d(in_channels, expert_hidden_dim, 3, 1, 1),
|
|||
|
|
nn.BatchNorm2d(expert_hidden_dim),
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Conv2d(expert_hidden_dim, in_channels, 3, 1, 1),
|
|||
|
|
nn.BatchNorm2d(in_channels),
|
|||
|
|
) for _ in range(num_experts)
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
# Gating network: 选择哪些experts
|
|||
|
|
self.gate = nn.Sequential(
|
|||
|
|
nn.AdaptiveAvgPool2d(1),
|
|||
|
|
nn.Flatten(),
|
|||
|
|
nn.Linear(in_channels, num_experts),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Load balancing auxiliary loss
|
|||
|
|
self.load_balancing_loss_weight = 0.01
|
|||
|
|
|
|||
|
|
def forward(self, x, camera_types=None):
|
|||
|
|
"""
|
|||
|
|
Args:
|
|||
|
|
x: (B, N, C, H, W)
|
|||
|
|
camera_types: (B, N) - camera type IDs
|
|||
|
|
|
|||
|
|
Returns:
|
|||
|
|
output: (B, N, C, H, W)
|
|||
|
|
aux_loss: load balancing loss
|
|||
|
|
"""
|
|||
|
|
B, N, C, H, W = x.shape
|
|||
|
|
|
|||
|
|
outputs = []
|
|||
|
|
gate_scores_all = []
|
|||
|
|
|
|||
|
|
for i in range(N):
|
|||
|
|
cam_feat = x[:, i] # (B, C, H, W)
|
|||
|
|
|
|||
|
|
# Gate scores: 选择experts
|
|||
|
|
gate_scores = self.gate(cam_feat) # (B, num_experts)
|
|||
|
|
gate_scores_all.append(gate_scores)
|
|||
|
|
|
|||
|
|
# Top-K selection
|
|||
|
|
top_k_scores, top_k_indices = torch.topk(
|
|||
|
|
gate_scores,
|
|||
|
|
self.num_active_experts,
|
|||
|
|
dim=-1
|
|||
|
|
) # (B, K)
|
|||
|
|
|
|||
|
|
# Normalize top-K scores
|
|||
|
|
top_k_scores = F.softmax(top_k_scores, dim=-1)
|
|||
|
|
|
|||
|
|
# 只计算top-K experts
|
|||
|
|
expert_outs = []
|
|||
|
|
for b in range(B):
|
|||
|
|
sample_out = torch.zeros_like(cam_feat[b:b+1])
|
|||
|
|
|
|||
|
|
for k in range(self.num_active_experts):
|
|||
|
|
expert_idx = top_k_indices[b, k]
|
|||
|
|
expert_weight = top_k_scores[b, k]
|
|||
|
|
|
|||
|
|
expert_result = self.experts[expert_idx](cam_feat[b:b+1])
|
|||
|
|
sample_out += expert_result * expert_weight
|
|||
|
|
|
|||
|
|
expert_outs.append(sample_out)
|
|||
|
|
|
|||
|
|
cam_output = torch.cat(expert_outs, dim=0)
|
|||
|
|
outputs.append(cam_output)
|
|||
|
|
|
|||
|
|
output = torch.stack(outputs, dim=1) # (B, N, C, H, W)
|
|||
|
|
|
|||
|
|
# Load balancing loss (鼓励均匀使用experts)
|
|||
|
|
gate_scores_all = torch.stack(gate_scores_all, dim=1) # (B, N, num_experts)
|
|||
|
|
gate_mean = gate_scores_all.mean(dim=[0, 1]) # (num_experts,)
|
|||
|
|
load_balance_loss = (gate_mean.std() ** 2) * self.load_balancing_loss_weight
|
|||
|
|
|
|||
|
|
return output, load_balance_loss
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**配置示例**:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
neck:
|
|||
|
|
type: GeneralizedLSSFPN
|
|||
|
|
use_sparse_moe: true
|
|||
|
|
moe_config:
|
|||
|
|
num_experts: 8
|
|||
|
|
num_active_experts: 3 # 每次只用3个
|
|||
|
|
expert_hidden_dim: 256
|
|||
|
|
|
|||
|
|
# 支持多种camera类型
|
|||
|
|
camera_configs:
|
|||
|
|
- type: 'wide' # expert 0
|
|||
|
|
- type: 'tele' # expert 1
|
|||
|
|
- type: 'fisheye' # expert 2
|
|||
|
|
- type: 'ultra_wide' # expert 3
|
|||
|
|
# ... 可以定义更多
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 支持大量camera类型 (8+)
|
|||
|
|
- ✅ 计算高效 (只用top-K)
|
|||
|
|
- ✅ 自动学习expert选择
|
|||
|
|
- ✅ Load balancing确保训练稳定
|
|||
|
|
|
|||
|
|
**缺点**:
|
|||
|
|
- ⚠️ 实现最复杂
|
|||
|
|
- ⚠️ 需要调试router training
|
|||
|
|
- ⚠️ 需要足够多样的camera数据
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 方案对比
|
|||
|
|
|
|||
|
|
| 方案 | 灵活性 | 复杂度 | 参数量 | 计算开销 | 推荐场景 |
|
|||
|
|
|------|--------|--------|--------|----------|---------|
|
|||
|
|
| **简单动态** | ⭐⭐ | ⭐ | +0 | +0% | 3-6个相似cameras |
|
|||
|
|
| **Camera Adapter** | ⭐⭐⭐ | ⭐⭐ | +4M | +5% | 不同类型cameras (4-6个) |
|
|||
|
|
| **MoE** | ⭐⭐⭐⭐ | ⭐⭐⭐ | +10M | +20% | 多种camera类型 |
|
|||
|
|
| **Per-Camera Attn** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | +8M | +15% | 任意配置,最优性能 |
|
|||
|
|
| **Sparse MoE** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | +12M | +10% | 大量cameras (>6) |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 针对您的BEVFusion项目建议
|
|||
|
|
|
|||
|
|
### 当前状态分析
|
|||
|
|
|
|||
|
|
您现在的配置:
|
|||
|
|
- ✅ 6 cameras (nuScenes标准)
|
|||
|
|
- ✅ 1 LiDAR
|
|||
|
|
- ✅ Task-specific GCA已实现
|
|||
|
|
|
|||
|
|
### 建议方案
|
|||
|
|
|
|||
|
|
**短期 (当前训练完成后)**:
|
|||
|
|
|
|||
|
|
**推荐: 方案2 - Camera Adapter**
|
|||
|
|
|
|||
|
|
理由:
|
|||
|
|
1. ✅ 实现简单,风险低
|
|||
|
|
2. ✅ 参数开销小 (~4M)
|
|||
|
|
3. ✅ 与Task-GCA架构兼容
|
|||
|
|
4. ✅ 可以先支持4-8个cameras
|
|||
|
|
5. ✅ 2-3天即可实现
|
|||
|
|
|
|||
|
|
实施步骤:
|
|||
|
|
```python
|
|||
|
|
# Step 1: 创建CameraAwareLSS类
|
|||
|
|
# Step 2: 在配置文件中启用
|
|||
|
|
# Step 3: 从现有checkpoint fine-tune
|
|||
|
|
# Step 4: 测试不同camera数量 (4, 5, 6, 8)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**中期 (如需更强性能)**:
|
|||
|
|
|
|||
|
|
**推荐: 方案4 - Per-Camera Attention**
|
|||
|
|
|
|||
|
|
理由:
|
|||
|
|
1. ✅ 最佳性能
|
|||
|
|
2. ✅ 灵活性最高
|
|||
|
|
3. ✅ 可以与Task-GCA结合
|
|||
|
|
4. ✅ 学术价值高
|
|||
|
|
|
|||
|
|
组合架构:
|
|||
|
|
```
|
|||
|
|
Camera Input (N个)
|
|||
|
|
↓
|
|||
|
|
Shared Backbone (Swin Transformer)
|
|||
|
|
↓
|
|||
|
|
Multi-Camera Attention Fusion ← 新增
|
|||
|
|
↓
|
|||
|
|
VTransform (LSS)
|
|||
|
|
↓
|
|||
|
|
Fuser (Camera + LiDAR)
|
|||
|
|
↓
|
|||
|
|
Decoder
|
|||
|
|
↓
|
|||
|
|
Task-specific GCA ← 已有
|
|||
|
|
├─ Detection GCA
|
|||
|
|
└─ Segmentation GCA
|
|||
|
|
↓
|
|||
|
|
Task Heads
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 实施建议
|
|||
|
|
|
|||
|
|
### 阶段1: 基础灵活化 (1周)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 1. 修改数据加载支持动态camera数量
|
|||
|
|
# mmdet3d/datasets/pipelines/loading.py
|
|||
|
|
|
|||
|
|
@PIPELINES.register_module()
|
|||
|
|
class LoadMultiViewImageFromFiles:
|
|||
|
|
def __init__(self, to_float32=False, camera_names=None):
|
|||
|
|
self.to_float32 = to_float32
|
|||
|
|
self.camera_names = camera_names or [
|
|||
|
|
'CAM_FRONT', 'CAM_FRONT_RIGHT', 'CAM_FRONT_LEFT',
|
|||
|
|
'CAM_BACK', 'CAM_BACK_LEFT', 'CAM_BACK_RIGHT'
|
|||
|
|
]
|
|||
|
|
self.num_cameras = len(self.camera_names)
|
|||
|
|
|
|||
|
|
def __call__(self, results):
|
|||
|
|
# 只加载指定的cameras
|
|||
|
|
images = []
|
|||
|
|
for cam_name in self.camera_names:
|
|||
|
|
if cam_name in results['cams']:
|
|||
|
|
img_path = results['cams'][cam_name]['data_path']
|
|||
|
|
images.append(Image.open(img_path))
|
|||
|
|
|
|||
|
|
results['img'] = images
|
|||
|
|
results['num_cameras'] = len(images)
|
|||
|
|
return results
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 阶段2: Camera Adapter (1周)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 实现camera-specific adapter
|
|||
|
|
vim mmdet3d/models/vtransforms/camera_aware_lss.py
|
|||
|
|
|
|||
|
|
# 配置文件
|
|||
|
|
vim configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_camera_aware.yaml
|
|||
|
|
|
|||
|
|
# 测试
|
|||
|
|
python tools/train.py configs/.../multitask_camera_aware.yaml \
|
|||
|
|
--load_from runs/run-326653dc-2334d461/epoch_5.pth \
|
|||
|
|
--data.samples_per_gpu 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 阶段3: MoE/Attention (可选, 2周)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 如果需要最强性能,实现attention或MoE
|
|||
|
|
vim mmdet3d/models/modules/camera_attention.py
|
|||
|
|
vim mmdet3d/models/modules/camera_moe.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 预期效果
|
|||
|
|
|
|||
|
|
### 方案2 (Camera Adapter)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
配置灵活性:
|
|||
|
|
✅ 支持 1-8 个cameras
|
|||
|
|
✅ 每个camera独立处理
|
|||
|
|
✅ 自动适配不同FOV
|
|||
|
|
|
|||
|
|
性能影响:
|
|||
|
|
参数: +4M (110M → 114M)
|
|||
|
|
速度: +5% (2.66s → 2.79s/iter)
|
|||
|
|
精度: +1-2% (adapter优化)
|
|||
|
|
|
|||
|
|
训练:
|
|||
|
|
从epoch_5.pth开始: 需要3-5 epochs适应
|
|||
|
|
总时间: ~2天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 方案4 (Per-Camera Attention)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
配置灵活性:
|
|||
|
|
✅ 支持任意数量cameras
|
|||
|
|
✅ Camera间信息交互
|
|||
|
|
✅ 动态camera权重
|
|||
|
|
|
|||
|
|
性能影响:
|
|||
|
|
参数: +8M (110M → 118M)
|
|||
|
|
速度: +15% (2.66s → 3.06s/iter)
|
|||
|
|
精度: +2-4% (attention优化)
|
|||
|
|
|
|||
|
|
训练:
|
|||
|
|
从epoch_5.pth开始: 需要5-8 epochs
|
|||
|
|
总时间: ~3-4天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 具体实现roadmap
|
|||
|
|
|
|||
|
|
### 如果您现在就需要
|
|||
|
|
|
|||
|
|
**推荐路径: Camera Adapter方案**
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Week 1: 实现 (等待当前训练完成期间)
|
|||
|
|
Day 1: 设计CameraAwareLSS接口
|
|||
|
|
Day 2-3: 实现camera adapter模块
|
|||
|
|
Day 4: 编写配置文件
|
|||
|
|
Day 5: 单元测试
|
|||
|
|
|
|||
|
|
Week 2: 训练 (Epoch 20完成后)
|
|||
|
|
Day 1: 从epoch_20.pth开始fine-tune
|
|||
|
|
Day 2-3: 训练Camera Adapter (5 epochs)
|
|||
|
|
Day 4: 评估不同camera配置
|
|||
|
|
Day 5: 性能对比和文档
|
|||
|
|
|
|||
|
|
Week 3: 优化 (可选)
|
|||
|
|
Day 1-3: 升级到Attention方案
|
|||
|
|
Day 4-5: 进一步调优
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 代码模板
|
|||
|
|
|
|||
|
|
我可以立即为您创建:
|
|||
|
|
|
|||
|
|
1. **CameraAwareLSS实现** (`mmdet3d/models/vtransforms/camera_aware_lss.py`)
|
|||
|
|
2. **配置文件模板** (`configs/.../multitask_camera_aware.yaml`)
|
|||
|
|
3. **测试脚本** (`tools/test_camera_configs.py`)
|
|||
|
|
4. **文档** (`docs/CAMERA_FLEXIBILITY_GUIDE.md`)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 总结建议
|
|||
|
|
|
|||
|
|
### 对于您的项目
|
|||
|
|
|
|||
|
|
**立即可做**:
|
|||
|
|
1. ✅ 简单修改:支持4-8个cameras(修改数据加载即可)
|
|||
|
|
2. ✅ Camera Adapter:2周实现,性能提升1-2%
|
|||
|
|
3. ✅ 与Task-GCA兼容:可以叠加使用
|
|||
|
|
|
|||
|
|
**进阶方案**:
|
|||
|
|
1. 🎯 Per-Camera Attention:如需最优性能
|
|||
|
|
2. 🎯 Sparse MoE:如果camera类型很多 (>8)
|
|||
|
|
|
|||
|
|
**我的建议**:
|
|||
|
|
- 先完成当前Task-GCA训练 (Epoch 20)
|
|||
|
|
- 评估性能后,如果需要进一步提升camera处理
|
|||
|
|
- 实现Camera Adapter方案 (ROI最高)
|
|||
|
|
- 如果效果好,再考虑升级到Attention
|
|||
|
|
|
|||
|
|
**需要我现在开始实现Camera Adapter代码吗?**
|