24 KiB
24 KiB
BEVFusion Camera配置灵活性分析与方案
分析时间: 2025-11-06
当前配置: 6 Cameras + LiDAR (nuScenes标准)
目标: 支持灵活的camera配置 (1-N个cameras)
📊 当前架构分析
现有Camera处理流程
数据流:
img: (B, N, C, H, W) # N=6 cameras
↓
Backbone: (B*N, C', H', W') # 6个相机共享权重
↓
Neck: (B*N, 256, H'', W'')
↓
VTransform: (B, N, 80, D, H, W) # 转到BEV空间
↓
BEV Pooling: (B, 80*D, H, W) # 聚合N个相机
↓
Fuser + Decoder: (B, 512, H, W)
↓
Task Heads
关键发现
- ✅ Camera数量N是动态的:代码中N可以是任意值
- ✅ 共享权重:所有cameras共享同一个backbone/neck
- ✅ BEV Pooling自动聚合:无论多少cameras,最终都pool到同一个BEV空间
- ⚠️ 固定配置:nuScenes硬编码为6个特定cameras
🎯 支持灵活Camera配置的方案
方案1: 简单动态配置 (最简单) ⭐
适用场景: 3-8个相似类型的cameras
实现方式:
- 只需修改数据加载,模型无需改动
- 所有cameras共享权重
- BEV pooling自动适配
# configs/custom/flexible_cameras.yaml
# 配置camera数量
num_cameras: 4 # 可以是1-N
camera_names:
- CAM_FRONT
- CAM_FRONT_LEFT
- CAM_FRONT_RIGHT
- CAM_BACK
# 可以增减
# 模型配置 - 无需修改!
model:
type: BEVFusion
encoders:
camera:
backbone:
type: SwinTransformer # 权重共享
neck:
type: GeneralizedLSSFPN
vtransform:
type: DepthLSSTransform
# BEV pooling会自动处理N个cameras
优点:
- ✅ 实现简单,无需修改模型代码
- ✅ 可以任意增减cameras
- ✅ 训练稳定
缺点:
- ⚠️ 所有cameras必须共享权重
- ⚠️ 无法针对特定camera优化
方案2: Camera-specific Adapter (推荐) ⭐⭐⭐
适用场景: 不同类型的cameras (如广角+长焦)
核心思想:
- Backbone共享
- 每个camera有独立的adapter
- 适合heterogeneous cameras
# mmdet3d/models/vtransforms/camera_aware_lss.py
class CameraAwareLSS(BaseTransform):
"""
Camera感知的LSS Transform
每个camera有独立的adapter处理特征
"""
def __init__(
self,
in_channels: int,
out_channels: int,
num_cameras: int = 6, # 动态camera数量
camera_types: list = None, # ['wide', 'tele', 'wide', ...]
**kwargs
):
super().__init__(in_channels, out_channels, **kwargs)
# 为每个camera创建lightweight adapter
self.camera_adapters = nn.ModuleList([
nn.Sequential(
nn.Conv2d(in_channels, in_channels, 3, 1, 1),
nn.BatchNorm2d(in_channels),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels, in_channels, 1),
) for _ in range(num_cameras)
])
# Camera类型embedding (可选)
self.use_camera_embedding = camera_types is not None
if self.use_camera_embedding:
unique_types = list(set(camera_types))
self.camera_type_embed = nn.Embedding(
len(unique_types),
in_channels
)
self.type_to_id = {t: i for i, t in enumerate(unique_types)}
# 参数量: num_cameras × (in_channels² × 9 + in_channels²)
# 示例: 6 × (256² × 10) ≈ 4M参数
def get_cam_feats(self, x, mats_dict):
"""
Args:
x: (B, N, C, fH, fW) - N个cameras的特征
mats_dict: camera矩阵
"""
B, N, C, fH, fW = x.shape
# 为每个camera应用adapter
adapted_features = []
for i in range(N):
cam_feat = x[:, i] # (B, C, fH, fW)
# Camera-specific adapter
cam_feat = self.camera_adapters[i](cam_feat)
# (可选) 添加camera类型embedding
if self.use_camera_embedding:
cam_type_id = self.type_to_id[self.camera_types[i]]
type_embed = self.camera_type_embed(
torch.tensor(cam_type_id).to(cam_feat.device)
)
cam_feat = cam_feat + type_embed.view(1, -1, 1, 1)
adapted_features.append(cam_feat)
# 重新组合
x = torch.stack(adapted_features, dim=1) # (B, N, C, fH, fW)
# 继续LSS处理
return super().get_cam_feats(x, mats_dict)
配置示例:
model:
encoders:
camera:
vtransform:
type: CameraAwareLSS
num_cameras: 4
camera_types: ['wide', 'tele', 'wide', 'wide']
in_channels: 256
out_channels: 80
优点:
- ✅ 每个camera有独立处理能力
- ✅ 参数增加不多 (~4M)
- ✅ 可以适配不同类型cameras
- ✅ 向后兼容
缺点:
- ⚠️ 需要修改vtransform代码
- ⚠️ 训练时间略增加 (~5%)
方案3: Mixture of Experts (MoE) ⭐⭐⭐⭐
适用场景: 多种camera类型,需要最优性能
核心思想:
- 为不同camera类型创建专家网络
- Router动态选择experts
- 可以根据camera属性自动路由
# mmdet3d/models/modules/camera_moe.py
class CameraMoE(nn.Module):
"""
Camera Mixture of Experts
根据camera类型动态选择expert
"""
def __init__(
self,
in_channels: int,
num_experts: int = 3, # 3种expert: wide, tele, fisheye
expert_capacity: int = 256,
**kwargs
):
super().__init__()
# 创建多个experts
self.experts = nn.ModuleList([
CameraExpert(in_channels, expert_capacity)
for _ in range(num_experts)
])
# Router: 决定使用哪个expert
self.router = nn.Sequential(
nn.AdaptiveAvgPool2d(1), # Global pooling
nn.Flatten(),
nn.Linear(in_channels, num_experts),
nn.Softmax(dim=-1)
)
# Camera属性encoder (FOV, focal length, etc.)
self.camera_attr_encoder = nn.Sequential(
nn.Linear(4, 64), # [FOV, focal_x, focal_y, type_id]
nn.ReLU(),
nn.Linear(64, in_channels),
)
def forward(self, x, camera_attrs):
"""
Args:
x: (B, N, C, H, W) - N个cameras
camera_attrs: (B, N, 4) - [FOV, focal_x, focal_y, type_id]
Returns:
(B, N, C, H, W) - processed features
"""
B, N, C, H, W = x.shape
outputs = []
for i in range(N):
cam_feat = x[:, i] # (B, C, H, W)
cam_attr = camera_attrs[:, i] # (B, 4)
# 1. 编码camera属性
attr_embed = self.camera_attr_encoder(cam_attr) # (B, C)
attr_embed = attr_embed.unsqueeze(-1).unsqueeze(-1) # (B, C, 1, 1)
# 2. Router选择expert
router_input = cam_feat + attr_embed
expert_weights = self.router(router_input) # (B, num_experts)
# 3. 组合多个experts
expert_outputs = []
for expert in self.experts:
expert_out = expert(cam_feat)
expert_outputs.append(expert_out)
# 加权组合: (B, num_experts, C, H, W)
expert_outputs = torch.stack(expert_outputs, dim=1)
# 应用router权重
expert_weights = expert_weights.view(B, -1, 1, 1, 1)
cam_output = (expert_outputs * expert_weights).sum(dim=1)
outputs.append(cam_output)
return torch.stack(outputs, dim=1) # (B, N, C, H, W)
class CameraExpert(nn.Module):
"""单个Expert网络"""
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, 1, 1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, in_channels, 3, 1, 1),
nn.BatchNorm2d(in_channels),
)
self.shortcut = nn.Identity()
def forward(self, x):
return F.relu(self.conv(x) + self.shortcut(x))
配置示例:
model:
encoders:
camera:
backbone:
type: SwinTransformer
# 共享backbone
neck:
type: GeneralizedLSSFPN
# 添加MoE
use_camera_moe: true
moe_config:
num_experts: 3 # wide, tele, fisheye
expert_capacity: 256
vtransform:
type: DepthLSSTransform
# 数据配置
camera_attributes:
CAM_FRONT:
type: 'wide'
fov: 120.0
focal_length: [1266.0, 1266.0]
CAM_FRONT_TELE:
type: 'tele'
fov: 30.0
focal_length: [2532.0, 2532.0]
CAM_FRONT_LEFT:
type: 'wide'
fov: 120.0
focal_length: [1266.0, 1266.0]
优点:
- ✅ 最强的表达能力
- ✅ 自动学习不同camera的最优处理
- ✅ 支持高度异质化的cameras
- ✅ 可以融合camera属性信息
缺点:
- ⚠️ 实现复杂度高
- ⚠️ 参数增加较多 (~10M)
- ⚠️ 训练时间增加 (~10-15%)
- ⚠️ 需要足够数据训练router
方案4: Per-Camera Attention (最灵活) ⭐⭐⭐⭐⭐
适用场景: 任意数量和类型的cameras
核心思想:
- 让模型学习每个camera的重要性
- Cross-camera attention
- 动态融合不同cameras
# mmdet3d/models/modules/camera_attention.py
class MultiCameraAttentionFusion(nn.Module):
"""
多相机注意力融合
特点:
- 支持任意数量cameras (1-N)
- 自动学习camera间的关系
- 位置感知的camera融合
"""
def __init__(
self,
in_channels: int,
num_cameras: int = 6, # 可变
use_camera_position: bool = True, # 使用camera位置信息
use_cross_attention: bool = True, # camera间交互
):
super().__init__()
self.num_cameras = num_cameras
self.use_camera_position = use_camera_position
# Camera位置编码 (相对vehicle的位置)
if use_camera_position:
self.position_encoder = nn.Sequential(
nn.Linear(6, 128), # [x, y, z, roll, pitch, yaw]
nn.ReLU(),
nn.Linear(128, in_channels),
)
# Self-attention for each camera
self.self_attention = nn.MultiheadAttention(
embed_dim=in_channels,
num_heads=8,
dropout=0.1,
)
# Cross-camera attention
if use_cross_attention:
self.cross_attention = nn.MultiheadAttention(
embed_dim=in_channels,
num_heads=8,
dropout=0.1,
)
# Camera importance weighting
self.camera_weight_net = nn.Sequential(
nn.Linear(in_channels, in_channels // 4),
nn.ReLU(),
nn.Linear(in_channels // 4, 1),
nn.Sigmoid(),
)
def forward(self, x, camera_positions=None):
"""
Args:
x: (B, N, C, H, W) - N个cameras
camera_positions: (B, N, 6) - 每个camera的位置[x,y,z,r,p,y]
Returns:
(B, N, C, H, W) - attention-enhanced features
"""
B, N, C, H, W = x.shape
# Reshape for attention: (N, B*H*W, C)
x_flat = x.permute(1, 0, 3, 4, 2).reshape(N, B*H*W, C)
# 1. 添加camera位置编码
if self.use_camera_position and camera_positions is not None:
pos_embed = self.position_encoder(camera_positions) # (B, N, C)
pos_embed = pos_embed.permute(1, 0, 2).unsqueeze(2) # (N, B, 1, C)
pos_embed = pos_embed.expand(-1, -1, H*W, -1).reshape(N, B*H*W, C)
x_flat = x_flat + pos_embed
# 2. Self-attention (within each camera)
x_self, _ = self.self_attention(x_flat, x_flat, x_flat)
# 3. Cross-camera attention (between cameras)
if hasattr(self, 'cross_attention'):
# Query from each camera, Key/Value from all cameras
x_cross = []
for i in range(N):
query = x_self[i:i+1] # (1, B*H*W, C)
key_value = x_self # (N, B*H*W, C)
attended, attn_weights = self.cross_attention(
query, key_value, key_value
)
x_cross.append(attended)
x_attended = torch.cat(x_cross, dim=0) # (N, B*H*W, C)
else:
x_attended = x_self
# 4. 计算每个camera的重要性权重
# Global pooling for each camera
x_pooled = x_attended.mean(dim=1) # (N, C)
camera_weights = self.camera_weight_net(x_pooled) # (N, 1)
# 应用权重
camera_weights = camera_weights.view(N, 1, 1).expand(-1, B*H*W, C)
x_weighted = x_attended * camera_weights
# Reshape back: (N, B*H*W, C) -> (B, N, C, H, W)
output = x_weighted.reshape(N, B, H, W, C).permute(1, 0, 4, 2, 3)
return output
集成到BEVFusion:
# mmdet3d/models/fusion_models/bevfusion.py (修改)
def extract_camera_features(self, img, ...):
B, N, C, H, W = img.shape
# Backbone (共享)
x = img.view(B * N, C, H, W)
x = self.encoders["camera"]["backbone"](x)
x = self.encoders["camera"]["neck"](x)
# Reshape: (B*N, C', H', W') -> (B, N, C', H', W')
_, C_out, H_out, W_out = x.shape
x = x.view(B, N, C_out, H_out, W_out)
# ✨ Multi-camera attention fusion
if hasattr(self, 'camera_attention'):
x = self.camera_attention(x, camera_positions)
# VTransform
x = self.encoders["camera"]["vtransform"](x, ...)
return x
优点:
- ✅ 最灵活,支持任意N个cameras
- ✅ 自动学习camera重要性
- ✅ camera间信息交互
- ✅ 位置感知融合
- ✅ 可以处理missing cameras (动态N)
缺点:
- ⚠️ 实现复杂
- ⚠️ 参数较多 (~8M)
- ⚠️ 计算开销 (+15-20ms)
方案5: Sparse MoE (高效版) ⭐⭐⭐⭐
适用场景: 大量cameras (>6),需要效率
核心思想:
- 每次只激活top-K experts
- 降低计算开销
- 适合异构camera系统
# mmdet3d/models/modules/sparse_camera_moe.py
class SparseCameraMoE(nn.Module):
"""
稀疏Camera MoE
对于每个sample,只使用top-K个experts
大幅降低计算开销
"""
def __init__(
self,
in_channels: int,
num_experts: int = 8, # 支持8种camera类型
num_active_experts: int = 3, # 每次只用3个
expert_hidden_dim: int = 256,
):
super().__init__()
self.num_experts = num_experts
self.num_active_experts = num_active_experts
# 创建experts
self.experts = nn.ModuleList([
nn.Sequential(
nn.Conv2d(in_channels, expert_hidden_dim, 3, 1, 1),
nn.BatchNorm2d(expert_hidden_dim),
nn.ReLU(),
nn.Conv2d(expert_hidden_dim, in_channels, 3, 1, 1),
nn.BatchNorm2d(in_channels),
) for _ in range(num_experts)
])
# Gating network: 选择哪些experts
self.gate = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(in_channels, num_experts),
)
# Load balancing auxiliary loss
self.load_balancing_loss_weight = 0.01
def forward(self, x, camera_types=None):
"""
Args:
x: (B, N, C, H, W)
camera_types: (B, N) - camera type IDs
Returns:
output: (B, N, C, H, W)
aux_loss: load balancing loss
"""
B, N, C, H, W = x.shape
outputs = []
gate_scores_all = []
for i in range(N):
cam_feat = x[:, i] # (B, C, H, W)
# Gate scores: 选择experts
gate_scores = self.gate(cam_feat) # (B, num_experts)
gate_scores_all.append(gate_scores)
# Top-K selection
top_k_scores, top_k_indices = torch.topk(
gate_scores,
self.num_active_experts,
dim=-1
) # (B, K)
# Normalize top-K scores
top_k_scores = F.softmax(top_k_scores, dim=-1)
# 只计算top-K experts
expert_outs = []
for b in range(B):
sample_out = torch.zeros_like(cam_feat[b:b+1])
for k in range(self.num_active_experts):
expert_idx = top_k_indices[b, k]
expert_weight = top_k_scores[b, k]
expert_result = self.experts[expert_idx](cam_feat[b:b+1])
sample_out += expert_result * expert_weight
expert_outs.append(sample_out)
cam_output = torch.cat(expert_outs, dim=0)
outputs.append(cam_output)
output = torch.stack(outputs, dim=1) # (B, N, C, H, W)
# Load balancing loss (鼓励均匀使用experts)
gate_scores_all = torch.stack(gate_scores_all, dim=1) # (B, N, num_experts)
gate_mean = gate_scores_all.mean(dim=[0, 1]) # (num_experts,)
load_balance_loss = (gate_mean.std() ** 2) * self.load_balancing_loss_weight
return output, load_balance_loss
配置示例:
model:
encoders:
camera:
neck:
type: GeneralizedLSSFPN
use_sparse_moe: true
moe_config:
num_experts: 8
num_active_experts: 3 # 每次只用3个
expert_hidden_dim: 256
# 支持多种camera类型
camera_configs:
- type: 'wide' # expert 0
- type: 'tele' # expert 1
- type: 'fisheye' # expert 2
- type: 'ultra_wide' # expert 3
# ... 可以定义更多
优点:
- ✅ 支持大量camera类型 (8+)
- ✅ 计算高效 (只用top-K)
- ✅ 自动学习expert选择
- ✅ Load balancing确保训练稳定
缺点:
- ⚠️ 实现最复杂
- ⚠️ 需要调试router training
- ⚠️ 需要足够多样的camera数据
🔍 方案对比
| 方案 | 灵活性 | 复杂度 | 参数量 | 计算开销 | 推荐场景 |
|---|---|---|---|---|---|
| 简单动态 | ⭐⭐ | ⭐ | +0 | +0% | 3-6个相似cameras |
| Camera Adapter | ⭐⭐⭐ | ⭐⭐ | +4M | +5% | 不同类型cameras (4-6个) |
| MoE | ⭐⭐⭐⭐ | ⭐⭐⭐ | +10M | +20% | 多种camera类型 |
| Per-Camera Attn | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | +8M | +15% | 任意配置,最优性能 |
| Sparse MoE | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | +12M | +10% | 大量cameras (>6) |
💡 针对您的BEVFusion项目建议
当前状态分析
您现在的配置:
- ✅ 6 cameras (nuScenes标准)
- ✅ 1 LiDAR
- ✅ Task-specific GCA已实现
建议方案
短期 (当前训练完成后):
推荐: 方案2 - Camera Adapter
理由:
- ✅ 实现简单,风险低
- ✅ 参数开销小 (~4M)
- ✅ 与Task-GCA架构兼容
- ✅ 可以先支持4-8个cameras
- ✅ 2-3天即可实现
实施步骤:
# Step 1: 创建CameraAwareLSS类
# Step 2: 在配置文件中启用
# Step 3: 从现有checkpoint fine-tune
# Step 4: 测试不同camera数量 (4, 5, 6, 8)
中期 (如需更强性能):
推荐: 方案4 - Per-Camera Attention
理由:
- ✅ 最佳性能
- ✅ 灵活性最高
- ✅ 可以与Task-GCA结合
- ✅ 学术价值高
组合架构:
Camera Input (N个)
↓
Shared Backbone (Swin Transformer)
↓
Multi-Camera Attention Fusion ← 新增
↓
VTransform (LSS)
↓
Fuser (Camera + LiDAR)
↓
Decoder
↓
Task-specific GCA ← 已有
├─ Detection GCA
└─ Segmentation GCA
↓
Task Heads
🚀 实施建议
阶段1: 基础灵活化 (1周)
# 1. 修改数据加载支持动态camera数量
# mmdet3d/datasets/pipelines/loading.py
@PIPELINES.register_module()
class LoadMultiViewImageFromFiles:
def __init__(self, to_float32=False, camera_names=None):
self.to_float32 = to_float32
self.camera_names = camera_names or [
'CAM_FRONT', 'CAM_FRONT_RIGHT', 'CAM_FRONT_LEFT',
'CAM_BACK', 'CAM_BACK_LEFT', 'CAM_BACK_RIGHT'
]
self.num_cameras = len(self.camera_names)
def __call__(self, results):
# 只加载指定的cameras
images = []
for cam_name in self.camera_names:
if cam_name in results['cams']:
img_path = results['cams'][cam_name]['data_path']
images.append(Image.open(img_path))
results['img'] = images
results['num_cameras'] = len(images)
return results
阶段2: Camera Adapter (1周)
# 实现camera-specific adapter
vim mmdet3d/models/vtransforms/camera_aware_lss.py
# 配置文件
vim configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_camera_aware.yaml
# 测试
python tools/train.py configs/.../multitask_camera_aware.yaml \
--load_from runs/run-326653dc-2334d461/epoch_5.pth \
--data.samples_per_gpu 1
阶段3: MoE/Attention (可选, 2周)
# 如果需要最强性能,实现attention或MoE
vim mmdet3d/models/modules/camera_attention.py
vim mmdet3d/models/modules/camera_moe.py
📊 预期效果
方案2 (Camera Adapter)
配置灵活性:
✅ 支持 1-8 个cameras
✅ 每个camera独立处理
✅ 自动适配不同FOV
性能影响:
参数: +4M (110M → 114M)
速度: +5% (2.66s → 2.79s/iter)
精度: +1-2% (adapter优化)
训练:
从epoch_5.pth开始: 需要3-5 epochs适应
总时间: ~2天
方案4 (Per-Camera Attention)
配置灵活性:
✅ 支持任意数量cameras
✅ Camera间信息交互
✅ 动态camera权重
性能影响:
参数: +8M (110M → 118M)
速度: +15% (2.66s → 3.06s/iter)
精度: +2-4% (attention优化)
训练:
从epoch_5.pth开始: 需要5-8 epochs
总时间: ~3-4天
🎯 具体实现roadmap
如果您现在就需要
推荐路径: Camera Adapter方案
Week 1: 实现 (等待当前训练完成期间)
Day 1: 设计CameraAwareLSS接口
Day 2-3: 实现camera adapter模块
Day 4: 编写配置文件
Day 5: 单元测试
Week 2: 训练 (Epoch 20完成后)
Day 1: 从epoch_20.pth开始fine-tune
Day 2-3: 训练Camera Adapter (5 epochs)
Day 4: 评估不同camera配置
Day 5: 性能对比和文档
Week 3: 优化 (可选)
Day 1-3: 升级到Attention方案
Day 4-5: 进一步调优
📝 代码模板
我可以立即为您创建:
- CameraAwareLSS实现 (
mmdet3d/models/vtransforms/camera_aware_lss.py) - 配置文件模板 (
configs/.../multitask_camera_aware.yaml) - 测试脚本 (
tools/test_camera_configs.py) - 文档 (
docs/CAMERA_FLEXIBILITY_GUIDE.md)
✅ 总结建议
对于您的项目
立即可做:
- ✅ 简单修改:支持4-8个cameras(修改数据加载即可)
- ✅ Camera Adapter:2周实现,性能提升1-2%
- ✅ 与Task-GCA兼容:可以叠加使用
进阶方案:
- 🎯 Per-Camera Attention:如需最优性能
- 🎯 Sparse MoE:如果camera类型很多 (>8)
我的建议:
- 先完成当前Task-GCA训练 (Epoch 20)
- 评估性能后,如果需要进一步提升camera处理
- 实现Camera Adapter方案 (ROI最高)
- 如果效果好,再考虑升级到Attention
需要我现在开始实现Camera Adapter代码吗?