bev-project/CAMERA_CONFIGURATION_ANALYS...

897 lines
24 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BEVFusion Camera配置灵活性分析与方案
**分析时间**: 2025-11-06
**当前配置**: 6 Cameras + LiDAR (nuScenes标准)
**目标**: 支持灵活的camera配置 (1-N个cameras)
---
## 📊 当前架构分析
### 现有Camera处理流程
```python
数据流:
img: (B, N, C, H, W) # N=6 cameras
Backbone: (B*N, C', H', W') # 6个相机共享权重
Neck: (B*N, 256, H'', W'')
VTransform: (B, N, 80, D, H, W) # 转到BEV空间
BEV Pooling: (B, 80*D, H, W) # 聚合N个相机
Fuser + Decoder: (B, 512, H, W)
Task Heads
```
### 关键发现
1.**Camera数量N是动态的**代码中N可以是任意值
2.**共享权重**所有cameras共享同一个backbone/neck
3.**BEV Pooling自动聚合**无论多少cameras最终都pool到同一个BEV空间
4. ⚠️ **固定配置**nuScenes硬编码为6个特定cameras
---
## 🎯 支持灵活Camera配置的方案
### 方案1: 简单动态配置 (最简单) ⭐
**适用场景**: 3-8个相似类型的cameras
**实现方式**:
- 只需修改数据加载,模型无需改动
- 所有cameras共享权重
- BEV pooling自动适配
```yaml
# configs/custom/flexible_cameras.yaml
# 配置camera数量
num_cameras: 4 # 可以是1-N
camera_names:
- CAM_FRONT
- CAM_FRONT_LEFT
- CAM_FRONT_RIGHT
- CAM_BACK
# 可以增减
# 模型配置 - 无需修改!
model:
type: BEVFusion
encoders:
camera:
backbone:
type: SwinTransformer # 权重共享
neck:
type: GeneralizedLSSFPN
vtransform:
type: DepthLSSTransform
# BEV pooling会自动处理N个cameras
```
**优点**:
- ✅ 实现简单,无需修改模型代码
- ✅ 可以任意增减cameras
- ✅ 训练稳定
**缺点**:
- ⚠️ 所有cameras必须共享权重
- ⚠️ 无法针对特定camera优化
---
### 方案2: Camera-specific Adapter (推荐) ⭐⭐⭐
**适用场景**: 不同类型的cameras (如广角+长焦)
**核心思想**:
- Backbone共享
- 每个camera有独立的adapter
- 适合heterogeneous cameras
```python
# mmdet3d/models/vtransforms/camera_aware_lss.py
class CameraAwareLSS(BaseTransform):
"""
Camera感知的LSS Transform
每个camera有独立的adapter处理特征
"""
def __init__(
self,
in_channels: int,
out_channels: int,
num_cameras: int = 6, # 动态camera数量
camera_types: list = None, # ['wide', 'tele', 'wide', ...]
**kwargs
):
super().__init__(in_channels, out_channels, **kwargs)
# 为每个camera创建lightweight adapter
self.camera_adapters = nn.ModuleList([
nn.Sequential(
nn.Conv2d(in_channels, in_channels, 3, 1, 1),
nn.BatchNorm2d(in_channels),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels, in_channels, 1),
) for _ in range(num_cameras)
])
# Camera类型embedding (可选)
self.use_camera_embedding = camera_types is not None
if self.use_camera_embedding:
unique_types = list(set(camera_types))
self.camera_type_embed = nn.Embedding(
len(unique_types),
in_channels
)
self.type_to_id = {t: i for i, t in enumerate(unique_types)}
# 参数量: num_cameras × (in_channels² × 9 + in_channels²)
# 示例: 6 × (256² × 10) ≈ 4M参数
def get_cam_feats(self, x, mats_dict):
"""
Args:
x: (B, N, C, fH, fW) - N个cameras的特征
mats_dict: camera矩阵
"""
B, N, C, fH, fW = x.shape
# 为每个camera应用adapter
adapted_features = []
for i in range(N):
cam_feat = x[:, i] # (B, C, fH, fW)
# Camera-specific adapter
cam_feat = self.camera_adapters[i](cam_feat)
# (可选) 添加camera类型embedding
if self.use_camera_embedding:
cam_type_id = self.type_to_id[self.camera_types[i]]
type_embed = self.camera_type_embed(
torch.tensor(cam_type_id).to(cam_feat.device)
)
cam_feat = cam_feat + type_embed.view(1, -1, 1, 1)
adapted_features.append(cam_feat)
# 重新组合
x = torch.stack(adapted_features, dim=1) # (B, N, C, fH, fW)
# 继续LSS处理
return super().get_cam_feats(x, mats_dict)
```
**配置示例**:
```yaml
model:
encoders:
camera:
vtransform:
type: CameraAwareLSS
num_cameras: 4
camera_types: ['wide', 'tele', 'wide', 'wide']
in_channels: 256
out_channels: 80
```
**优点**:
- ✅ 每个camera有独立处理能力
- ✅ 参数增加不多 (~4M)
- ✅ 可以适配不同类型cameras
- ✅ 向后兼容
**缺点**:
- ⚠️ 需要修改vtransform代码
- ⚠️ 训练时间略增加 (~5%)
---
### 方案3: Mixture of Experts (MoE) ⭐⭐⭐⭐
**适用场景**: 多种camera类型需要最优性能
**核心思想**:
- 为不同camera类型创建专家网络
- Router动态选择experts
- 可以根据camera属性自动路由
```python
# mmdet3d/models/modules/camera_moe.py
class CameraMoE(nn.Module):
"""
Camera Mixture of Experts
根据camera类型动态选择expert
"""
def __init__(
self,
in_channels: int,
num_experts: int = 3, # 3种expert: wide, tele, fisheye
expert_capacity: int = 256,
**kwargs
):
super().__init__()
# 创建多个experts
self.experts = nn.ModuleList([
CameraExpert(in_channels, expert_capacity)
for _ in range(num_experts)
])
# Router: 决定使用哪个expert
self.router = nn.Sequential(
nn.AdaptiveAvgPool2d(1), # Global pooling
nn.Flatten(),
nn.Linear(in_channels, num_experts),
nn.Softmax(dim=-1)
)
# Camera属性encoder (FOV, focal length, etc.)
self.camera_attr_encoder = nn.Sequential(
nn.Linear(4, 64), # [FOV, focal_x, focal_y, type_id]
nn.ReLU(),
nn.Linear(64, in_channels),
)
def forward(self, x, camera_attrs):
"""
Args:
x: (B, N, C, H, W) - N个cameras
camera_attrs: (B, N, 4) - [FOV, focal_x, focal_y, type_id]
Returns:
(B, N, C, H, W) - processed features
"""
B, N, C, H, W = x.shape
outputs = []
for i in range(N):
cam_feat = x[:, i] # (B, C, H, W)
cam_attr = camera_attrs[:, i] # (B, 4)
# 1. 编码camera属性
attr_embed = self.camera_attr_encoder(cam_attr) # (B, C)
attr_embed = attr_embed.unsqueeze(-1).unsqueeze(-1) # (B, C, 1, 1)
# 2. Router选择expert
router_input = cam_feat + attr_embed
expert_weights = self.router(router_input) # (B, num_experts)
# 3. 组合多个experts
expert_outputs = []
for expert in self.experts:
expert_out = expert(cam_feat)
expert_outputs.append(expert_out)
# 加权组合: (B, num_experts, C, H, W)
expert_outputs = torch.stack(expert_outputs, dim=1)
# 应用router权重
expert_weights = expert_weights.view(B, -1, 1, 1, 1)
cam_output = (expert_outputs * expert_weights).sum(dim=1)
outputs.append(cam_output)
return torch.stack(outputs, dim=1) # (B, N, C, H, W)
class CameraExpert(nn.Module):
"""单个Expert网络"""
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, 3, 1, 1),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels, in_channels, 3, 1, 1),
nn.BatchNorm2d(in_channels),
)
self.shortcut = nn.Identity()
def forward(self, x):
return F.relu(self.conv(x) + self.shortcut(x))
```
**配置示例**:
```yaml
model:
encoders:
camera:
backbone:
type: SwinTransformer
# 共享backbone
neck:
type: GeneralizedLSSFPN
# 添加MoE
use_camera_moe: true
moe_config:
num_experts: 3 # wide, tele, fisheye
expert_capacity: 256
vtransform:
type: DepthLSSTransform
# 数据配置
camera_attributes:
CAM_FRONT:
type: 'wide'
fov: 120.0
focal_length: [1266.0, 1266.0]
CAM_FRONT_TELE:
type: 'tele'
fov: 30.0
focal_length: [2532.0, 2532.0]
CAM_FRONT_LEFT:
type: 'wide'
fov: 120.0
focal_length: [1266.0, 1266.0]
```
**优点**:
- ✅ 最强的表达能力
- ✅ 自动学习不同camera的最优处理
- ✅ 支持高度异质化的cameras
- ✅ 可以融合camera属性信息
**缺点**:
- ⚠️ 实现复杂度高
- ⚠️ 参数增加较多 (~10M)
- ⚠️ 训练时间增加 (~10-15%)
- ⚠️ 需要足够数据训练router
---
### 方案4: Per-Camera Attention (最灵活) ⭐⭐⭐⭐⭐
**适用场景**: 任意数量和类型的cameras
**核心思想**:
- 让模型学习每个camera的重要性
- Cross-camera attention
- 动态融合不同cameras
```python
# mmdet3d/models/modules/camera_attention.py
class MultiCameraAttentionFusion(nn.Module):
"""
多相机注意力融合
特点:
- 支持任意数量cameras (1-N)
- 自动学习camera间的关系
- 位置感知的camera融合
"""
def __init__(
self,
in_channels: int,
num_cameras: int = 6, # 可变
use_camera_position: bool = True, # 使用camera位置信息
use_cross_attention: bool = True, # camera间交互
):
super().__init__()
self.num_cameras = num_cameras
self.use_camera_position = use_camera_position
# Camera位置编码 (相对vehicle的位置)
if use_camera_position:
self.position_encoder = nn.Sequential(
nn.Linear(6, 128), # [x, y, z, roll, pitch, yaw]
nn.ReLU(),
nn.Linear(128, in_channels),
)
# Self-attention for each camera
self.self_attention = nn.MultiheadAttention(
embed_dim=in_channels,
num_heads=8,
dropout=0.1,
)
# Cross-camera attention
if use_cross_attention:
self.cross_attention = nn.MultiheadAttention(
embed_dim=in_channels,
num_heads=8,
dropout=0.1,
)
# Camera importance weighting
self.camera_weight_net = nn.Sequential(
nn.Linear(in_channels, in_channels // 4),
nn.ReLU(),
nn.Linear(in_channels // 4, 1),
nn.Sigmoid(),
)
def forward(self, x, camera_positions=None):
"""
Args:
x: (B, N, C, H, W) - N个cameras
camera_positions: (B, N, 6) - 每个camera的位置[x,y,z,r,p,y]
Returns:
(B, N, C, H, W) - attention-enhanced features
"""
B, N, C, H, W = x.shape
# Reshape for attention: (N, B*H*W, C)
x_flat = x.permute(1, 0, 3, 4, 2).reshape(N, B*H*W, C)
# 1. 添加camera位置编码
if self.use_camera_position and camera_positions is not None:
pos_embed = self.position_encoder(camera_positions) # (B, N, C)
pos_embed = pos_embed.permute(1, 0, 2).unsqueeze(2) # (N, B, 1, C)
pos_embed = pos_embed.expand(-1, -1, H*W, -1).reshape(N, B*H*W, C)
x_flat = x_flat + pos_embed
# 2. Self-attention (within each camera)
x_self, _ = self.self_attention(x_flat, x_flat, x_flat)
# 3. Cross-camera attention (between cameras)
if hasattr(self, 'cross_attention'):
# Query from each camera, Key/Value from all cameras
x_cross = []
for i in range(N):
query = x_self[i:i+1] # (1, B*H*W, C)
key_value = x_self # (N, B*H*W, C)
attended, attn_weights = self.cross_attention(
query, key_value, key_value
)
x_cross.append(attended)
x_attended = torch.cat(x_cross, dim=0) # (N, B*H*W, C)
else:
x_attended = x_self
# 4. 计算每个camera的重要性权重
# Global pooling for each camera
x_pooled = x_attended.mean(dim=1) # (N, C)
camera_weights = self.camera_weight_net(x_pooled) # (N, 1)
# 应用权重
camera_weights = camera_weights.view(N, 1, 1).expand(-1, B*H*W, C)
x_weighted = x_attended * camera_weights
# Reshape back: (N, B*H*W, C) -> (B, N, C, H, W)
output = x_weighted.reshape(N, B, H, W, C).permute(1, 0, 4, 2, 3)
return output
```
**集成到BEVFusion**:
```python
# mmdet3d/models/fusion_models/bevfusion.py (修改)
def extract_camera_features(self, img, ...):
B, N, C, H, W = img.shape
# Backbone (共享)
x = img.view(B * N, C, H, W)
x = self.encoders["camera"]["backbone"](x)
x = self.encoders["camera"]["neck"](x)
# Reshape: (B*N, C', H', W') -> (B, N, C', H', W')
_, C_out, H_out, W_out = x.shape
x = x.view(B, N, C_out, H_out, W_out)
# ✨ Multi-camera attention fusion
if hasattr(self, 'camera_attention'):
x = self.camera_attention(x, camera_positions)
# VTransform
x = self.encoders["camera"]["vtransform"](x, ...)
return x
```
**优点**:
- ✅ 最灵活支持任意N个cameras
- ✅ 自动学习camera重要性
- ✅ camera间信息交互
- ✅ 位置感知融合
- ✅ 可以处理missing cameras (动态N)
**缺点**:
- ⚠️ 实现复杂
- ⚠️ 参数较多 (~8M)
- ⚠️ 计算开销 (+15-20ms)
---
### 方案5: Sparse MoE (高效版) ⭐⭐⭐⭐
**适用场景**: 大量cameras (>6),需要效率
**核心思想**:
- 每次只激活top-K experts
- 降低计算开销
- 适合异构camera系统
```python
# mmdet3d/models/modules/sparse_camera_moe.py
class SparseCameraMoE(nn.Module):
"""
稀疏Camera MoE
对于每个sample只使用top-K个experts
大幅降低计算开销
"""
def __init__(
self,
in_channels: int,
num_experts: int = 8, # 支持8种camera类型
num_active_experts: int = 3, # 每次只用3个
expert_hidden_dim: int = 256,
):
super().__init__()
self.num_experts = num_experts
self.num_active_experts = num_active_experts
# 创建experts
self.experts = nn.ModuleList([
nn.Sequential(
nn.Conv2d(in_channels, expert_hidden_dim, 3, 1, 1),
nn.BatchNorm2d(expert_hidden_dim),
nn.ReLU(),
nn.Conv2d(expert_hidden_dim, in_channels, 3, 1, 1),
nn.BatchNorm2d(in_channels),
) for _ in range(num_experts)
])
# Gating network: 选择哪些experts
self.gate = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Linear(in_channels, num_experts),
)
# Load balancing auxiliary loss
self.load_balancing_loss_weight = 0.01
def forward(self, x, camera_types=None):
"""
Args:
x: (B, N, C, H, W)
camera_types: (B, N) - camera type IDs
Returns:
output: (B, N, C, H, W)
aux_loss: load balancing loss
"""
B, N, C, H, W = x.shape
outputs = []
gate_scores_all = []
for i in range(N):
cam_feat = x[:, i] # (B, C, H, W)
# Gate scores: 选择experts
gate_scores = self.gate(cam_feat) # (B, num_experts)
gate_scores_all.append(gate_scores)
# Top-K selection
top_k_scores, top_k_indices = torch.topk(
gate_scores,
self.num_active_experts,
dim=-1
) # (B, K)
# Normalize top-K scores
top_k_scores = F.softmax(top_k_scores, dim=-1)
# 只计算top-K experts
expert_outs = []
for b in range(B):
sample_out = torch.zeros_like(cam_feat[b:b+1])
for k in range(self.num_active_experts):
expert_idx = top_k_indices[b, k]
expert_weight = top_k_scores[b, k]
expert_result = self.experts[expert_idx](cam_feat[b:b+1])
sample_out += expert_result * expert_weight
expert_outs.append(sample_out)
cam_output = torch.cat(expert_outs, dim=0)
outputs.append(cam_output)
output = torch.stack(outputs, dim=1) # (B, N, C, H, W)
# Load balancing loss (鼓励均匀使用experts)
gate_scores_all = torch.stack(gate_scores_all, dim=1) # (B, N, num_experts)
gate_mean = gate_scores_all.mean(dim=[0, 1]) # (num_experts,)
load_balance_loss = (gate_mean.std() ** 2) * self.load_balancing_loss_weight
return output, load_balance_loss
```
**配置示例**:
```yaml
model:
encoders:
camera:
neck:
type: GeneralizedLSSFPN
use_sparse_moe: true
moe_config:
num_experts: 8
num_active_experts: 3 # 每次只用3个
expert_hidden_dim: 256
# 支持多种camera类型
camera_configs:
- type: 'wide' # expert 0
- type: 'tele' # expert 1
- type: 'fisheye' # expert 2
- type: 'ultra_wide' # expert 3
# ... 可以定义更多
```
**优点**:
- ✅ 支持大量camera类型 (8+)
- ✅ 计算高效 (只用top-K)
- ✅ 自动学习expert选择
- ✅ Load balancing确保训练稳定
**缺点**:
- ⚠️ 实现最复杂
- ⚠️ 需要调试router training
- ⚠️ 需要足够多样的camera数据
---
## 🔍 方案对比
| 方案 | 灵活性 | 复杂度 | 参数量 | 计算开销 | 推荐场景 |
|------|--------|--------|--------|----------|---------|
| **简单动态** | ⭐⭐ | ⭐ | +0 | +0% | 3-6个相似cameras |
| **Camera Adapter** | ⭐⭐⭐ | ⭐⭐ | +4M | +5% | 不同类型cameras (4-6个) |
| **MoE** | ⭐⭐⭐⭐ | ⭐⭐⭐ | +10M | +20% | 多种camera类型 |
| **Per-Camera Attn** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | +8M | +15% | 任意配置,最优性能 |
| **Sparse MoE** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | +12M | +10% | 大量cameras (>6) |
---
## 💡 针对您的BEVFusion项目建议
### 当前状态分析
您现在的配置:
- ✅ 6 cameras (nuScenes标准)
- ✅ 1 LiDAR
- ✅ Task-specific GCA已实现
### 建议方案
**短期 (当前训练完成后)**:
**推荐: 方案2 - Camera Adapter**
理由:
1. ✅ 实现简单,风险低
2. ✅ 参数开销小 (~4M)
3. ✅ 与Task-GCA架构兼容
4. ✅ 可以先支持4-8个cameras
5. ✅ 2-3天即可实现
实施步骤:
```python
# Step 1: 创建CameraAwareLSS类
# Step 2: 在配置文件中启用
# Step 3: 从现有checkpoint fine-tune
# Step 4: 测试不同camera数量 (4, 5, 6, 8)
```
**中期 (如需更强性能)**:
**推荐: 方案4 - Per-Camera Attention**
理由:
1. ✅ 最佳性能
2. ✅ 灵活性最高
3. ✅ 可以与Task-GCA结合
4. ✅ 学术价值高
组合架构:
```
Camera Input (N个)
Shared Backbone (Swin Transformer)
Multi-Camera Attention Fusion ← 新增
VTransform (LSS)
Fuser (Camera + LiDAR)
Decoder
Task-specific GCA ← 已有
├─ Detection GCA
└─ Segmentation GCA
Task Heads
```
---
## 🚀 实施建议
### 阶段1: 基础灵活化 (1周)
```python
# 1. 修改数据加载支持动态camera数量
# mmdet3d/datasets/pipelines/loading.py
@PIPELINES.register_module()
class LoadMultiViewImageFromFiles:
def __init__(self, to_float32=False, camera_names=None):
self.to_float32 = to_float32
self.camera_names = camera_names or [
'CAM_FRONT', 'CAM_FRONT_RIGHT', 'CAM_FRONT_LEFT',
'CAM_BACK', 'CAM_BACK_LEFT', 'CAM_BACK_RIGHT'
]
self.num_cameras = len(self.camera_names)
def __call__(self, results):
# 只加载指定的cameras
images = []
for cam_name in self.camera_names:
if cam_name in results['cams']:
img_path = results['cams'][cam_name]['data_path']
images.append(Image.open(img_path))
results['img'] = images
results['num_cameras'] = len(images)
return results
```
### 阶段2: Camera Adapter (1周)
```bash
# 实现camera-specific adapter
vim mmdet3d/models/vtransforms/camera_aware_lss.py
# 配置文件
vim configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_camera_aware.yaml
# 测试
python tools/train.py configs/.../multitask_camera_aware.yaml \
--load_from runs/run-326653dc-2334d461/epoch_5.pth \
--data.samples_per_gpu 1
```
### 阶段3: MoE/Attention (可选, 2周)
```bash
# 如果需要最强性能实现attention或MoE
vim mmdet3d/models/modules/camera_attention.py
vim mmdet3d/models/modules/camera_moe.py
```
---
## 📊 预期效果
### 方案2 (Camera Adapter)
```
配置灵活性:
✅ 支持 1-8 个cameras
✅ 每个camera独立处理
✅ 自动适配不同FOV
性能影响:
参数: +4M (110M → 114M)
速度: +5% (2.66s → 2.79s/iter)
精度: +1-2% (adapter优化)
训练:
从epoch_5.pth开始: 需要3-5 epochs适应
总时间: ~2天
```
### 方案4 (Per-Camera Attention)
```
配置灵活性:
✅ 支持任意数量cameras
✅ Camera间信息交互
✅ 动态camera权重
性能影响:
参数: +8M (110M → 118M)
速度: +15% (2.66s → 3.06s/iter)
精度: +2-4% (attention优化)
训练:
从epoch_5.pth开始: 需要5-8 epochs
总时间: ~3-4天
```
---
## 🎯 具体实现roadmap
### 如果您现在就需要
**推荐路径: Camera Adapter方案**
```
Week 1: 实现 (等待当前训练完成期间)
Day 1: 设计CameraAwareLSS接口
Day 2-3: 实现camera adapter模块
Day 4: 编写配置文件
Day 5: 单元测试
Week 2: 训练 (Epoch 20完成后)
Day 1: 从epoch_20.pth开始fine-tune
Day 2-3: 训练Camera Adapter (5 epochs)
Day 4: 评估不同camera配置
Day 5: 性能对比和文档
Week 3: 优化 (可选)
Day 1-3: 升级到Attention方案
Day 4-5: 进一步调优
```
---
## 📝 代码模板
我可以立即为您创建:
1. **CameraAwareLSS实现** (`mmdet3d/models/vtransforms/camera_aware_lss.py`)
2. **配置文件模板** (`configs/.../multitask_camera_aware.yaml`)
3. **测试脚本** (`tools/test_camera_configs.py`)
4. **文档** (`docs/CAMERA_FLEXIBILITY_GUIDE.md`)
---
## ✅ 总结建议
### 对于您的项目
**立即可做**:
1. ✅ 简单修改支持4-8个cameras修改数据加载即可
2. ✅ Camera Adapter2周实现性能提升1-2%
3. ✅ 与Task-GCA兼容可以叠加使用
**进阶方案**:
1. 🎯 Per-Camera Attention如需最优性能
2. 🎯 Sparse MoE如果camera类型很多 (>8)
**我的建议**:
- 先完成当前Task-GCA训练 (Epoch 20)
- 评估性能后如果需要进一步提升camera处理
- 实现Camera Adapter方案 (ROI最高)
- 如果效果好再考虑升级到Attention
**需要我现在开始实现Camera Adapter代码吗**