bev-project/CAMERA_CONFIGURATION_ANALYS...

24 KiB
Raw Blame History

BEVFusion Camera配置灵活性分析与方案

分析时间: 2025-11-06
当前配置: 6 Cameras + LiDAR (nuScenes标准)
目标: 支持灵活的camera配置 (1-N个cameras)


📊 当前架构分析

现有Camera处理流程

数据流:
  img: (B, N, C, H, W)  # N=6 cameras
    
  Backbone: (B*N, C', H', W')  # 6个相机共享权重
    
  Neck: (B*N, 256, H'', W'')
    
  VTransform: (B, N, 80, D, H, W)  # 转到BEV空间
    
  BEV Pooling: (B, 80*D, H, W)  # 聚合N个相机
    
  Fuser + Decoder: (B, 512, H, W)
    
  Task Heads

关键发现

  1. Camera数量N是动态的代码中N可以是任意值
  2. 共享权重所有cameras共享同一个backbone/neck
  3. BEV Pooling自动聚合无论多少cameras最终都pool到同一个BEV空间
  4. ⚠️ 固定配置nuScenes硬编码为6个特定cameras

🎯 支持灵活Camera配置的方案

方案1: 简单动态配置 (最简单)

适用场景: 3-8个相似类型的cameras

实现方式:

  • 只需修改数据加载,模型无需改动
  • 所有cameras共享权重
  • BEV pooling自动适配
# configs/custom/flexible_cameras.yaml

# 配置camera数量
num_cameras: 4  # 可以是1-N

camera_names: 
  - CAM_FRONT
  - CAM_FRONT_LEFT  
  - CAM_FRONT_RIGHT
  - CAM_BACK
  # 可以增减

# 模型配置 - 无需修改!
model:
  type: BEVFusion
  encoders:
    camera:
      backbone:
        type: SwinTransformer  # 权重共享
      neck:
        type: GeneralizedLSSFPN
      vtransform:
        type: DepthLSSTransform
        # BEV pooling会自动处理N个cameras

优点:

  • 实现简单,无需修改模型代码
  • 可以任意增减cameras
  • 训练稳定

缺点:

  • ⚠️ 所有cameras必须共享权重
  • ⚠️ 无法针对特定camera优化

方案2: Camera-specific Adapter (推荐)

适用场景: 不同类型的cameras (如广角+长焦)

核心思想:

  • Backbone共享
  • 每个camera有独立的adapter
  • 适合heterogeneous cameras
# mmdet3d/models/vtransforms/camera_aware_lss.py

class CameraAwareLSS(BaseTransform):
    """
    Camera感知的LSS Transform
    每个camera有独立的adapter处理特征
    """
    
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        num_cameras: int = 6,  # 动态camera数量
        camera_types: list = None,  # ['wide', 'tele', 'wide', ...]
        **kwargs
    ):
        super().__init__(in_channels, out_channels, **kwargs)
        
        # 为每个camera创建lightweight adapter
        self.camera_adapters = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(in_channels, in_channels, 3, 1, 1),
                nn.BatchNorm2d(in_channels),
                nn.ReLU(inplace=True),
                nn.Conv2d(in_channels, in_channels, 1),
            ) for _ in range(num_cameras)
        ])
        
        # Camera类型embedding (可选)
        self.use_camera_embedding = camera_types is not None
        if self.use_camera_embedding:
            unique_types = list(set(camera_types))
            self.camera_type_embed = nn.Embedding(
                len(unique_types), 
                in_channels
            )
            self.type_to_id = {t: i for i, t in enumerate(unique_types)}
        
        # 参数量: num_cameras × (in_channels² × 9 + in_channels²)
        # 示例: 6 × (256² × 10) ≈ 4M参数
    
    def get_cam_feats(self, x, mats_dict):
        """
        Args:
            x: (B, N, C, fH, fW) - N个cameras的特征
            mats_dict: camera矩阵
        """
        B, N, C, fH, fW = x.shape
        
        # 为每个camera应用adapter
        adapted_features = []
        for i in range(N):
            cam_feat = x[:, i]  # (B, C, fH, fW)
            
            # Camera-specific adapter
            cam_feat = self.camera_adapters[i](cam_feat)
            
            # (可选) 添加camera类型embedding
            if self.use_camera_embedding:
                cam_type_id = self.type_to_id[self.camera_types[i]]
                type_embed = self.camera_type_embed(
                    torch.tensor(cam_type_id).to(cam_feat.device)
                )
                cam_feat = cam_feat + type_embed.view(1, -1, 1, 1)
            
            adapted_features.append(cam_feat)
        
        # 重新组合
        x = torch.stack(adapted_features, dim=1)  # (B, N, C, fH, fW)
        
        # 继续LSS处理
        return super().get_cam_feats(x, mats_dict)

配置示例:

model:
  encoders:
    camera:
      vtransform:
        type: CameraAwareLSS
        num_cameras: 4
        camera_types: ['wide', 'tele', 'wide', 'wide']
        in_channels: 256
        out_channels: 80

优点:

  • 每个camera有独立处理能力
  • 参数增加不多 (~4M)
  • 可以适配不同类型cameras
  • 向后兼容

缺点:

  • ⚠️ 需要修改vtransform代码
  • ⚠️ 训练时间略增加 (~5%)

方案3: Mixture of Experts (MoE)

适用场景: 多种camera类型需要最优性能

核心思想:

  • 为不同camera类型创建专家网络
  • Router动态选择experts
  • 可以根据camera属性自动路由
# mmdet3d/models/modules/camera_moe.py

class CameraMoE(nn.Module):
    """
    Camera Mixture of Experts
    
    根据camera类型动态选择expert
    """
    
    def __init__(
        self,
        in_channels: int,
        num_experts: int = 3,  # 3种expert: wide, tele, fisheye
        expert_capacity: int = 256,
        **kwargs
    ):
        super().__init__()
        
        # 创建多个experts
        self.experts = nn.ModuleList([
            CameraExpert(in_channels, expert_capacity)
            for _ in range(num_experts)
        ])
        
        # Router: 决定使用哪个expert
        self.router = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # Global pooling
            nn.Flatten(),
            nn.Linear(in_channels, num_experts),
            nn.Softmax(dim=-1)
        )
        
        # Camera属性encoder (FOV, focal length, etc.)
        self.camera_attr_encoder = nn.Sequential(
            nn.Linear(4, 64),  # [FOV, focal_x, focal_y, type_id]
            nn.ReLU(),
            nn.Linear(64, in_channels),
        )
    
    def forward(self, x, camera_attrs):
        """
        Args:
            x: (B, N, C, H, W) - N个cameras
            camera_attrs: (B, N, 4) - [FOV, focal_x, focal_y, type_id]
        
        Returns:
            (B, N, C, H, W) - processed features
        """
        B, N, C, H, W = x.shape
        
        outputs = []
        for i in range(N):
            cam_feat = x[:, i]  # (B, C, H, W)
            cam_attr = camera_attrs[:, i]  # (B, 4)
            
            # 1. 编码camera属性
            attr_embed = self.camera_attr_encoder(cam_attr)  # (B, C)
            attr_embed = attr_embed.unsqueeze(-1).unsqueeze(-1)  # (B, C, 1, 1)
            
            # 2. Router选择expert
            router_input = cam_feat + attr_embed
            expert_weights = self.router(router_input)  # (B, num_experts)
            
            # 3. 组合多个experts
            expert_outputs = []
            for expert in self.experts:
                expert_out = expert(cam_feat)
                expert_outputs.append(expert_out)
            
            # 加权组合: (B, num_experts, C, H, W)
            expert_outputs = torch.stack(expert_outputs, dim=1)
            
            # 应用router权重
            expert_weights = expert_weights.view(B, -1, 1, 1, 1)
            cam_output = (expert_outputs * expert_weights).sum(dim=1)
            
            outputs.append(cam_output)
        
        return torch.stack(outputs, dim=1)  # (B, N, C, H, W)


class CameraExpert(nn.Module):
    """单个Expert网络"""
    
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, 3, 1, 1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, in_channels, 3, 1, 1),
            nn.BatchNorm2d(in_channels),
        )
        self.shortcut = nn.Identity()
    
    def forward(self, x):
        return F.relu(self.conv(x) + self.shortcut(x))

配置示例:

model:
  encoders:
    camera:
      backbone:
        type: SwinTransformer
        # 共享backbone
      
      neck:
        type: GeneralizedLSSFPN
        # 添加MoE
        use_camera_moe: true
        moe_config:
          num_experts: 3  # wide, tele, fisheye
          expert_capacity: 256
      
      vtransform:
        type: DepthLSSTransform

# 数据配置
camera_attributes:
  CAM_FRONT:
    type: 'wide'
    fov: 120.0
    focal_length: [1266.0, 1266.0]
  
  CAM_FRONT_TELE:
    type: 'tele'
    fov: 30.0
    focal_length: [2532.0, 2532.0]
  
  CAM_FRONT_LEFT:
    type: 'wide'
    fov: 120.0
    focal_length: [1266.0, 1266.0]

优点:

  • 最强的表达能力
  • 自动学习不同camera的最优处理
  • 支持高度异质化的cameras
  • 可以融合camera属性信息

缺点:

  • ⚠️ 实现复杂度高
  • ⚠️ 参数增加较多 (~10M)
  • ⚠️ 训练时间增加 (~10-15%)
  • ⚠️ 需要足够数据训练router

方案4: Per-Camera Attention (最灵活)

适用场景: 任意数量和类型的cameras

核心思想:

  • 让模型学习每个camera的重要性
  • Cross-camera attention
  • 动态融合不同cameras
# mmdet3d/models/modules/camera_attention.py

class MultiCameraAttentionFusion(nn.Module):
    """
    多相机注意力融合
    
    特点:
      - 支持任意数量cameras (1-N)
      - 自动学习camera间的关系
      - 位置感知的camera融合
    """
    
    def __init__(
        self,
        in_channels: int,
        num_cameras: int = 6,  # 可变
        use_camera_position: bool = True,  # 使用camera位置信息
        use_cross_attention: bool = True,  # camera间交互
    ):
        super().__init__()
        
        self.num_cameras = num_cameras
        self.use_camera_position = use_camera_position
        
        # Camera位置编码 (相对vehicle的位置)
        if use_camera_position:
            self.position_encoder = nn.Sequential(
                nn.Linear(6, 128),  # [x, y, z, roll, pitch, yaw]
                nn.ReLU(),
                nn.Linear(128, in_channels),
            )
        
        # Self-attention for each camera
        self.self_attention = nn.MultiheadAttention(
            embed_dim=in_channels,
            num_heads=8,
            dropout=0.1,
        )
        
        # Cross-camera attention
        if use_cross_attention:
            self.cross_attention = nn.MultiheadAttention(
                embed_dim=in_channels,
                num_heads=8,
                dropout=0.1,
            )
        
        # Camera importance weighting
        self.camera_weight_net = nn.Sequential(
            nn.Linear(in_channels, in_channels // 4),
            nn.ReLU(),
            nn.Linear(in_channels // 4, 1),
            nn.Sigmoid(),
        )
    
    def forward(self, x, camera_positions=None):
        """
        Args:
            x: (B, N, C, H, W) - N个cameras
            camera_positions: (B, N, 6) - 每个camera的位置[x,y,z,r,p,y]
        
        Returns:
            (B, N, C, H, W) - attention-enhanced features
        """
        B, N, C, H, W = x.shape
        
        # Reshape for attention: (N, B*H*W, C)
        x_flat = x.permute(1, 0, 3, 4, 2).reshape(N, B*H*W, C)
        
        # 1. 添加camera位置编码
        if self.use_camera_position and camera_positions is not None:
            pos_embed = self.position_encoder(camera_positions)  # (B, N, C)
            pos_embed = pos_embed.permute(1, 0, 2).unsqueeze(2)  # (N, B, 1, C)
            pos_embed = pos_embed.expand(-1, -1, H*W, -1).reshape(N, B*H*W, C)
            x_flat = x_flat + pos_embed
        
        # 2. Self-attention (within each camera)
        x_self, _ = self.self_attention(x_flat, x_flat, x_flat)
        
        # 3. Cross-camera attention (between cameras)
        if hasattr(self, 'cross_attention'):
            # Query from each camera, Key/Value from all cameras
            x_cross = []
            for i in range(N):
                query = x_self[i:i+1]  # (1, B*H*W, C)
                key_value = x_self     # (N, B*H*W, C)
                attended, attn_weights = self.cross_attention(
                    query, key_value, key_value
                )
                x_cross.append(attended)
            
            x_attended = torch.cat(x_cross, dim=0)  # (N, B*H*W, C)
        else:
            x_attended = x_self
        
        # 4. 计算每个camera的重要性权重
        # Global pooling for each camera
        x_pooled = x_attended.mean(dim=1)  # (N, C)
        camera_weights = self.camera_weight_net(x_pooled)  # (N, 1)
        
        # 应用权重
        camera_weights = camera_weights.view(N, 1, 1).expand(-1, B*H*W, C)
        x_weighted = x_attended * camera_weights
        
        # Reshape back: (N, B*H*W, C) -> (B, N, C, H, W)
        output = x_weighted.reshape(N, B, H, W, C).permute(1, 0, 4, 2, 3)
        
        return output

集成到BEVFusion:

# mmdet3d/models/fusion_models/bevfusion.py (修改)

def extract_camera_features(self, img, ...):
    B, N, C, H, W = img.shape
    
    # Backbone (共享)
    x = img.view(B * N, C, H, W)
    x = self.encoders["camera"]["backbone"](x)
    x = self.encoders["camera"]["neck"](x)
    
    # Reshape: (B*N, C', H', W') -> (B, N, C', H', W')
    _, C_out, H_out, W_out = x.shape
    x = x.view(B, N, C_out, H_out, W_out)
    
    # ✨ Multi-camera attention fusion
    if hasattr(self, 'camera_attention'):
        x = self.camera_attention(x, camera_positions)
    
    # VTransform
    x = self.encoders["camera"]["vtransform"](x, ...)
    
    return x

优点:

  • 最灵活支持任意N个cameras
  • 自动学习camera重要性
  • camera间信息交互
  • 位置感知融合
  • 可以处理missing cameras (动态N)

缺点:

  • ⚠️ 实现复杂
  • ⚠️ 参数较多 (~8M)
  • ⚠️ 计算开销 (+15-20ms)

方案5: Sparse MoE (高效版)

适用场景: 大量cameras (>6),需要效率

核心思想:

  • 每次只激活top-K experts
  • 降低计算开销
  • 适合异构camera系统
# mmdet3d/models/modules/sparse_camera_moe.py

class SparseCameraMoE(nn.Module):
    """
    稀疏Camera MoE
    
    对于每个sample只使用top-K个experts
    大幅降低计算开销
    """
    
    def __init__(
        self,
        in_channels: int,
        num_experts: int = 8,  # 支持8种camera类型
        num_active_experts: int = 3,  # 每次只用3个
        expert_hidden_dim: int = 256,
    ):
        super().__init__()
        
        self.num_experts = num_experts
        self.num_active_experts = num_active_experts
        
        # 创建experts
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(in_channels, expert_hidden_dim, 3, 1, 1),
                nn.BatchNorm2d(expert_hidden_dim),
                nn.ReLU(),
                nn.Conv2d(expert_hidden_dim, in_channels, 3, 1, 1),
                nn.BatchNorm2d(in_channels),
            ) for _ in range(num_experts)
        ])
        
        # Gating network: 选择哪些experts
        self.gate = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Linear(in_channels, num_experts),
        )
        
        # Load balancing auxiliary loss
        self.load_balancing_loss_weight = 0.01
    
    def forward(self, x, camera_types=None):
        """
        Args:
            x: (B, N, C, H, W)
            camera_types: (B, N) - camera type IDs
        
        Returns:
            output: (B, N, C, H, W)
            aux_loss: load balancing loss
        """
        B, N, C, H, W = x.shape
        
        outputs = []
        gate_scores_all = []
        
        for i in range(N):
            cam_feat = x[:, i]  # (B, C, H, W)
            
            # Gate scores: 选择experts
            gate_scores = self.gate(cam_feat)  # (B, num_experts)
            gate_scores_all.append(gate_scores)
            
            # Top-K selection
            top_k_scores, top_k_indices = torch.topk(
                gate_scores, 
                self.num_active_experts, 
                dim=-1
            )  # (B, K)
            
            # Normalize top-K scores
            top_k_scores = F.softmax(top_k_scores, dim=-1)
            
            # 只计算top-K experts
            expert_outs = []
            for b in range(B):
                sample_out = torch.zeros_like(cam_feat[b:b+1])
                
                for k in range(self.num_active_experts):
                    expert_idx = top_k_indices[b, k]
                    expert_weight = top_k_scores[b, k]
                    
                    expert_result = self.experts[expert_idx](cam_feat[b:b+1])
                    sample_out += expert_result * expert_weight
                
                expert_outs.append(sample_out)
            
            cam_output = torch.cat(expert_outs, dim=0)
            outputs.append(cam_output)
        
        output = torch.stack(outputs, dim=1)  # (B, N, C, H, W)
        
        # Load balancing loss (鼓励均匀使用experts)
        gate_scores_all = torch.stack(gate_scores_all, dim=1)  # (B, N, num_experts)
        gate_mean = gate_scores_all.mean(dim=[0, 1])  # (num_experts,)
        load_balance_loss = (gate_mean.std() ** 2) * self.load_balancing_loss_weight
        
        return output, load_balance_loss

配置示例:

model:
  encoders:
    camera:
      neck:
        type: GeneralizedLSSFPN
        use_sparse_moe: true
        moe_config:
          num_experts: 8
          num_active_experts: 3  # 每次只用3个
          expert_hidden_dim: 256

# 支持多种camera类型
camera_configs:
  - type: 'wide'        # expert 0
  - type: 'tele'        # expert 1
  - type: 'fisheye'     # expert 2
  - type: 'ultra_wide'  # expert 3
  # ... 可以定义更多

优点:

  • 支持大量camera类型 (8+)
  • 计算高效 (只用top-K)
  • 自动学习expert选择
  • Load balancing确保训练稳定

缺点:

  • ⚠️ 实现最复杂
  • ⚠️ 需要调试router training
  • ⚠️ 需要足够多样的camera数据

🔍 方案对比

方案 灵活性 复杂度 参数量 计算开销 推荐场景
简单动态 +0 +0% 3-6个相似cameras
Camera Adapter +4M +5% 不同类型cameras (4-6个)
MoE +10M +20% 多种camera类型
Per-Camera Attn +8M +15% 任意配置,最优性能
Sparse MoE +12M +10% 大量cameras (>6)

💡 针对您的BEVFusion项目建议

当前状态分析

您现在的配置:

  • 6 cameras (nuScenes标准)
  • 1 LiDAR
  • Task-specific GCA已实现

建议方案

短期 (当前训练完成后):

推荐: 方案2 - Camera Adapter

理由:

  1. 实现简单,风险低
  2. 参数开销小 (~4M)
  3. 与Task-GCA架构兼容
  4. 可以先支持4-8个cameras
  5. 2-3天即可实现

实施步骤:

# Step 1: 创建CameraAwareLSS类
# Step 2: 在配置文件中启用
# Step 3: 从现有checkpoint fine-tune
# Step 4: 测试不同camera数量 (4, 5, 6, 8)

中期 (如需更强性能):

推荐: 方案4 - Per-Camera Attention

理由:

  1. 最佳性能
  2. 灵活性最高
  3. 可以与Task-GCA结合
  4. 学术价值高

组合架构:

Camera Input (N个)
  ↓
Shared Backbone (Swin Transformer)
  ↓
Multi-Camera Attention Fusion  ← 新增
  ↓
VTransform (LSS)
  ↓
Fuser (Camera + LiDAR)
  ↓
Decoder
  ↓
Task-specific GCA  ← 已有
  ├─ Detection GCA
  └─ Segmentation GCA
  ↓
Task Heads

🚀 实施建议

阶段1: 基础灵活化 (1周)

# 1. 修改数据加载支持动态camera数量
# mmdet3d/datasets/pipelines/loading.py

@PIPELINES.register_module()
class LoadMultiViewImageFromFiles:
    def __init__(self, to_float32=False, camera_names=None):
        self.to_float32 = to_float32
        self.camera_names = camera_names or [
            'CAM_FRONT', 'CAM_FRONT_RIGHT', 'CAM_FRONT_LEFT',
            'CAM_BACK', 'CAM_BACK_LEFT', 'CAM_BACK_RIGHT'
        ]
        self.num_cameras = len(self.camera_names)
    
    def __call__(self, results):
        # 只加载指定的cameras
        images = []
        for cam_name in self.camera_names:
            if cam_name in results['cams']:
                img_path = results['cams'][cam_name]['data_path']
                images.append(Image.open(img_path))
        
        results['img'] = images
        results['num_cameras'] = len(images)
        return results

阶段2: Camera Adapter (1周)

# 实现camera-specific adapter
vim mmdet3d/models/vtransforms/camera_aware_lss.py

# 配置文件
vim configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_camera_aware.yaml

# 测试
python tools/train.py configs/.../multitask_camera_aware.yaml \
  --load_from runs/run-326653dc-2334d461/epoch_5.pth \
  --data.samples_per_gpu 1

阶段3: MoE/Attention (可选, 2周)

# 如果需要最强性能实现attention或MoE
vim mmdet3d/models/modules/camera_attention.py
vim mmdet3d/models/modules/camera_moe.py

📊 预期效果

方案2 (Camera Adapter)

配置灵活性:
  ✅ 支持 1-8 个cameras
  ✅ 每个camera独立处理
  ✅ 自动适配不同FOV

性能影响:
  参数: +4M (110M → 114M)
  速度: +5% (2.66s → 2.79s/iter)
  精度: +1-2% (adapter优化)

训练:
  从epoch_5.pth开始: 需要3-5 epochs适应
  总时间: ~2天

方案4 (Per-Camera Attention)

配置灵活性:
  ✅ 支持任意数量cameras
  ✅ Camera间信息交互
  ✅ 动态camera权重

性能影响:
  参数: +8M (110M → 118M)
  速度: +15% (2.66s → 3.06s/iter)
  精度: +2-4% (attention优化)

训练:
  从epoch_5.pth开始: 需要5-8 epochs
  总时间: ~3-4天

🎯 具体实现roadmap

如果您现在就需要

推荐路径: Camera Adapter方案

Week 1: 实现 (等待当前训练完成期间)
  Day 1: 设计CameraAwareLSS接口
  Day 2-3: 实现camera adapter模块
  Day 4: 编写配置文件
  Day 5: 单元测试

Week 2: 训练 (Epoch 20完成后)
  Day 1: 从epoch_20.pth开始fine-tune
  Day 2-3: 训练Camera Adapter (5 epochs)
  Day 4: 评估不同camera配置
  Day 5: 性能对比和文档

Week 3: 优化 (可选)
  Day 1-3: 升级到Attention方案
  Day 4-5: 进一步调优

📝 代码模板

我可以立即为您创建:

  1. CameraAwareLSS实现 (mmdet3d/models/vtransforms/camera_aware_lss.py)
  2. 配置文件模板 (configs/.../multitask_camera_aware.yaml)
  3. 测试脚本 (tools/test_camera_configs.py)
  4. 文档 (docs/CAMERA_FLEXIBILITY_GUIDE.md)

总结建议

对于您的项目

立即可做:

  1. 简单修改支持4-8个cameras修改数据加载即可
  2. Camera Adapter2周实现性能提升1-2%
  3. 与Task-GCA兼容可以叠加使用

进阶方案:

  1. 🎯 Per-Camera Attention如需最优性能
  2. 🎯 Sparse MoE如果camera类型很多 (>8)

我的建议:

  • 先完成当前Task-GCA训练 (Epoch 20)
  • 评估性能后如果需要进一步提升camera处理
  • 实现Camera Adapter方案 (ROI最高)
  • 如果效果好再考虑升级到Attention

需要我现在开始实现Camera Adapter代码吗