433 lines
8.9 KiB
Markdown
433 lines
8.9 KiB
Markdown
# Camera配置灵活性方案速览
|
||
|
||
**目标**: 支持不同数量和类型的cameras
|
||
**当前**: 6 cameras (nuScenes)
|
||
**需求**: 灵活配置 1-N cameras
|
||
|
||
---
|
||
|
||
## 🎯 五种方案速览
|
||
|
||
### 1️⃣ 简单动态配置 ⭐
|
||
```
|
||
实现难度: ★☆☆☆☆
|
||
参数增加: 0
|
||
速度影响: 0%
|
||
适用: 3-6个相似cameras
|
||
|
||
只需修改数据加载,模型代码无需改动
|
||
```
|
||
|
||
### 2️⃣ Camera Adapter ⭐⭐⭐ **推荐**
|
||
```
|
||
实现难度: ★★☆☆☆
|
||
参数增加: +4M
|
||
速度影响: +5%
|
||
适用: 不同类型cameras (广角/长焦)
|
||
|
||
每个camera有独立adapter,可学习不同处理策略
|
||
```
|
||
|
||
### 3️⃣ Mixture of Experts (MoE) ⭐⭐⭐⭐
|
||
```
|
||
实现难度: ★★★☆☆
|
||
参数增加: +10M
|
||
速度影响: +20%
|
||
适用: 多种camera类型
|
||
|
||
Router自动选择expert处理不同cameras
|
||
```
|
||
|
||
### 4️⃣ Per-Camera Attention ⭐⭐⭐⭐⭐ **最强**
|
||
```
|
||
实现难度: ★★★☆☆
|
||
参数增加: +8M
|
||
速度影响: +15%
|
||
适用: 任意cameras,需要最优性能
|
||
|
||
Camera间信息交互,动态权重融合
|
||
```
|
||
|
||
### 5️⃣ Sparse MoE ⭐⭐⭐⭐
|
||
```
|
||
实现难度: ★★★★☆
|
||
参数增加: +12M
|
||
速度影响: +10%
|
||
适用: 大量cameras (>6)
|
||
|
||
Top-K激活,高效处理多cameras
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 方案选择决策树
|
||
|
||
```
|
||
您有几个cameras?
|
||
│
|
||
├─ 3-6个,类型相似
|
||
│ └─→ 方案1: 简单动态配置 ✅
|
||
│
|
||
├─ 4-6个,有广角+长焦
|
||
│ └─→ 方案2: Camera Adapter ✅ 推荐
|
||
│
|
||
├─ 6-8个,多种类型
|
||
│ └─→ 方案3: MoE 或 方案4: Attention
|
||
│
|
||
└─ >8个,异构系统
|
||
└─→ 方案5: Sparse MoE
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 针对您的项目建议
|
||
|
||
### 现状
|
||
```
|
||
✅ Phase 4A Task-GCA训练中
|
||
✅ Epoch 6/20 (32%完成)
|
||
✅ 预计11/13完成
|
||
```
|
||
|
||
### 推荐方案: Camera Adapter
|
||
|
||
**时机**: 等Task-GCA训练完成后 (11/13后)
|
||
|
||
**理由**:
|
||
1. ✅ 与Task-GCA架构兼容
|
||
2. ✅ 实现快 (1周)
|
||
3. ✅ 风险低
|
||
4. ✅ ROI高 (性能+1-2%)
|
||
|
||
**组合架构**:
|
||
```
|
||
Input (N cameras)
|
||
↓
|
||
Swin Backbone (共享)
|
||
↓
|
||
Camera Adapters (N个独立) ← 新增
|
||
↓
|
||
LSS Transform
|
||
↓
|
||
BEV Pooling
|
||
↓
|
||
Fuser
|
||
↓
|
||
Decoder
|
||
↓
|
||
Task-specific GCA (已有)
|
||
├─ Detection GCA
|
||
└─ Segmentation GCA
|
||
↓
|
||
Heads
|
||
```
|
||
|
||
---
|
||
|
||
## 📋 实施计划
|
||
|
||
### Phase A: Camera Adapter实现 (1周)
|
||
|
||
```bash
|
||
# Day 1-2: 代码实现
|
||
创建: mmdet3d/models/vtransforms/camera_aware_lss.py
|
||
修改: mmdet3d/models/fusion_models/bevfusion.py
|
||
测试: tests/test_camera_adapter.py
|
||
|
||
# Day 3-4: 配置和集成
|
||
创建: configs/.../multitask_camera_adapter.yaml
|
||
验证: 加载现有checkpoint
|
||
测试: 前向传播正常
|
||
|
||
# Day 5: 文档
|
||
编写: CAMERA_ADAPTER_GUIDE.md
|
||
```
|
||
|
||
### Phase B: 训练和验证 (1周)
|
||
|
||
```bash
|
||
# Day 1: Fine-tune
|
||
torchpack dist-run -np 8 python tools/train.py \
|
||
configs/.../multitask_camera_adapter.yaml \
|
||
--load_from /data/runs/phase4a_stage1_task_gca/epoch_20.pth \
|
||
--max_epochs 5
|
||
|
||
# Day 2-4: 测试不同配置
|
||
# 4 cameras: front, front_left, front_right, back
|
||
# 5 cameras: + back_left
|
||
# 6 cameras: + back_right (原配置)
|
||
# 8 cameras: + left, right (假设)
|
||
|
||
# Day 5: 性能评估和对比
|
||
```
|
||
|
||
---
|
||
|
||
## 💻 核心代码示例
|
||
|
||
### 最小实现 (Camera Adapter)
|
||
|
||
```python
|
||
# mmdet3d/models/vtransforms/camera_aware_lss.py
|
||
|
||
from .lss import LSSTransform
|
||
|
||
class CameraAwareLSS(LSSTransform):
|
||
"""Camera-aware LSS with per-camera adapters"""
|
||
|
||
def __init__(self, num_cameras=6, **kwargs):
|
||
super().__init__(**kwargs)
|
||
|
||
# 每个camera的adapter (轻量级)
|
||
self.camera_adapters = nn.ModuleList([
|
||
nn.Sequential(
|
||
nn.Conv2d(self.C, self.C, 3, 1, 1, groups=self.C//8),
|
||
nn.BatchNorm2d(self.C),
|
||
nn.ReLU(),
|
||
nn.Conv2d(self.C, self.C, 1),
|
||
) for _ in range(num_cameras)
|
||
])
|
||
|
||
def get_cam_feats(self, x, mats_dict):
|
||
"""
|
||
x: (B, N, C, fH, fW)
|
||
"""
|
||
B, N, C, fH, fW = x.shape
|
||
|
||
# 应用camera-specific adapters
|
||
adapted = []
|
||
for i in range(N):
|
||
feat = x[:, i] # (B, C, fH, fW)
|
||
adapted_feat = self.camera_adapters[i](feat)
|
||
adapted.append(adapted_feat)
|
||
|
||
x = torch.stack(adapted, dim=1) # (B, N, C, fH, fW)
|
||
|
||
# 继续原LSS处理
|
||
return super().get_cam_feats(x, mats_dict)
|
||
```
|
||
|
||
**配置使用**:
|
||
|
||
```yaml
|
||
model:
|
||
encoders:
|
||
camera:
|
||
vtransform:
|
||
type: CameraAwareLSS # 替换DepthLSSTransform
|
||
num_cameras: 6 # 可修改为4, 5, 8...
|
||
in_channels: 256
|
||
out_channels: 80
|
||
# ... 其他参数同LSSTransform
|
||
```
|
||
|
||
---
|
||
|
||
## 🎓 技术细节
|
||
|
||
### BEV Pooling如何处理N个cameras?
|
||
|
||
```python
|
||
# 关键代码: mmdet3d/models/vtransforms/base.py
|
||
|
||
def bev_pool(self, geom_feats, x):
|
||
"""
|
||
Args:
|
||
x: (B, N, D, H, W, C) # N可以是任意值!
|
||
geom_feats: (B, N, D, H, W, 3) # 几何信息
|
||
|
||
Process:
|
||
1. Flatten: (B*N*D*H*W, C)
|
||
2. 根据几何投影到BEV grid
|
||
3. 在同一BEV位置的features累加
|
||
4. 返回: (B, C*D, BEV_H, BEV_W)
|
||
|
||
关键:
|
||
- N个cameras的features会自动聚合
|
||
- 无需知道N的具体值
|
||
- 重叠区域会累加 (implicit fusion)
|
||
"""
|
||
B, N, D, H, W, C = x.shape
|
||
|
||
# Flatten所有cameras
|
||
x = x.reshape(B*N*D*H*W, C)
|
||
|
||
# 几何投影和pooling
|
||
x = bev_pool_kernel(x, geom_feats, ...) # CUDA kernel
|
||
|
||
return x # (B, C*D, BEV_H, BEV_W)
|
||
```
|
||
|
||
**结论**:
|
||
✅ BEV pooling天然支持动态N
|
||
✅ 只需要保证geometry正确
|
||
✅ 模型代码几乎不需要改动
|
||
|
||
---
|
||
|
||
## ⚙️ 配置示例
|
||
|
||
### 示例1: 4 Cameras配置
|
||
|
||
```yaml
|
||
# configs/custom/4cameras.yaml
|
||
|
||
num_cameras: 4
|
||
|
||
camera_names:
|
||
- CAM_FRONT
|
||
- CAM_FRONT_LEFT
|
||
- CAM_FRONT_RIGHT
|
||
- CAM_BACK
|
||
|
||
model:
|
||
encoders:
|
||
camera:
|
||
vtransform:
|
||
type: CameraAwareLSS
|
||
num_cameras: 4 # ← 改这里
|
||
# 其他参数不变
|
||
```
|
||
|
||
### 示例2: 混合cameras (广角+长焦)
|
||
|
||
```yaml
|
||
num_cameras: 5
|
||
|
||
camera_configs:
|
||
CAM_FRONT_WIDE:
|
||
type: wide
|
||
fov: 120
|
||
focal: 1266.0
|
||
adapter_id: 0
|
||
|
||
CAM_FRONT_TELE:
|
||
type: tele
|
||
fov: 30
|
||
focal: 2532.0
|
||
adapter_id: 1
|
||
|
||
CAM_LEFT:
|
||
type: wide
|
||
fov: 120
|
||
focal: 1266.0
|
||
adapter_id: 0 # 共享wide adapter
|
||
|
||
CAM_RIGHT:
|
||
type: wide
|
||
adapter_id: 0
|
||
|
||
CAM_BACK:
|
||
type: fisheye
|
||
fov: 190
|
||
adapter_id: 2
|
||
|
||
model:
|
||
encoders:
|
||
camera:
|
||
vtransform:
|
||
type: CameraAwareLSS
|
||
num_cameras: 5
|
||
camera_types: ['wide', 'tele', 'wide', 'wide', 'fisheye']
|
||
# Adapter会根据type自动选择
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 快速开始
|
||
|
||
### 测试当前模型的camera灵活性
|
||
|
||
```python
|
||
# test_camera_flexibility.py
|
||
|
||
import torch
|
||
from mmdet3d.models import build_model
|
||
|
||
# 加载当前模型
|
||
cfg = Config.fromfile('configs/.../multitask_BEV2X_phase4a_stage1_task_gca.yaml')
|
||
model = build_model(cfg.model)
|
||
model.load_state_dict(torch.load('epoch_5.pth')['state_dict'])
|
||
|
||
# 测试不同camera数量
|
||
for num_cams in [3, 4, 5, 6, 8]:
|
||
print(f"\n测试 {num_cams} cameras:")
|
||
|
||
# 模拟输入
|
||
img = torch.randn(1, num_cams, 3, 900, 1600) # 动态N
|
||
camera_intrinsics = torch.randn(1, num_cams, 4, 4)
|
||
camera2lidar = torch.randn(1, num_cams, 4, 4)
|
||
# ... 其他输入
|
||
|
||
try:
|
||
# 前向传播
|
||
output = model.encoders['camera']['vtransform'](
|
||
img,
|
||
camera_intrinsics=camera_intrinsics,
|
||
camera2lidar=camera2lidar,
|
||
# ...
|
||
)
|
||
print(f" ✅ 成功!输出shape: {output.shape}")
|
||
except Exception as e:
|
||
print(f" ❌ 失败: {e}")
|
||
```
|
||
|
||
**预期结果**:
|
||
- ✅ 应该支持3-8个cameras (代码层面)
|
||
- ⚠️ 但需要重新训练 (权重是针对6个训练的)
|
||
|
||
---
|
||
|
||
## 📊 性能预估
|
||
|
||
### Camera Adapter方案
|
||
|
||
| Cameras | 参数量 | 训练时间 | 预期mIoU | vs 6-cam |
|
||
|---------|--------|---------|---------|---------|
|
||
| 4 | 114M | -15% | 58-60% | -1-3% |
|
||
| 5 | 114M | -8% | 60-61% | -0-1% |
|
||
| 6 | 114M | 基线 | 61% | 0% |
|
||
| 8 | 114M | +8% | 62-63% | +1-2% |
|
||
|
||
**分析**:
|
||
- 更多cameras → 更多视角 → 性能提升
|
||
- 但收益递减 (6→8只提升1-2%)
|
||
- 4 cameras仍可达到58-60% (可接受)
|
||
|
||
---
|
||
|
||
## ✨ 总结
|
||
|
||
**您的问题**: 如何灵活配置cameras?
|
||
|
||
**我的建议**:
|
||
|
||
1. **立即可用** (无需修改):
|
||
- 修改数据加载,支持4-8 cameras
|
||
- 从现有checkpoint fine-tune
|
||
- 1天实现
|
||
|
||
2. **推荐方案** (最佳ROI):
|
||
- 实现Camera Adapter
|
||
- 1周开发 + 1周训练
|
||
- 性能提升1-2%,灵活性大增
|
||
|
||
3. **进阶方案** (如需极致):
|
||
- Per-Camera Attention
|
||
- 2周开发 + 1周训练
|
||
- 性能提升2-4%,支持任意配置
|
||
|
||
4. **MoE不是最优选择**:
|
||
- 计算开销大 (+20%)
|
||
- 训练复杂
|
||
- 收益不如Attention明显
|
||
- 除非cameras类型极多 (>8)
|
||
|
||
**下一步**:
|
||
1. 等待当前训练完成 (11/13)
|
||
2. 评估是否需要camera灵活性
|
||
3. 如需要,我立即实现Camera Adapter方案
|
||
|
||
**需要我现在开始写代码吗?** 🚀
|
||
|