433 lines
8.9 KiB
Markdown
433 lines
8.9 KiB
Markdown
|
|
# Camera配置灵活性方案速览
|
|||
|
|
|
|||
|
|
**目标**: 支持不同数量和类型的cameras
|
|||
|
|
**当前**: 6 cameras (nuScenes)
|
|||
|
|
**需求**: 灵活配置 1-N cameras
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 五种方案速览
|
|||
|
|
|
|||
|
|
### 1️⃣ 简单动态配置 ⭐
|
|||
|
|
```
|
|||
|
|
实现难度: ★☆☆☆☆
|
|||
|
|
参数增加: 0
|
|||
|
|
速度影响: 0%
|
|||
|
|
适用: 3-6个相似cameras
|
|||
|
|
|
|||
|
|
只需修改数据加载,模型代码无需改动
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2️⃣ Camera Adapter ⭐⭐⭐ **推荐**
|
|||
|
|
```
|
|||
|
|
实现难度: ★★☆☆☆
|
|||
|
|
参数增加: +4M
|
|||
|
|
速度影响: +5%
|
|||
|
|
适用: 不同类型cameras (广角/长焦)
|
|||
|
|
|
|||
|
|
每个camera有独立adapter,可学习不同处理策略
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3️⃣ Mixture of Experts (MoE) ⭐⭐⭐⭐
|
|||
|
|
```
|
|||
|
|
实现难度: ★★★☆☆
|
|||
|
|
参数增加: +10M
|
|||
|
|
速度影响: +20%
|
|||
|
|
适用: 多种camera类型
|
|||
|
|
|
|||
|
|
Router自动选择expert处理不同cameras
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4️⃣ Per-Camera Attention ⭐⭐⭐⭐⭐ **最强**
|
|||
|
|
```
|
|||
|
|
实现难度: ★★★☆☆
|
|||
|
|
参数增加: +8M
|
|||
|
|
速度影响: +15%
|
|||
|
|
适用: 任意cameras,需要最优性能
|
|||
|
|
|
|||
|
|
Camera间信息交互,动态权重融合
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5️⃣ Sparse MoE ⭐⭐⭐⭐
|
|||
|
|
```
|
|||
|
|
实现难度: ★★★★☆
|
|||
|
|
参数增加: +12M
|
|||
|
|
速度影响: +10%
|
|||
|
|
适用: 大量cameras (>6)
|
|||
|
|
|
|||
|
|
Top-K激活,高效处理多cameras
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 方案选择决策树
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
您有几个cameras?
|
|||
|
|
│
|
|||
|
|
├─ 3-6个,类型相似
|
|||
|
|
│ └─→ 方案1: 简单动态配置 ✅
|
|||
|
|
│
|
|||
|
|
├─ 4-6个,有广角+长焦
|
|||
|
|
│ └─→ 方案2: Camera Adapter ✅ 推荐
|
|||
|
|
│
|
|||
|
|
├─ 6-8个,多种类型
|
|||
|
|
│ └─→ 方案3: MoE 或 方案4: Attention
|
|||
|
|
│
|
|||
|
|
└─ >8个,异构系统
|
|||
|
|
└─→ 方案5: Sparse MoE
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 针对您的项目建议
|
|||
|
|
|
|||
|
|
### 现状
|
|||
|
|
```
|
|||
|
|
✅ Phase 4A Task-GCA训练中
|
|||
|
|
✅ Epoch 6/20 (32%完成)
|
|||
|
|
✅ 预计11/13完成
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 推荐方案: Camera Adapter
|
|||
|
|
|
|||
|
|
**时机**: 等Task-GCA训练完成后 (11/13后)
|
|||
|
|
|
|||
|
|
**理由**:
|
|||
|
|
1. ✅ 与Task-GCA架构兼容
|
|||
|
|
2. ✅ 实现快 (1周)
|
|||
|
|
3. ✅ 风险低
|
|||
|
|
4. ✅ ROI高 (性能+1-2%)
|
|||
|
|
|
|||
|
|
**组合架构**:
|
|||
|
|
```
|
|||
|
|
Input (N cameras)
|
|||
|
|
↓
|
|||
|
|
Swin Backbone (共享)
|
|||
|
|
↓
|
|||
|
|
Camera Adapters (N个独立) ← 新增
|
|||
|
|
↓
|
|||
|
|
LSS Transform
|
|||
|
|
↓
|
|||
|
|
BEV Pooling
|
|||
|
|
↓
|
|||
|
|
Fuser
|
|||
|
|
↓
|
|||
|
|
Decoder
|
|||
|
|
↓
|
|||
|
|
Task-specific GCA (已有)
|
|||
|
|
├─ Detection GCA
|
|||
|
|
└─ Segmentation GCA
|
|||
|
|
↓
|
|||
|
|
Heads
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 实施计划
|
|||
|
|
|
|||
|
|
### Phase A: Camera Adapter实现 (1周)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Day 1-2: 代码实现
|
|||
|
|
创建: mmdet3d/models/vtransforms/camera_aware_lss.py
|
|||
|
|
修改: mmdet3d/models/fusion_models/bevfusion.py
|
|||
|
|
测试: tests/test_camera_adapter.py
|
|||
|
|
|
|||
|
|
# Day 3-4: 配置和集成
|
|||
|
|
创建: configs/.../multitask_camera_adapter.yaml
|
|||
|
|
验证: 加载现有checkpoint
|
|||
|
|
测试: 前向传播正常
|
|||
|
|
|
|||
|
|
# Day 5: 文档
|
|||
|
|
编写: CAMERA_ADAPTER_GUIDE.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Phase B: 训练和验证 (1周)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Day 1: Fine-tune
|
|||
|
|
torchpack dist-run -np 8 python tools/train.py \
|
|||
|
|
configs/.../multitask_camera_adapter.yaml \
|
|||
|
|
--load_from /data/runs/phase4a_stage1_task_gca/epoch_20.pth \
|
|||
|
|
--max_epochs 5
|
|||
|
|
|
|||
|
|
# Day 2-4: 测试不同配置
|
|||
|
|
# 4 cameras: front, front_left, front_right, back
|
|||
|
|
# 5 cameras: + back_left
|
|||
|
|
# 6 cameras: + back_right (原配置)
|
|||
|
|
# 8 cameras: + left, right (假设)
|
|||
|
|
|
|||
|
|
# Day 5: 性能评估和对比
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💻 核心代码示例
|
|||
|
|
|
|||
|
|
### 最小实现 (Camera Adapter)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# mmdet3d/models/vtransforms/camera_aware_lss.py
|
|||
|
|
|
|||
|
|
from .lss import LSSTransform
|
|||
|
|
|
|||
|
|
class CameraAwareLSS(LSSTransform):
|
|||
|
|
"""Camera-aware LSS with per-camera adapters"""
|
|||
|
|
|
|||
|
|
def __init__(self, num_cameras=6, **kwargs):
|
|||
|
|
super().__init__(**kwargs)
|
|||
|
|
|
|||
|
|
# 每个camera的adapter (轻量级)
|
|||
|
|
self.camera_adapters = nn.ModuleList([
|
|||
|
|
nn.Sequential(
|
|||
|
|
nn.Conv2d(self.C, self.C, 3, 1, 1, groups=self.C//8),
|
|||
|
|
nn.BatchNorm2d(self.C),
|
|||
|
|
nn.ReLU(),
|
|||
|
|
nn.Conv2d(self.C, self.C, 1),
|
|||
|
|
) for _ in range(num_cameras)
|
|||
|
|
])
|
|||
|
|
|
|||
|
|
def get_cam_feats(self, x, mats_dict):
|
|||
|
|
"""
|
|||
|
|
x: (B, N, C, fH, fW)
|
|||
|
|
"""
|
|||
|
|
B, N, C, fH, fW = x.shape
|
|||
|
|
|
|||
|
|
# 应用camera-specific adapters
|
|||
|
|
adapted = []
|
|||
|
|
for i in range(N):
|
|||
|
|
feat = x[:, i] # (B, C, fH, fW)
|
|||
|
|
adapted_feat = self.camera_adapters[i](feat)
|
|||
|
|
adapted.append(adapted_feat)
|
|||
|
|
|
|||
|
|
x = torch.stack(adapted, dim=1) # (B, N, C, fH, fW)
|
|||
|
|
|
|||
|
|
# 继续原LSS处理
|
|||
|
|
return super().get_cam_feats(x, mats_dict)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**配置使用**:
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
type: CameraAwareLSS # 替换DepthLSSTransform
|
|||
|
|
num_cameras: 6 # 可修改为4, 5, 8...
|
|||
|
|
in_channels: 256
|
|||
|
|
out_channels: 80
|
|||
|
|
# ... 其他参数同LSSTransform
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 技术细节
|
|||
|
|
|
|||
|
|
### BEV Pooling如何处理N个cameras?
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 关键代码: mmdet3d/models/vtransforms/base.py
|
|||
|
|
|
|||
|
|
def bev_pool(self, geom_feats, x):
|
|||
|
|
"""
|
|||
|
|
Args:
|
|||
|
|
x: (B, N, D, H, W, C) # N可以是任意值!
|
|||
|
|
geom_feats: (B, N, D, H, W, 3) # 几何信息
|
|||
|
|
|
|||
|
|
Process:
|
|||
|
|
1. Flatten: (B*N*D*H*W, C)
|
|||
|
|
2. 根据几何投影到BEV grid
|
|||
|
|
3. 在同一BEV位置的features累加
|
|||
|
|
4. 返回: (B, C*D, BEV_H, BEV_W)
|
|||
|
|
|
|||
|
|
关键:
|
|||
|
|
- N个cameras的features会自动聚合
|
|||
|
|
- 无需知道N的具体值
|
|||
|
|
- 重叠区域会累加 (implicit fusion)
|
|||
|
|
"""
|
|||
|
|
B, N, D, H, W, C = x.shape
|
|||
|
|
|
|||
|
|
# Flatten所有cameras
|
|||
|
|
x = x.reshape(B*N*D*H*W, C)
|
|||
|
|
|
|||
|
|
# 几何投影和pooling
|
|||
|
|
x = bev_pool_kernel(x, geom_feats, ...) # CUDA kernel
|
|||
|
|
|
|||
|
|
return x # (B, C*D, BEV_H, BEV_W)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**结论**:
|
|||
|
|
✅ BEV pooling天然支持动态N
|
|||
|
|
✅ 只需要保证geometry正确
|
|||
|
|
✅ 模型代码几乎不需要改动
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚙️ 配置示例
|
|||
|
|
|
|||
|
|
### 示例1: 4 Cameras配置
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# configs/custom/4cameras.yaml
|
|||
|
|
|
|||
|
|
num_cameras: 4
|
|||
|
|
|
|||
|
|
camera_names:
|
|||
|
|
- CAM_FRONT
|
|||
|
|
- CAM_FRONT_LEFT
|
|||
|
|
- CAM_FRONT_RIGHT
|
|||
|
|
- CAM_BACK
|
|||
|
|
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
type: CameraAwareLSS
|
|||
|
|
num_cameras: 4 # ← 改这里
|
|||
|
|
# 其他参数不变
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 示例2: 混合cameras (广角+长焦)
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
num_cameras: 5
|
|||
|
|
|
|||
|
|
camera_configs:
|
|||
|
|
CAM_FRONT_WIDE:
|
|||
|
|
type: wide
|
|||
|
|
fov: 120
|
|||
|
|
focal: 1266.0
|
|||
|
|
adapter_id: 0
|
|||
|
|
|
|||
|
|
CAM_FRONT_TELE:
|
|||
|
|
type: tele
|
|||
|
|
fov: 30
|
|||
|
|
focal: 2532.0
|
|||
|
|
adapter_id: 1
|
|||
|
|
|
|||
|
|
CAM_LEFT:
|
|||
|
|
type: wide
|
|||
|
|
fov: 120
|
|||
|
|
focal: 1266.0
|
|||
|
|
adapter_id: 0 # 共享wide adapter
|
|||
|
|
|
|||
|
|
CAM_RIGHT:
|
|||
|
|
type: wide
|
|||
|
|
adapter_id: 0
|
|||
|
|
|
|||
|
|
CAM_BACK:
|
|||
|
|
type: fisheye
|
|||
|
|
fov: 190
|
|||
|
|
adapter_id: 2
|
|||
|
|
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
vtransform:
|
|||
|
|
type: CameraAwareLSS
|
|||
|
|
num_cameras: 5
|
|||
|
|
camera_types: ['wide', 'tele', 'wide', 'wide', 'fisheye']
|
|||
|
|
# Adapter会根据type自动选择
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 快速开始
|
|||
|
|
|
|||
|
|
### 测试当前模型的camera灵活性
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# test_camera_flexibility.py
|
|||
|
|
|
|||
|
|
import torch
|
|||
|
|
from mmdet3d.models import build_model
|
|||
|
|
|
|||
|
|
# 加载当前模型
|
|||
|
|
cfg = Config.fromfile('configs/.../multitask_BEV2X_phase4a_stage1_task_gca.yaml')
|
|||
|
|
model = build_model(cfg.model)
|
|||
|
|
model.load_state_dict(torch.load('epoch_5.pth')['state_dict'])
|
|||
|
|
|
|||
|
|
# 测试不同camera数量
|
|||
|
|
for num_cams in [3, 4, 5, 6, 8]:
|
|||
|
|
print(f"\n测试 {num_cams} cameras:")
|
|||
|
|
|
|||
|
|
# 模拟输入
|
|||
|
|
img = torch.randn(1, num_cams, 3, 900, 1600) # 动态N
|
|||
|
|
camera_intrinsics = torch.randn(1, num_cams, 4, 4)
|
|||
|
|
camera2lidar = torch.randn(1, num_cams, 4, 4)
|
|||
|
|
# ... 其他输入
|
|||
|
|
|
|||
|
|
try:
|
|||
|
|
# 前向传播
|
|||
|
|
output = model.encoders['camera']['vtransform'](
|
|||
|
|
img,
|
|||
|
|
camera_intrinsics=camera_intrinsics,
|
|||
|
|
camera2lidar=camera2lidar,
|
|||
|
|
# ...
|
|||
|
|
)
|
|||
|
|
print(f" ✅ 成功!输出shape: {output.shape}")
|
|||
|
|
except Exception as e:
|
|||
|
|
print(f" ❌ 失败: {e}")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期结果**:
|
|||
|
|
- ✅ 应该支持3-8个cameras (代码层面)
|
|||
|
|
- ⚠️ 但需要重新训练 (权重是针对6个训练的)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 性能预估
|
|||
|
|
|
|||
|
|
### Camera Adapter方案
|
|||
|
|
|
|||
|
|
| Cameras | 参数量 | 训练时间 | 预期mIoU | vs 6-cam |
|
|||
|
|
|---------|--------|---------|---------|---------|
|
|||
|
|
| 4 | 114M | -15% | 58-60% | -1-3% |
|
|||
|
|
| 5 | 114M | -8% | 60-61% | -0-1% |
|
|||
|
|
| 6 | 114M | 基线 | 61% | 0% |
|
|||
|
|
| 8 | 114M | +8% | 62-63% | +1-2% |
|
|||
|
|
|
|||
|
|
**分析**:
|
|||
|
|
- 更多cameras → 更多视角 → 性能提升
|
|||
|
|
- 但收益递减 (6→8只提升1-2%)
|
|||
|
|
- 4 cameras仍可达到58-60% (可接受)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✨ 总结
|
|||
|
|
|
|||
|
|
**您的问题**: 如何灵活配置cameras?
|
|||
|
|
|
|||
|
|
**我的建议**:
|
|||
|
|
|
|||
|
|
1. **立即可用** (无需修改):
|
|||
|
|
- 修改数据加载,支持4-8 cameras
|
|||
|
|
- 从现有checkpoint fine-tune
|
|||
|
|
- 1天实现
|
|||
|
|
|
|||
|
|
2. **推荐方案** (最佳ROI):
|
|||
|
|
- 实现Camera Adapter
|
|||
|
|
- 1周开发 + 1周训练
|
|||
|
|
- 性能提升1-2%,灵活性大增
|
|||
|
|
|
|||
|
|
3. **进阶方案** (如需极致):
|
|||
|
|
- Per-Camera Attention
|
|||
|
|
- 2周开发 + 1周训练
|
|||
|
|
- 性能提升2-4%,支持任意配置
|
|||
|
|
|
|||
|
|
4. **MoE不是最优选择**:
|
|||
|
|
- 计算开销大 (+20%)
|
|||
|
|
- 训练复杂
|
|||
|
|
- 收益不如Attention明显
|
|||
|
|
- 除非cameras类型极多 (>8)
|
|||
|
|
|
|||
|
|
**下一步**:
|
|||
|
|
1. 等待当前训练完成 (11/13)
|
|||
|
|
2. 评估是否需要camera灵活性
|
|||
|
|
3. 如需要,我立即实现Camera Adapter方案
|
|||
|
|
|
|||
|
|
**需要我现在开始写代码吗?** 🚀
|
|||
|
|
|