497 lines
10 KiB
Markdown
497 lines
10 KiB
Markdown
|
|
# GeneralizedLSSFPN 详细解析
|
|||
|
|
|
|||
|
|
**问题**: GeneralizedLSSFPN是否有LiDAR depth输入?
|
|||
|
|
**答案**: ❌ **没有!GeneralizedLSSFPN纯粹处理图像特征,不使用LiDAR depth。**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 核心功能
|
|||
|
|
|
|||
|
|
**GeneralizedLSSFPN是一个纯视觉的特征金字塔网络**,用于融合Swin Transformer Backbone输出的多尺度图像特征。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📐 完整架构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
输入: 来自Swin Transformer的3个尺度
|
|||
|
|
├─ laterals[0]: 192通道 @ H/8 × W/8 (浅层)
|
|||
|
|
├─ laterals[1]: 384通道 @ H/16 × W/16 (中层)
|
|||
|
|
└─ laterals[2]: 768通道 @ H/32 × W/32 (深层)
|
|||
|
|
↓
|
|||
|
|
【Top-Down融合过程】
|
|||
|
|
↓
|
|||
|
|
Step 1: 从最深层开始 (laterals[2])
|
|||
|
|
768ch @ H/32×W/32
|
|||
|
|
↓ 上采样2× (双线性插值)
|
|||
|
|
768ch @ H/16×W/16
|
|||
|
|
↓ Concat
|
|||
|
|
[384ch + 768ch] @ H/16×W/16 = 1152ch
|
|||
|
|
↓ lateral_conv (1×1卷积降维)
|
|||
|
|
256ch @ H/16×W/16
|
|||
|
|
↓ fpn_conv (3×3卷积平滑)
|
|||
|
|
256ch @ H/16×W/16 ← 中间输出
|
|||
|
|
↓
|
|||
|
|
Step 2: 继续向上
|
|||
|
|
256ch @ H/16×W/16
|
|||
|
|
↓ 上采样2×
|
|||
|
|
256ch @ H/8×W/8
|
|||
|
|
↓ Concat
|
|||
|
|
[192ch + 256ch] @ H/8×W/8 = 448ch
|
|||
|
|
↓ lateral_conv (1×1卷积)
|
|||
|
|
256ch @ H/8×W/8
|
|||
|
|
↓ fpn_conv (3×3卷积)
|
|||
|
|
256ch @ H/8×W/8 ← 最终输出
|
|||
|
|
↓
|
|||
|
|
输出: 256通道统一特征 @ H/8×W/8
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💻 源码解析
|
|||
|
|
|
|||
|
|
### 初始化 (\_\_init\_\_)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
class GeneralizedLSSFPN(BaseModule):
|
|||
|
|
def __init__(
|
|||
|
|
self,
|
|||
|
|
in_channels, # [192, 384, 768] - 3个尺度的输入通道
|
|||
|
|
out_channels, # 256 - 统一输出通道
|
|||
|
|
num_outs, # 输出数量
|
|||
|
|
start_level=0, # 从第0层开始
|
|||
|
|
end_level=-1, # 到最后一层
|
|||
|
|
no_norm_on_lateral=False,
|
|||
|
|
conv_cfg=None,
|
|||
|
|
norm_cfg=dict(type="BN2d"),
|
|||
|
|
act_cfg=dict(type="ReLU"),
|
|||
|
|
upsample_cfg=dict(mode="bilinear", align_corners=True),
|
|||
|
|
):
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键点**:
|
|||
|
|
- ❌ 没有任何depth相关参数
|
|||
|
|
- ❌ 没有LiDAR输入参数
|
|||
|
|
- ✅ 只处理图像特征
|
|||
|
|
|
|||
|
|
### 构建网络层
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 横向连接卷积 (降维)
|
|||
|
|
for i in range(self.start_level, self.backbone_end_level):
|
|||
|
|
l_conv = ConvModule(
|
|||
|
|
in_channels[i] + (
|
|||
|
|
in_channels[i + 1] if i == self.backbone_end_level - 1
|
|||
|
|
else out_channels
|
|||
|
|
),
|
|||
|
|
out_channels, # 输出256通道
|
|||
|
|
1, # 1×1卷积
|
|||
|
|
...
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 输出平滑卷积
|
|||
|
|
fpn_conv = ConvModule(
|
|||
|
|
out_channels,
|
|||
|
|
out_channels,
|
|||
|
|
3, # 3×3卷积
|
|||
|
|
padding=1,
|
|||
|
|
...
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**作用**:
|
|||
|
|
1. **lateral_conv**: 将concatenate后的特征降维到256通道
|
|||
|
|
2. **fpn_conv**: 3×3卷积平滑特征,消除上采样伪影
|
|||
|
|
|
|||
|
|
### 前向传播 (forward)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
def forward(self, inputs):
|
|||
|
|
# inputs: [laterals[0], laterals[1], laterals[2]]
|
|||
|
|
# 只接收图像特征,没有depth输入!
|
|||
|
|
|
|||
|
|
assert len(inputs) == len(self.in_channels)
|
|||
|
|
|
|||
|
|
# 1. 构建lateral features
|
|||
|
|
laterals = [inputs[i + self.start_level] for i in range(len(inputs))]
|
|||
|
|
|
|||
|
|
# 2. Top-Down路径
|
|||
|
|
used_backbone_levels = len(laterals) - 1
|
|||
|
|
for i in range(used_backbone_levels - 1, -1, -1):
|
|||
|
|
# 上采样高层特征
|
|||
|
|
x = F.interpolate(
|
|||
|
|
laterals[i + 1],
|
|||
|
|
size=laterals[i].shape[2:], # 上采样到低层分辨率
|
|||
|
|
mode='bilinear', # 双线性插值
|
|||
|
|
align_corners=True,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Concat: 低层特征 + 上采样的高层特征
|
|||
|
|
laterals[i] = torch.cat([laterals[i], x], dim=1)
|
|||
|
|
|
|||
|
|
# 降维: 1×1卷积
|
|||
|
|
laterals[i] = self.lateral_convs[i](laterals[i])
|
|||
|
|
|
|||
|
|
# 平滑: 3×3卷积
|
|||
|
|
laterals[i] = self.fpn_convs[i](laterals[i])
|
|||
|
|
|
|||
|
|
# 3. 输出融合后的特征
|
|||
|
|
outs = [laterals[i] for i in range(used_backbone_levels)]
|
|||
|
|
return tuple(outs)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键发现**:
|
|||
|
|
- ❌ `forward`函数只接收`inputs`(图像特征)
|
|||
|
|
- ❌ 没有额外的depth参数
|
|||
|
|
- ✅ 纯粹的特征金字塔融合
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 没有LiDAR Depth的原因
|
|||
|
|
|
|||
|
|
### 1. 设计理念
|
|||
|
|
|
|||
|
|
BEVFusion采用**后期融合**策略:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Camera分支:
|
|||
|
|
图像 → Backbone → FPN → LSS → Camera BEV
|
|||
|
|
↓
|
|||
|
|
不使用LiDAR depth,纯视觉
|
|||
|
|
|
|||
|
|
LiDAR分支:
|
|||
|
|
点云 → Voxelization → Backbone → LiDAR BEV
|
|||
|
|
↓
|
|||
|
|
独立处理
|
|||
|
|
|
|||
|
|
融合阶段:
|
|||
|
|
Camera BEV + LiDAR BEV → ConvFuser → 最终BEV
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. Depth从哪来?
|
|||
|
|
|
|||
|
|
**答案: 网络自己学习预测深度!**
|
|||
|
|
|
|||
|
|
在后续的`DepthLSSTransform`中:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 简化的LSS流程
|
|||
|
|
def forward(self, x):
|
|||
|
|
# x: 256ch图像特征 (来自FPN)
|
|||
|
|
|
|||
|
|
# 1. 深度预测网络 (学习的!)
|
|||
|
|
depth_logits = self.depth_net(x) # 预测深度分布
|
|||
|
|
depth_prob = softmax(depth_logits, dim=1) # 概率化
|
|||
|
|
|
|||
|
|
# 2. Lift到3D
|
|||
|
|
features_3d = x.unsqueeze(3) * depth_prob.unsqueeze(1)
|
|||
|
|
|
|||
|
|
# 3. Splat到BEV
|
|||
|
|
bev_features = cumsum(features_3d, dim=depth_axis)
|
|||
|
|
|
|||
|
|
return bev_features
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键**:
|
|||
|
|
- ✅ 深度是**网络学习预测**的
|
|||
|
|
- ✅ 不依赖LiDAR depth标注
|
|||
|
|
- ✅ 端到端可训练
|
|||
|
|
|
|||
|
|
### 3. 为什么不用LiDAR Depth?
|
|||
|
|
|
|||
|
|
#### ❌ 劣势(如果使用)
|
|||
|
|
```
|
|||
|
|
1. 依赖性:
|
|||
|
|
- 推理时必须有LiDAR
|
|||
|
|
- 无法纯相机部署
|
|||
|
|
|
|||
|
|
2. 模态不匹配:
|
|||
|
|
- LiDAR稀疏,图像密集
|
|||
|
|
- 对齐困难
|
|||
|
|
|
|||
|
|
3. 训练复杂:
|
|||
|
|
- 需要精确的相机-LiDAR标定
|
|||
|
|
- 时间同步要求高
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### ✅ 优势(纯视觉)
|
|||
|
|
```
|
|||
|
|
1. 独立性:
|
|||
|
|
- Camera分支独立工作
|
|||
|
|
- 可以纯相机部署
|
|||
|
|
|
|||
|
|
2. 端到端:
|
|||
|
|
- 深度预测可学习
|
|||
|
|
- 适应不同场景
|
|||
|
|
|
|||
|
|
3. 灵活性:
|
|||
|
|
- 可以用LiDAR监督训练
|
|||
|
|
- 但推理时不依赖
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔄 完整数据流
|
|||
|
|
|
|||
|
|
### Camera分支 (无LiDAR depth)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
原始图像 (1600×900, 6视角)
|
|||
|
|
↓
|
|||
|
|
【Swin Transformer Backbone】
|
|||
|
|
输出: [192ch@H/8, 384ch@H/16, 768ch@H/32]
|
|||
|
|
↓
|
|||
|
|
【GeneralizedLSSFPN】← 我们在这里!
|
|||
|
|
输入: [192ch, 384ch, 768ch]
|
|||
|
|
❌ 没有depth输入
|
|||
|
|
操作: 纯图像特征融合
|
|||
|
|
输出: 256ch @ H/8×W/8
|
|||
|
|
↓
|
|||
|
|
【DepthLSSTransform】
|
|||
|
|
输入: 256ch图像特征
|
|||
|
|
操作:
|
|||
|
|
1. 深度预测网络 → 深度分布
|
|||
|
|
2. Lift (2D→3D)
|
|||
|
|
3. Splat (3D→BEV)
|
|||
|
|
输出: 80ch BEV特征
|
|||
|
|
↓
|
|||
|
|
【ConvFuser】
|
|||
|
|
输入: Camera BEV (80ch) + LiDAR BEV (256ch)
|
|||
|
|
输出: 256ch融合BEV
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 具体例子
|
|||
|
|
|
|||
|
|
### 输入
|
|||
|
|
```python
|
|||
|
|
inputs = [
|
|||
|
|
x0, # 192ch, 200×112 (H/8×W/8) - Stage 2
|
|||
|
|
x1, # 384ch, 100×56 (H/16×W/16) - Stage 3
|
|||
|
|
x2, # 768ch, 50×28 (H/32×W/32) - Stage 4
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 处理过程
|
|||
|
|
|
|||
|
|
#### Iteration 1 (i=1, 从倒数第二层开始)
|
|||
|
|
```python
|
|||
|
|
# 上采样 x2
|
|||
|
|
x2_up = interpolate(x2, size=(100, 56)) # 768ch, 100×56
|
|||
|
|
|
|||
|
|
# Concat
|
|||
|
|
concat = cat([x1, x2_up], dim=1) # [384+768]ch, 100×56 = 1152ch
|
|||
|
|
|
|||
|
|
# Lateral conv (降维)
|
|||
|
|
lateral = lateral_conv_1(concat) # 256ch, 100×56
|
|||
|
|
|
|||
|
|
# FPN conv (平滑)
|
|||
|
|
x1_out = fpn_conv_1(lateral) # 256ch, 100×56
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Iteration 2 (i=0, 最浅层)
|
|||
|
|
```python
|
|||
|
|
# 上采样 x1_out
|
|||
|
|
x1_up = interpolate(x1_out, size=(200, 112)) # 256ch, 200×112
|
|||
|
|
|
|||
|
|
# Concat
|
|||
|
|
concat = cat([x0, x1_up], dim=1) # [192+256]ch, 200×112 = 448ch
|
|||
|
|
|
|||
|
|
# Lateral conv
|
|||
|
|
lateral = lateral_conv_0(concat) # 256ch, 200×112
|
|||
|
|
|
|||
|
|
# FPN conv
|
|||
|
|
x0_out = fpn_conv_0(lateral) # 256ch, 200×112
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 输出
|
|||
|
|
```python
|
|||
|
|
outputs = [
|
|||
|
|
x0_out, # 256ch, 200×112 (H/8×W/8) ← 最终FPN输出
|
|||
|
|
# x1_out, # 256ch, 100×56 (可选,但通常只用最高分辨率)
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🆚 vs 传统FPN
|
|||
|
|
|
|||
|
|
### 传统FPN
|
|||
|
|
```
|
|||
|
|
深层特征 → 上采样 → ADD (element-wise)
|
|||
|
|
↑
|
|||
|
|
浅层特征
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### GeneralizedLSSFPN
|
|||
|
|
```
|
|||
|
|
深层特征 → 上采样 → CONCAT (通道拼接)
|
|||
|
|
↑
|
|||
|
|
浅层特征
|
|||
|
|
↓
|
|||
|
|
1×1卷积 (降维)
|
|||
|
|
↓
|
|||
|
|
3×3卷积 (平滑)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
- ✅ Concat保留更多信息(vs ADD)
|
|||
|
|
- ✅ 1×1卷积可学习最佳融合方式
|
|||
|
|
- ✅ 3×3卷积消除上采样伪影
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 对BEVFusion的意义
|
|||
|
|
|
|||
|
|
### 1. 多尺度信息融合
|
|||
|
|
```
|
|||
|
|
浅层 (H/8): 细节 + 精确位置
|
|||
|
|
中层 (H/16): 平衡
|
|||
|
|
深层 (H/32): 语义 + 全局上下文
|
|||
|
|
↓
|
|||
|
|
融合到256ch统一表示
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 为LSS准备良好特征
|
|||
|
|
```
|
|||
|
|
LSS需要:
|
|||
|
|
- 丰富的语义信息 (深层提供)
|
|||
|
|
- 精确的空间信息 (浅层提供)
|
|||
|
|
- 统一的特征维度 (FPN提供)
|
|||
|
|
|
|||
|
|
FPN输出正好满足!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 提升小目标检测
|
|||
|
|
```
|
|||
|
|
Stop Line, Divider等小目标:
|
|||
|
|
依赖高分辨率特征 (H/8)
|
|||
|
|
FPN确保浅层细节传递到输出
|
|||
|
|
这是Phase 4A性能提升的基础
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔬 消融实验(来自BEVFusion论文)
|
|||
|
|
|
|||
|
|
### 无FPN (只用单一尺度)
|
|||
|
|
```
|
|||
|
|
方案: 只用Stage 4 (H/32) 深层特征
|
|||
|
|
结果:
|
|||
|
|
- 小目标检测下降 ~15%
|
|||
|
|
- 整体mAP下降 ~5%
|
|||
|
|
原因: 分辨率太低,丢失细节
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 有FPN (多尺度融合)
|
|||
|
|
```
|
|||
|
|
方案: 融合3个尺度 (本项目配置)
|
|||
|
|
结果:
|
|||
|
|
- 小目标检测最佳 ✅
|
|||
|
|
- 大目标也保持良好 ✅
|
|||
|
|
- 计算开销增加适中
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 关键洞察
|
|||
|
|
|
|||
|
|
### 1. 纯视觉设计
|
|||
|
|
```
|
|||
|
|
GeneralizedLSSFPN完全不依赖LiDAR
|
|||
|
|
↓
|
|||
|
|
Camera分支可以独立工作
|
|||
|
|
↓
|
|||
|
|
支持纯相机部署场景
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 深度学习预测
|
|||
|
|
```
|
|||
|
|
不使用LiDAR depth作为输入
|
|||
|
|
↓
|
|||
|
|
网络学习从图像预测深度
|
|||
|
|
↓
|
|||
|
|
端到端优化,适应性更强
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 后期融合策略
|
|||
|
|
```
|
|||
|
|
Camera: 独立处理 → Camera BEV
|
|||
|
|
LiDAR: 独立处理 → LiDAR BEV
|
|||
|
|
↓
|
|||
|
|
在BEV空间融合
|
|||
|
|
↓
|
|||
|
|
发挥各自优势
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📐 参数量和计算量
|
|||
|
|
|
|||
|
|
### 参数量
|
|||
|
|
```
|
|||
|
|
假设3个尺度: [192, 384, 768] → 256
|
|||
|
|
|
|||
|
|
lateral_conv_0: (192+256) × 256 × 1×1 ≈ 115K
|
|||
|
|
fpn_conv_0: 256 × 256 × 3×3 ≈ 590K
|
|||
|
|
|
|||
|
|
lateral_conv_1: (384+768) × 256 × 1×1 ≈ 295K
|
|||
|
|
fpn_conv_1: 256 × 256 × 3×3 ≈ 590K
|
|||
|
|
|
|||
|
|
总计: ~1.6M 参数
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 计算量 (FLOPs)
|
|||
|
|
```
|
|||
|
|
主要来源:
|
|||
|
|
- 上采样 (双线性插值): 中等
|
|||
|
|
- 1×1卷积: 较小
|
|||
|
|
- 3×3卷积: 主要开销
|
|||
|
|
|
|||
|
|
估算: ~2 GFLOPs (占总体<1%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 总结
|
|||
|
|
|
|||
|
|
### GeneralizedLSSFPN的本质
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
✅ 纯视觉特征金字塔网络
|
|||
|
|
✅ 融合Backbone多尺度输出
|
|||
|
|
✅ 输出统一256通道特征
|
|||
|
|
❌ 不使用LiDAR depth
|
|||
|
|
❌ 不依赖外部深度信息
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 在BEVFusion中的角色
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Swin Transformer (多尺度提取)
|
|||
|
|
↓
|
|||
|
|
GeneralizedLSSFPN (多尺度融合) ← 我们在这里
|
|||
|
|
↓
|
|||
|
|
DepthLSSTransform (学习深度预测)
|
|||
|
|
↓
|
|||
|
|
ConvFuser (多模态融合)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 对Phase 4A的贡献
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1. 保留浅层细节 → 小目标检测 ⭐
|
|||
|
|
2. 融合深层语义 → 场景理解
|
|||
|
|
3. 统一特征表示 → LSS高效处理
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**最终答案**:
|
|||
|
|
- ❌ **GeneralizedLSSFPN没有LiDAR depth输入**
|
|||
|
|
- ✅ **它是纯视觉的FPN,只处理图像特征**
|
|||
|
|
- ✅ **深度信息由后续网络学习预测**
|
|||
|
|
- ✅ **这是BEVFusion后期融合设计的体现**
|
|||
|
|
|