791 lines
18 KiB
Markdown
791 lines
18 KiB
Markdown
|
|
# BEV分割任务输入输出尺寸详细分析
|
|||
|
|
|
|||
|
|
**配置**: `fusion-det-seg-swint-enhanced.yaml`
|
|||
|
|
**任务**: BEV Map Segmentation
|
|||
|
|
**类别数**: 6类 (drivable_area, ped_crossing, walkway, stop_line, carpark_area, divider)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📐 完整数据流尺寸追踪
|
|||
|
|
|
|||
|
|
### 0. 原始输入数据
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 相机图像
|
|||
|
|
images: (B, N, 3, H, W)
|
|||
|
|
B = 1 (batch size, 单GPU)
|
|||
|
|
N = 6 (相机数量)
|
|||
|
|
H = 256 (图像高度)
|
|||
|
|
W = 704 (图像宽度)
|
|||
|
|
形状: (1, 6, 3, 256, 704)
|
|||
|
|
|
|||
|
|
# LiDAR点云
|
|||
|
|
points: List[Tensor]
|
|||
|
|
每个sample: (N_points, 5) # x, y, z, intensity, timestamp
|
|||
|
|
范围: [-54m, 54m] × [-54m, 54m] × [-5m, 3m]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 1. Camera Encoder输出 → BEV特征
|
|||
|
|
|
|||
|
|
#### 1.1 Backbone (SwinTransformer)
|
|||
|
|
```python
|
|||
|
|
Input: (1, 6, 3, 256, 704)
|
|||
|
|
↓
|
|||
|
|
SwinT Backbone (3个输出尺度)
|
|||
|
|
├─ Stage1: (1, 6, 192, 64, 176) # H/4, W/4
|
|||
|
|
├─ Stage2: (1, 6, 384, 32, 88) # H/8, W/8
|
|||
|
|
└─ Stage3: (1, 6, 768, 16, 44) # H/16, W/16
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 1.2 Neck (GeneralizedLSSFPN)
|
|||
|
|
```python
|
|||
|
|
Input: 3个尺度 [192, 384, 768]通道
|
|||
|
|
↓
|
|||
|
|
FPN处理
|
|||
|
|
↓
|
|||
|
|
Output: (1, 6, 256, 32, 88) # 统一到32×88尺寸,256通道
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 1.3 View Transform (DepthLSSTransform)
|
|||
|
|
```python
|
|||
|
|
Input: (1, 6, 256, 32, 88)
|
|||
|
|
↓
|
|||
|
|
DepthNet: 256 → 199 (119 depth + 80 context)
|
|||
|
|
↓
|
|||
|
|
3D Volume: (1, 6, 80, 119, 32, 88)
|
|||
|
|
↓
|
|||
|
|
BEV Pooling (投影到BEV平面)
|
|||
|
|
↓
|
|||
|
|
Camera BEV: (1, 80, 360, 360) # 80通道
|
|||
|
|
|
|||
|
|
计算:
|
|||
|
|
xbound: [-54, 54, 0.3] → 108m / 0.3m = 360 grids
|
|||
|
|
ybound: [-54, 54, 0.3] → 108m / 0.3m = 360 grids
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. LiDAR Encoder输出 → BEV特征
|
|||
|
|
|
|||
|
|
#### 2.1 Voxelization
|
|||
|
|
```python
|
|||
|
|
Input: Points (N_points, 5)
|
|||
|
|
↓
|
|||
|
|
Voxelization
|
|||
|
|
voxel_size: [0.075m, 0.075m, 0.2m]
|
|||
|
|
point_range: [-54m, 54m] × [-54m, 54m] × [-5m, 3m]
|
|||
|
|
↓
|
|||
|
|
Voxels: (N_voxels, 10, 5) # 每voxel最多10个点
|
|||
|
|
|
|||
|
|
Sparse Shape:
|
|||
|
|
X: 108m / 0.075m = 1440 grids
|
|||
|
|
Y: 108m / 0.075m = 1440 grids
|
|||
|
|
Z: 8m / 0.2m = 40 grids
|
|||
|
|
→ (1440, 1440, 40)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2.2 Sparse Encoder
|
|||
|
|
```python
|
|||
|
|
Input: Sparse Voxels (1440, 1440, 40)
|
|||
|
|
↓
|
|||
|
|
SparseEncoder (4个stage)
|
|||
|
|
↓
|
|||
|
|
Output: Dense BEV (1, 256, 360, 360)
|
|||
|
|
|
|||
|
|
计算:
|
|||
|
|
Sparse (1440, 1440) → Dense (360, 360)
|
|||
|
|
降采样倍数: 1440 / 360 = 4x
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. Fuser输出 → 融合BEV特征
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
Camera BEV: (1, 80, 360, 360)
|
|||
|
|
LiDAR BEV: (1, 256, 360, 360)
|
|||
|
|
↓
|
|||
|
|
ConvFuser (Camera 80→256, 然后相加)
|
|||
|
|
↓
|
|||
|
|
Fused BEV: (1, 256, 360, 360)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4. Decoder输出 → 多尺度特征
|
|||
|
|
|
|||
|
|
#### 4.1 SECOND Backbone
|
|||
|
|
```python
|
|||
|
|
Input: (1, 256, 360, 360)
|
|||
|
|
↓
|
|||
|
|
SECOND (2个stage)
|
|||
|
|
├─ Stage1: (1, 128, 360, 360) # stride=1
|
|||
|
|
└─ Stage2: (1, 256, 180, 180) # stride=2, 降采样
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 4.2 SECONDFPN Neck
|
|||
|
|
```python
|
|||
|
|
Input:
|
|||
|
|
├─ Stage1: (1, 128, 360, 360)
|
|||
|
|
└─ Stage2: (1, 256, 180, 180)
|
|||
|
|
↓
|
|||
|
|
FPN处理
|
|||
|
|
├─ Feature1: (1, 256, 360, 360) # Stage1 → 256通道
|
|||
|
|
└─ Feature2: (1, 256, 360, 360) # Stage2上采样2倍 → 256通道
|
|||
|
|
↓
|
|||
|
|
Concat
|
|||
|
|
↓
|
|||
|
|
Output: (1, 512, 360, 360) # 256×2 = 512通道
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键**: Decoder输出是**512通道,360×360空间尺寸**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 5. 分割头详细尺寸分析
|
|||
|
|
|
|||
|
|
### 配置参数
|
|||
|
|
```yaml
|
|||
|
|
map:
|
|||
|
|
in_channels: 512 # Decoder输出
|
|||
|
|
grid_transform:
|
|||
|
|
input_scope: [[-54.0, 54.0, 0.75], [-54.0, 54.0, 0.75]]
|
|||
|
|
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**计算**:
|
|||
|
|
```python
|
|||
|
|
# Input scope计算
|
|||
|
|
input_x_size = (54.0 - (-54.0)) / 0.75 = 108 / 0.75 = 144 grids
|
|||
|
|
input_y_size = (54.0 - (-54.0)) / 0.75 = 108 / 0.75 = 144 grids
|
|||
|
|
|
|||
|
|
# Output scope计算
|
|||
|
|
output_x_size = (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
|||
|
|
output_y_size = (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 5.1 原始BEVSegmentationHead尺寸流
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
输入 (从Decoder):
|
|||
|
|
Shape: (B, 512, 360, 360)
|
|||
|
|
Size: 1 × 512 × 360 × 360
|
|||
|
|
|
|||
|
|
↓ Step 1: BEVGridTransform
|
|||
|
|
|
|||
|
|
# Grid Transform详细过程:
|
|||
|
|
# 1. 从360×360 resample到144×144 (input_scope)
|
|||
|
|
# 2. 生成grid坐标
|
|||
|
|
# 范围: [-54, 54] → 144个grid点,步长0.75m
|
|||
|
|
# 3. Grid sample插值到200×200 (output_scope)
|
|||
|
|
# 范围: [-50, 50] → 200个grid点,步长0.5m
|
|||
|
|
|
|||
|
|
BEV Grid Transform:
|
|||
|
|
Input: (B, 512, 360, 360)
|
|||
|
|
Output: (B, 512, 200, 200)
|
|||
|
|
|
|||
|
|
说明:
|
|||
|
|
- Decoder输出360×360,但分割只关注中心区域
|
|||
|
|
- 从360×360裁剪/插值到200×200
|
|||
|
|
- 空间范围从±54m缩小到±50m
|
|||
|
|
|
|||
|
|
↓ Step 2: Classifier Layer 1
|
|||
|
|
|
|||
|
|
Conv2d(512, 512, 3, padding=1) + BN + ReLU
|
|||
|
|
Input: (B, 512, 200, 200)
|
|||
|
|
Output: (B, 512, 200, 200)
|
|||
|
|
|
|||
|
|
↓ Step 3: Classifier Layer 2
|
|||
|
|
|
|||
|
|
Conv2d(512, 512, 3, padding=1) + BN + ReLU
|
|||
|
|
Input: (B, 512, 200, 200)
|
|||
|
|
Output: (B, 512, 200, 200)
|
|||
|
|
|
|||
|
|
↓ Step 4: Final Classifier
|
|||
|
|
|
|||
|
|
Conv2d(512, 6, 1) # 6类
|
|||
|
|
Input: (B, 512, 200, 200)
|
|||
|
|
Output: (B, 6, 200, 200) # Logits
|
|||
|
|
|
|||
|
|
↓ Step 5 (推理时): Sigmoid激活
|
|||
|
|
|
|||
|
|
torch.sigmoid(logits)
|
|||
|
|
Output: (B, 6, 200, 200) # 概率值 [0, 1]
|
|||
|
|
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
|
|||
|
|
最终输出:
|
|||
|
|
Shape: (B, 6, 200, 200)
|
|||
|
|
Batch: B = 1 (单GPU)
|
|||
|
|
Classes: 6 (每个类别一个通道)
|
|||
|
|
Spatial: 200 × 200 (空间分辨率)
|
|||
|
|
Range: ±50m × ±50m (实际覆盖范围)
|
|||
|
|
Resolution: 0.5m per grid (每格0.5米)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 5.2 增强EnhancedBEVSegmentationHead尺寸流
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
输入 (从Decoder):
|
|||
|
|
Shape: (B, 512, 360, 360)
|
|||
|
|
|
|||
|
|
↓ Step 1: BEV Grid Transform
|
|||
|
|
|
|||
|
|
BEVGridTransform:
|
|||
|
|
Input: (B, 512, 360, 360)
|
|||
|
|
Output: (B, 512, 200, 200)
|
|||
|
|
|
|||
|
|
↓ Step 2: ASPP 多尺度特征提取
|
|||
|
|
|
|||
|
|
ASPP (5个分支):
|
|||
|
|
Branch 1 (1×1): (B, 512, 200, 200) → (B, 256, 200, 200)
|
|||
|
|
Branch 2 (3×3@d6): (B, 512, 200, 200) → (B, 256, 200, 200)
|
|||
|
|
Branch 3 (3×3@d12): (B, 512, 200, 200) → (B, 256, 200, 200)
|
|||
|
|
Branch 4 (3×3@d18): (B, 512, 200, 200) → (B, 256, 200, 200)
|
|||
|
|
Branch 5 (Global): (B, 512, 200, 200) → (B, 256, 200, 200)
|
|||
|
|
↓
|
|||
|
|
Concat: (B, 1280, 200, 200) # 256 × 5
|
|||
|
|
↓
|
|||
|
|
Project Conv 1×1: (B, 1280, 200, 200) → (B, 256, 200, 200)
|
|||
|
|
|
|||
|
|
↓ Step 3: Channel Attention
|
|||
|
|
|
|||
|
|
ChannelAttention:
|
|||
|
|
Input: (B, 256, 200, 200)
|
|||
|
|
Output: (B, 256, 200, 200) # 通道加权
|
|||
|
|
|
|||
|
|
↓ Step 4: Spatial Attention
|
|||
|
|
|
|||
|
|
SpatialAttention:
|
|||
|
|
Input: (B, 256, 200, 200)
|
|||
|
|
Output: (B, 256, 200, 200) # 空间加权
|
|||
|
|
|
|||
|
|
↓ Step 5: Auxiliary Classifier (Deep Supervision)
|
|||
|
|
|
|||
|
|
Conv2d(256, 6, 1) [仅训练时]
|
|||
|
|
Input: (B, 256, 200, 200)
|
|||
|
|
Output: (B, 6, 200, 200) # 辅助监督
|
|||
|
|
|
|||
|
|
↓ Step 6: Deep Decoder (4层)
|
|||
|
|
|
|||
|
|
Layer 1: Conv(256, 256, 3) + BN + ReLU + Dropout
|
|||
|
|
Input: (B, 256, 200, 200)
|
|||
|
|
Output: (B, 256, 200, 200)
|
|||
|
|
|
|||
|
|
Layer 2: Conv(256, 128, 3) + BN + ReLU + Dropout
|
|||
|
|
Input: (B, 256, 200, 200)
|
|||
|
|
Output: (B, 128, 200, 200)
|
|||
|
|
|
|||
|
|
Layer 3: Conv(128, 128, 3) + BN + ReLU + Dropout
|
|||
|
|
Input: (B, 128, 200, 200)
|
|||
|
|
Output: (B, 128, 200, 200)
|
|||
|
|
|
|||
|
|
↓ Step 7: Per-class Classifiers (6个独立分类器)
|
|||
|
|
|
|||
|
|
For each class (×6):
|
|||
|
|
Conv(128, 64, 3) + BN + ReLU
|
|||
|
|
Input: (B, 128, 200, 200)
|
|||
|
|
Output: (B, 64, 200, 200)
|
|||
|
|
Conv(64, 1, 1)
|
|||
|
|
Input: (B, 64, 200, 200)
|
|||
|
|
Output: (B, 1, 200, 200)
|
|||
|
|
|
|||
|
|
Concat 6个输出:
|
|||
|
|
Output: (B, 6, 200, 200) # Logits
|
|||
|
|
|
|||
|
|
↓ Step 8 (推理时): Sigmoid
|
|||
|
|
|
|||
|
|
torch.sigmoid(logits)
|
|||
|
|
Output: (B, 6, 200, 200) # 概率 [0, 1]
|
|||
|
|
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
|
|||
|
|
最终输出:
|
|||
|
|
Shape: (B, 6, 200, 200)
|
|||
|
|
尺寸: 1 × 6 × 200 × 200 = 240,000 个值
|
|||
|
|
内存: 240,000 × 4 bytes (float32) = 960 KB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 关键尺寸总结
|
|||
|
|
|
|||
|
|
### 输入尺寸
|
|||
|
|
|
|||
|
|
| 层级 | 名称 | 形状 | 空间尺寸 | 通道数 |
|
|||
|
|
|------|------|------|---------|--------|
|
|||
|
|
| **Decoder输出** | Decoder Neck输出 | (B, 512, 360, 360) | **360×360** | **512** |
|
|||
|
|
| **Grid Transform后** | BEV网格变换 | (B, 512, 200, 200) | **200×200** | 512 |
|
|||
|
|
|
|||
|
|
### 中间尺寸 (增强版)
|
|||
|
|
|
|||
|
|
| 层级 | 名称 | 形状 | 通道数 |
|
|||
|
|
|------|------|------|--------|
|
|||
|
|
| ASPP输出 | 多尺度特征 | (B, 256, 200, 200) | 256 |
|
|||
|
|
| 注意力后 | 特征增强 | (B, 256, 200, 200) | 256 |
|
|||
|
|
| Decoder Layer 1 | 深层解码 | (B, 256, 200, 200) | 256 |
|
|||
|
|
| Decoder Layer 2 | 深层解码 | (B, 128, 200, 200) | 128 |
|
|||
|
|
| Decoder Layer 3 | 深层解码 | (B, 128, 200, 200) | 128 |
|
|||
|
|
|
|||
|
|
### 输出尺寸
|
|||
|
|
|
|||
|
|
| 层级 | 名称 | 形状 | 说明 |
|
|||
|
|
|------|------|------|------|
|
|||
|
|
| **分类器输出** | Logits | (B, 6, 200, 200) | 6类,未归一化 |
|
|||
|
|
| **最终输出** | Probabilities | (B, 6, 200, 200) | Sigmoid后,[0,1] |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🗺️ 空间范围详解
|
|||
|
|
|
|||
|
|
### BEV网格配置
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
grid_transform:
|
|||
|
|
input_scope: [[-54.0, 54.0, 0.75], [-54.0, 54.0, 0.75]]
|
|||
|
|
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**详细计算**:
|
|||
|
|
|
|||
|
|
#### Input Scope (Decoder输出空间)
|
|||
|
|
```python
|
|||
|
|
范围: [-54m, 54m] × [-54m, 54m]
|
|||
|
|
分辨率: 0.75m per grid
|
|||
|
|
|
|||
|
|
网格数量:
|
|||
|
|
X轴: (54 - (-54)) / 0.75 = 108 / 0.75 = 144 grids
|
|||
|
|
Y轴: (54 - (-54)) / 0.75 = 108 / 0.75 = 144 grids
|
|||
|
|
|
|||
|
|
实际Decoder输出: 360×360
|
|||
|
|
Grid Transform处理: 360×360 → 144×144 (下采样)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**问题**: Decoder实际输出360×360,但input_scope期望144×144
|
|||
|
|
|
|||
|
|
**解释**: BEVGridTransform通过grid_sample插值处理尺寸不匹配
|
|||
|
|
```python
|
|||
|
|
# grid_sample会自动处理
|
|||
|
|
# 从360×360采样到144×144,然后插值到200×200
|
|||
|
|
F.grid_sample(
|
|||
|
|
x, # (B, 512, 360, 360)
|
|||
|
|
grid, # 200×200个采样坐标,范围对应到360×360
|
|||
|
|
mode='bilinear',
|
|||
|
|
)
|
|||
|
|
# Output: (B, 512, 200, 200)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Output Scope (分割输出空间)
|
|||
|
|
```python
|
|||
|
|
范围: [-50m, 50m] × [-50m, 50m]
|
|||
|
|
分辨率: 0.5m per grid
|
|||
|
|
|
|||
|
|
网格数量:
|
|||
|
|
X轴: (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
|||
|
|
Y轴: (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
|||
|
|
|
|||
|
|
最终输出: 200×200 BEV grid
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📐 尺寸变化可视化
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
完整Pipeline尺寸流:
|
|||
|
|
|
|||
|
|
原始图像: (1, 6, 3, 256, 704)
|
|||
|
|
↓
|
|||
|
|
SwinT → FPN
|
|||
|
|
↓ (1, 6, 256, 32, 88)
|
|||
|
|
|
|||
|
|
DepthLSS → BEV Pooling
|
|||
|
|
↓ Camera BEV: (1, 80, 360, 360)
|
|||
|
|
|
|||
|
|
LiDAR Sparse Encoder
|
|||
|
|
↓ LiDAR BEV: (1, 256, 360, 360)
|
|||
|
|
|
|||
|
|
ConvFuser (融合)
|
|||
|
|
↓ Fused BEV: (1, 256, 360, 360)
|
|||
|
|
|
|||
|
|
SECOND Backbone
|
|||
|
|
↓ (1, 128, 360, 360) + (1, 256, 180, 180)
|
|||
|
|
|
|||
|
|
SECONDFPN Neck
|
|||
|
|
↓ (1, 512, 360, 360) ← 分割头输入
|
|||
|
|
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
|
|||
|
|
分割头处理:
|
|||
|
|
|
|||
|
|
BEVGridTransform
|
|||
|
|
Input: (1, 512, 360, 360)
|
|||
|
|
Output: (1, 512, 200, 200) ← 空间降采样
|
|||
|
|
|
|||
|
|
ASPP (增强版)
|
|||
|
|
Output: (1, 256, 200, 200) ← 通道降维
|
|||
|
|
|
|||
|
|
双注意力
|
|||
|
|
Output: (1, 256, 200, 200)
|
|||
|
|
|
|||
|
|
Deep Decoder (4层)
|
|||
|
|
Output: (1, 128, 200, 200) ← 通道进一步降维
|
|||
|
|
|
|||
|
|
Per-class Classifiers
|
|||
|
|
Output: (1, 6, 200, 200) ← 最终分割mask
|
|||
|
|
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 核心尺寸总结
|
|||
|
|
|
|||
|
|
### 输入尺寸
|
|||
|
|
```
|
|||
|
|
分割头输入:
|
|||
|
|
形状: (B, 512, 360, 360)
|
|||
|
|
批次: B = 1 (单GPU)
|
|||
|
|
通道: 512 (来自SECONDFPN concat)
|
|||
|
|
空间: 360 × 360 grids
|
|||
|
|
分辨率: 0.3m per grid
|
|||
|
|
范围: ±54m × ±54m
|
|||
|
|
内存: 1 × 512 × 360 × 360 × 4 bytes = 264 MB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 输出尺寸
|
|||
|
|
```
|
|||
|
|
分割头输出:
|
|||
|
|
形状: (B, 6, 200, 200)
|
|||
|
|
批次: B = 1
|
|||
|
|
类别: 6 (每类一个通道)
|
|||
|
|
空间: 200 × 200 grids
|
|||
|
|
分辨率: 0.5m per grid
|
|||
|
|
范围: ±50m × ±50m (100m × 100m = 10,000平方米)
|
|||
|
|
内存: 1 × 6 × 200 × 200 × 4 bytes = 960 KB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 各类别输出详解
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 最终输出
|
|||
|
|
output: (B, 6, 200, 200)
|
|||
|
|
|
|||
|
|
# 按类别拆分
|
|||
|
|
output[0, 0, :, :] → (200, 200) drivable_area
|
|||
|
|
output[0, 1, :, :] → (200, 200) ped_crossing
|
|||
|
|
output[0, 2, :, :] → (200, 200) walkway
|
|||
|
|
output[0, 3, :, :] → (200, 200) stop_line
|
|||
|
|
output[0, 4, :, :] → (200, 200) carpark_area
|
|||
|
|
output[0, 5, :, :] → (200, 200) divider
|
|||
|
|
|
|||
|
|
# 每个类别
|
|||
|
|
每个像素值: 0.0 ~ 1.0 (概率)
|
|||
|
|
> 0.5 → 该像素属于此类别
|
|||
|
|
< 0.5 → 该像素不属于此类别
|
|||
|
|
|
|||
|
|
# 空间对应
|
|||
|
|
grid[0, 0] → 实际位置 (-50m, -50m)
|
|||
|
|
grid[100, 100] → 实际位置 (0m, 0m) - 车辆中心
|
|||
|
|
grid[199, 199] → 实际位置 (+50m, +50m)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 不同配置的输出尺寸对比
|
|||
|
|
|
|||
|
|
### 官方单分割
|
|||
|
|
```yaml
|
|||
|
|
grid_transform:
|
|||
|
|
input_scope: [[-51.2, 51.2, 0.8], [-51.2, 51.2, 0.8]]
|
|||
|
|
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
|||
|
|
|
|||
|
|
计算:
|
|||
|
|
Input: 102.4m / 0.8m = 128 × 128
|
|||
|
|
Output: 100m / 0.5m = 200 × 200
|
|||
|
|
|
|||
|
|
输出: (B, 6, 200, 200)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 我们的双任务
|
|||
|
|
```yaml
|
|||
|
|
grid_transform:
|
|||
|
|
input_scope: [[-54.0, 54.0, 0.75], [-54.0, 54.0, 0.75]]
|
|||
|
|
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
|||
|
|
|
|||
|
|
计算:
|
|||
|
|
Input: 108m / 0.75m = 144 × 144
|
|||
|
|
Output: 100m / 0.5m = 200 × 200
|
|||
|
|
|
|||
|
|
输出: (B, 6, 200, 200)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**结论**: 最终输出尺寸相同,都是 **(B, 6, 200, 200)**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💾 内存占用分析
|
|||
|
|
|
|||
|
|
### 分割头内存占用 (单样本)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 前向传播中间tensor
|
|||
|
|
|
|||
|
|
BEV Grid Transform输出:
|
|||
|
|
(1, 512, 200, 200) = 40.96 MB
|
|||
|
|
|
|||
|
|
ASPP中间态:
|
|||
|
|
5个分支concat: (1, 1280, 200, 200) = 102.4 MB
|
|||
|
|
Project后: (1, 256, 200, 200) = 20.48 MB
|
|||
|
|
|
|||
|
|
注意力模块:
|
|||
|
|
Channel attention: (1, 256, 200, 200) = 20.48 MB
|
|||
|
|
Spatial attention: (1, 256, 200, 200) = 20.48 MB
|
|||
|
|
|
|||
|
|
Decoder中间态:
|
|||
|
|
Layer 1: (1, 256, 200, 200) = 20.48 MB
|
|||
|
|
Layer 2: (1, 128, 200, 200) = 10.24 MB
|
|||
|
|
Layer 3: (1, 128, 200, 200) = 10.24 MB
|
|||
|
|
|
|||
|
|
Per-class分类:
|
|||
|
|
每个class: (1, 64, 200, 200) = 5.12 MB × 6 = 30.72 MB
|
|||
|
|
|
|||
|
|
最终输出:
|
|||
|
|
(1, 6, 200, 200) = 0.96 MB
|
|||
|
|
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
|
|||
|
|
峰值内存 (ASPP concat时): ~102 MB (单样本)
|
|||
|
|
总显存占用 (含梯度): ~19 GB (8 GPUs, 完整模型)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 实际应用中的尺寸
|
|||
|
|
|
|||
|
|
### 物理空间覆盖
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
输出: (B, 6, 200, 200)
|
|||
|
|
|
|||
|
|
物理空间:
|
|||
|
|
范围: ±50m × ±50m
|
|||
|
|
面积: 100m × 100m = 10,000 平方米
|
|||
|
|
分辨率: 0.5m per grid
|
|||
|
|
|
|||
|
|
网格尺寸:
|
|||
|
|
每个grid: 0.5m × 0.5m = 0.25平方米
|
|||
|
|
总grid数: 200 × 200 = 40,000 grids
|
|||
|
|
总覆盖: 40,000 × 0.25 = 10,000 平方米 ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 空间精度
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
每个grid: 0.5m × 0.5m
|
|||
|
|
|
|||
|
|
对于不同对象:
|
|||
|
|
├─ 车辆 (4m × 2m): 约 8×4 = 32 grids ✅ 足够精确
|
|||
|
|
├─ 人行道 (1.5m宽): 约 3 grids ✅ 可识别
|
|||
|
|
├─ 车道线 (0.15m宽): 约 0.3 grids ⚠️ 亚像素级,困难
|
|||
|
|
└─ 停止线 (0.3m宽): 约 0.6 grids ⚠️ 难以精确识别
|
|||
|
|
|
|||
|
|
说明: 这就是为什么stop_line和divider性能低的原因之一!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔬 分辨率影响分析
|
|||
|
|
|
|||
|
|
### 如果提升输出分辨率
|
|||
|
|
|
|||
|
|
**选项1: 提高到250×250**
|
|||
|
|
```yaml
|
|||
|
|
output_scope: [[-50, 50, 0.4], [-50, 50, 0.4]]
|
|||
|
|
|
|||
|
|
计算:
|
|||
|
|
100m / 0.4m = 250 × 250 grids
|
|||
|
|
|
|||
|
|
影响:
|
|||
|
|
✅ 车道线识别更准确 (0.15m → 0.4 grids)
|
|||
|
|
⚠️ 计算量增加 56% (200² → 250²)
|
|||
|
|
⚠️ 显存增加 ~3GB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**选项2: 提高到400×400**
|
|||
|
|
```yaml
|
|||
|
|
output_scope: [[-50, 50, 0.25], [-50, 50, 0.25]]
|
|||
|
|
|
|||
|
|
计算:
|
|||
|
|
100m / 0.25m = 400 × 400 grids
|
|||
|
|
|
|||
|
|
影响:
|
|||
|
|
✅ 车道线识别显著提升 (0.15m → 0.6 grids)
|
|||
|
|
❌ 计算量增加 4倍
|
|||
|
|
❌ 显存爆炸 (+8GB)
|
|||
|
|
❌ 不推荐
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**建议**: 保持200×200,通过增强网络架构提升性能
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 尺寸选择的权衡
|
|||
|
|
|
|||
|
|
### 为什么选择200×200?
|
|||
|
|
|
|||
|
|
**优势**:
|
|||
|
|
- ✅ 计算量适中 (200² = 40K pixels)
|
|||
|
|
- ✅ 显存占用合理
|
|||
|
|
- ✅ 对大目标(车道、停车区)足够精确
|
|||
|
|
- ✅ 与官方benchmark一致
|
|||
|
|
|
|||
|
|
**劣势**:
|
|||
|
|
- ⚠️ 对线性小目标(车道线0.15m)精度有限
|
|||
|
|
- ⚠️ 亚像素级特征难以捕获
|
|||
|
|
|
|||
|
|
**解决方案**:
|
|||
|
|
- 不提高分辨率(成本太高)
|
|||
|
|
- 用更强的网络架构(ASPP, 注意力)
|
|||
|
|
- 用Dice Loss优化小目标
|
|||
|
|
- 用类别权重强化学习
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 与检测输出对比
|
|||
|
|
|
|||
|
|
### 检测头输出
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
Detection Head输出:
|
|||
|
|
boxes_3d: (N_objects, 9) # N_objects个3D框
|
|||
|
|
每个框: [x, y, z, l, w, h, yaw, vx, vy]
|
|||
|
|
scores_3d: (N_objects,) # 置信度
|
|||
|
|
labels_3d: (N_objects,) # 类别
|
|||
|
|
|
|||
|
|
特点:
|
|||
|
|
- 稀疏输出 (只有检测到的对象)
|
|||
|
|
- 可变数量 (N_objects通常10-50)
|
|||
|
|
- 每个对象9维信息
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 分割头输出
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
Segmentation Head输出:
|
|||
|
|
masks_bev: (B, 6, 200, 200) # 密集输出
|
|||
|
|
|
|||
|
|
特点:
|
|||
|
|
- 密集输出 (每个grid都有预测)
|
|||
|
|
- 固定尺寸 (200×200)
|
|||
|
|
- 每个位置6类概率
|
|||
|
|
- 总计: 240,000个预测值
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**对比**:
|
|||
|
|
- 检测: 稀疏、动态数量、高维表示
|
|||
|
|
- 分割: 密集、固定尺寸、2D平面
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔢 详细尺寸计算表
|
|||
|
|
|
|||
|
|
### 完整流程尺寸
|
|||
|
|
|
|||
|
|
| 步骤 | 模块 | 输入形状 | 输出形状 | 空间尺寸 |
|
|||
|
|
|------|------|---------|---------|----------|
|
|||
|
|
| 1 | 原始图像 | - | (1, 6, 3, 256, 704) | 256×704 |
|
|||
|
|
| 2 | SwinT | (1, 6, 3, 256, 704) | (1, 6, 768, 16, 44) | 16×44 |
|
|||
|
|
| 3 | FPN | (1, 6, 768, 16, 44) | (1, 6, 256, 32, 88) | 32×88 |
|
|||
|
|
| 4 | DepthLSS | (1, 6, 256, 32, 88) | (1, 80, 360, 360) | **360×360** |
|
|||
|
|
| 5 | LiDAR Encoder | Points | (1, 256, 360, 360) | **360×360** |
|
|||
|
|
| 6 | ConvFuser | 2个BEV | (1, 256, 360, 360) | **360×360** |
|
|||
|
|
| 7 | SECOND | (1, 256, 360, 360) | (1, 128, 360, 360) | **360×360** |
|
|||
|
|
| 8 | SECONDFPN | 2个尺度 | (1, 512, 360, 360) | **360×360** |
|
|||
|
|
| 9 | **Grid Transform** | (1, 512, 360, 360) | (1, 512, **200, 200**) | **200×200** ← 降采样 |
|
|||
|
|
| 10 | ASPP | (1, 512, 200, 200) | (1, 256, 200, 200) | 200×200 |
|
|||
|
|
| 11 | 注意力 | (1, 256, 200, 200) | (1, 256, 200, 200) | 200×200 |
|
|||
|
|
| 12 | Decoder | (1, 256, 200, 200) | (1, 128, 200, 200) | 200×200 |
|
|||
|
|
| 13 | **Classifiers** | (1, 128, 200, 200) | **(1, 6, 200, 200)** | **200×200** |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 快速参考
|
|||
|
|
|
|||
|
|
### 关键尺寸速查
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
|
|||
|
|
分割头输入:
|
|||
|
|
✓ 形状: (1, 512, 360, 360)
|
|||
|
|
✓ 通道: 512
|
|||
|
|
✓ 空间: 360×360 grids
|
|||
|
|
✓ 分辨率: 0.3m/grid
|
|||
|
|
✓ 范围: ±54m
|
|||
|
|
|
|||
|
|
分割头输出:
|
|||
|
|
✓ 形状: (1, 6, 200, 200)
|
|||
|
|
✓ 类别: 6
|
|||
|
|
✓ 空间: 200×200 grids
|
|||
|
|
✓ 分辨率: 0.5m/grid
|
|||
|
|
✓ 范围: ±50m
|
|||
|
|
✓ 总面积: 10,000平方米
|
|||
|
|
|
|||
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 设计考虑
|
|||
|
|
|
|||
|
|
### 为什么360×360 → 200×200?
|
|||
|
|
|
|||
|
|
**原因**:
|
|||
|
|
|
|||
|
|
1. **计算效率**:
|
|||
|
|
- 360×360太大,分割头计算量爆炸
|
|||
|
|
- 200×200是性能和效率的平衡点
|
|||
|
|
|
|||
|
|
2. **感兴趣区域**:
|
|||
|
|
- ±54m太远,分割精度低
|
|||
|
|
- ±50m是自动驾驶关注的主要区域
|
|||
|
|
|
|||
|
|
3. **标注精度**:
|
|||
|
|
- nuScenes标注范围主要在±50m内
|
|||
|
|
- 远距离区域标注可能不准确
|
|||
|
|
|
|||
|
|
4. **与官方一致**:
|
|||
|
|
- 官方benchmark都用200×200输出
|
|||
|
|
- 便于性能对比
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 总结
|
|||
|
|
|
|||
|
|
### 核心尺寸
|
|||
|
|
```
|
|||
|
|
输入: (1, 512, 360, 360) - 512通道, 360×360空间
|
|||
|
|
↓
|
|||
|
|
Grid Transform (360×360 → 200×200)
|
|||
|
|
↓
|
|||
|
|
输出: (1, 6, 200, 200) - 6类别, 200×200空间
|
|||
|
|
|
|||
|
|
空间范围: ±50m × ±50m = 10,000平方米
|
|||
|
|
空间分辨率: 0.5m per grid (50cm)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**生成时间**: 2025-10-19
|
|||
|
|
**文档版本**: 1.0
|
|||
|
|
|