791 lines
18 KiB
Markdown
791 lines
18 KiB
Markdown
# BEV分割任务输入输出尺寸详细分析
|
||
|
||
**配置**: `fusion-det-seg-swint-enhanced.yaml`
|
||
**任务**: BEV Map Segmentation
|
||
**类别数**: 6类 (drivable_area, ped_crossing, walkway, stop_line, carpark_area, divider)
|
||
|
||
---
|
||
|
||
## 📐 完整数据流尺寸追踪
|
||
|
||
### 0. 原始输入数据
|
||
|
||
```python
|
||
# 相机图像
|
||
images: (B, N, 3, H, W)
|
||
B = 1 (batch size, 单GPU)
|
||
N = 6 (相机数量)
|
||
H = 256 (图像高度)
|
||
W = 704 (图像宽度)
|
||
形状: (1, 6, 3, 256, 704)
|
||
|
||
# LiDAR点云
|
||
points: List[Tensor]
|
||
每个sample: (N_points, 5) # x, y, z, intensity, timestamp
|
||
范围: [-54m, 54m] × [-54m, 54m] × [-5m, 3m]
|
||
```
|
||
|
||
---
|
||
|
||
### 1. Camera Encoder输出 → BEV特征
|
||
|
||
#### 1.1 Backbone (SwinTransformer)
|
||
```python
|
||
Input: (1, 6, 3, 256, 704)
|
||
↓
|
||
SwinT Backbone (3个输出尺度)
|
||
├─ Stage1: (1, 6, 192, 64, 176) # H/4, W/4
|
||
├─ Stage2: (1, 6, 384, 32, 88) # H/8, W/8
|
||
└─ Stage3: (1, 6, 768, 16, 44) # H/16, W/16
|
||
```
|
||
|
||
#### 1.2 Neck (GeneralizedLSSFPN)
|
||
```python
|
||
Input: 3个尺度 [192, 384, 768]通道
|
||
↓
|
||
FPN处理
|
||
↓
|
||
Output: (1, 6, 256, 32, 88) # 统一到32×88尺寸,256通道
|
||
```
|
||
|
||
#### 1.3 View Transform (DepthLSSTransform)
|
||
```python
|
||
Input: (1, 6, 256, 32, 88)
|
||
↓
|
||
DepthNet: 256 → 199 (119 depth + 80 context)
|
||
↓
|
||
3D Volume: (1, 6, 80, 119, 32, 88)
|
||
↓
|
||
BEV Pooling (投影到BEV平面)
|
||
↓
|
||
Camera BEV: (1, 80, 360, 360) # 80通道
|
||
|
||
计算:
|
||
xbound: [-54, 54, 0.3] → 108m / 0.3m = 360 grids
|
||
ybound: [-54, 54, 0.3] → 108m / 0.3m = 360 grids
|
||
```
|
||
|
||
---
|
||
|
||
### 2. LiDAR Encoder输出 → BEV特征
|
||
|
||
#### 2.1 Voxelization
|
||
```python
|
||
Input: Points (N_points, 5)
|
||
↓
|
||
Voxelization
|
||
voxel_size: [0.075m, 0.075m, 0.2m]
|
||
point_range: [-54m, 54m] × [-54m, 54m] × [-5m, 3m]
|
||
↓
|
||
Voxels: (N_voxels, 10, 5) # 每voxel最多10个点
|
||
|
||
Sparse Shape:
|
||
X: 108m / 0.075m = 1440 grids
|
||
Y: 108m / 0.075m = 1440 grids
|
||
Z: 8m / 0.2m = 40 grids
|
||
→ (1440, 1440, 40)
|
||
```
|
||
|
||
#### 2.2 Sparse Encoder
|
||
```python
|
||
Input: Sparse Voxels (1440, 1440, 40)
|
||
↓
|
||
SparseEncoder (4个stage)
|
||
↓
|
||
Output: Dense BEV (1, 256, 360, 360)
|
||
|
||
计算:
|
||
Sparse (1440, 1440) → Dense (360, 360)
|
||
降采样倍数: 1440 / 360 = 4x
|
||
```
|
||
|
||
---
|
||
|
||
### 3. Fuser输出 → 融合BEV特征
|
||
|
||
```python
|
||
Camera BEV: (1, 80, 360, 360)
|
||
LiDAR BEV: (1, 256, 360, 360)
|
||
↓
|
||
ConvFuser (Camera 80→256, 然后相加)
|
||
↓
|
||
Fused BEV: (1, 256, 360, 360)
|
||
```
|
||
|
||
---
|
||
|
||
### 4. Decoder输出 → 多尺度特征
|
||
|
||
#### 4.1 SECOND Backbone
|
||
```python
|
||
Input: (1, 256, 360, 360)
|
||
↓
|
||
SECOND (2个stage)
|
||
├─ Stage1: (1, 128, 360, 360) # stride=1
|
||
└─ Stage2: (1, 256, 180, 180) # stride=2, 降采样
|
||
```
|
||
|
||
#### 4.2 SECONDFPN Neck
|
||
```python
|
||
Input:
|
||
├─ Stage1: (1, 128, 360, 360)
|
||
└─ Stage2: (1, 256, 180, 180)
|
||
↓
|
||
FPN处理
|
||
├─ Feature1: (1, 256, 360, 360) # Stage1 → 256通道
|
||
└─ Feature2: (1, 256, 360, 360) # Stage2上采样2倍 → 256通道
|
||
↓
|
||
Concat
|
||
↓
|
||
Output: (1, 512, 360, 360) # 256×2 = 512通道
|
||
```
|
||
|
||
**关键**: Decoder输出是**512通道,360×360空间尺寸**
|
||
|
||
---
|
||
|
||
## 🎯 5. 分割头详细尺寸分析
|
||
|
||
### 配置参数
|
||
```yaml
|
||
map:
|
||
in_channels: 512 # Decoder输出
|
||
grid_transform:
|
||
input_scope: [[-54.0, 54.0, 0.75], [-54.0, 54.0, 0.75]]
|
||
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
||
```
|
||
|
||
**计算**:
|
||
```python
|
||
# Input scope计算
|
||
input_x_size = (54.0 - (-54.0)) / 0.75 = 108 / 0.75 = 144 grids
|
||
input_y_size = (54.0 - (-54.0)) / 0.75 = 108 / 0.75 = 144 grids
|
||
|
||
# Output scope计算
|
||
output_x_size = (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
||
output_y_size = (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
||
```
|
||
|
||
---
|
||
|
||
### 5.1 原始BEVSegmentationHead尺寸流
|
||
|
||
```python
|
||
输入 (从Decoder):
|
||
Shape: (B, 512, 360, 360)
|
||
Size: 1 × 512 × 360 × 360
|
||
|
||
↓ Step 1: BEVGridTransform
|
||
|
||
# Grid Transform详细过程:
|
||
# 1. 从360×360 resample到144×144 (input_scope)
|
||
# 2. 生成grid坐标
|
||
# 范围: [-54, 54] → 144个grid点,步长0.75m
|
||
# 3. Grid sample插值到200×200 (output_scope)
|
||
# 范围: [-50, 50] → 200个grid点,步长0.5m
|
||
|
||
BEV Grid Transform:
|
||
Input: (B, 512, 360, 360)
|
||
Output: (B, 512, 200, 200)
|
||
|
||
说明:
|
||
- Decoder输出360×360,但分割只关注中心区域
|
||
- 从360×360裁剪/插值到200×200
|
||
- 空间范围从±54m缩小到±50m
|
||
|
||
↓ Step 2: Classifier Layer 1
|
||
|
||
Conv2d(512, 512, 3, padding=1) + BN + ReLU
|
||
Input: (B, 512, 200, 200)
|
||
Output: (B, 512, 200, 200)
|
||
|
||
↓ Step 3: Classifier Layer 2
|
||
|
||
Conv2d(512, 512, 3, padding=1) + BN + ReLU
|
||
Input: (B, 512, 200, 200)
|
||
Output: (B, 512, 200, 200)
|
||
|
||
↓ Step 4: Final Classifier
|
||
|
||
Conv2d(512, 6, 1) # 6类
|
||
Input: (B, 512, 200, 200)
|
||
Output: (B, 6, 200, 200) # Logits
|
||
|
||
↓ Step 5 (推理时): Sigmoid激活
|
||
|
||
torch.sigmoid(logits)
|
||
Output: (B, 6, 200, 200) # 概率值 [0, 1]
|
||
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
||
最终输出:
|
||
Shape: (B, 6, 200, 200)
|
||
Batch: B = 1 (单GPU)
|
||
Classes: 6 (每个类别一个通道)
|
||
Spatial: 200 × 200 (空间分辨率)
|
||
Range: ±50m × ±50m (实际覆盖范围)
|
||
Resolution: 0.5m per grid (每格0.5米)
|
||
```
|
||
|
||
---
|
||
|
||
### 5.2 增强EnhancedBEVSegmentationHead尺寸流
|
||
|
||
```python
|
||
输入 (从Decoder):
|
||
Shape: (B, 512, 360, 360)
|
||
|
||
↓ Step 1: BEV Grid Transform
|
||
|
||
BEVGridTransform:
|
||
Input: (B, 512, 360, 360)
|
||
Output: (B, 512, 200, 200)
|
||
|
||
↓ Step 2: ASPP 多尺度特征提取
|
||
|
||
ASPP (5个分支):
|
||
Branch 1 (1×1): (B, 512, 200, 200) → (B, 256, 200, 200)
|
||
Branch 2 (3×3@d6): (B, 512, 200, 200) → (B, 256, 200, 200)
|
||
Branch 3 (3×3@d12): (B, 512, 200, 200) → (B, 256, 200, 200)
|
||
Branch 4 (3×3@d18): (B, 512, 200, 200) → (B, 256, 200, 200)
|
||
Branch 5 (Global): (B, 512, 200, 200) → (B, 256, 200, 200)
|
||
↓
|
||
Concat: (B, 1280, 200, 200) # 256 × 5
|
||
↓
|
||
Project Conv 1×1: (B, 1280, 200, 200) → (B, 256, 200, 200)
|
||
|
||
↓ Step 3: Channel Attention
|
||
|
||
ChannelAttention:
|
||
Input: (B, 256, 200, 200)
|
||
Output: (B, 256, 200, 200) # 通道加权
|
||
|
||
↓ Step 4: Spatial Attention
|
||
|
||
SpatialAttention:
|
||
Input: (B, 256, 200, 200)
|
||
Output: (B, 256, 200, 200) # 空间加权
|
||
|
||
↓ Step 5: Auxiliary Classifier (Deep Supervision)
|
||
|
||
Conv2d(256, 6, 1) [仅训练时]
|
||
Input: (B, 256, 200, 200)
|
||
Output: (B, 6, 200, 200) # 辅助监督
|
||
|
||
↓ Step 6: Deep Decoder (4层)
|
||
|
||
Layer 1: Conv(256, 256, 3) + BN + ReLU + Dropout
|
||
Input: (B, 256, 200, 200)
|
||
Output: (B, 256, 200, 200)
|
||
|
||
Layer 2: Conv(256, 128, 3) + BN + ReLU + Dropout
|
||
Input: (B, 256, 200, 200)
|
||
Output: (B, 128, 200, 200)
|
||
|
||
Layer 3: Conv(128, 128, 3) + BN + ReLU + Dropout
|
||
Input: (B, 128, 200, 200)
|
||
Output: (B, 128, 200, 200)
|
||
|
||
↓ Step 7: Per-class Classifiers (6个独立分类器)
|
||
|
||
For each class (×6):
|
||
Conv(128, 64, 3) + BN + ReLU
|
||
Input: (B, 128, 200, 200)
|
||
Output: (B, 64, 200, 200)
|
||
Conv(64, 1, 1)
|
||
Input: (B, 64, 200, 200)
|
||
Output: (B, 1, 200, 200)
|
||
|
||
Concat 6个输出:
|
||
Output: (B, 6, 200, 200) # Logits
|
||
|
||
↓ Step 8 (推理时): Sigmoid
|
||
|
||
torch.sigmoid(logits)
|
||
Output: (B, 6, 200, 200) # 概率 [0, 1]
|
||
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
||
最终输出:
|
||
Shape: (B, 6, 200, 200)
|
||
尺寸: 1 × 6 × 200 × 200 = 240,000 个值
|
||
内存: 240,000 × 4 bytes (float32) = 960 KB
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 关键尺寸总结
|
||
|
||
### 输入尺寸
|
||
|
||
| 层级 | 名称 | 形状 | 空间尺寸 | 通道数 |
|
||
|------|------|------|---------|--------|
|
||
| **Decoder输出** | Decoder Neck输出 | (B, 512, 360, 360) | **360×360** | **512** |
|
||
| **Grid Transform后** | BEV网格变换 | (B, 512, 200, 200) | **200×200** | 512 |
|
||
|
||
### 中间尺寸 (增强版)
|
||
|
||
| 层级 | 名称 | 形状 | 通道数 |
|
||
|------|------|------|--------|
|
||
| ASPP输出 | 多尺度特征 | (B, 256, 200, 200) | 256 |
|
||
| 注意力后 | 特征增强 | (B, 256, 200, 200) | 256 |
|
||
| Decoder Layer 1 | 深层解码 | (B, 256, 200, 200) | 256 |
|
||
| Decoder Layer 2 | 深层解码 | (B, 128, 200, 200) | 128 |
|
||
| Decoder Layer 3 | 深层解码 | (B, 128, 200, 200) | 128 |
|
||
|
||
### 输出尺寸
|
||
|
||
| 层级 | 名称 | 形状 | 说明 |
|
||
|------|------|------|------|
|
||
| **分类器输出** | Logits | (B, 6, 200, 200) | 6类,未归一化 |
|
||
| **最终输出** | Probabilities | (B, 6, 200, 200) | Sigmoid后,[0,1] |
|
||
|
||
---
|
||
|
||
## 🗺️ 空间范围详解
|
||
|
||
### BEV网格配置
|
||
|
||
```yaml
|
||
grid_transform:
|
||
input_scope: [[-54.0, 54.0, 0.75], [-54.0, 54.0, 0.75]]
|
||
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
||
```
|
||
|
||
**详细计算**:
|
||
|
||
#### Input Scope (Decoder输出空间)
|
||
```python
|
||
范围: [-54m, 54m] × [-54m, 54m]
|
||
分辨率: 0.75m per grid
|
||
|
||
网格数量:
|
||
X轴: (54 - (-54)) / 0.75 = 108 / 0.75 = 144 grids
|
||
Y轴: (54 - (-54)) / 0.75 = 108 / 0.75 = 144 grids
|
||
|
||
实际Decoder输出: 360×360
|
||
Grid Transform处理: 360×360 → 144×144 (下采样)
|
||
```
|
||
|
||
**问题**: Decoder实际输出360×360,但input_scope期望144×144
|
||
|
||
**解释**: BEVGridTransform通过grid_sample插值处理尺寸不匹配
|
||
```python
|
||
# grid_sample会自动处理
|
||
# 从360×360采样到144×144,然后插值到200×200
|
||
F.grid_sample(
|
||
x, # (B, 512, 360, 360)
|
||
grid, # 200×200个采样坐标,范围对应到360×360
|
||
mode='bilinear',
|
||
)
|
||
# Output: (B, 512, 200, 200)
|
||
```
|
||
|
||
#### Output Scope (分割输出空间)
|
||
```python
|
||
范围: [-50m, 50m] × [-50m, 50m]
|
||
分辨率: 0.5m per grid
|
||
|
||
网格数量:
|
||
X轴: (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
||
Y轴: (50 - (-50)) / 0.5 = 100 / 0.5 = 200 grids
|
||
|
||
最终输出: 200×200 BEV grid
|
||
```
|
||
|
||
---
|
||
|
||
## 📐 尺寸变化可视化
|
||
|
||
```
|
||
完整Pipeline尺寸流:
|
||
|
||
原始图像: (1, 6, 3, 256, 704)
|
||
↓
|
||
SwinT → FPN
|
||
↓ (1, 6, 256, 32, 88)
|
||
|
||
DepthLSS → BEV Pooling
|
||
↓ Camera BEV: (1, 80, 360, 360)
|
||
|
||
LiDAR Sparse Encoder
|
||
↓ LiDAR BEV: (1, 256, 360, 360)
|
||
|
||
ConvFuser (融合)
|
||
↓ Fused BEV: (1, 256, 360, 360)
|
||
|
||
SECOND Backbone
|
||
↓ (1, 128, 360, 360) + (1, 256, 180, 180)
|
||
|
||
SECONDFPN Neck
|
||
↓ (1, 512, 360, 360) ← 分割头输入
|
||
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
||
分割头处理:
|
||
|
||
BEVGridTransform
|
||
Input: (1, 512, 360, 360)
|
||
Output: (1, 512, 200, 200) ← 空间降采样
|
||
|
||
ASPP (增强版)
|
||
Output: (1, 256, 200, 200) ← 通道降维
|
||
|
||
双注意力
|
||
Output: (1, 256, 200, 200)
|
||
|
||
Deep Decoder (4层)
|
||
Output: (1, 128, 200, 200) ← 通道进一步降维
|
||
|
||
Per-class Classifiers
|
||
Output: (1, 6, 200, 200) ← 最终分割mask
|
||
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 核心尺寸总结
|
||
|
||
### 输入尺寸
|
||
```
|
||
分割头输入:
|
||
形状: (B, 512, 360, 360)
|
||
批次: B = 1 (单GPU)
|
||
通道: 512 (来自SECONDFPN concat)
|
||
空间: 360 × 360 grids
|
||
分辨率: 0.3m per grid
|
||
范围: ±54m × ±54m
|
||
内存: 1 × 512 × 360 × 360 × 4 bytes = 264 MB
|
||
```
|
||
|
||
### 输出尺寸
|
||
```
|
||
分割头输出:
|
||
形状: (B, 6, 200, 200)
|
||
批次: B = 1
|
||
类别: 6 (每类一个通道)
|
||
空间: 200 × 200 grids
|
||
分辨率: 0.5m per grid
|
||
范围: ±50m × ±50m (100m × 100m = 10,000平方米)
|
||
内存: 1 × 6 × 200 × 200 × 4 bytes = 960 KB
|
||
```
|
||
|
||
---
|
||
|
||
## 🔍 各类别输出详解
|
||
|
||
```python
|
||
# 最终输出
|
||
output: (B, 6, 200, 200)
|
||
|
||
# 按类别拆分
|
||
output[0, 0, :, :] → (200, 200) drivable_area
|
||
output[0, 1, :, :] → (200, 200) ped_crossing
|
||
output[0, 2, :, :] → (200, 200) walkway
|
||
output[0, 3, :, :] → (200, 200) stop_line
|
||
output[0, 4, :, :] → (200, 200) carpark_area
|
||
output[0, 5, :, :] → (200, 200) divider
|
||
|
||
# 每个类别
|
||
每个像素值: 0.0 ~ 1.0 (概率)
|
||
> 0.5 → 该像素属于此类别
|
||
< 0.5 → 该像素不属于此类别
|
||
|
||
# 空间对应
|
||
grid[0, 0] → 实际位置 (-50m, -50m)
|
||
grid[100, 100] → 实际位置 (0m, 0m) - 车辆中心
|
||
grid[199, 199] → 实际位置 (+50m, +50m)
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 不同配置的输出尺寸对比
|
||
|
||
### 官方单分割
|
||
```yaml
|
||
grid_transform:
|
||
input_scope: [[-51.2, 51.2, 0.8], [-51.2, 51.2, 0.8]]
|
||
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
||
|
||
计算:
|
||
Input: 102.4m / 0.8m = 128 × 128
|
||
Output: 100m / 0.5m = 200 × 200
|
||
|
||
输出: (B, 6, 200, 200)
|
||
```
|
||
|
||
### 我们的双任务
|
||
```yaml
|
||
grid_transform:
|
||
input_scope: [[-54.0, 54.0, 0.75], [-54.0, 54.0, 0.75]]
|
||
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
|
||
|
||
计算:
|
||
Input: 108m / 0.75m = 144 × 144
|
||
Output: 100m / 0.5m = 200 × 200
|
||
|
||
输出: (B, 6, 200, 200)
|
||
```
|
||
|
||
**结论**: 最终输出尺寸相同,都是 **(B, 6, 200, 200)**
|
||
|
||
---
|
||
|
||
## 💾 内存占用分析
|
||
|
||
### 分割头内存占用 (单样本)
|
||
|
||
```python
|
||
# 前向传播中间tensor
|
||
|
||
BEV Grid Transform输出:
|
||
(1, 512, 200, 200) = 40.96 MB
|
||
|
||
ASPP中间态:
|
||
5个分支concat: (1, 1280, 200, 200) = 102.4 MB
|
||
Project后: (1, 256, 200, 200) = 20.48 MB
|
||
|
||
注意力模块:
|
||
Channel attention: (1, 256, 200, 200) = 20.48 MB
|
||
Spatial attention: (1, 256, 200, 200) = 20.48 MB
|
||
|
||
Decoder中间态:
|
||
Layer 1: (1, 256, 200, 200) = 20.48 MB
|
||
Layer 2: (1, 128, 200, 200) = 10.24 MB
|
||
Layer 3: (1, 128, 200, 200) = 10.24 MB
|
||
|
||
Per-class分类:
|
||
每个class: (1, 64, 200, 200) = 5.12 MB × 6 = 30.72 MB
|
||
|
||
最终输出:
|
||
(1, 6, 200, 200) = 0.96 MB
|
||
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
||
峰值内存 (ASPP concat时): ~102 MB (单样本)
|
||
总显存占用 (含梯度): ~19 GB (8 GPUs, 完整模型)
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 实际应用中的尺寸
|
||
|
||
### 物理空间覆盖
|
||
|
||
```
|
||
输出: (B, 6, 200, 200)
|
||
|
||
物理空间:
|
||
范围: ±50m × ±50m
|
||
面积: 100m × 100m = 10,000 平方米
|
||
分辨率: 0.5m per grid
|
||
|
||
网格尺寸:
|
||
每个grid: 0.5m × 0.5m = 0.25平方米
|
||
总grid数: 200 × 200 = 40,000 grids
|
||
总覆盖: 40,000 × 0.25 = 10,000 平方米 ✅
|
||
```
|
||
|
||
### 空间精度
|
||
|
||
```
|
||
每个grid: 0.5m × 0.5m
|
||
|
||
对于不同对象:
|
||
├─ 车辆 (4m × 2m): 约 8×4 = 32 grids ✅ 足够精确
|
||
├─ 人行道 (1.5m宽): 约 3 grids ✅ 可识别
|
||
├─ 车道线 (0.15m宽): 约 0.3 grids ⚠️ 亚像素级,困难
|
||
└─ 停止线 (0.3m宽): 约 0.6 grids ⚠️ 难以精确识别
|
||
|
||
说明: 这就是为什么stop_line和divider性能低的原因之一!
|
||
```
|
||
|
||
---
|
||
|
||
## 🔬 分辨率影响分析
|
||
|
||
### 如果提升输出分辨率
|
||
|
||
**选项1: 提高到250×250**
|
||
```yaml
|
||
output_scope: [[-50, 50, 0.4], [-50, 50, 0.4]]
|
||
|
||
计算:
|
||
100m / 0.4m = 250 × 250 grids
|
||
|
||
影响:
|
||
✅ 车道线识别更准确 (0.15m → 0.4 grids)
|
||
⚠️ 计算量增加 56% (200² → 250²)
|
||
⚠️ 显存增加 ~3GB
|
||
```
|
||
|
||
**选项2: 提高到400×400**
|
||
```yaml
|
||
output_scope: [[-50, 50, 0.25], [-50, 50, 0.25]]
|
||
|
||
计算:
|
||
100m / 0.25m = 400 × 400 grids
|
||
|
||
影响:
|
||
✅ 车道线识别显著提升 (0.15m → 0.6 grids)
|
||
❌ 计算量增加 4倍
|
||
❌ 显存爆炸 (+8GB)
|
||
❌ 不推荐
|
||
```
|
||
|
||
**建议**: 保持200×200,通过增强网络架构提升性能
|
||
|
||
---
|
||
|
||
## 📈 尺寸选择的权衡
|
||
|
||
### 为什么选择200×200?
|
||
|
||
**优势**:
|
||
- ✅ 计算量适中 (200² = 40K pixels)
|
||
- ✅ 显存占用合理
|
||
- ✅ 对大目标(车道、停车区)足够精确
|
||
- ✅ 与官方benchmark一致
|
||
|
||
**劣势**:
|
||
- ⚠️ 对线性小目标(车道线0.15m)精度有限
|
||
- ⚠️ 亚像素级特征难以捕获
|
||
|
||
**解决方案**:
|
||
- 不提高分辨率(成本太高)
|
||
- 用更强的网络架构(ASPP, 注意力)
|
||
- 用Dice Loss优化小目标
|
||
- 用类别权重强化学习
|
||
|
||
---
|
||
|
||
## 🎯 与检测输出对比
|
||
|
||
### 检测头输出
|
||
|
||
```python
|
||
Detection Head输出:
|
||
boxes_3d: (N_objects, 9) # N_objects个3D框
|
||
每个框: [x, y, z, l, w, h, yaw, vx, vy]
|
||
scores_3d: (N_objects,) # 置信度
|
||
labels_3d: (N_objects,) # 类别
|
||
|
||
特点:
|
||
- 稀疏输出 (只有检测到的对象)
|
||
- 可变数量 (N_objects通常10-50)
|
||
- 每个对象9维信息
|
||
```
|
||
|
||
### 分割头输出
|
||
|
||
```python
|
||
Segmentation Head输出:
|
||
masks_bev: (B, 6, 200, 200) # 密集输出
|
||
|
||
特点:
|
||
- 密集输出 (每个grid都有预测)
|
||
- 固定尺寸 (200×200)
|
||
- 每个位置6类概率
|
||
- 总计: 240,000个预测值
|
||
```
|
||
|
||
**对比**:
|
||
- 检测: 稀疏、动态数量、高维表示
|
||
- 分割: 密集、固定尺寸、2D平面
|
||
|
||
---
|
||
|
||
## 🔢 详细尺寸计算表
|
||
|
||
### 完整流程尺寸
|
||
|
||
| 步骤 | 模块 | 输入形状 | 输出形状 | 空间尺寸 |
|
||
|------|------|---------|---------|----------|
|
||
| 1 | 原始图像 | - | (1, 6, 3, 256, 704) | 256×704 |
|
||
| 2 | SwinT | (1, 6, 3, 256, 704) | (1, 6, 768, 16, 44) | 16×44 |
|
||
| 3 | FPN | (1, 6, 768, 16, 44) | (1, 6, 256, 32, 88) | 32×88 |
|
||
| 4 | DepthLSS | (1, 6, 256, 32, 88) | (1, 80, 360, 360) | **360×360** |
|
||
| 5 | LiDAR Encoder | Points | (1, 256, 360, 360) | **360×360** |
|
||
| 6 | ConvFuser | 2个BEV | (1, 256, 360, 360) | **360×360** |
|
||
| 7 | SECOND | (1, 256, 360, 360) | (1, 128, 360, 360) | **360×360** |
|
||
| 8 | SECONDFPN | 2个尺度 | (1, 512, 360, 360) | **360×360** |
|
||
| 9 | **Grid Transform** | (1, 512, 360, 360) | (1, 512, **200, 200**) | **200×200** ← 降采样 |
|
||
| 10 | ASPP | (1, 512, 200, 200) | (1, 256, 200, 200) | 200×200 |
|
||
| 11 | 注意力 | (1, 256, 200, 200) | (1, 256, 200, 200) | 200×200 |
|
||
| 12 | Decoder | (1, 256, 200, 200) | (1, 128, 200, 200) | 200×200 |
|
||
| 13 | **Classifiers** | (1, 128, 200, 200) | **(1, 6, 200, 200)** | **200×200** |
|
||
|
||
---
|
||
|
||
## 📊 快速参考
|
||
|
||
### 关键尺寸速查
|
||
|
||
```
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
|
||
分割头输入:
|
||
✓ 形状: (1, 512, 360, 360)
|
||
✓ 通道: 512
|
||
✓ 空间: 360×360 grids
|
||
✓ 分辨率: 0.3m/grid
|
||
✓ 范围: ±54m
|
||
|
||
分割头输出:
|
||
✓ 形状: (1, 6, 200, 200)
|
||
✓ 类别: 6
|
||
✓ 空间: 200×200 grids
|
||
✓ 分辨率: 0.5m/grid
|
||
✓ 范围: ±50m
|
||
✓ 总面积: 10,000平方米
|
||
|
||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||
```
|
||
|
||
---
|
||
|
||
## 💡 设计考虑
|
||
|
||
### 为什么360×360 → 200×200?
|
||
|
||
**原因**:
|
||
|
||
1. **计算效率**:
|
||
- 360×360太大,分割头计算量爆炸
|
||
- 200×200是性能和效率的平衡点
|
||
|
||
2. **感兴趣区域**:
|
||
- ±54m太远,分割精度低
|
||
- ±50m是自动驾驶关注的主要区域
|
||
|
||
3. **标注精度**:
|
||
- nuScenes标注范围主要在±50m内
|
||
- 远距离区域标注可能不准确
|
||
|
||
4. **与官方一致**:
|
||
- 官方benchmark都用200×200输出
|
||
- 便于性能对比
|
||
|
||
---
|
||
|
||
## 🎓 总结
|
||
|
||
### 核心尺寸
|
||
```
|
||
输入: (1, 512, 360, 360) - 512通道, 360×360空间
|
||
↓
|
||
Grid Transform (360×360 → 200×200)
|
||
↓
|
||
输出: (1, 6, 200, 200) - 6类别, 200×200空间
|
||
|
||
空间范围: ±50m × ±50m = 10,000平方米
|
||
空间分辨率: 0.5m per grid (50cm)
|
||
```
|
||
|
||
---
|
||
|
||
**生成时间**: 2025-10-19
|
||
**文档版本**: 1.0
|
||
|