bev-project/project/docs/MULTITASK_GUIDE.md

602 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BEVFusion 多任务多头支持指南
## ✅ 答案:完全支持!
BEVFusion **完全支持同时进行3D目标检测和BEV地图分割**,这是该框架的核心设计特点之一。
## 架构设计
### 多头结构
```python
BEVFusion
├── Encoders (多模态编码器)
├── Camera Encoder
└── LiDAR Encoder
├── Fuser (特征融合)
├── Decoder (BEV解码器)
└── Heads (多任务头)
├── object: 3D目标检测头 (TransFusion/CenterPoint)
└── map: BEV地图分割头 (BEVSegmentationHead)
```
### 代码实现bevfusion.py
```python
class BEVFusion(Base3DFusionModel):
def __init__(self, encoders, fuser, decoder, heads, **kwargs):
# 初始化多个任务头
self.heads = nn.ModuleDict()
for name in heads:
if heads[name] is not None:
self.heads[name] = build_head(heads[name])
# 为每个任务设置损失权重
self.loss_scale = dict()
for name in heads:
if heads[name] is not None:
self.loss_scale[name] = 1.0
def forward_single(self, ...):
# 1. 多模态特征提取和融合
features = []
for sensor in self.encoders:
feature = self.extract_features(...)
features.append(feature)
# 2. 特征融合
x = self.fuser(features)
# 3. BEV解码
x = self.decoder["backbone"](x)
x = self.decoder["neck"](x)
# 4. 多任务头处理
if self.training:
outputs = {}
for type, head in self.heads.items():
if type == "object":
# 3D目标检测
pred_dict = head(x, metas)
losses = head.loss(gt_bboxes_3d, gt_labels_3d, pred_dict)
elif type == "map":
# BEV地图分割
losses = head(x, gt_masks_bev)
# 收集损失
for name, val in losses.items():
outputs[f"loss/{type}/{name}"] = val * self.loss_scale[type]
return outputs
else:
# 推理模式:同时输出检测和分割结果
outputs = [{} for _ in range(batch_size)]
for type, head in self.heads.items():
if type == "object":
pred_dict = head(x, metas)
bboxes = head.get_bboxes(pred_dict, metas)
for k, (boxes, scores, labels) in enumerate(bboxes):
outputs[k].update({
"boxes_3d": boxes.to("cpu"),
"scores_3d": scores.cpu(),
"labels_3d": labels.cpu(),
})
elif type == "map":
logits = head(x)
for k in range(batch_size):
outputs[k].update({
"masks_bev": logits[k].cpu(),
"gt_masks_bev": gt_masks_bev[k].cpu(),
})
return outputs
```
## 配置文件示例
### 方案1仅检测configs/nuscenes/det/default.yaml
```yaml
model:
type: BEVFusion
heads:
object: # 启用检测头
type: TransFusionHead
# ... 检测头配置
map: null # 禁用分割头
```
### 方案2仅分割configs/nuscenes/seg/default.yaml
```yaml
model:
type: BEVFusion
heads:
object: null # 禁用检测头
map: # 启用分割头
type: BEVSegmentationHead
# ... 分割头配置
```
### 方案3多任务检测 + 分割)✨
```yaml
model:
type: BEVFusion
encoders:
camera:
backbone:
type: SwinTransformer
# ... camera配置
neck:
type: GeneralizedLSSFPN
# ... neck配置
vtransform:
type: LSSTransform
# ... vtransform配置
lidar:
voxelize:
# ... 体素化配置
backbone:
type: SparseEncoder
# ... lidar backbone配置
fuser:
type: ConvFuser
in_channels: [80, 256]
out_channels: 256
decoder:
backbone:
type: SECOND
in_channels: 256
out_channels: [128, 256]
# ... decoder配置
neck:
type: SECONDFPN
in_channels: [128, 256]
out_channels: [256, 256]
# ... neck配置
heads:
# 任务13D目标检测
object:
type: TransFusionHead
num_proposals: 200
auxiliary: true
in_channels: 512
num_classes: 10
num_heads: 8
nms_kernel_size: 3
ffn_channel: 256
dropout: 0.1
common_heads:
center: [2, 2]
height: [1, 2]
dim: [3, 2]
rot: [2, 2]
vel: [2, 2]
bbox_coder:
type: TransFusionBBoxCoder
pc_range: [-54.0, -54.0]
post_center_range: [-61.2, -61.2, -10.0, 61.2, 61.2, 10.0]
voxel_size: [0.075, 0.075]
loss_cls:
type: FocalLoss
use_sigmoid: true
gamma: 2.0
alpha: 0.25
reduction: mean
loss_bbox:
type: L1Loss
reduction: mean
loss_weight: 0.25
loss_iou:
type: GIoULoss
reduction: mean
loss_weight: 0.0
# 任务2BEV地图分割
map:
type: BEVSegmentationHead
in_channels: 512
grid_transform:
input_scope: [[-54.0, 54.0, 0.8], [-54.0, 54.0, 0.8]]
output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
classes: ['drivable_area', 'ped_crossing', 'walkway', 'stop_line',
'carpark_area', 'divider']
loss:
type: FocalLoss # 或 CrossEntropyLoss
use_sigmoid: true
gamma: 2.0
alpha: 0.25
# 可选:为不同任务设置不同的损失权重
loss_scale:
object: 1.0 # 检测损失权重
map: 1.0 # 分割损失权重
```
## 创建多任务配置
### 步骤1创建配置文件
创建 `configs/nuscenes/multitask/fusion-det-seg.yaml`:
```yaml
# 继承基础配置
_base_:
- ../default.yaml
# 模型配置
model:
type: BEVFusion
# 编码器(复用检测的配置)
encoders:
camera:
backbone:
type: SwinTransformer
embed_dims: 96
depths: [2, 2, 6, 2]
num_heads: [3, 6, 12, 24]
window_size: 7
mlp_ratio: 4
qkv_bias: true
qk_scale: null
drop_rate: 0.
attn_drop_rate: 0.
drop_path_rate: 0.2
patch_norm: true
out_indices: [1, 2, 3]
with_cp: false
convert_weights: true
init_cfg:
type: Pretrained
checkpoint: pretrained/swint-nuimages-pretrained.pth
neck:
type: GeneralizedLSSFPN
in_channels: [192, 384, 768]
out_channels: 256
start_level: 0
num_outs: 3
vtransform:
type: LSSTransform
in_channels: 256
out_channels: 80
image_size: [256, 704]
feature_size: [32, 88]
xbound: [-54.0, 54.0, 0.3]
ybound: [-54.0, 54.0, 0.3]
zbound: [-10.0, 10.0, 20.0]
dbound: [1.0, 60.0, 0.5]
downsample: 2
lidar:
voxelize:
max_num_points: 10
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
voxel_size: [0.075, 0.075, 0.2]
max_voxels: [120000, 160000]
backbone:
type: SparseEncoder
in_channels: 5
sparse_shape: [1440, 1440, 41]
output_channels: 128
order: [conv, norm, act]
encoder_channels:
- [16, 16, 32]
- [32, 32, 64]
- [64, 64, 128]
- [128, 128]
encoder_paddings:
- [0, 0, 1]
- [0, 0, 1]
- [0, 0, [1, 1, 0]]
- [0, 0]
block_type: basicblock
# 融合器
fuser:
type: ConvFuser
in_channels: [80, 256]
out_channels: 256
# 解码器
decoder:
backbone:
type: SECOND
in_channels: 256
out_channels: [128, 256]
layer_nums: [5, 5]
layer_strides: [1, 2]
neck:
type: SECONDFPN
in_channels: [128, 256]
out_channels: [256, 256]
upsample_strides: [1, 2]
# 多任务头
heads:
# 3D目标检测
object:
type: TransFusionHead
in_channels: 512
num_proposals: 200
auxiliary: true
num_classes: 10
num_heads: 8
nms_kernel_size: 3
ffn_channel: 256
dropout: 0.1
common_heads:
center: [2, 2]
height: [1, 2]
dim: [3, 2]
rot: [2, 2]
vel: [2, 2]
loss_cls:
type: FocalLoss
use_sigmoid: true
gamma: 2.0
alpha: 0.25
loss_bbox:
type: L1Loss
loss_weight: 0.25
# BEV地图分割
map:
type: BEVSegmentationHead
in_channels: 512
classes: ['drivable_area', 'ped_crossing', 'walkway',
'stop_line', 'carpark_area', 'divider']
loss: focal
# 损失权重(可选)
loss_scale:
object: 1.0
map: 1.0
# 训练配置
optimizer:
type: AdamW
lr: 2.0e-4 # 多任务可能需要调整学习率
weight_decay: 0.01
lr_config:
policy: CosineAnnealing
warmup: linear
warmup_iters: 500
warmup_ratio: 0.33333333
min_lr_ratio: 1.0e-3
runner:
type: EpochBasedRunner
max_epochs: 20
# 评估配置
evaluation:
interval: 1
pipeline:
# 同时评估检测和分割
- type: DetEval
metric: bbox
- type: SegEval
metric: map
```
### 步骤2训练命令
```bash
# 多任务训练
torchpack dist-run -np 8 python tools/train.py \
configs/nuscenes/multitask/fusion-det-seg.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth \
--load_from pretrained/lidar-only-det.pth
```
### 步骤3测试/推理
```bash
# 多任务测试(同时评估检测和分割)
torchpack dist-run -np 8 python tools/test.py \
configs/nuscenes/multitask/fusion-det-seg.yaml \
runs/multitask/latest.pth \
--eval bbox map
```
## 输出结果格式
### 训练时输出(损失)
```python
{
'loss/object/heatmap': 0.234,
'loss/object/bbox': 0.456,
'loss/object/iou': 0.123,
'loss/map/seg': 0.345,
'loss/depth': 0.089, # 如果使用BEVDepth
'stats/object/...': ...,
'stats/map/...': ...
}
```
### 推理时输出(预测结果)
```python
# 每个样本的输出
[
{
# 3D检测结果
'boxes_3d': LiDARInstance3DBoxes(...), # 形状: (N, 9)
'scores_3d': tensor([...]), # 形状: (N,)
'labels_3d': tensor([...]), # 形状: (N,)
# BEV分割结果
'masks_bev': tensor([[...]]), # 形状: (C, H, W)
'gt_masks_bev': tensor([[...]]) # 形状: (C, H, W) - 如果有GT
},
# ... 更多样本
]
```
## 可视化多任务结果
```python
import torch
import matplotlib.pyplot as plt
from mmdet3d.core.bbox import LiDARInstance3DBoxes
def visualize_multitask_results(data, prediction):
"""可视化多任务输出"""
# 1. 可视化3D检测框BEV视图
boxes_3d = prediction['boxes_3d']
scores_3d = prediction['scores_3d']
labels_3d = prediction['labels_3d']
# 2. 可视化BEV分割
masks_bev = prediction['masks_bev'] # (C, H, W)
fig, axes = plt.subplots(1, 2, figsize=(15, 7))
# 左图3D检测
ax = axes[0]
# 绘制BEV平面和检测框
for box, score, label in zip(boxes_3d.tensor, scores_3d, labels_3d):
# 绘制框 (简化示例)
corners = boxes_3d.corners[[i]]
# ... 绘制逻辑
ax.set_title('3D Object Detection')
# 右图BEV分割
ax = axes[1]
seg_map = torch.argmax(masks_bev, dim=0) # (H, W)
im = ax.imshow(seg_map.cpu().numpy())
ax.set_title('BEV Map Segmentation')
plt.colorbar(im, ax=ax)
plt.tight_layout()
plt.savefig('multitask_result.png')
```
## 性能和资源消耗
### 单任务 vs 多任务对比
| 配置 | 显存/GPU | 训练时间 | 性能 |
|------|----------|----------|------|
| 仅检测 | ~18GB | 20-24h | mAP: 68-70% |
| 仅分割 | ~14GB | 12-15h | mIoU: 62-63% |
| **多任务** | **~22GB** | **28-32h** | **mAP: 67-69%<br>mIoU: 61-62%** |
注意事项:
- 多任务训练显存消耗略高增加约4GB
- 训练时间约为两个单任务之和
- 性能可能略低于单独训练,但共享特征提取带来效率提升
- 推理时可以同时输出两种结果无需多次forward
### 优化建议
1. **调整损失权重**
```yaml
loss_scale:
object: 1.0 # 可以调整为 0.5-2.0
map: 1.0 # 可以调整为 0.5-2.0
```
2. **渐进式训练策略**
```bash
# 阶段1先训练检测冻结分割头
# 阶段2再训练分割冻结检测头
# 阶段3联合fine-tuning
```
3. **使用更大的batch size**
```yaml
data:
samples_per_gpu: 2 # 如果显存允许
```
## 实际应用场景
### 1. 自动驾驶完整感知
```
多任务输出:
├── 3D目标检测 → 车辆、行人、障碍物
└── BEV分割 → 可行驶区域、人行横道、停车区域
优势:
- 统一的BEV表示
- 共享特征提取
- 一次推理获得完整场景理解
```
### 2. 实时系统部署
```
检测 + 分割 (多任务) vs 两个单独模型
├── 推理时间1x vs 1.8x
├── 显存占用1x vs 1.6x
└── 参数量1x vs 1.7x
```
### 3. 端到端训练
```
优势:
- 两个任务互相促进
- 分割帮助检测理解场景结构
- 检测帮助分割关注重要区域
```
## 常见问题
### Q1: 多任务训练会影响单个任务的性能吗?
**A**: 可能会有轻微影响1-2%),但:
- 共享特征提取带来的效率提升
- 两个任务可以互相促进
- 实际应用中往往需要同时获得两种结果
### Q2: 可以只推理其中一个任务吗?
**A**: 可以!在配置文件中设置:
```yaml
heads:
object: {...} # 保留
map: null # 禁用
```
### Q3: 如何平衡两个任务的损失?
**A**: 调整 `loss_scale`:
```yaml
loss_scale:
object: 2.0 # 更关注检测
map: 1.0
```
### Q4: 多任务训练需要什么数据?
**A**: 需要同时包含:
- 3D检测标注 (gt_bboxes_3d, gt_labels_3d)
- BEV分割标注 (gt_masks_bev)
nuScenes数据集同时提供这两种标注。
### Q5: 可以添加更多任务头吗?
**A**: 完全可以!例如添加速度预测、轨迹预测等:
```python
heads:
object: {...}
map: {...}
velocity: {...} # 自定义任务头
trajectory: {...} # 自定义任务头
```
## 总结
**BEVFusion完全支持多任务多头输出**
- ✅ 同时进行3D检测和BEV分割
- ✅ 共享特征提取和BEV表示
- ✅ 统一的训练和推理流程
- ✅ 灵活的配置系统
- ✅ 可扩展到更多任务
🚀 **推荐使用多任务配置**
- 提高推理效率
- 任务间互相促进
- 更完整的场景理解
- 适合实际应用部署
---
生成时间: 2025-10-16