15 KiB

Raw Blame History

BEVFusion 多任务多头支持指南

✅ 答案：完全支持！

BEVFusion 完全支持同时进行3D目标检测和BEV地图分割，这是该框架的核心设计特点之一。

架构设计

多头结构

BEVFusion
    ├── Encoders (多模态编码器)
    │   ├── Camera Encoder
    │   └── LiDAR Encoder
    ├── Fuser (特征融合)
    ├── Decoder (BEV解码器)
    └── Heads (多任务头) ★
        ├── object: 3D目标检测头 (TransFusion/CenterPoint)
        └── map: BEV地图分割头 (BEVSegmentationHead)

代码实现（bevfusion.py）

class BEVFusion(Base3DFusionModel):
    def __init__(self, encoders, fuser, decoder, heads, **kwargs):
        # 初始化多个任务头
        self.heads = nn.ModuleDict()
        for name in heads:
            if heads[name] is not None:
                self.heads[name] = build_head(heads[name])
        
        # 为每个任务设置损失权重
        self.loss_scale = dict()
        for name in heads:
            if heads[name] is not None:
                self.loss_scale[name] = 1.0
    
    def forward_single(self, ...):
        # 1. 多模态特征提取和融合
        features = []
        for sensor in self.encoders:
            feature = self.extract_features(...)
            features.append(feature)
        
        # 2. 特征融合
        x = self.fuser(features)
        
        # 3. BEV解码
        x = self.decoder["backbone"](x)
        x = self.decoder["neck"](x)
        
        # 4. 多任务头处理
        if self.training:
            outputs = {}
            for type, head in self.heads.items():
                if type == "object":
                    # 3D目标检测
                    pred_dict = head(x, metas)
                    losses = head.loss(gt_bboxes_3d, gt_labels_3d, pred_dict)
                elif type == "map":
                    # BEV地图分割
                    losses = head(x, gt_masks_bev)
                
                # 收集损失
                for name, val in losses.items():
                    outputs[f"loss/{type}/{name}"] = val * self.loss_scale[type]
            return outputs
        else:
            # 推理模式：同时输出检测和分割结果
            outputs = [{} for _ in range(batch_size)]
            for type, head in self.heads.items():
                if type == "object":
                    pred_dict = head(x, metas)
                    bboxes = head.get_bboxes(pred_dict, metas)
                    for k, (boxes, scores, labels) in enumerate(bboxes):
                        outputs[k].update({
                            "boxes_3d": boxes.to("cpu"),
                            "scores_3d": scores.cpu(),
                            "labels_3d": labels.cpu(),
                        })
                elif type == "map":
                    logits = head(x)
                    for k in range(batch_size):
                        outputs[k].update({
                            "masks_bev": logits[k].cpu(),
                            "gt_masks_bev": gt_masks_bev[k].cpu(),
                        })
            return outputs

配置文件示例

方案1：仅检测（configs/nuscenes/det/default.yaml）

model:
  type: BEVFusion
  heads:
    object:  # 启用检测头
      type: TransFusionHead
      # ... 检测头配置
    map: null  # 禁用分割头

方案2：仅分割（configs/nuscenes/seg/default.yaml）

model:
  type: BEVFusion
  heads:
    object: null  # 禁用检测头
    map:  # 启用分割头
      type: BEVSegmentationHead
      # ... 分割头配置

方案3：多任务（检测 + 分割）✨

model:
  type: BEVFusion
  encoders:
    camera:
      backbone:
        type: SwinTransformer
        # ... camera配置
      neck:
        type: GeneralizedLSSFPN
        # ... neck配置
      vtransform:
        type: LSSTransform
        # ... vtransform配置
    lidar:
      voxelize:
        # ... 体素化配置
      backbone:
        type: SparseEncoder
        # ... lidar backbone配置
  
  fuser:
    type: ConvFuser
    in_channels: [80, 256]
    out_channels: 256
  
  decoder:
    backbone:
      type: SECOND
      in_channels: 256
      out_channels: [128, 256]
      # ... decoder配置
    neck:
      type: SECONDFPN
      in_channels: [128, 256]
      out_channels: [256, 256]
      # ... neck配置
  
  heads:
    # 任务1：3D目标检测
    object:
      type: TransFusionHead
      num_proposals: 200
      auxiliary: true
      in_channels: 512
      num_classes: 10
      num_heads: 8
      nms_kernel_size: 3
      ffn_channel: 256
      dropout: 0.1
      common_heads:
        center: [2, 2]
        height: [1, 2]
        dim: [3, 2]
        rot: [2, 2]
        vel: [2, 2]
      bbox_coder:
        type: TransFusionBBoxCoder
        pc_range: [-54.0, -54.0]
        post_center_range: [-61.2, -61.2, -10.0, 61.2, 61.2, 10.0]
        voxel_size: [0.075, 0.075]
      loss_cls:
        type: FocalLoss
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
        reduction: mean
      loss_bbox:
        type: L1Loss
        reduction: mean
        loss_weight: 0.25
      loss_iou:
        type: GIoULoss
        reduction: mean
        loss_weight: 0.0
    
    # 任务2：BEV地图分割
    map:
      type: BEVSegmentationHead
      in_channels: 512
      grid_transform:
        input_scope: [[-54.0, 54.0, 0.8], [-54.0, 54.0, 0.8]]
        output_scope: [[-50, 50, 0.5], [-50, 50, 0.5]]
      classes: ['drivable_area', 'ped_crossing', 'walkway', 'stop_line', 
                'carpark_area', 'divider']
      loss:
        type: FocalLoss  # 或 CrossEntropyLoss
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25

  # 可选：为不同任务设置不同的损失权重
  loss_scale:
    object: 1.0  # 检测损失权重
    map: 1.0     # 分割损失权重

创建多任务配置

步骤1：创建配置文件

创建 configs/nuscenes/multitask/fusion-det-seg.yaml:

# 继承基础配置
_base_:
  - ../default.yaml

# 模型配置
model:
  type: BEVFusion
  
  # 编码器（复用检测的配置）
  encoders:
    camera:
      backbone:
        type: SwinTransformer
        embed_dims: 96
        depths: [2, 2, 6, 2]
        num_heads: [3, 6, 12, 24]
        window_size: 7
        mlp_ratio: 4
        qkv_bias: true
        qk_scale: null
        drop_rate: 0.
        attn_drop_rate: 0.
        drop_path_rate: 0.2
        patch_norm: true
        out_indices: [1, 2, 3]
        with_cp: false
        convert_weights: true
        init_cfg:
          type: Pretrained
          checkpoint: pretrained/swint-nuimages-pretrained.pth
      neck:
        type: GeneralizedLSSFPN
        in_channels: [192, 384, 768]
        out_channels: 256
        start_level: 0
        num_outs: 3
      vtransform:
        type: LSSTransform
        in_channels: 256
        out_channels: 80
        image_size: [256, 704]
        feature_size: [32, 88]
        xbound: [-54.0, 54.0, 0.3]
        ybound: [-54.0, 54.0, 0.3]
        zbound: [-10.0, 10.0, 20.0]
        dbound: [1.0, 60.0, 0.5]
        downsample: 2
    
    lidar:
      voxelize:
        max_num_points: 10
        point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
        voxel_size: [0.075, 0.075, 0.2]
        max_voxels: [120000, 160000]
      backbone:
        type: SparseEncoder
        in_channels: 5
        sparse_shape: [1440, 1440, 41]
        output_channels: 128
        order: [conv, norm, act]
        encoder_channels:
          - [16, 16, 32]
          - [32, 32, 64]
          - [64, 64, 128]
          - [128, 128]
        encoder_paddings:
          - [0, 0, 1]
          - [0, 0, 1]
          - [0, 0, [1, 1, 0]]
          - [0, 0]
        block_type: basicblock
  
  # 融合器
  fuser:
    type: ConvFuser
    in_channels: [80, 256]
    out_channels: 256
  
  # 解码器
  decoder:
    backbone:
      type: SECOND
      in_channels: 256
      out_channels: [128, 256]
      layer_nums: [5, 5]
      layer_strides: [1, 2]
    neck:
      type: SECONDFPN
      in_channels: [128, 256]
      out_channels: [256, 256]
      upsample_strides: [1, 2]
  
  # 多任务头
  heads:
    # 3D目标检测
    object:
      type: TransFusionHead
      in_channels: 512
      num_proposals: 200
      auxiliary: true
      num_classes: 10
      num_heads: 8
      nms_kernel_size: 3
      ffn_channel: 256
      dropout: 0.1
      common_heads:
        center: [2, 2]
        height: [1, 2]
        dim: [3, 2]
        rot: [2, 2]
        vel: [2, 2]
      loss_cls:
        type: FocalLoss
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
      loss_bbox:
        type: L1Loss
        loss_weight: 0.25
    
    # BEV地图分割
    map:
      type: BEVSegmentationHead
      in_channels: 512
      classes: ['drivable_area', 'ped_crossing', 'walkway', 
                'stop_line', 'carpark_area', 'divider']
      loss: focal

  # 损失权重（可选）
  loss_scale:
    object: 1.0
    map: 1.0

# 训练配置
optimizer:
  type: AdamW
  lr: 2.0e-4  # 多任务可能需要调整学习率
  weight_decay: 0.01

lr_config:
  policy: CosineAnnealing
  warmup: linear
  warmup_iters: 500
  warmup_ratio: 0.33333333
  min_lr_ratio: 1.0e-3

runner:
  type: EpochBasedRunner
  max_epochs: 20

# 评估配置
evaluation:
  interval: 1
  pipeline:
    # 同时评估检测和分割
    - type: DetEval
      metric: bbox
    - type: SegEval
      metric: map

步骤2：训练命令

# 多任务训练
torchpack dist-run -np 8 python tools/train.py \
  configs/nuscenes/multitask/fusion-det-seg.yaml \
  --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth \
  --load_from pretrained/lidar-only-det.pth

步骤3：测试/推理

# 多任务测试（同时评估检测和分割）
torchpack dist-run -np 8 python tools/test.py \
  configs/nuscenes/multitask/fusion-det-seg.yaml \
  runs/multitask/latest.pth \
  --eval bbox map

输出结果格式

训练时输出（损失）

{
    'loss/object/heatmap': 0.234,
    'loss/object/bbox': 0.456,
    'loss/object/iou': 0.123,
    'loss/map/seg': 0.345,
    'loss/depth': 0.089,  # 如果使用BEVDepth
    'stats/object/...': ...,
    'stats/map/...': ...
}

推理时输出（预测结果）

# 每个样本的输出
[
    {
        # 3D检测结果
        'boxes_3d': LiDARInstance3DBoxes(...),  # 形状: (N, 9)
        'scores_3d': tensor([...]),              # 形状: (N,)
        'labels_3d': tensor([...]),              # 形状: (N,)
        
        # BEV分割结果
        'masks_bev': tensor([[...]]),            # 形状: (C, H, W)
        'gt_masks_bev': tensor([[...]])          # 形状: (C, H, W) - 如果有GT
    },
    # ... 更多样本
]

可视化多任务结果

import torch
import matplotlib.pyplot as plt
from mmdet3d.core.bbox import LiDARInstance3DBoxes

def visualize_multitask_results(data, prediction):
    """可视化多任务输出"""
    
    # 1. 可视化3D检测框（BEV视图）
    boxes_3d = prediction['boxes_3d']
    scores_3d = prediction['scores_3d']
    labels_3d = prediction['labels_3d']
    
    # 2. 可视化BEV分割
    masks_bev = prediction['masks_bev']  # (C, H, W)
    
    fig, axes = plt.subplots(1, 2, figsize=(15, 7))
    
    # 左图：3D检测
    ax = axes[0]
    # 绘制BEV平面和检测框
    for box, score, label in zip(boxes_3d.tensor, scores_3d, labels_3d):
        # 绘制框 (简化示例)
        corners = boxes_3d.corners[[i]]
        # ... 绘制逻辑
    ax.set_title('3D Object Detection')
    
    # 右图：BEV分割
    ax = axes[1]
    seg_map = torch.argmax(masks_bev, dim=0)  # (H, W)
    im = ax.imshow(seg_map.cpu().numpy())
    ax.set_title('BEV Map Segmentation')
    plt.colorbar(im, ax=ax)
    
    plt.tight_layout()
    plt.savefig('multitask_result.png')

性能和资源消耗

单任务 vs 多任务对比

配置	显存/GPU	训练时间	性能
仅检测	~18GB	20-24h	mAP: 68-70%
仅分割	~14GB	12-15h	mIoU: 62-63%
多任务	~22GB	28-32h	mAP: 67-69% mIoU: 61-62%

注意事项：

多任务训练显存消耗略高（增加约4GB）
训练时间约为两个单任务之和
性能可能略低于单独训练，但共享特征提取带来效率提升
推理时可以同时输出两种结果，无需多次forward

优化建议

调整损失权重

loss_scale:
  object: 1.0   # 可以调整为 0.5-2.0
  map: 1.0      # 可以调整为 0.5-2.0

渐进式训练策略

# 阶段1：先训练检测（冻结分割头）
# 阶段2：再训练分割（冻结检测头）
# 阶段3：联合fine-tuning

使用更大的batch size

data:
  samples_per_gpu: 2  # 如果显存允许

实际应用场景

1. 自动驾驶完整感知

多任务输出：
├── 3D目标检测 → 车辆、行人、障碍物
└── BEV分割 → 可行驶区域、人行横道、停车区域

优势：
- 统一的BEV表示
- 共享特征提取
- 一次推理获得完整场景理解

2. 实时系统部署

检测 + 分割 (多任务) vs 两个单独模型
├── 推理时间：1x vs 1.8x
├── 显存占用：1x vs 1.6x
└── 参数量：1x vs 1.7x

3. 端到端训练

优势：
- 两个任务互相促进
- 分割帮助检测理解场景结构
- 检测帮助分割关注重要区域

常见问题

Q1: 多任务训练会影响单个任务的性能吗？

A: 可能会有轻微影响（1-2%），但：

共享特征提取带来的效率提升
两个任务可以互相促进
实际应用中往往需要同时获得两种结果

Q2: 可以只推理其中一个任务吗？

A: 可以！在配置文件中设置：

heads:
  object: {...}  # 保留
  map: null      # 禁用

Q3: 如何平衡两个任务的损失？

A: 调整 loss_scale:

loss_scale:
  object: 2.0  # 更关注检测
  map: 1.0

Q4: 多任务训练需要什么数据？

A: 需要同时包含：

3D检测标注 (gt_bboxes_3d, gt_labels_3d)
BEV分割标注 (gt_masks_bev)

nuScenes数据集同时提供这两种标注。

Q5: 可以添加更多任务头吗？

A: 完全可以！例如添加速度预测、轨迹预测等：

heads:
  object: {...}
  map: {...}
  velocity: {...}  # 自定义任务头
  trajectory: {...}  # 自定义任务头

总结

✅ BEVFusion完全支持多任务多头输出

✅ 同时进行3D检测和BEV分割
✅ 共享特征提取和BEV表示
✅ 统一的训练和推理流程
✅ 灵活的配置系统
✅ 可扩展到更多任务

🚀 推荐使用多任务配置

提高推理效率
任务间互相促进
更完整的场景理解
适合实际应用部署

生成时间: 2025-10-16

15 KiB Raw Blame History Unescape Escape