bev-project/archive/docs_old/Batch机制完整分析_20251102.md

7.1 KiB
Raw Blame History

BEVFusion Batch Size机制完整分析

分析时间: 2025-11-02 12:45 UTC
结论: 理论上batch=2完全可行


🎯 核心发现

您的理解完全正确!

一个Batch Sample = 6个相机 + 1个LiDAR点云
├─ 相机维度: (B, N, C, H, W)  
│   ├─ B = batch_size (samples_per_gpu)
│   └─ N = 6 (固定6个相机)
└─ LiDAR维度: List[B个点云]

当batch=2时:

Sample 1: 6 cameras + 1 lidar
Sample 2: 6 cameras + 1 lidar
───────────────────────────
总计: 12个相机图像 + 2个点云 (同时处理)

📋 Batch处理流程分析

数据流动batch=2为例

输入:
  img: (2, 6, 3, 256, 704)      # 2个样本每个6相机
  points: [点云1, 点云2]         # 2个LiDAR点云

Camera Encoder:
  reshape: (2, 6, 3, 256, 704) → (12, 3, 256, 704)  # 展平6相机
  backbone处理12个图像
  reshape回: (2, 6, C, H, W)
  vtransform → (2, 80, BEV_H, BEV_W)  # camera BEV特征

LiDAR Encoder:
  voxelize: 2个点云 → sparse voxels
  backbone → (2, 256, BEV_H, BEV_W)  # lidar BEV特征

Fuser:
  features = [camera_feat, lidar_feat]
  len(features) = 2  # ← camera和lidar两种模态
  
  self.fuser(features)  # ← 应该走这里!
  输出: (2, 256, BEV_H, BEV_W)  # 融合后的BEV特征

Decoder & Heads:
  输入batch=2输出batch=2
  完全支持batch维度

🚨 为什么Batch=2失败

错误断言分析

代码:

if self.fuser is not None:
    x = self.fuser(features)  # ← 应该走这里
else:
    assert len(features) == 1  # ← 不应该走这里

Fuser配置确认:

model:
  fuser:
    type: ConvFuser          ✅ 存在
    in_channels: [80, 256]   ✅ 正确
    out_channels: 256        ✅ 正确

🔍 可能的原因

原因1: Fuser初始化失败最可能

检查点:

# bevfusion.py: Line 58-61
if fuser is not None:
    self.fuser = build_fuser(fuser)
else:
    self.fuser = None

可能情况:

  • Fuser配置被传入了但build_fuser()失败
  • 返回None或抛出异常后被捕获
  • 导致 self.fuser = None

原因2: 配置继承问题

# multitask_BEV2X_phase4a_stage1_fp16.yaml
_base_: ./multitask_BEV2X_phase4a_stage1.yaml

# 可能FP16配置覆盖了某些关键参数

原因3: __load_from__导致的问题

--load_from /data/runs/phase4a_stage1/epoch_1.pth

如果epoch_1.pth是在没有fuser的配置下训练的加载时可能覆盖了fuser。


解决方案

方案A: 调试fuser初始化推荐

添加调试信息:

# mmdet3d/models/fusion_models/bevfusion.py
# Line 58-61修改为:

if fuser is not None:
    print(f"[DEBUG] Building fuser: {fuser.get('type', 'unknown')}")
    self.fuser = build_fuser(fuser)
    print(f"[DEBUG] Fuser built successfully: {self.fuser is not None}")
else:
    print("[DEBUG] No fuser config, setting to None")
    self.fuser = None

方案B: 显式在FP16配置中声明fuser

# multitask_BEV2X_phase4a_stage1_fp16.yaml

_base_: ./multitask_BEV2X_phase4a_stage1.yaml

# 显式声明fuser防止被覆盖
model:
  fuser:
    type: ConvFuser
    in_channels: [80, 256]
    out_channels: 256

# FP16和batch配置
fp16:
  loss_scale: dynamic

data:
  samples_per_gpu: 2
  workers_per_gpu: 0

optimizer:
  lr: 4.0e-5

方案C: 检查base配置文件验证

# 查看完整的配置继承链
cat configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml | grep "_base_"

# 确认convfuser.yaml被正确加载
cat configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml

💡 关键理解

Batch Size在BEVFusion中的三个维度

维度1: 样本Batch (samples_per_gpu) 可调

samples_per_gpu = 2
意味着: 同时处理2个driving scene samples
每个sample = 6 cameras + 1 lidar

维度2: 相机数量 (N=6) ✗ 固定

每个sample固定6个相机视图
这是数据集决定的,不可改变
代码中: B, N, C, H, W

维度3: 模态数量 (len(features)=2) ✗ 固定

features = [camera_feat, lidar_feat]
len(features) = 2 (camera + lidar两种模态)
这是架构决定的不是batch

🚀 Batch=2理论上完全可行

代码支持验证

1. Camera Encoder支持batch>1

# Line 121-131
B, N, C, H, W = x.size()  # B可以是任意值
x = x.view(B * N, C, H, W)  # 展平处理
# ... backbone ...
x = x.view(B, int(BN / B), C, H, W)  # 恢复batch维度

2. LiDAR Encoder支持batch>1

# Line 151-154
batch_size = coords[-1, 0] + 1  # 从坐标中提取batch size
x = self.encoders[sensor]["backbone"](feats, coords, batch_size, ...)

3. Fuser支持batch>1

# ConvFuser是标准卷积支持任意batch size
# features[0]: (B, 80, H, W)  camera
# features[1]: (B, 256, H, W) lidar
# 输出: (B, 256, H, W)

4. Decoder支持batch>1

# Line 337-340
batch_size = x.shape[0]  # 提取batch size
x = self.decoder["backbone"](x)  # SECOND支持batch
x = self.decoder["neck"](x)      # FPN支持batch

5. Heads支持batch>1

# Line 342-349
# TransFusionHead和SegmentationHead都支持batch维度

🎯 最终结论

Batch=2完全应该可行

所有模块都支持batch>1:

  • Camera Encoder
  • LiDAR Encoder
  • ConvFuser
  • Decoder
  • Detection Head
  • Segmentation Head

问题不在batch size而在fuser初始化


🔧 建议调试步骤

步骤1: 添加调试日志

修改 mmdet3d/models/fusion_models/bevfusion.py:

# Line 58-61
if fuser is not None:
    print(f"[INIT] Fuser config: {fuser}")
    self.fuser = build_fuser(fuser)
    print(f"[INIT] Fuser built: {self.fuser}")
else:
    print("[INIT] No fuser config")
    self.fuser = None

# Line 331-335
print(f"[FORWARD] self.fuser is None: {self.fuser is None}")
print(f"[FORWARD] len(features): {len(features)}")
if self.fuser is not None:
    x = self.fuser(features)
else:
    assert len(features) == 1, f"No fuser but {len(features)} features!"
    x = features[0]

步骤2: 重新启动训练

cd /workspace/bevfusion
bash CLEANUP_AND_START_FP16_BATCH2.sh

步骤3: 查看调试输出

# 查看fuser初始化日志
grep "\[INIT\] Fuser" $(ls -t phase4a_stage1_fp16*.log | head -1)

# 查看forward日志
grep "\[FORWARD\]" $(ls -t phase4a_stage1_fp16*.log | head -1) | head -5

📝 总结

关键理解:

  1. Batch size = samples_per_gpu(样本数)
  2. 每个sample = 6 cameras + 1 lidar(固定)
  3. Features = [camera_feat, lidar_feat](模态数=2
  4. Batch=2理论上完全可行

问题定位:

  • 不是batch size导致的问题
  • 不是features维度的问题
  • 可能是fuser初始化失败

下一步: 添加调试日志定位fuser初始化问题


文档版本: 1.0
状态: 机制分析完成,等待调试验证