7.1 KiB
7.1 KiB
BEVFusion Batch Size机制完整分析
分析时间: 2025-11-02 12:45 UTC
结论: ✅ 理论上batch=2完全可行!
🎯 核心发现
您的理解完全正确!✅
一个Batch Sample = 6个相机 + 1个LiDAR点云
├─ 相机维度: (B, N, C, H, W)
│ ├─ B = batch_size (samples_per_gpu)
│ └─ N = 6 (固定,6个相机)
└─ LiDAR维度: List[B个点云]
当batch=2时:
Sample 1: 6 cameras + 1 lidar
Sample 2: 6 cameras + 1 lidar
───────────────────────────
总计: 12个相机图像 + 2个点云 (同时处理)
📋 Batch处理流程分析
数据流动(batch=2为例)
输入:
img: (2, 6, 3, 256, 704) # 2个样本,每个6相机
points: [点云1, 点云2] # 2个LiDAR点云
Camera Encoder:
reshape: (2, 6, 3, 256, 704) → (12, 3, 256, 704) # 展平6相机
backbone处理12个图像
reshape回: (2, 6, C, H, W)
vtransform → (2, 80, BEV_H, BEV_W) # camera BEV特征
LiDAR Encoder:
voxelize: 2个点云 → sparse voxels
backbone → (2, 256, BEV_H, BEV_W) # lidar BEV特征
Fuser:
features = [camera_feat, lidar_feat]
len(features) = 2 # ← camera和lidar两种模态
self.fuser(features) # ← 应该走这里!
输出: (2, 256, BEV_H, BEV_W) # 融合后的BEV特征
Decoder & Heads:
输入batch=2,输出batch=2
完全支持batch维度!✅
🚨 为什么Batch=2失败?
错误断言分析
代码:
if self.fuser is not None:
x = self.fuser(features) # ← 应该走这里
else:
assert len(features) == 1 # ← 不应该走这里
Fuser配置确认:
model:
fuser:
type: ConvFuser ✅ 存在
in_channels: [80, 256] ✅ 正确
out_channels: 256 ✅ 正确
🔍 可能的原因
原因1: Fuser初始化失败(最可能)⭐
检查点:
# bevfusion.py: Line 58-61
if fuser is not None:
self.fuser = build_fuser(fuser)
else:
self.fuser = None
可能情况:
- Fuser配置被传入了,但build_fuser()失败
- 返回None或抛出异常后被捕获
- 导致
self.fuser = None
原因2: 配置继承问题
# multitask_BEV2X_phase4a_stage1_fp16.yaml
_base_: ./multitask_BEV2X_phase4a_stage1.yaml
# 可能FP16配置覆盖了某些关键参数?
原因3: __load_from__导致的问题
--load_from /data/runs/phase4a_stage1/epoch_1.pth
如果epoch_1.pth是在没有fuser的配置下训练的,加载时可能覆盖了fuser。
✅ 解决方案
方案A: 调试fuser初始化(推荐)⭐
添加调试信息:
# mmdet3d/models/fusion_models/bevfusion.py
# Line 58-61修改为:
if fuser is not None:
print(f"[DEBUG] Building fuser: {fuser.get('type', 'unknown')}")
self.fuser = build_fuser(fuser)
print(f"[DEBUG] Fuser built successfully: {self.fuser is not None}")
else:
print("[DEBUG] No fuser config, setting to None")
self.fuser = None
方案B: 显式在FP16配置中声明fuser
# multitask_BEV2X_phase4a_stage1_fp16.yaml
_base_: ./multitask_BEV2X_phase4a_stage1.yaml
# 显式声明fuser(防止被覆盖)
model:
fuser:
type: ConvFuser
in_channels: [80, 256]
out_channels: 256
# FP16和batch配置
fp16:
loss_scale: dynamic
data:
samples_per_gpu: 2
workers_per_gpu: 0
optimizer:
lr: 4.0e-5
方案C: 检查base配置文件(验证)
# 查看完整的配置继承链
cat configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml | grep "_base_"
# 确认convfuser.yaml被正确加载
cat configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml
💡 关键理解
Batch Size在BEVFusion中的三个维度
维度1: 样本Batch (samples_per_gpu) ⭐ 可调
samples_per_gpu = 2
意味着: 同时处理2个driving scene samples
每个sample = 6 cameras + 1 lidar
维度2: 相机数量 (N=6) ✗ 固定
每个sample固定6个相机视图
这是数据集决定的,不可改变
代码中: B, N, C, H, W
维度3: 模态数量 (len(features)=2) ✗ 固定
features = [camera_feat, lidar_feat]
len(features) = 2 (camera + lidar两种模态)
这是架构决定的,不是batch
🚀 Batch=2理论上完全可行
代码支持验证
1. Camera Encoder支持batch>1 ✅
# Line 121-131
B, N, C, H, W = x.size() # B可以是任意值
x = x.view(B * N, C, H, W) # 展平处理
# ... backbone ...
x = x.view(B, int(BN / B), C, H, W) # 恢复batch维度
2. LiDAR Encoder支持batch>1 ✅
# Line 151-154
batch_size = coords[-1, 0] + 1 # 从坐标中提取batch size
x = self.encoders[sensor]["backbone"](feats, coords, batch_size, ...)
3. Fuser支持batch>1 ✅
# ConvFuser是标准卷积,支持任意batch size
# features[0]: (B, 80, H, W) camera
# features[1]: (B, 256, H, W) lidar
# 输出: (B, 256, H, W)
4. Decoder支持batch>1 ✅
# Line 337-340
batch_size = x.shape[0] # 提取batch size
x = self.decoder["backbone"](x) # SECOND支持batch
x = self.decoder["neck"](x) # FPN支持batch
5. Heads支持batch>1 ✅
# Line 342-349
# TransFusionHead和SegmentationHead都支持batch维度
🎯 最终结论
Batch=2完全应该可行!⭐⭐⭐
所有模块都支持batch>1:
- ✅ Camera Encoder
- ✅ LiDAR Encoder
- ✅ ConvFuser
- ✅ Decoder
- ✅ Detection Head
- ✅ Segmentation Head
问题不在batch size,而在fuser初始化!
🔧 建议调试步骤
步骤1: 添加调试日志
修改 mmdet3d/models/fusion_models/bevfusion.py:
# Line 58-61
if fuser is not None:
print(f"[INIT] Fuser config: {fuser}")
self.fuser = build_fuser(fuser)
print(f"[INIT] Fuser built: {self.fuser}")
else:
print("[INIT] No fuser config")
self.fuser = None
# Line 331-335
print(f"[FORWARD] self.fuser is None: {self.fuser is None}")
print(f"[FORWARD] len(features): {len(features)}")
if self.fuser is not None:
x = self.fuser(features)
else:
assert len(features) == 1, f"No fuser but {len(features)} features!"
x = features[0]
步骤2: 重新启动训练
cd /workspace/bevfusion
bash CLEANUP_AND_START_FP16_BATCH2.sh
步骤3: 查看调试输出
# 查看fuser初始化日志
grep "\[INIT\] Fuser" $(ls -t phase4a_stage1_fp16*.log | head -1)
# 查看forward日志
grep "\[FORWARD\]" $(ls -t phase4a_stage1_fp16*.log | head -1) | head -5
📝 总结
关键理解:
- ✅ Batch size = samples_per_gpu(样本数)
- ✅ 每个sample = 6 cameras + 1 lidar(固定)
- ✅ Features = [camera_feat, lidar_feat](模态数=2)
- ✅ Batch=2理论上完全可行
问题定位:
- ❌ 不是batch size导致的问题
- ❌ 不是features维度的问题
- ✅ 可能是fuser初始化失败
下一步: 添加调试日志,定位fuser初始化问题
文档版本: 1.0
状态: ✅ 机制分析完成,等待调试验证