14 KiB
14 KiB
BEVFusion 模型优化启动计划
开始时间: 2025-10-30
Baseline: Epoch 23 (NDS 0.6941, mAP 0.6446, mIoU 0.4130)
目标: 准备Orin部署的优化模型
🎯 优化目标
最终部署目标
硬件: NVIDIA Orin 270T
推理时间: <80ms (理想<60ms)
吞吐量: >12 FPS (理想>16 FPS)
功耗: <60W (理想<45W)
精度损失: <3%
优化路线
原始模型: 110M参数, 450 GFLOPs, 90ms@A100
↓
剪枝模型: 60M参数, 250 GFLOPs, 50ms@A100 (-45%)
↓
INT8模型: 15M参数, 62 GFLOPs, 40ms@A100 (-56%)
↓
TensorRT: 15M参数, 优化kernel, 30ms@A100 (-67%)
↓
Orin部署: 50-60ms推理, 16+ FPS, <50W 目标达成✅
📋 三阶段优化计划
阶段1: 模型分析(1-2天,立即开始)
任务清单
- 分析模型参数量和FLOPs
- Profiling推理性能瓶颈
- 敏感度分析(哪些层可剪枝)
- 确定剪枝策略
需要的工具
tools/analysis/
├── model_complexity.py # 模型复杂度分析
├── profile_inference.py # 推理性能profiling
├── sensitivity_analysis.py # 敏感度分析
└── layer_statistics.py # 层统计信息
阶段2: 模型剪枝(3-5天)
目标
参数量: 110M → 60M (-45%)
FLOPs: 450G → 250G (-44%)
精度损失: <1.5%
剪枝策略
1. SwinTransformer Backbone
- 通道剪枝: 减少20-30% channels
- 层数剪枝: 可选择减少attention层
2. FPN Neck
- 通道剪枝: 减少25-30% channels
3. Decoder
- 通道剪枝: 减少20% channels
4. Detection/Segmentation Heads
- 谨慎剪枝: 减少10-15% (影响精度)
剪枝工具
- Torch-Pruning (推荐)
- torch.nn.utils.prune (内置)
阶段3: 量化训练(4-6天)
目标
模型大小: 441MB (FP32) → 110MB (INT8) (-75%)
推理速度: 2-3倍提升
精度损失: <2% (累计<3%)
量化策略
1. PTQ (Post-Training Quantization)
- 快速验证可行性
- 预期精度损失: 2-3%
2. QAT (Quantization-Aware Training)
- 训练恢复精度
- 5个epochs, lr=1e-6
- 预期精度恢复: 1-2%
🚀 立即行动:阶段1启动
Step 1: 模型复杂度分析
创建分析脚本:
# tools/analysis/model_complexity.py
import torch
import torch.nn as nn
from thop import profile, clever_format
from mmcv import Config
from mmdet3d.models import build_model
def analyze_model_complexity(config_file, checkpoint_file=None):
"""分析模型复杂度"""
# 加载配置
cfg = Config.fromfile(config_file)
# 构建模型
model = build_model(cfg.model)
model.eval()
if checkpoint_file:
checkpoint = torch.load(checkpoint_file, map_location='cpu')
model.load_state_dict(checkpoint['state_dict'])
# 统计参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("=" * 80)
print("模型参数统计")
print("=" * 80)
print(f"总参数量: {total_params:,} ({total_params/1e6:.2f}M)")
print(f"可训练参数: {trainable_params:,} ({trainable_params/1e6:.2f}M)")
print(f"模型大小 (FP32): {total_params * 4 / 1024 / 1024:.2f} MB")
print()
# 分模块统计
print("=" * 80)
print("各模块参数统计")
print("=" * 80)
module_params = {}
for name, module in model.named_children():
params = sum(p.numel() for p in module.parameters())
module_params[name] = params
print(f"{name:30s}: {params:12,} ({params/total_params*100:5.2f}%)")
print()
# FLOPs统计(需要dummy input)
print("=" * 80)
print("计算量统计 (需要dummy input)")
print("=" * 80)
# 创建dummy inputs
batch_size = 1
dummy_images = torch.randn(batch_size, 6, 3, 256, 704) # 6个相机视角
dummy_points = torch.randn(batch_size, 40000, 5) # 点云
try:
flops, params = profile(model, inputs=(dummy_images, dummy_points))
flops, params = clever_format([flops, params], "%.3f")
print(f"FLOPs: {flops}")
print(f"Params: {params}")
except Exception as e:
print(f"FLOPs计算失败: {e}")
print("可能需要修改model forward以支持profile")
return model, total_params, module_params
if __name__ == '__main__':
import sys
if len(sys.argv) < 2:
print("用法: python model_complexity.py <config_file> [checkpoint_file]")
sys.exit(1)
config_file = sys.argv[1]
checkpoint_file = sys.argv[2] if len(sys.argv) > 2 else None
model, total_params, module_params = analyze_model_complexity(
config_file,
checkpoint_file
)
print("\n分析完成!")
Step 2: 推理性能Profiling
# tools/analysis/profile_inference.py
import torch
import time
import numpy as np
from mmcv import Config
from mmdet3d.models import build_model
from mmdet3d.datasets import build_dataloader, build_dataset
def profile_inference(config_file, checkpoint_file, num_samples=100):
"""Profiling推理性能"""
# 加载配置和模型
cfg = Config.fromfile(config_file)
model = build_model(cfg.model).cuda()
checkpoint = torch.load(checkpoint_file)
model.load_state_dict(checkpoint['state_dict'])
model.eval()
# 构建数据集
dataset = build_dataset(cfg.data.val)
data_loader = build_dataloader(
dataset,
samples_per_gpu=1,
workers_per_gpu=0,
dist=False,
shuffle=False
)
# 预热
print("预热GPU...")
with torch.no_grad():
for i, data in enumerate(data_loader):
if i >= 10:
break
_ = model(return_loss=False, rescale=True, **data)
# 性能测试
print(f"\n开始profiling (测试{num_samples}个样本)...")
times = []
with torch.no_grad():
for i, data in enumerate(data_loader):
if i >= num_samples:
break
torch.cuda.synchronize()
start = time.time()
_ = model(return_loss=False, rescale=True, **data)
torch.cuda.synchronize()
end = time.time()
times.append((end - start) * 1000) # ms
if (i + 1) % 10 == 0:
print(f" 已处理: {i+1}/{num_samples}")
# 统计
times = np.array(times)
print("\n" + "=" * 80)
print("推理性能统计")
print("=" * 80)
print(f"平均推理时间: {np.mean(times):.2f} ms")
print(f"中位数: {np.median(times):.2f} ms")
print(f"最小值: {np.min(times):.2f} ms")
print(f"最大值: {np.max(times):.2f} ms")
print(f"标准差: {np.std(times):.2f} ms")
print(f"P95: {np.percentile(times, 95):.2f} ms")
print(f"P99: {np.percentile(times, 99):.2f} ms")
print(f"\n吞吐量: {1000/np.mean(times):.2f} FPS")
print("=" * 80)
return times
if __name__ == '__main__':
import sys
if len(sys.argv) < 3:
print("用法: python profile_inference.py <config> <checkpoint> [num_samples]")
sys.exit(1)
config_file = sys.argv[1]
checkpoint_file = sys.argv[2]
num_samples = int(sys.argv[3]) if len(sys.argv) > 3 else 100
times = profile_inference(config_file, checkpoint_file, num_samples)
print("\nProfileing完成!")
Step 3: 敏感度分析
# tools/analysis/sensitivity_analysis.py
import torch
import torch.nn as nn
import copy
from tqdm import tqdm
from mmcv import Config
from mmdet3d.models import build_model
from mmdet3d.datasets import build_dataloader, build_dataset
from mmdet3d.apis import single_gpu_test
def prune_layer_channels(model, layer_name, ratio=0.5):
"""临时剪枝指定层的通道"""
# 这里简化处理,实际需要根据层类型处理
pruned_model = copy.deepcopy(model)
# 找到目标层并剪枝
for name, module in pruned_model.named_modules():
if name == layer_name:
if isinstance(module, nn.Conv2d):
# 简化:只保留前50%的通道
out_channels = module.out_channels
keep_channels = int(out_channels * (1 - ratio))
# 这里需要实际的剪枝实现
pass
return pruned_model
def evaluate_model(model, data_loader):
"""快速评估模型"""
model.eval()
results = []
with torch.no_grad():
for data in tqdm(data_loader, desc="Evaluating"):
result = model(return_loss=False, rescale=True, **data)
results.extend(result)
# 简化:返回平均分数(实际需要计算mAP/NDS)
return len(results) # 占位符
def analyze_sensitivity(config_file, checkpoint_file, prune_ratio=0.5):
"""分析各层剪枝敏感度"""
print("加载模型...")
cfg = Config.fromfile(config_file)
model = build_model(cfg.model).cuda()
checkpoint = torch.load(checkpoint_file)
model.load_state_dict(checkpoint['state_dict'])
# 构建数据集(使用少量样本快速测试)
print("构建数据集...")
cfg.data.val.ann_file = cfg.data.val.ann_file # 使用mini val set
dataset = build_dataset(cfg.data.val)
data_loader = build_dataloader(
dataset,
samples_per_gpu=1,
workers_per_gpu=0,
dist=False,
shuffle=False
)
# Baseline性能
print("\n评估baseline性能...")
baseline_score = evaluate_model(model, data_loader)
print(f"Baseline score: {baseline_score}")
# 分析各层敏感度
sensitivities = {}
print(f"\n开始敏感度分析 (剪枝比例: {prune_ratio})...")
for name, module in tqdm(model.named_modules()):
# 只分析Conv2d层
if not isinstance(module, nn.Conv2d):
continue
if module.out_channels < 64: # 跳过小层
continue
print(f"\n测试层: {name}")
# 临时剪枝该层
pruned_model = prune_layer_channels(model, name, prune_ratio)
# 评估
pruned_score = evaluate_model(pruned_model, data_loader)
# 计算敏感度
sensitivity = baseline_score - pruned_score
sensitivities[name] = sensitivity
print(f" 剪枝后score: {pruned_score}")
print(f" 敏感度: {sensitivity:.4f}")
del pruned_model
# 排序并保存
sorted_sens = sorted(sensitivities.items(), key=lambda x: x[1])
print("\n" + "=" * 80)
print("敏感度排序 (从低到高,低敏感度=易剪枝)")
print("=" * 80)
for name, sens in sorted_sens[:20]: # 显示前20个
print(f"{name:60s}: {sens:.4f}")
return sensitivities
if __name__ == '__main__':
import sys
if len(sys.argv) < 3:
print("用法: python sensitivity_analysis.py <config> <checkpoint>")
sys.exit(1)
config_file = sys.argv[1]
checkpoint_file = sys.argv[2]
sensitivities = analyze_sensitivity(config_file, checkpoint_file)
# 保存结果
import json
with open('sensitivity_results.json', 'w') as f:
json.dump(sensitivities, f, indent=2)
print("\n敏感度分析完成!结果已保存到 sensitivity_results.json")
📊 立即执行的命令
1. 模型复杂度分析(5分钟)
cd /workspace/bevfusion
# 创建分析目录
mkdir -p tools/analysis
# 创建并运行分析脚本
python tools/analysis/model_complexity.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
runs/enhanced_from_epoch19/epoch_23.pth \
> analysis_results/model_complexity.txt
cat analysis_results/model_complexity.txt
2. 推理性能Profiling(15分钟)
# Profiling推理性能
python tools/analysis/profile_inference.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
runs/enhanced_from_epoch19/epoch_23.pth \
100 \
> analysis_results/inference_profile.txt
cat analysis_results/inference_profile.txt
3. 敏感度分析(1-2小时,可选)
# 敏感度分析(使用mini val set快速测试)
python tools/analysis/sensitivity_analysis.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
runs/enhanced_from_epoch19/epoch_23.pth \
> analysis_results/sensitivity_analysis.txt
📋 预期分析结果
基于BEVFusion架构,预期结果:
模型复杂度
总参数量: ~110M
- Camera Encoder (SwinT): ~47M (43%) ← 最大模块
- LiDAR Encoder: ~19M (17%)
- Fuser: ~2M (2%)
- Decoder: ~16M (14%)
- Detection Head: ~18M (16%)
- Segmentation Head: ~8M (7%)
FLOPs: ~450 GFLOPs
模型大小: ~441 MB (FP32)
推理性能 (A100)
平均推理时间: ~90ms
- Camera branch: ~40ms (44%) ← 最大瓶颈
- LiDAR branch: ~17ms (19%)
- Fusion + Decoder: ~15ms (17%)
- Heads: ~18ms (20%)
吞吐量: ~11 FPS
优化潜力
1. Camera Encoder剪枝
- 潜力: 减少40-50%参数
- 加速: 20-30%
- 敏感度: 中等
2. Decoder简化
- 潜力: 减少30-40%参数
- 加速: 10-15%
- 敏感度: 低
3. INT8量化
- 加速: 2-3倍
- 精度损失: <2%
🎯 今天的目标
必须完成
- 创建分析工具脚本
- 运行模型复杂度分析
- 运行推理性能profiling
- 生成分析报告
可选
- 敏感度分析(如果时间允许)
- 确定剪枝策略
- 准备剪枝工具
📅 后续7天计划
Day 1 (今天):
✓ 模型分析
✓ Profiling
✓ 确定优化策略
Day 2-3:
→ 实施剪枝
→ 剪枝模型微调(3 epochs)
Day 4:
→ 评估剪枝模型
→ PTQ量化测试
Day 5-6:
→ QAT量化训练(5 epochs)
Day 7:
→ 评估量化模型
→ 生成优化报告
→ 准备TensorRT转换
🚀 立即开始
当前Stage 1训练正在进行(GPU 0-3),可以并行进行模型分析(GPU 4-7或CPU)
创建分析工具
cd /workspace/bevfusion
mkdir -p tools/analysis
mkdir -p analysis_results
# 创建分析脚本(见上面的Python代码)
# 然后运行分析
状态: 🚀 准备开始模型优化
重点: 先分析,再优化
并行: 不影响Stage 1训练