bev-project/project/docs/模型优化启动计划.md

14 KiB
Raw Blame History

BEVFusion 模型优化启动计划

开始时间: 2025-10-30
Baseline: Epoch 23 (NDS 0.6941, mAP 0.6446, mIoU 0.4130)
目标: 准备Orin部署的优化模型


🎯 优化目标

最终部署目标

硬件: NVIDIA Orin 270T
推理时间: <80ms (理想<60ms)
吞吐量: >12 FPS (理想>16 FPS)
功耗: <60W (理想<45W)
精度损失: <3%

优化路线

原始模型: 110M参数, 450 GFLOPs, 90ms@A100
    ↓
剪枝模型: 60M参数, 250 GFLOPs, 50ms@A100  (-45%)
    ↓
INT8模型: 15M参数, 62 GFLOPs, 40ms@A100   (-56%)
    ↓
TensorRT: 15M参数, 优化kernel, 30ms@A100  (-67%)
    ↓
Orin部署: 50-60ms推理, 16+ FPS, <50W       目标达成✅

📋 三阶段优化计划

阶段1: 模型分析1-2天立即开始

任务清单

  • 分析模型参数量和FLOPs
  • Profiling推理性能瓶颈
  • 敏感度分析(哪些层可剪枝)
  • 确定剪枝策略

需要的工具

tools/analysis/
├── model_complexity.py        # 模型复杂度分析
├── profile_inference.py       # 推理性能profiling
├── sensitivity_analysis.py    # 敏感度分析
└── layer_statistics.py        # 层统计信息

阶段2: 模型剪枝3-5天

目标

参数量: 110M → 60M (-45%)
FLOPs: 450G → 250G (-44%)
精度损失: <1.5%

剪枝策略

1. SwinTransformer Backbone
   - 通道剪枝: 减少20-30% channels
   - 层数剪枝: 可选择减少attention层

2. FPN Neck
   - 通道剪枝: 减少25-30% channels
   
3. Decoder
   - 通道剪枝: 减少20% channels
   
4. Detection/Segmentation Heads
   - 谨慎剪枝: 减少10-15% (影响精度)

剪枝工具

  • Torch-Pruning (推荐)
  • torch.nn.utils.prune (内置)

阶段3: 量化训练4-6天

目标

模型大小: 441MB (FP32) → 110MB (INT8) (-75%)
推理速度: 2-3倍提升
精度损失: <2% (累计<3%)

量化策略

1. PTQ (Post-Training Quantization)
   - 快速验证可行性
   - 预期精度损失: 2-3%

2. QAT (Quantization-Aware Training)
   - 训练恢复精度
   - 5个epochs, lr=1e-6
   - 预期精度恢复: 1-2%

🚀 立即行动阶段1启动

Step 1: 模型复杂度分析

创建分析脚本:

# tools/analysis/model_complexity.py

import torch
import torch.nn as nn
from thop import profile, clever_format
from mmcv import Config
from mmdet3d.models import build_model

def analyze_model_complexity(config_file, checkpoint_file=None):
    """分析模型复杂度"""
    
    # 加载配置
    cfg = Config.fromfile(config_file)
    
    # 构建模型
    model = build_model(cfg.model)
    model.eval()
    
    if checkpoint_file:
        checkpoint = torch.load(checkpoint_file, map_location='cpu')
        model.load_state_dict(checkpoint['state_dict'])
    
    # 统计参数量
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print("=" * 80)
    print("模型参数统计")
    print("=" * 80)
    print(f"总参数量: {total_params:,} ({total_params/1e6:.2f}M)")
    print(f"可训练参数: {trainable_params:,} ({trainable_params/1e6:.2f}M)")
    print(f"模型大小 (FP32): {total_params * 4 / 1024 / 1024:.2f} MB")
    print()
    
    # 分模块统计
    print("=" * 80)
    print("各模块参数统计")
    print("=" * 80)
    
    module_params = {}
    for name, module in model.named_children():
        params = sum(p.numel() for p in module.parameters())
        module_params[name] = params
        print(f"{name:30s}: {params:12,} ({params/total_params*100:5.2f}%)")
    
    print()
    
    # FLOPs统计需要dummy input
    print("=" * 80)
    print("计算量统计 (需要dummy input)")
    print("=" * 80)
    
    # 创建dummy inputs
    batch_size = 1
    dummy_images = torch.randn(batch_size, 6, 3, 256, 704)  # 6个相机视角
    dummy_points = torch.randn(batch_size, 40000, 5)  # 点云
    
    try:
        flops, params = profile(model, inputs=(dummy_images, dummy_points))
        flops, params = clever_format([flops, params], "%.3f")
        print(f"FLOPs: {flops}")
        print(f"Params: {params}")
    except Exception as e:
        print(f"FLOPs计算失败: {e}")
        print("可能需要修改model forward以支持profile")
    
    return model, total_params, module_params

if __name__ == '__main__':
    import sys
    
    if len(sys.argv) < 2:
        print("用法: python model_complexity.py <config_file> [checkpoint_file]")
        sys.exit(1)
    
    config_file = sys.argv[1]
    checkpoint_file = sys.argv[2] if len(sys.argv) > 2 else None
    
    model, total_params, module_params = analyze_model_complexity(
        config_file, 
        checkpoint_file
    )
    
    print("\n分析完成!")

Step 2: 推理性能Profiling

# tools/analysis/profile_inference.py

import torch
import time
import numpy as np
from mmcv import Config
from mmdet3d.models import build_model
from mmdet3d.datasets import build_dataloader, build_dataset

def profile_inference(config_file, checkpoint_file, num_samples=100):
    """Profiling推理性能"""
    
    # 加载配置和模型
    cfg = Config.fromfile(config_file)
    model = build_model(cfg.model).cuda()
    checkpoint = torch.load(checkpoint_file)
    model.load_state_dict(checkpoint['state_dict'])
    model.eval()
    
    # 构建数据集
    dataset = build_dataset(cfg.data.val)
    data_loader = build_dataloader(
        dataset,
        samples_per_gpu=1,
        workers_per_gpu=0,
        dist=False,
        shuffle=False
    )
    
    # 预热
    print("预热GPU...")
    with torch.no_grad():
        for i, data in enumerate(data_loader):
            if i >= 10:
                break
            _ = model(return_loss=False, rescale=True, **data)
    
    # 性能测试
    print(f"\n开始profiling (测试{num_samples}个样本)...")
    times = []
    
    with torch.no_grad():
        for i, data in enumerate(data_loader):
            if i >= num_samples:
                break
            
            torch.cuda.synchronize()
            start = time.time()
            
            _ = model(return_loss=False, rescale=True, **data)
            
            torch.cuda.synchronize()
            end = time.time()
            
            times.append((end - start) * 1000)  # ms
            
            if (i + 1) % 10 == 0:
                print(f"  已处理: {i+1}/{num_samples}")
    
    # 统计
    times = np.array(times)
    
    print("\n" + "=" * 80)
    print("推理性能统计")
    print("=" * 80)
    print(f"平均推理时间: {np.mean(times):.2f} ms")
    print(f"中位数: {np.median(times):.2f} ms")
    print(f"最小值: {np.min(times):.2f} ms")
    print(f"最大值: {np.max(times):.2f} ms")
    print(f"标准差: {np.std(times):.2f} ms")
    print(f"P95: {np.percentile(times, 95):.2f} ms")
    print(f"P99: {np.percentile(times, 99):.2f} ms")
    print(f"\n吞吐量: {1000/np.mean(times):.2f} FPS")
    print("=" * 80)
    
    return times

if __name__ == '__main__':
    import sys
    
    if len(sys.argv) < 3:
        print("用法: python profile_inference.py <config> <checkpoint> [num_samples]")
        sys.exit(1)
    
    config_file = sys.argv[1]
    checkpoint_file = sys.argv[2]
    num_samples = int(sys.argv[3]) if len(sys.argv) > 3 else 100
    
    times = profile_inference(config_file, checkpoint_file, num_samples)
    
    print("\nProfileing完成")

Step 3: 敏感度分析

# tools/analysis/sensitivity_analysis.py

import torch
import torch.nn as nn
import copy
from tqdm import tqdm
from mmcv import Config
from mmdet3d.models import build_model
from mmdet3d.datasets import build_dataloader, build_dataset
from mmdet3d.apis import single_gpu_test

def prune_layer_channels(model, layer_name, ratio=0.5):
    """临时剪枝指定层的通道"""
    # 这里简化处理,实际需要根据层类型处理
    pruned_model = copy.deepcopy(model)
    
    # 找到目标层并剪枝
    for name, module in pruned_model.named_modules():
        if name == layer_name:
            if isinstance(module, nn.Conv2d):
                # 简化只保留前50%的通道
                out_channels = module.out_channels
                keep_channels = int(out_channels * (1 - ratio))
                # 这里需要实际的剪枝实现
                pass
    
    return pruned_model

def evaluate_model(model, data_loader):
    """快速评估模型"""
    model.eval()
    results = []
    
    with torch.no_grad():
        for data in tqdm(data_loader, desc="Evaluating"):
            result = model(return_loss=False, rescale=True, **data)
            results.extend(result)
    
    # 简化返回平均分数实际需要计算mAP/NDS
    return len(results)  # 占位符

def analyze_sensitivity(config_file, checkpoint_file, prune_ratio=0.5):
    """分析各层剪枝敏感度"""
    
    print("加载模型...")
    cfg = Config.fromfile(config_file)
    model = build_model(cfg.model).cuda()
    checkpoint = torch.load(checkpoint_file)
    model.load_state_dict(checkpoint['state_dict'])
    
    # 构建数据集(使用少量样本快速测试)
    print("构建数据集...")
    cfg.data.val.ann_file = cfg.data.val.ann_file  # 使用mini val set
    dataset = build_dataset(cfg.data.val)
    data_loader = build_dataloader(
        dataset,
        samples_per_gpu=1,
        workers_per_gpu=0,
        dist=False,
        shuffle=False
    )
    
    # Baseline性能
    print("\n评估baseline性能...")
    baseline_score = evaluate_model(model, data_loader)
    print(f"Baseline score: {baseline_score}")
    
    # 分析各层敏感度
    sensitivities = {}
    
    print(f"\n开始敏感度分析 (剪枝比例: {prune_ratio})...")
    for name, module in tqdm(model.named_modules()):
        # 只分析Conv2d层
        if not isinstance(module, nn.Conv2d):
            continue
        
        if module.out_channels < 64:  # 跳过小层
            continue
        
        print(f"\n测试层: {name}")
        
        # 临时剪枝该层
        pruned_model = prune_layer_channels(model, name, prune_ratio)
        
        # 评估
        pruned_score = evaluate_model(pruned_model, data_loader)
        
        # 计算敏感度
        sensitivity = baseline_score - pruned_score
        sensitivities[name] = sensitivity
        
        print(f"  剪枝后score: {pruned_score}")
        print(f"  敏感度: {sensitivity:.4f}")
        
        del pruned_model
    
    # 排序并保存
    sorted_sens = sorted(sensitivities.items(), key=lambda x: x[1])
    
    print("\n" + "=" * 80)
    print("敏感度排序 (从低到高,低敏感度=易剪枝)")
    print("=" * 80)
    for name, sens in sorted_sens[:20]:  # 显示前20个
        print(f"{name:60s}: {sens:.4f}")
    
    return sensitivities

if __name__ == '__main__':
    import sys
    
    if len(sys.argv) < 3:
        print("用法: python sensitivity_analysis.py <config> <checkpoint>")
        sys.exit(1)
    
    config_file = sys.argv[1]
    checkpoint_file = sys.argv[2]
    
    sensitivities = analyze_sensitivity(config_file, checkpoint_file)
    
    # 保存结果
    import json
    with open('sensitivity_results.json', 'w') as f:
        json.dump(sensitivities, f, indent=2)
    
    print("\n敏感度分析完成!结果已保存到 sensitivity_results.json")

📊 立即执行的命令

1. 模型复杂度分析5分钟

cd /workspace/bevfusion

# 创建分析目录
mkdir -p tools/analysis

# 创建并运行分析脚本
python tools/analysis/model_complexity.py \
  configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
  runs/enhanced_from_epoch19/epoch_23.pth \
  > analysis_results/model_complexity.txt

cat analysis_results/model_complexity.txt

2. 推理性能Profiling15分钟

# Profiling推理性能
python tools/analysis/profile_inference.py \
  configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
  runs/enhanced_from_epoch19/epoch_23.pth \
  100 \
  > analysis_results/inference_profile.txt

cat analysis_results/inference_profile.txt

3. 敏感度分析1-2小时可选

# 敏感度分析使用mini val set快速测试
python tools/analysis/sensitivity_analysis.py \
  configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
  runs/enhanced_from_epoch19/epoch_23.pth \
  > analysis_results/sensitivity_analysis.txt

📋 预期分析结果

基于BEVFusion架构预期结果

模型复杂度

总参数量: ~110M
  - Camera Encoder (SwinT): ~47M (43%)  ← 最大模块
  - LiDAR Encoder: ~19M (17%)
  - Fuser: ~2M (2%)
  - Decoder: ~16M (14%)
  - Detection Head: ~18M (16%)
  - Segmentation Head: ~8M (7%)

FLOPs: ~450 GFLOPs
模型大小: ~441 MB (FP32)

推理性能 (A100)

平均推理时间: ~90ms
  - Camera branch: ~40ms (44%)  ← 最大瓶颈
  - LiDAR branch: ~17ms (19%)
  - Fusion + Decoder: ~15ms (17%)
  - Heads: ~18ms (20%)

吞吐量: ~11 FPS

优化潜力

1. Camera Encoder剪枝
   - 潜力: 减少40-50%参数
   - 加速: 20-30%
   - 敏感度: 中等

2. Decoder简化
   - 潜力: 减少30-40%参数
   - 加速: 10-15%
   - 敏感度: 低

3. INT8量化
   - 加速: 2-3倍
   - 精度损失: <2%

🎯 今天的目标

必须完成

  • 创建分析工具脚本
  • 运行模型复杂度分析
  • 运行推理性能profiling
  • 生成分析报告

可选

  • 敏感度分析(如果时间允许)
  • 确定剪枝策略
  • 准备剪枝工具

📅 后续7天计划

Day 1 (今天):
  ✓ 模型分析
  ✓ Profiling
  ✓ 确定优化策略

Day 2-3:
  → 实施剪枝
  → 剪枝模型微调3 epochs

Day 4:
  → 评估剪枝模型
  → PTQ量化测试

Day 5-6:
  → QAT量化训练5 epochs

Day 7:
  → 评估量化模型
  → 生成优化报告
  → 准备TensorRT转换

🚀 立即开始

当前Stage 1训练正在进行GPU 0-3可以并行进行模型分析GPU 4-7或CPU

创建分析工具

cd /workspace/bevfusion
mkdir -p tools/analysis
mkdir -p analysis_results

# 创建分析脚本见上面的Python代码
# 然后运行分析

状态: 🚀 准备开始模型优化
重点: 先分析,再优化
并行: 不影响Stage 1训练