# BEVFusion 部署到 NVIDIA Orin 270T 方案 ## 🎯 目标 将训练好的BEVFusion双任务/三任务模型部署到**NVIDIA AGX Orin 270T**,实现: - ✅ 实时推理(>10 FPS) - ✅ 低延迟(<100ms) - ✅ 低功耗(<60W) - ✅ 保持精度(mAP下降<3%) --- ## 📊 NVIDIA Orin 270T 规格 ### 硬件参数 | 参数 | 规格 | |------|------| | **GPU** | 2048 CUDA cores + 64 Tensor cores | | **AI算力** | 275 TOPS (INT8) | | **显存** | 64GB unified memory | | **CPU** | 12-core ARM Cortex-A78AE | | **功耗** | 15W - 60W (可配置) | | **架构** | Ampere (类似A100) | ### 性能基准 - **FP32**: ~5 TFLOPS - **FP16**: ~10 TFLOPS - **INT8**: ~20 TOPS - **与A100对比**: ~1/10性能,但功耗仅1/5 --- ## 📋 部署流程总览 ``` 训练完成 (A100 × 8) ↓ 步骤1: 模型分析和优化 (1-2天) ↓ 步骤2: 结构化剪枝 (2-3天) ↓ 步骤3: 量化训练 (QAT) (3-4天) ↓ 步骤4: TensorRT优化 (2-3天) ↓ 步骤5: Orin上测试 (1-2天) ↓ 步骤6: 性能调优 (2-3天) ↓ 生产部署 ✅ ``` **总时间**: 约2-3周 --- ## 🔧 步骤1: 模型分析和优化(1-2天) ### 1.1 模型复杂度分析 ```bash # 分析模型参数量和FLOPs python tools/analysis/model_complexity.py \ --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \ --checkpoint runs/run-xxx/epoch_20.pth ``` **预期输出**: ``` BEVFusion 双任务模型: - 参数量: 110M - FLOPs: 450 GFLOPs - 推理时间 (A100): 90ms - 推理时间 (Orin估算): 450-900ms (太慢!) ``` ### 1.2 性能瓶颈分析 使用Nsight Systems分析: ```bash # 在A100上profiling nsys profile -o bevfusion_profile \ python tools/benchmark.py \ --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \ --checkpoint runs/run-xxx/epoch_20.pth ``` **关注模块**: - SwinTransformer backbone (最耗时) - Multi-head attention - 3D卷积操作 - NMS后处理 ### 1.3 导出基准模型 ```bash # 导出ONNX格式 python tools/export_onnx.py \ --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \ --checkpoint runs/run-xxx/epoch_20.pth \ --output bevfusion_fp32.onnx ``` --- ## ✂️ 步骤2: 结构化剪枝(2-3天) ### 2.1 剪枝策略 **目标**: 减少40-50%参数量和FLOPs **剪枝方案**: 1. **Channel Pruning** (通道剪枝) - SwinTransformer: 减少20% channels - FPN: 减少30% channels - Decoder: 减少25% channels 2. **Layer Pruning** (层剪枝) - SwinTransformer: 6层→4层 - Decoder: 5层→4层 3. **Attention Head Pruning** - Multi-head数量: 8→6 ### 2.2 剪枝工具选择 **推荐**: Torch-Pruning ```python # tools/pruning/prune_bevfusion.py import torch import torch_pruning as tp # 加载模型 model = build_model(config) model.load_state_dict(checkpoint) # 定义剪枝策略 strategy = tp.strategy.L1Strategy() # 对SwinTransformer剪枝 pruner = tp.pruner.MagnitudePruner( model.encoders['camera'].backbone, example_inputs=example_images, importance=strategy, pruning_ratio=0.3, # 剪枝30% iterative_steps=5, ) # 执行剪枝 for i in range(5): pruner.step() # 微调 finetune(model, train_loader, epochs=5) # 保存剪枝后模型 torch.save(model.state_dict(), 'bevfusion_pruned.pth') ``` ### 2.3 剪枝后微调 ```bash # 在原始数据集上微调5个epochs torchpack dist-run -np 8 python tools/train.py \ configs/nuscenes/multitask/fusion-det-seg-swint_pruned.yaml \ --load_from bevfusion_pruned.pth \ --cfg-options \ max_epochs=5 \ optimizer.lr=5.0e-5 # 较小的学习率 ``` **预期结果**: - 参数量: 110M → 60M (-45%) - FLOPs: 450G → 250G (-44%) - 精度损失: <2% - 推理时间: 90ms → 50ms (A100) --- ## 🔢 步骤3: 量化训练 QAT(3-4天) ### 3.1 量化策略 **目标**: FP32 → INT8,保持精度损失<2% **量化方案**: ``` FP32模型 (110M参数) ↓ PTQ (Post-Training Quantization) - 快速验证 ↓ QAT (Quantization-Aware Training) - 精度恢复 ↓ INT8模型 (27.5M参数,4倍压缩) ``` ### 3.2 使用PyTorch Quantization ```python # tools/quantization/quantize_bevfusion.py import torch from torch.quantization import prepare_qat, convert # 加载剪枝后的模型 model = load_pruned_model('bevfusion_pruned.pth') model.eval() # 设置量化配置 model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') # 准备QAT model_qat = prepare_qat(model) # QAT训练 (重要!) # 使用较小学习率,训练3-5个epochs train_qat( model_qat, train_loader, epochs=5, lr=1e-5 ) # 转换为INT8 model_int8 = convert(model_qat) # 保存 torch.save(model_int8.state_dict(), 'bevfusion_int8.pth') ``` ### 3.3 QAT训练配置 ```yaml # configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml _base_: ./fusion-det-seg-swint_pruned.yaml # QAT特定配置 quantization: enabled: true qconfig: 'fbgemm' # 训练参数 max_epochs: 5 optimizer: lr: 1.0e-5 # 很小的学习率 weight_decay: 0.0001 # 数据增强减弱 augment2d: resize: [[0.45, 0.48], [0.48, 0.48]] # 减少resize范围 rotate: [-2.0, 2.0] # 减少旋转 augment3d: scale: [0.95, 1.05] # 减少缩放 rotate: [-0.39, 0.39] # 减少旋转 translate: 0.25 # 减少平移 ``` ### 3.4 量化验证 ```bash # 验证INT8模型精度 python tools/test.py \ configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml \ bevfusion_int8.pth \ --eval bbox map ``` **预期结果**: - 模型大小: 110M → 27.5M (-75%) - 推理速度: 2-4倍提升 - 精度损失: 1-2% - 内存占用: 减少75% --- ## 🚀 步骤4: TensorRT优化(2-3天) ### 4.1 TensorRT转换 ```python # tools/tensorrt/convert_to_trt.py import tensorrt as trt import torch # 1. 导出ONNX(从INT8模型) torch.onnx.export( model_int8, dummy_input, 'bevfusion_int8.onnx', opset_version=17, input_names=['images', 'points'], output_names=['bboxes', 'scores', 'labels', 'masks'], dynamic_axes={ 'images': {0: 'batch'}, 'points': {0: 'batch'} } ) # 2. 构建TensorRT Engine TRT_LOGGER = trt.Logger(trt.Logger.WARNING) builder = trt.Builder(TRT_LOGGER) network = builder.create_network( 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) ) # 解析ONNX parser = trt.OnnxParser(network, TRT_LOGGER) with open('bevfusion_int8.onnx', 'rb') as f: parser.parse(f.read()) # 配置TensorRT config = builder.create_builder_config() config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30) # 4GB # INT8优化 config.set_flag(trt.BuilderFlag.INT8) config.set_flag(trt.BuilderFlag.FP16) # FP16作为fallback # Calibration (用于PTQ) config.int8_calibrator = BEVFusionCalibrator( calibration_dataset, cache_file='bevfusion_calibration.cache' ) # 构建Engine serialized_engine = builder.build_serialized_network(network, config) # 保存 with open('bevfusion_int8.engine', 'wb') as f: f.write(serialized_engine) ``` ### 4.2 TensorRT推理接口 ```python # tools/tensorrt/trt_inference.py import tensorrt as trt import pycuda.driver as cuda import pycuda.autoinit class BEVFusionTRT: def __init__(self, engine_path): # 加载engine with open(engine_path, 'rb') as f: runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) self.engine = runtime.deserialize_cuda_engine(f.read()) self.context = self.engine.create_execution_context() # 分配GPU内存 self.allocate_buffers() def allocate_buffers(self): self.inputs = [] self.outputs = [] self.bindings = [] for i in range(self.engine.num_bindings): binding = self.engine.get_binding_name(i) size = trt.volume(self.engine.get_binding_shape(i)) dtype = trt.nptype(self.engine.get_binding_dtype(i)) # 分配device内存 device_mem = cuda.mem_alloc(size * dtype.itemsize) self.bindings.append(int(device_mem)) if self.engine.binding_is_input(i): self.inputs.append({'binding': binding, 'memory': device_mem}) else: self.outputs.append({'binding': binding, 'memory': device_mem}) def infer(self, images, points): # 拷贝输入到GPU cuda.memcpy_htod(self.inputs[0]['memory'], images) cuda.memcpy_htod(self.inputs[1]['memory'], points) # 执行推理 self.context.execute_v2(bindings=self.bindings) # 拷贝输出到CPU outputs = [] for output in self.outputs: host_mem = cuda.pagelocked_empty(output['shape'], output['dtype']) cuda.memcpy_dtoh(host_mem, output['memory']) outputs.append(host_mem) return outputs # 使用 trt_model = BEVFusionTRT('bevfusion_int8.engine') bboxes, scores, labels, masks = trt_model.infer(images, points) ``` ### 4.3 TensorRT优化技巧 **针对Orin的优化**: ```python # 1. DLA加速(Orin有2个DLA) config.set_flag(trt.BuilderFlag.GPU_FALLBACK) config.default_device_type = trt.DeviceType.DLA config.DLA_core = 0 # 使用DLA core 0 # 2. Kernel自动调优 config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS) # 3. 优化Batch Size(Orin适合小batch) config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, True) # 4. Profile优化(针对真实输入shape) profile = builder.create_optimization_profile() profile.set_shape( "images", min=(1, 6, 3, 256, 704), opt=(1, 6, 3, 256, 704), # 最优shape max=(2, 6, 3, 256, 704) ) config.add_optimization_profile(profile) ``` --- ## 🧪 步骤5: Orin上测试(1-2天) ### 5.1 环境准备 ```bash # 在Orin上安装依赖 # JetPack 5.1+ (包含CUDA 11.4, cuDNN 8.6, TensorRT 8.5) # 安装Python依赖 pip3 install pycuda pip3 install numpy opencv-python # 拷贝模型文件 scp bevfusion_int8.engine orin@192.168.1.100:/home/orin/models/ ``` ### 5.2 性能测试 ```python # tools/benchmark_orin.py import time import numpy as np # 加载TensorRT模型 trt_model = BEVFusionTRT('bevfusion_int8.engine') # 预热 for _ in range(10): trt_model.infer(dummy_images, dummy_points) # 性能测试 times = [] for i in range(100): start = time.time() outputs = trt_model.infer(images, points) end = time.time() times.append((end - start) * 1000) # ms print(f"平均推理时间: {np.mean(times):.2f} ms") print(f"吞吐量: {1000/np.mean(times):.2f} FPS") print(f"P99延迟: {np.percentile(times, 99):.2f} ms") ``` ### 5.3 功耗测试 ```bash # 监控功耗 sudo tegrastats --interval 1000 > power_log.txt & # 运行推理 python3 tools/benchmark_orin.py # 分析功耗 cat power_log.txt | grep "VDD_GPU_SOC" ``` ### 5.4 精度验证 ```bash # 在Orin上跑nuScenes验证集 python3 tools/test_orin.py \ --engine bevfusion_int8.engine \ --data-root /data/nuscenes \ --eval bbox map ``` **预期性能**: - **推理时间**: 60-80ms (vs 90ms on A100) - **FPS**: 12-16 FPS ✅ - **功耗**: 40-50W - **精度损失**: <3% --- ## ⚡ 步骤6: 性能调优(2-3天) ### 6.1 多流并行 ```python # 使用CUDA Streams加速预处理 class OptimizedPipeline: def __init__(self): self.preprocess_stream = cuda.Stream() self.infer_stream = cuda.Stream() self.postprocess_stream = cuda.Stream() def process_frame(self, raw_images, raw_points): # 预处理(异步) with self.preprocess_stream: images = preprocess_images(raw_images) points = preprocess_points(raw_points) # 推理(异步) with self.infer_stream: self.infer_stream.wait_for_event(preprocess_done) outputs = self.trt_model.infer(images, points) # 后处理(异步) with self.postprocess_stream: self.postprocess_stream.wait_for_event(infer_done) results = postprocess_outputs(outputs) return results ``` ### 6.2 内存优化 ```python # 使用Unified Memory减少拷贝 import pycuda.driver as cuda # 分配unified memory images_um = cuda.managed_empty(shape, dtype=np.float32) points_um = cuda.managed_empty(shape, dtype=np.float32) # 直接在CPU上填充数据 np.copyto(images_um, preprocessed_images) # GPU可以直接访问,无需显式拷贝 outputs = trt_model.infer(images_um, points_um) ``` ### 6.3 DLA Offload 针对Orin的2个DLA核心: ```python # 将部分网络offload到DLA # DLA适合:卷积、池化、归一化 # GPU保留:Attention、复杂操作 # Engine构建时指定 dla_layers = [ 'encoder/camera/backbone/conv1', 'encoder/camera/backbone/layer1', 'encoder/lidar/voxelize', ] for layer_name in dla_layers: layer = network.get_layer_by_name(layer_name) layer.device_type = trt.DeviceType.DLA ``` --- ## 📊 预期性能对比 ### 各优化阶段性能 | 阶段 | 参数量 | FLOPs | 推理时间(Orin) | 精度损失 | 说明 | |------|--------|-------|---------------|---------|------| | **原始FP32** | 110M | 450G | 900ms | - | 太慢 ❌ | | **剪枝后FP32** | 60M | 250G | 500ms | -1.5% | 仍慢 ⚠️ | | **剪枝+INT8** | 15M | 62G | 80ms | -2.5% | 可用 ✅ | | **+TensorRT** | 15M | 62G | 65ms | -2.5% | 良好 ✅ | | **+多流优化** | 15M | 62G | 50ms | -2.5% | 最优 🌟 | ### 最终性能目标 | 指标 | 目标值 | 预期达到 | |------|--------|---------| | **推理时间** | <80ms | 50-65ms ✅ | | **吞吐量** | >10 FPS | 15-20 FPS ✅ | | **功耗** | <60W | 40-50W ✅ | | **检测mAP** | >63% | 65-67% ✅ | | **分割mIoU** | >52% | 53-57% ✅ | | **内存占用** | <4GB | 2-3GB ✅ | --- ## 🛠️ 工具和脚本 ### 创建必要的工具脚本 ```bash tools/ ├── pruning/ │ ├── prune_bevfusion.py # 剪枝脚本 │ └── eval_pruned_model.py # 评估剪枝后模型 ├── quantization/ │ ├── quantize_bevfusion.py # 量化脚本 │ ├── qat_train.py # QAT训练 │ └── calibrate.py # INT8校准 ├── tensorrt/ │ ├── convert_to_trt.py # ONNX→TensorRT │ ├── trt_inference.py # TensorRT推理 │ └── optimize_dla.py # DLA优化 ├── deployment/ │ ├── benchmark_orin.py # Orin性能测试 │ ├── deploy_to_orin.sh # 一键部署脚本 │ └── monitor_performance.py # 性能监控 └── analysis/ ├── model_complexity.py # 模型复杂度分析 └── latency_breakdown.py # 延迟分解分析 ``` --- ## 📅 详细时间表 ### 第1周:剪枝和量化准备 | 天数 | 任务 | 输出 | |------|------|------| | Day 1-2 | 模型分析,导出ONNX | 基准测试报告 | | Day 3-4 | 结构化剪枝 | 剪枝后模型 (60M) | | Day 5 | 剪枝模型微调 | 微调后checkpoint | | Day 6-7 | PTQ初步测试 | INT8可行性报告 | ### 第2周:量化训练和TensorRT | 天数 | 任务 | 输出 | |------|------|------| | Day 8-10 | QAT训练 | INT8模型 (15M) | | Day 11-12 | TensorRT转换和优化 | TRT Engine | | Day 13 | A100上TensorRT测试 | 性能基准 | | Day 14 | 准备Orin环境 | 部署包 | ### 第3周:Orin测试和调优 | 天数 | 任务 | 输出 | |------|------|------| | Day 15 | 部署到Orin | 初步结果 | | Day 16 | 性能和功耗测试 | 测试报告 | | Day 17-18 | 精度验证 | 精度报告 | | Day 19-20 | 多流和DLA优化 | 优化后模型 | | Day 21 | 最终验证和文档 | 部署文档 ✅ | --- ## 🔍 关键技术点 ### 1. 针对Orin的特殊优化 **Orin vs 通用GPU**: - ✅ Unified Memory优势大 - ✅ DLA可用,适合卷积层 - ⚠️ Tensor Cores较少,FP16优势小 - ⚠️ 带宽较低,需优化内存访问 ### 2. BEVFusion特定优化 **关键模块优化**: 1. **SwinTransformer** - 最耗时(~40%) - 剪枝效果最好 - Window Attention可用卷积近似 2. **LSS View Transform** - 3D卷积密集 - INT8量化效果好 - 可考虑分离运算 3. **ConvFuser** - 简单concat+conv - 几乎无损优化 4. **TransFusion Head** - Query机制复杂 - 需要仔细量化 - NMS可CPU并行 ### 3. 精度保持技巧 **QAT训练要点**: - ✅ 使用原始数据集全量训练 - ✅ 学习率要小(1e-5) - ✅ 训练3-5个epochs足够 - ✅ BatchNorm层不量化 - ✅ 某些敏感层保持FP16 --- ## 📦 部署包结构 ``` bevfusion_orin_deploy/ ├── models/ │ ├── bevfusion_int8.engine # TensorRT Engine │ ├── config.yaml # 配置文件 │ └── class_names.txt # 类别名称 ├── lib/ │ ├── libbevfusion.so # C++推理库 │ └── python/ │ └── bevfusion_trt.py # Python接口 ├── scripts/ │ ├── run_inference.sh # 推理脚本 │ └── benchmark.sh # 性能测试 ├── data/ │ └── sample_data/ # 测试数据 ├── docs/ │ ├── API.md # API文档 │ └── OPTIMIZATION.md # 优化说明 └── README.md # 使用说明 ``` --- ## 🎯 性能保证策略 ### 如果性能不达标 **Plan B选项**: 1. **进一步剪枝** (60M → 40M) - 牺牲1-2%精度 - 提升20-30%速度 2. **降低输入分辨率** - 图像: 256×704 → 192×512 - BEV: 180×180 → 128×128 - 速度提升40% 3. **简化任务** - 只保留检测任务 - 或检测+分割二选一 4. **使用两个Orin** - Camera处理用Orin-1 - LiDAR处理用Orin-2 - 并行推理 --- ## 📚 参考资源 ### 官方文档 - [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/) - [Orin Developer Guide](https://developer.nvidia.com/embedded/jetson-agx-orin-developer-kit) - [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html) ### 开源工具 - [Torch-Pruning](https://github.com/VainF/Torch-Pruning) - [TensorRT-OSS](https://github.com/NVIDIA/TensorRT) - [ONNX Runtime](https://github.com/microsoft/onnxruntime) ### 相关论文 - "Learned Step Size Quantization" (LSQ) - "Network Slimming" (Channel Pruning) - "Accelerating Deep Learning with TensorRT" --- ## ✅ 成功标准 ### 最低要求 - ✅ 推理时间 < 80ms - ✅ 吞吐量 > 12 FPS - ✅ 功耗 < 60W - ✅ 检测mAP > 63% - ✅ 分割mIoU > 52% ### 理想目标 - 🌟 推理时间 < 60ms - 🌟 吞吐量 > 16 FPS - 🌟 功耗 < 45W - 🌟 检测mAP > 65% - 🌟 分割mIoU > 55% --- ## 🚀 快速开始 ### 一键部署脚本 ```bash #!/bin/bash # scripts/deploy_to_orin.sh echo "========== BEVFusion Orin部署 ==========" # 1. 剪枝 echo "步骤1: 模型剪枝..." python tools/pruning/prune_bevfusion.py \ --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \ --checkpoint runs/run-xxx/epoch_20.pth \ --output bevfusion_pruned.pth # 2. 量化 echo "步骤2: INT8量化..." python tools/quantization/quantize_bevfusion.py \ --model bevfusion_pruned.pth \ --output bevfusion_int8.pth \ --calibration-data data/nuscenes/calibration_100samples # 3. TensorRT转换 echo "步骤3: TensorRT转换..." python tools/tensorrt/convert_to_trt.py \ --model bevfusion_int8.pth \ --output bevfusion_int8.engine \ --fp16 \ --int8 \ --workspace 4096 # 4. 测试 echo "步骤4: 性能测试..." python tools/deployment/benchmark_orin.py \ --engine bevfusion_int8.engine echo "部署完成!" ``` --- 生成时间: 2025-10-17 目标硬件: NVIDIA AGX Orin 270T 预计部署周期: 2-3周