bev-project/project/docs/ORIN_DEPLOYMENT_PLAN.md

840 lines
20 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BEVFusion 部署到 NVIDIA Orin 270T 方案
## 🎯 目标
将训练好的BEVFusion双任务/三任务模型部署到**NVIDIA AGX Orin 270T**,实现:
- ✅ 实时推理(>10 FPS
- ✅ 低延迟(<100ms
- 低功耗<60W
- 保持精度mAP下降<3%
---
## 📊 NVIDIA Orin 270T 规格
### 硬件参数
| 参数 | 规格 |
|------|------|
| **GPU** | 2048 CUDA cores + 64 Tensor cores |
| **AI算力** | 275 TOPS (INT8) |
| **显存** | 64GB unified memory |
| **CPU** | 12-core ARM Cortex-A78AE |
| **功耗** | 15W - 60W (可配置) |
| **架构** | Ampere (类似A100) |
### 性能基准
- **FP32**: ~5 TFLOPS
- **FP16**: ~10 TFLOPS
- **INT8**: ~20 TOPS
- **与A100对比**: ~1/10性能但功耗仅1/5
---
## 📋 部署流程总览
```
训练完成 (A100 × 8)
步骤1: 模型分析和优化 (1-2天)
步骤2: 结构化剪枝 (2-3天)
步骤3: 量化训练 (QAT) (3-4天)
步骤4: TensorRT优化 (2-3天)
步骤5: Orin上测试 (1-2天)
步骤6: 性能调优 (2-3天)
生产部署 ✅
```
**总时间**: 约2-3周
---
## 🔧 步骤1: 模型分析和优化1-2天
### 1.1 模型复杂度分析
```bash
# 分析模型参数量和FLOPs
python tools/analysis/model_complexity.py \
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
--checkpoint runs/run-xxx/epoch_20.pth
```
**预期输出**:
```
BEVFusion 双任务模型:
- 参数量: 110M
- FLOPs: 450 GFLOPs
- 推理时间 (A100): 90ms
- 推理时间 (Orin估算): 450-900ms (太慢!)
```
### 1.2 性能瓶颈分析
使用Nsight Systems分析
```bash
# 在A100上profiling
nsys profile -o bevfusion_profile \
python tools/benchmark.py \
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
--checkpoint runs/run-xxx/epoch_20.pth
```
**关注模块**:
- SwinTransformer backbone (最耗时)
- Multi-head attention
- 3D卷积操作
- NMS后处理
### 1.3 导出基准模型
```bash
# 导出ONNX格式
python tools/export_onnx.py \
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
--checkpoint runs/run-xxx/epoch_20.pth \
--output bevfusion_fp32.onnx
```
---
## ✂️ 步骤2: 结构化剪枝2-3天
### 2.1 剪枝策略
**目标**: 减少40-50%参数量和FLOPs
**剪枝方案**:
1. **Channel Pruning** (通道剪枝)
- SwinTransformer: 减少20% channels
- FPN: 减少30% channels
- Decoder: 减少25% channels
2. **Layer Pruning** (层剪枝)
- SwinTransformer: 6层4层
- Decoder: 5层4层
3. **Attention Head Pruning**
- Multi-head数量: 86
### 2.2 剪枝工具选择
**推荐**: Torch-Pruning
```python
# tools/pruning/prune_bevfusion.py
import torch
import torch_pruning as tp
# 加载模型
model = build_model(config)
model.load_state_dict(checkpoint)
# 定义剪枝策略
strategy = tp.strategy.L1Strategy()
# 对SwinTransformer剪枝
pruner = tp.pruner.MagnitudePruner(
model.encoders['camera'].backbone,
example_inputs=example_images,
importance=strategy,
pruning_ratio=0.3, # 剪枝30%
iterative_steps=5,
)
# 执行剪枝
for i in range(5):
pruner.step()
# 微调
finetune(model, train_loader, epochs=5)
# 保存剪枝后模型
torch.save(model.state_dict(), 'bevfusion_pruned.pth')
```
### 2.3 剪枝后微调
```bash
# 在原始数据集上微调5个epochs
torchpack dist-run -np 8 python tools/train.py \
configs/nuscenes/multitask/fusion-det-seg-swint_pruned.yaml \
--load_from bevfusion_pruned.pth \
--cfg-options \
max_epochs=5 \
optimizer.lr=5.0e-5 # 较小的学习率
```
**预期结果**:
- 参数量: 110M 60M (-45%)
- FLOPs: 450G 250G (-44%)
- 精度损失: <2%
- 推理时间: 90ms 50ms (A100)
---
## 🔢 步骤3: 量化训练 QAT3-4天
### 3.1 量化策略
**目标**: FP32 INT8保持精度损失<2%
**量化方案**:
```
FP32模型 (110M参数)
PTQ (Post-Training Quantization) - 快速验证
QAT (Quantization-Aware Training) - 精度恢复
INT8模型 (27.5M参数4倍压缩)
```
### 3.2 使用PyTorch Quantization
```python
# tools/quantization/quantize_bevfusion.py
import torch
from torch.quantization import prepare_qat, convert
# 加载剪枝后的模型
model = load_pruned_model('bevfusion_pruned.pth')
model.eval()
# 设置量化配置
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
# 准备QAT
model_qat = prepare_qat(model)
# QAT训练 (重要!)
# 使用较小学习率训练3-5个epochs
train_qat(
model_qat,
train_loader,
epochs=5,
lr=1e-5
)
# 转换为INT8
model_int8 = convert(model_qat)
# 保存
torch.save(model_int8.state_dict(), 'bevfusion_int8.pth')
```
### 3.3 QAT训练配置
```yaml
# configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml
_base_: ./fusion-det-seg-swint_pruned.yaml
# QAT特定配置
quantization:
enabled: true
qconfig: 'fbgemm'
# 训练参数
max_epochs: 5
optimizer:
lr: 1.0e-5 # 很小的学习率
weight_decay: 0.0001
# 数据增强减弱
augment2d:
resize: [[0.45, 0.48], [0.48, 0.48]] # 减少resize范围
rotate: [-2.0, 2.0] # 减少旋转
augment3d:
scale: [0.95, 1.05] # 减少缩放
rotate: [-0.39, 0.39] # 减少旋转
translate: 0.25 # 减少平移
```
### 3.4 量化验证
```bash
# 验证INT8模型精度
python tools/test.py \
configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml \
bevfusion_int8.pth \
--eval bbox map
```
**预期结果**:
- 模型大小: 110M 27.5M (-75%)
- 推理速度: 2-4倍提升
- 精度损失: 1-2%
- 内存占用: 减少75%
---
## 🚀 步骤4: TensorRT优化2-3天
### 4.1 TensorRT转换
```python
# tools/tensorrt/convert_to_trt.py
import tensorrt as trt
import torch
# 1. 导出ONNX从INT8模型
torch.onnx.export(
model_int8,
dummy_input,
'bevfusion_int8.onnx',
opset_version=17,
input_names=['images', 'points'],
output_names=['bboxes', 'scores', 'labels', 'masks'],
dynamic_axes={
'images': {0: 'batch'},
'points': {0: 'batch'}
}
)
# 2. 构建TensorRT Engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
# 解析ONNX
parser = trt.OnnxParser(network, TRT_LOGGER)
with open('bevfusion_int8.onnx', 'rb') as f:
parser.parse(f.read())
# 配置TensorRT
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30) # 4GB
# INT8优化
config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16) # FP16作为fallback
# Calibration (用于PTQ)
config.int8_calibrator = BEVFusionCalibrator(
calibration_dataset,
cache_file='bevfusion_calibration.cache'
)
# 构建Engine
serialized_engine = builder.build_serialized_network(network, config)
# 保存
with open('bevfusion_int8.engine', 'wb') as f:
f.write(serialized_engine)
```
### 4.2 TensorRT推理接口
```python
# tools/tensorrt/trt_inference.py
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit
class BEVFusionTRT:
def __init__(self, engine_path):
# 加载engine
with open(engine_path, 'rb') as f:
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
# 分配GPU内存
self.allocate_buffers()
def allocate_buffers(self):
self.inputs = []
self.outputs = []
self.bindings = []
for i in range(self.engine.num_bindings):
binding = self.engine.get_binding_name(i)
size = trt.volume(self.engine.get_binding_shape(i))
dtype = trt.nptype(self.engine.get_binding_dtype(i))
# 分配device内存
device_mem = cuda.mem_alloc(size * dtype.itemsize)
self.bindings.append(int(device_mem))
if self.engine.binding_is_input(i):
self.inputs.append({'binding': binding, 'memory': device_mem})
else:
self.outputs.append({'binding': binding, 'memory': device_mem})
def infer(self, images, points):
# 拷贝输入到GPU
cuda.memcpy_htod(self.inputs[0]['memory'], images)
cuda.memcpy_htod(self.inputs[1]['memory'], points)
# 执行推理
self.context.execute_v2(bindings=self.bindings)
# 拷贝输出到CPU
outputs = []
for output in self.outputs:
host_mem = cuda.pagelocked_empty(output['shape'], output['dtype'])
cuda.memcpy_dtoh(host_mem, output['memory'])
outputs.append(host_mem)
return outputs
# 使用
trt_model = BEVFusionTRT('bevfusion_int8.engine')
bboxes, scores, labels, masks = trt_model.infer(images, points)
```
### 4.3 TensorRT优化技巧
**针对Orin的优化**:
```python
# 1. DLA加速Orin有2个DLA
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
config.default_device_type = trt.DeviceType.DLA
config.DLA_core = 0 # 使用DLA core 0
# 2. Kernel自动调优
config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
# 3. 优化Batch SizeOrin适合小batch
config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, True)
# 4. Profile优化针对真实输入shape
profile = builder.create_optimization_profile()
profile.set_shape(
"images",
min=(1, 6, 3, 256, 704),
opt=(1, 6, 3, 256, 704), # 最优shape
max=(2, 6, 3, 256, 704)
)
config.add_optimization_profile(profile)
```
---
## 🧪 步骤5: Orin上测试1-2天
### 5.1 环境准备
```bash
# 在Orin上安装依赖
# JetPack 5.1+ (包含CUDA 11.4, cuDNN 8.6, TensorRT 8.5)
# 安装Python依赖
pip3 install pycuda
pip3 install numpy opencv-python
# 拷贝模型文件
scp bevfusion_int8.engine orin@192.168.1.100:/home/orin/models/
```
### 5.2 性能测试
```python
# tools/benchmark_orin.py
import time
import numpy as np
# 加载TensorRT模型
trt_model = BEVFusionTRT('bevfusion_int8.engine')
# 预热
for _ in range(10):
trt_model.infer(dummy_images, dummy_points)
# 性能测试
times = []
for i in range(100):
start = time.time()
outputs = trt_model.infer(images, points)
end = time.time()
times.append((end - start) * 1000) # ms
print(f"平均推理时间: {np.mean(times):.2f} ms")
print(f"吞吐量: {1000/np.mean(times):.2f} FPS")
print(f"P99延迟: {np.percentile(times, 99):.2f} ms")
```
### 5.3 功耗测试
```bash
# 监控功耗
sudo tegrastats --interval 1000 > power_log.txt &
# 运行推理
python3 tools/benchmark_orin.py
# 分析功耗
cat power_log.txt | grep "VDD_GPU_SOC"
```
### 5.4 精度验证
```bash
# 在Orin上跑nuScenes验证集
python3 tools/test_orin.py \
--engine bevfusion_int8.engine \
--data-root /data/nuscenes \
--eval bbox map
```
**预期性能**:
- **推理时间**: 60-80ms (vs 90ms on A100)
- **FPS**: 12-16 FPS
- **功耗**: 40-50W
- **精度损失**: <3%
---
## ⚡ 步骤6: 性能调优2-3天
### 6.1 多流并行
```python
# 使用CUDA Streams加速预处理
class OptimizedPipeline:
def __init__(self):
self.preprocess_stream = cuda.Stream()
self.infer_stream = cuda.Stream()
self.postprocess_stream = cuda.Stream()
def process_frame(self, raw_images, raw_points):
# 预处理(异步)
with self.preprocess_stream:
images = preprocess_images(raw_images)
points = preprocess_points(raw_points)
# 推理(异步)
with self.infer_stream:
self.infer_stream.wait_for_event(preprocess_done)
outputs = self.trt_model.infer(images, points)
# 后处理(异步)
with self.postprocess_stream:
self.postprocess_stream.wait_for_event(infer_done)
results = postprocess_outputs(outputs)
return results
```
### 6.2 内存优化
```python
# 使用Unified Memory减少拷贝
import pycuda.driver as cuda
# 分配unified memory
images_um = cuda.managed_empty(shape, dtype=np.float32)
points_um = cuda.managed_empty(shape, dtype=np.float32)
# 直接在CPU上填充数据
np.copyto(images_um, preprocessed_images)
# GPU可以直接访问无需显式拷贝
outputs = trt_model.infer(images_um, points_um)
```
### 6.3 DLA Offload
针对Orin的2个DLA核心
```python
# 将部分网络offload到DLA
# DLA适合卷积、池化、归一化
# GPU保留Attention、复杂操作
# Engine构建时指定
dla_layers = [
'encoder/camera/backbone/conv1',
'encoder/camera/backbone/layer1',
'encoder/lidar/voxelize',
]
for layer_name in dla_layers:
layer = network.get_layer_by_name(layer_name)
layer.device_type = trt.DeviceType.DLA
```
---
## 📊 预期性能对比
### 各优化阶段性能
| 阶段 | 参数量 | FLOPs | 推理时间(Orin) | 精度损失 | 说明 |
|------|--------|-------|---------------|---------|------|
| **原始FP32** | 110M | 450G | 900ms | - | 太慢 |
| **剪枝后FP32** | 60M | 250G | 500ms | -1.5% | 仍慢 |
| **剪枝+INT8** | 15M | 62G | 80ms | -2.5% | 可用 |
| **+TensorRT** | 15M | 62G | 65ms | -2.5% | 良好 |
| **+多流优化** | 15M | 62G | 50ms | -2.5% | 最优 🌟 |
### 最终性能目标
| 指标 | 目标值 | 预期达到 |
|------|--------|---------|
| **推理时间** | <80ms | 50-65ms |
| **吞吐量** | >10 FPS | 15-20 FPS ✅ |
| **功耗** | <60W | 40-50W |
| **检测mAP** | >63% | 65-67% ✅ |
| **分割mIoU** | >52% | 53-57% ✅ |
| **内存占用** | <4GB | 2-3GB |
---
## 🛠️ 工具和脚本
### 创建必要的工具脚本
```bash
tools/
├── pruning/
│ ├── prune_bevfusion.py # 剪枝脚本
│ └── eval_pruned_model.py # 评估剪枝后模型
├── quantization/
│ ├── quantize_bevfusion.py # 量化脚本
│ ├── qat_train.py # QAT训练
│ └── calibrate.py # INT8校准
├── tensorrt/
│ ├── convert_to_trt.py # ONNX→TensorRT
│ ├── trt_inference.py # TensorRT推理
│ └── optimize_dla.py # DLA优化
├── deployment/
│ ├── benchmark_orin.py # Orin性能测试
│ ├── deploy_to_orin.sh # 一键部署脚本
│ └── monitor_performance.py # 性能监控
└── analysis/
├── model_complexity.py # 模型复杂度分析
└── latency_breakdown.py # 延迟分解分析
```
---
## 📅 详细时间表
### 第1周剪枝和量化准备
| 天数 | 任务 | 输出 |
|------|------|------|
| Day 1-2 | 模型分析导出ONNX | 基准测试报告 |
| Day 3-4 | 结构化剪枝 | 剪枝后模型 (60M) |
| Day 5 | 剪枝模型微调 | 微调后checkpoint |
| Day 6-7 | PTQ初步测试 | INT8可行性报告 |
### 第2周量化训练和TensorRT
| 天数 | 任务 | 输出 |
|------|------|------|
| Day 8-10 | QAT训练 | INT8模型 (15M) |
| Day 11-12 | TensorRT转换和优化 | TRT Engine |
| Day 13 | A100上TensorRT测试 | 性能基准 |
| Day 14 | 准备Orin环境 | 部署包 |
### 第3周Orin测试和调优
| 天数 | 任务 | 输出 |
|------|------|------|
| Day 15 | 部署到Orin | 初步结果 |
| Day 16 | 性能和功耗测试 | 测试报告 |
| Day 17-18 | 精度验证 | 精度报告 |
| Day 19-20 | 多流和DLA优化 | 优化后模型 |
| Day 21 | 最终验证和文档 | 部署文档 |
---
## 🔍 关键技术点
### 1. 针对Orin的特殊优化
**Orin vs 通用GPU**:
- Unified Memory优势大
- DLA可用适合卷积层
- Tensor Cores较少FP16优势小
- 带宽较低需优化内存访问
### 2. BEVFusion特定优化
**关键模块优化**:
1. **SwinTransformer**
- 最耗时~40%
- 剪枝效果最好
- Window Attention可用卷积近似
2. **LSS View Transform**
- 3D卷积密集
- INT8量化效果好
- 可考虑分离运算
3. **ConvFuser**
- 简单concat+conv
- 几乎无损优化
4. **TransFusion Head**
- Query机制复杂
- 需要仔细量化
- NMS可CPU并行
### 3. 精度保持技巧
**QAT训练要点**:
- 使用原始数据集全量训练
- 学习率要小1e-5
- 训练3-5个epochs足够
- BatchNorm层不量化
- 某些敏感层保持FP16
---
## 📦 部署包结构
```
bevfusion_orin_deploy/
├── models/
│ ├── bevfusion_int8.engine # TensorRT Engine
│ ├── config.yaml # 配置文件
│ └── class_names.txt # 类别名称
├── lib/
│ ├── libbevfusion.so # C++推理库
│ └── python/
│ └── bevfusion_trt.py # Python接口
├── scripts/
│ ├── run_inference.sh # 推理脚本
│ └── benchmark.sh # 性能测试
├── data/
│ └── sample_data/ # 测试数据
├── docs/
│ ├── API.md # API文档
│ └── OPTIMIZATION.md # 优化说明
└── README.md # 使用说明
```
---
## 🎯 性能保证策略
### 如果性能不达标
**Plan B选项**:
1. **进一步剪枝** (60M 40M)
- 牺牲1-2%精度
- 提升20-30%速度
2. **降低输入分辨率**
- 图像: 256×704 192×512
- BEV: 180×180 128×128
- 速度提升40%
3. **简化任务**
- 只保留检测任务
- 或检测+分割二选一
4. **使用两个Orin**
- Camera处理用Orin-1
- LiDAR处理用Orin-2
- 并行推理
---
## 📚 参考资源
### 官方文档
- [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/)
- [Orin Developer Guide](https://developer.nvidia.com/embedded/jetson-agx-orin-developer-kit)
- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
### 开源工具
- [Torch-Pruning](https://github.com/VainF/Torch-Pruning)
- [TensorRT-OSS](https://github.com/NVIDIA/TensorRT)
- [ONNX Runtime](https://github.com/microsoft/onnxruntime)
### 相关论文
- "Learned Step Size Quantization" (LSQ)
- "Network Slimming" (Channel Pruning)
- "Accelerating Deep Learning with TensorRT"
---
## ✅ 成功标准
### 最低要求
- 推理时间 < 80ms
- 吞吐量 > 12 FPS
- ✅ 功耗 < 60W
- 检测mAP > 63%
- ✅ 分割mIoU > 52%
### 理想目标
- 🌟 推理时间 < 60ms
- 🌟 吞吐量 > 16 FPS
- 🌟 功耗 < 45W
- 🌟 检测mAP > 65%
- 🌟 分割mIoU > 55%
---
## 🚀 快速开始
### 一键部署脚本
```bash
#!/bin/bash
# scripts/deploy_to_orin.sh
echo "========== BEVFusion Orin部署 =========="
# 1. 剪枝
echo "步骤1: 模型剪枝..."
python tools/pruning/prune_bevfusion.py \
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
--checkpoint runs/run-xxx/epoch_20.pth \
--output bevfusion_pruned.pth
# 2. 量化
echo "步骤2: INT8量化..."
python tools/quantization/quantize_bevfusion.py \
--model bevfusion_pruned.pth \
--output bevfusion_int8.pth \
--calibration-data data/nuscenes/calibration_100samples
# 3. TensorRT转换
echo "步骤3: TensorRT转换..."
python tools/tensorrt/convert_to_trt.py \
--model bevfusion_int8.pth \
--output bevfusion_int8.engine \
--fp16 \
--int8 \
--workspace 4096
# 4. 测试
echo "步骤4: 性能测试..."
python tools/deployment/benchmark_orin.py \
--engine bevfusion_int8.engine
echo "部署完成!"
```
---
生成时间: 2025-10-17
目标硬件: NVIDIA AGX Orin 270T
预计部署周期: 2-3周