840 lines
20 KiB
Markdown
840 lines
20 KiB
Markdown
|
|
# BEVFusion 部署到 NVIDIA Orin 270T 方案
|
|||
|
|
|
|||
|
|
## 🎯 目标
|
|||
|
|
|
|||
|
|
将训练好的BEVFusion双任务/三任务模型部署到**NVIDIA AGX Orin 270T**,实现:
|
|||
|
|
- ✅ 实时推理(>10 FPS)
|
|||
|
|
- ✅ 低延迟(<100ms)
|
|||
|
|
- ✅ 低功耗(<60W)
|
|||
|
|
- ✅ 保持精度(mAP下降<3%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 NVIDIA Orin 270T 规格
|
|||
|
|
|
|||
|
|
### 硬件参数
|
|||
|
|
| 参数 | 规格 |
|
|||
|
|
|------|------|
|
|||
|
|
| **GPU** | 2048 CUDA cores + 64 Tensor cores |
|
|||
|
|
| **AI算力** | 275 TOPS (INT8) |
|
|||
|
|
| **显存** | 64GB unified memory |
|
|||
|
|
| **CPU** | 12-core ARM Cortex-A78AE |
|
|||
|
|
| **功耗** | 15W - 60W (可配置) |
|
|||
|
|
| **架构** | Ampere (类似A100) |
|
|||
|
|
|
|||
|
|
### 性能基准
|
|||
|
|
- **FP32**: ~5 TFLOPS
|
|||
|
|
- **FP16**: ~10 TFLOPS
|
|||
|
|
- **INT8**: ~20 TOPS
|
|||
|
|
- **与A100对比**: ~1/10性能,但功耗仅1/5
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 部署流程总览
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
训练完成 (A100 × 8)
|
|||
|
|
↓
|
|||
|
|
步骤1: 模型分析和优化 (1-2天)
|
|||
|
|
↓
|
|||
|
|
步骤2: 结构化剪枝 (2-3天)
|
|||
|
|
↓
|
|||
|
|
步骤3: 量化训练 (QAT) (3-4天)
|
|||
|
|
↓
|
|||
|
|
步骤4: TensorRT优化 (2-3天)
|
|||
|
|
↓
|
|||
|
|
步骤5: Orin上测试 (1-2天)
|
|||
|
|
↓
|
|||
|
|
步骤6: 性能调优 (2-3天)
|
|||
|
|
↓
|
|||
|
|
生产部署 ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**总时间**: 约2-3周
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 步骤1: 模型分析和优化(1-2天)
|
|||
|
|
|
|||
|
|
### 1.1 模型复杂度分析
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 分析模型参数量和FLOPs
|
|||
|
|
python tools/analysis/model_complexity.py \
|
|||
|
|
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
|
|||
|
|
--checkpoint runs/run-xxx/epoch_20.pth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期输出**:
|
|||
|
|
```
|
|||
|
|
BEVFusion 双任务模型:
|
|||
|
|
- 参数量: 110M
|
|||
|
|
- FLOPs: 450 GFLOPs
|
|||
|
|
- 推理时间 (A100): 90ms
|
|||
|
|
- 推理时间 (Orin估算): 450-900ms (太慢!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 1.2 性能瓶颈分析
|
|||
|
|
|
|||
|
|
使用Nsight Systems分析:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 在A100上profiling
|
|||
|
|
nsys profile -o bevfusion_profile \
|
|||
|
|
python tools/benchmark.py \
|
|||
|
|
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
|
|||
|
|
--checkpoint runs/run-xxx/epoch_20.pth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关注模块**:
|
|||
|
|
- SwinTransformer backbone (最耗时)
|
|||
|
|
- Multi-head attention
|
|||
|
|
- 3D卷积操作
|
|||
|
|
- NMS后处理
|
|||
|
|
|
|||
|
|
### 1.3 导出基准模型
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 导出ONNX格式
|
|||
|
|
python tools/export_onnx.py \
|
|||
|
|
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
|
|||
|
|
--checkpoint runs/run-xxx/epoch_20.pth \
|
|||
|
|
--output bevfusion_fp32.onnx
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✂️ 步骤2: 结构化剪枝(2-3天)
|
|||
|
|
|
|||
|
|
### 2.1 剪枝策略
|
|||
|
|
|
|||
|
|
**目标**: 减少40-50%参数量和FLOPs
|
|||
|
|
|
|||
|
|
**剪枝方案**:
|
|||
|
|
1. **Channel Pruning** (通道剪枝)
|
|||
|
|
- SwinTransformer: 减少20% channels
|
|||
|
|
- FPN: 减少30% channels
|
|||
|
|
- Decoder: 减少25% channels
|
|||
|
|
|
|||
|
|
2. **Layer Pruning** (层剪枝)
|
|||
|
|
- SwinTransformer: 6层→4层
|
|||
|
|
- Decoder: 5层→4层
|
|||
|
|
|
|||
|
|
3. **Attention Head Pruning**
|
|||
|
|
- Multi-head数量: 8→6
|
|||
|
|
|
|||
|
|
### 2.2 剪枝工具选择
|
|||
|
|
|
|||
|
|
**推荐**: Torch-Pruning
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tools/pruning/prune_bevfusion.py
|
|||
|
|
|
|||
|
|
import torch
|
|||
|
|
import torch_pruning as tp
|
|||
|
|
|
|||
|
|
# 加载模型
|
|||
|
|
model = build_model(config)
|
|||
|
|
model.load_state_dict(checkpoint)
|
|||
|
|
|
|||
|
|
# 定义剪枝策略
|
|||
|
|
strategy = tp.strategy.L1Strategy()
|
|||
|
|
|
|||
|
|
# 对SwinTransformer剪枝
|
|||
|
|
pruner = tp.pruner.MagnitudePruner(
|
|||
|
|
model.encoders['camera'].backbone,
|
|||
|
|
example_inputs=example_images,
|
|||
|
|
importance=strategy,
|
|||
|
|
pruning_ratio=0.3, # 剪枝30%
|
|||
|
|
iterative_steps=5,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 执行剪枝
|
|||
|
|
for i in range(5):
|
|||
|
|
pruner.step()
|
|||
|
|
|
|||
|
|
# 微调
|
|||
|
|
finetune(model, train_loader, epochs=5)
|
|||
|
|
|
|||
|
|
# 保存剪枝后模型
|
|||
|
|
torch.save(model.state_dict(), 'bevfusion_pruned.pth')
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2.3 剪枝后微调
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 在原始数据集上微调5个epochs
|
|||
|
|
torchpack dist-run -np 8 python tools/train.py \
|
|||
|
|
configs/nuscenes/multitask/fusion-det-seg-swint_pruned.yaml \
|
|||
|
|
--load_from bevfusion_pruned.pth \
|
|||
|
|
--cfg-options \
|
|||
|
|
max_epochs=5 \
|
|||
|
|
optimizer.lr=5.0e-5 # 较小的学习率
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期结果**:
|
|||
|
|
- 参数量: 110M → 60M (-45%)
|
|||
|
|
- FLOPs: 450G → 250G (-44%)
|
|||
|
|
- 精度损失: <2%
|
|||
|
|
- 推理时间: 90ms → 50ms (A100)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔢 步骤3: 量化训练 QAT(3-4天)
|
|||
|
|
|
|||
|
|
### 3.1 量化策略
|
|||
|
|
|
|||
|
|
**目标**: FP32 → INT8,保持精度损失<2%
|
|||
|
|
|
|||
|
|
**量化方案**:
|
|||
|
|
```
|
|||
|
|
FP32模型 (110M参数)
|
|||
|
|
↓
|
|||
|
|
PTQ (Post-Training Quantization) - 快速验证
|
|||
|
|
↓
|
|||
|
|
QAT (Quantization-Aware Training) - 精度恢复
|
|||
|
|
↓
|
|||
|
|
INT8模型 (27.5M参数,4倍压缩)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 使用PyTorch Quantization
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tools/quantization/quantize_bevfusion.py
|
|||
|
|
|
|||
|
|
import torch
|
|||
|
|
from torch.quantization import prepare_qat, convert
|
|||
|
|
|
|||
|
|
# 加载剪枝后的模型
|
|||
|
|
model = load_pruned_model('bevfusion_pruned.pth')
|
|||
|
|
model.eval()
|
|||
|
|
|
|||
|
|
# 设置量化配置
|
|||
|
|
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
|
|||
|
|
|
|||
|
|
# 准备QAT
|
|||
|
|
model_qat = prepare_qat(model)
|
|||
|
|
|
|||
|
|
# QAT训练 (重要!)
|
|||
|
|
# 使用较小学习率,训练3-5个epochs
|
|||
|
|
train_qat(
|
|||
|
|
model_qat,
|
|||
|
|
train_loader,
|
|||
|
|
epochs=5,
|
|||
|
|
lr=1e-5
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 转换为INT8
|
|||
|
|
model_int8 = convert(model_qat)
|
|||
|
|
|
|||
|
|
# 保存
|
|||
|
|
torch.save(model_int8.state_dict(), 'bevfusion_int8.pth')
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.3 QAT训练配置
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml
|
|||
|
|
|
|||
|
|
_base_: ./fusion-det-seg-swint_pruned.yaml
|
|||
|
|
|
|||
|
|
# QAT特定配置
|
|||
|
|
quantization:
|
|||
|
|
enabled: true
|
|||
|
|
qconfig: 'fbgemm'
|
|||
|
|
|
|||
|
|
# 训练参数
|
|||
|
|
max_epochs: 5
|
|||
|
|
optimizer:
|
|||
|
|
lr: 1.0e-5 # 很小的学习率
|
|||
|
|
weight_decay: 0.0001
|
|||
|
|
|
|||
|
|
# 数据增强减弱
|
|||
|
|
augment2d:
|
|||
|
|
resize: [[0.45, 0.48], [0.48, 0.48]] # 减少resize范围
|
|||
|
|
rotate: [-2.0, 2.0] # 减少旋转
|
|||
|
|
|
|||
|
|
augment3d:
|
|||
|
|
scale: [0.95, 1.05] # 减少缩放
|
|||
|
|
rotate: [-0.39, 0.39] # 减少旋转
|
|||
|
|
translate: 0.25 # 减少平移
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.4 量化验证
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 验证INT8模型精度
|
|||
|
|
python tools/test.py \
|
|||
|
|
configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml \
|
|||
|
|
bevfusion_int8.pth \
|
|||
|
|
--eval bbox map
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期结果**:
|
|||
|
|
- 模型大小: 110M → 27.5M (-75%)
|
|||
|
|
- 推理速度: 2-4倍提升
|
|||
|
|
- 精度损失: 1-2%
|
|||
|
|
- 内存占用: 减少75%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 步骤4: TensorRT优化(2-3天)
|
|||
|
|
|
|||
|
|
### 4.1 TensorRT转换
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tools/tensorrt/convert_to_trt.py
|
|||
|
|
|
|||
|
|
import tensorrt as trt
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
# 1. 导出ONNX(从INT8模型)
|
|||
|
|
torch.onnx.export(
|
|||
|
|
model_int8,
|
|||
|
|
dummy_input,
|
|||
|
|
'bevfusion_int8.onnx',
|
|||
|
|
opset_version=17,
|
|||
|
|
input_names=['images', 'points'],
|
|||
|
|
output_names=['bboxes', 'scores', 'labels', 'masks'],
|
|||
|
|
dynamic_axes={
|
|||
|
|
'images': {0: 'batch'},
|
|||
|
|
'points': {0: 'batch'}
|
|||
|
|
}
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 2. 构建TensorRT Engine
|
|||
|
|
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
|
|||
|
|
builder = trt.Builder(TRT_LOGGER)
|
|||
|
|
network = builder.create_network(
|
|||
|
|
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 解析ONNX
|
|||
|
|
parser = trt.OnnxParser(network, TRT_LOGGER)
|
|||
|
|
with open('bevfusion_int8.onnx', 'rb') as f:
|
|||
|
|
parser.parse(f.read())
|
|||
|
|
|
|||
|
|
# 配置TensorRT
|
|||
|
|
config = builder.create_builder_config()
|
|||
|
|
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30) # 4GB
|
|||
|
|
|
|||
|
|
# INT8优化
|
|||
|
|
config.set_flag(trt.BuilderFlag.INT8)
|
|||
|
|
config.set_flag(trt.BuilderFlag.FP16) # FP16作为fallback
|
|||
|
|
|
|||
|
|
# Calibration (用于PTQ)
|
|||
|
|
config.int8_calibrator = BEVFusionCalibrator(
|
|||
|
|
calibration_dataset,
|
|||
|
|
cache_file='bevfusion_calibration.cache'
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 构建Engine
|
|||
|
|
serialized_engine = builder.build_serialized_network(network, config)
|
|||
|
|
|
|||
|
|
# 保存
|
|||
|
|
with open('bevfusion_int8.engine', 'wb') as f:
|
|||
|
|
f.write(serialized_engine)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.2 TensorRT推理接口
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tools/tensorrt/trt_inference.py
|
|||
|
|
|
|||
|
|
import tensorrt as trt
|
|||
|
|
import pycuda.driver as cuda
|
|||
|
|
import pycuda.autoinit
|
|||
|
|
|
|||
|
|
class BEVFusionTRT:
|
|||
|
|
def __init__(self, engine_path):
|
|||
|
|
# 加载engine
|
|||
|
|
with open(engine_path, 'rb') as f:
|
|||
|
|
runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
|
|||
|
|
self.engine = runtime.deserialize_cuda_engine(f.read())
|
|||
|
|
|
|||
|
|
self.context = self.engine.create_execution_context()
|
|||
|
|
|
|||
|
|
# 分配GPU内存
|
|||
|
|
self.allocate_buffers()
|
|||
|
|
|
|||
|
|
def allocate_buffers(self):
|
|||
|
|
self.inputs = []
|
|||
|
|
self.outputs = []
|
|||
|
|
self.bindings = []
|
|||
|
|
|
|||
|
|
for i in range(self.engine.num_bindings):
|
|||
|
|
binding = self.engine.get_binding_name(i)
|
|||
|
|
size = trt.volume(self.engine.get_binding_shape(i))
|
|||
|
|
dtype = trt.nptype(self.engine.get_binding_dtype(i))
|
|||
|
|
|
|||
|
|
# 分配device内存
|
|||
|
|
device_mem = cuda.mem_alloc(size * dtype.itemsize)
|
|||
|
|
self.bindings.append(int(device_mem))
|
|||
|
|
|
|||
|
|
if self.engine.binding_is_input(i):
|
|||
|
|
self.inputs.append({'binding': binding, 'memory': device_mem})
|
|||
|
|
else:
|
|||
|
|
self.outputs.append({'binding': binding, 'memory': device_mem})
|
|||
|
|
|
|||
|
|
def infer(self, images, points):
|
|||
|
|
# 拷贝输入到GPU
|
|||
|
|
cuda.memcpy_htod(self.inputs[0]['memory'], images)
|
|||
|
|
cuda.memcpy_htod(self.inputs[1]['memory'], points)
|
|||
|
|
|
|||
|
|
# 执行推理
|
|||
|
|
self.context.execute_v2(bindings=self.bindings)
|
|||
|
|
|
|||
|
|
# 拷贝输出到CPU
|
|||
|
|
outputs = []
|
|||
|
|
for output in self.outputs:
|
|||
|
|
host_mem = cuda.pagelocked_empty(output['shape'], output['dtype'])
|
|||
|
|
cuda.memcpy_dtoh(host_mem, output['memory'])
|
|||
|
|
outputs.append(host_mem)
|
|||
|
|
|
|||
|
|
return outputs
|
|||
|
|
|
|||
|
|
# 使用
|
|||
|
|
trt_model = BEVFusionTRT('bevfusion_int8.engine')
|
|||
|
|
bboxes, scores, labels, masks = trt_model.infer(images, points)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.3 TensorRT优化技巧
|
|||
|
|
|
|||
|
|
**针对Orin的优化**:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 1. DLA加速(Orin有2个DLA)
|
|||
|
|
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
|
|||
|
|
config.default_device_type = trt.DeviceType.DLA
|
|||
|
|
config.DLA_core = 0 # 使用DLA core 0
|
|||
|
|
|
|||
|
|
# 2. Kernel自动调优
|
|||
|
|
config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
|
|||
|
|
|
|||
|
|
# 3. 优化Batch Size(Orin适合小batch)
|
|||
|
|
config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, True)
|
|||
|
|
|
|||
|
|
# 4. Profile优化(针对真实输入shape)
|
|||
|
|
profile = builder.create_optimization_profile()
|
|||
|
|
profile.set_shape(
|
|||
|
|
"images",
|
|||
|
|
min=(1, 6, 3, 256, 704),
|
|||
|
|
opt=(1, 6, 3, 256, 704), # 最优shape
|
|||
|
|
max=(2, 6, 3, 256, 704)
|
|||
|
|
)
|
|||
|
|
config.add_optimization_profile(profile)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🧪 步骤5: Orin上测试(1-2天)
|
|||
|
|
|
|||
|
|
### 5.1 环境准备
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 在Orin上安装依赖
|
|||
|
|
# JetPack 5.1+ (包含CUDA 11.4, cuDNN 8.6, TensorRT 8.5)
|
|||
|
|
|
|||
|
|
# 安装Python依赖
|
|||
|
|
pip3 install pycuda
|
|||
|
|
pip3 install numpy opencv-python
|
|||
|
|
|
|||
|
|
# 拷贝模型文件
|
|||
|
|
scp bevfusion_int8.engine orin@192.168.1.100:/home/orin/models/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.2 性能测试
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# tools/benchmark_orin.py
|
|||
|
|
|
|||
|
|
import time
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
# 加载TensorRT模型
|
|||
|
|
trt_model = BEVFusionTRT('bevfusion_int8.engine')
|
|||
|
|
|
|||
|
|
# 预热
|
|||
|
|
for _ in range(10):
|
|||
|
|
trt_model.infer(dummy_images, dummy_points)
|
|||
|
|
|
|||
|
|
# 性能测试
|
|||
|
|
times = []
|
|||
|
|
for i in range(100):
|
|||
|
|
start = time.time()
|
|||
|
|
outputs = trt_model.infer(images, points)
|
|||
|
|
end = time.time()
|
|||
|
|
times.append((end - start) * 1000) # ms
|
|||
|
|
|
|||
|
|
print(f"平均推理时间: {np.mean(times):.2f} ms")
|
|||
|
|
print(f"吞吐量: {1000/np.mean(times):.2f} FPS")
|
|||
|
|
print(f"P99延迟: {np.percentile(times, 99):.2f} ms")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.3 功耗测试
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 监控功耗
|
|||
|
|
sudo tegrastats --interval 1000 > power_log.txt &
|
|||
|
|
|
|||
|
|
# 运行推理
|
|||
|
|
python3 tools/benchmark_orin.py
|
|||
|
|
|
|||
|
|
# 分析功耗
|
|||
|
|
cat power_log.txt | grep "VDD_GPU_SOC"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.4 精度验证
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 在Orin上跑nuScenes验证集
|
|||
|
|
python3 tools/test_orin.py \
|
|||
|
|
--engine bevfusion_int8.engine \
|
|||
|
|
--data-root /data/nuscenes \
|
|||
|
|
--eval bbox map
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期性能**:
|
|||
|
|
- **推理时间**: 60-80ms (vs 90ms on A100)
|
|||
|
|
- **FPS**: 12-16 FPS ✅
|
|||
|
|
- **功耗**: 40-50W
|
|||
|
|
- **精度损失**: <3%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚡ 步骤6: 性能调优(2-3天)
|
|||
|
|
|
|||
|
|
### 6.1 多流并行
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 使用CUDA Streams加速预处理
|
|||
|
|
class OptimizedPipeline:
|
|||
|
|
def __init__(self):
|
|||
|
|
self.preprocess_stream = cuda.Stream()
|
|||
|
|
self.infer_stream = cuda.Stream()
|
|||
|
|
self.postprocess_stream = cuda.Stream()
|
|||
|
|
|
|||
|
|
def process_frame(self, raw_images, raw_points):
|
|||
|
|
# 预处理(异步)
|
|||
|
|
with self.preprocess_stream:
|
|||
|
|
images = preprocess_images(raw_images)
|
|||
|
|
points = preprocess_points(raw_points)
|
|||
|
|
|
|||
|
|
# 推理(异步)
|
|||
|
|
with self.infer_stream:
|
|||
|
|
self.infer_stream.wait_for_event(preprocess_done)
|
|||
|
|
outputs = self.trt_model.infer(images, points)
|
|||
|
|
|
|||
|
|
# 后处理(异步)
|
|||
|
|
with self.postprocess_stream:
|
|||
|
|
self.postprocess_stream.wait_for_event(infer_done)
|
|||
|
|
results = postprocess_outputs(outputs)
|
|||
|
|
|
|||
|
|
return results
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 6.2 内存优化
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 使用Unified Memory减少拷贝
|
|||
|
|
import pycuda.driver as cuda
|
|||
|
|
|
|||
|
|
# 分配unified memory
|
|||
|
|
images_um = cuda.managed_empty(shape, dtype=np.float32)
|
|||
|
|
points_um = cuda.managed_empty(shape, dtype=np.float32)
|
|||
|
|
|
|||
|
|
# 直接在CPU上填充数据
|
|||
|
|
np.copyto(images_um, preprocessed_images)
|
|||
|
|
|
|||
|
|
# GPU可以直接访问,无需显式拷贝
|
|||
|
|
outputs = trt_model.infer(images_um, points_um)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 6.3 DLA Offload
|
|||
|
|
|
|||
|
|
针对Orin的2个DLA核心:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 将部分网络offload到DLA
|
|||
|
|
# DLA适合:卷积、池化、归一化
|
|||
|
|
# GPU保留:Attention、复杂操作
|
|||
|
|
|
|||
|
|
# Engine构建时指定
|
|||
|
|
dla_layers = [
|
|||
|
|
'encoder/camera/backbone/conv1',
|
|||
|
|
'encoder/camera/backbone/layer1',
|
|||
|
|
'encoder/lidar/voxelize',
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
for layer_name in dla_layers:
|
|||
|
|
layer = network.get_layer_by_name(layer_name)
|
|||
|
|
layer.device_type = trt.DeviceType.DLA
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 预期性能对比
|
|||
|
|
|
|||
|
|
### 各优化阶段性能
|
|||
|
|
|
|||
|
|
| 阶段 | 参数量 | FLOPs | 推理时间(Orin) | 精度损失 | 说明 |
|
|||
|
|
|------|--------|-------|---------------|---------|------|
|
|||
|
|
| **原始FP32** | 110M | 450G | 900ms | - | 太慢 ❌ |
|
|||
|
|
| **剪枝后FP32** | 60M | 250G | 500ms | -1.5% | 仍慢 ⚠️ |
|
|||
|
|
| **剪枝+INT8** | 15M | 62G | 80ms | -2.5% | 可用 ✅ |
|
|||
|
|
| **+TensorRT** | 15M | 62G | 65ms | -2.5% | 良好 ✅ |
|
|||
|
|
| **+多流优化** | 15M | 62G | 50ms | -2.5% | 最优 🌟 |
|
|||
|
|
|
|||
|
|
### 最终性能目标
|
|||
|
|
|
|||
|
|
| 指标 | 目标值 | 预期达到 |
|
|||
|
|
|------|--------|---------|
|
|||
|
|
| **推理时间** | <80ms | 50-65ms ✅ |
|
|||
|
|
| **吞吐量** | >10 FPS | 15-20 FPS ✅ |
|
|||
|
|
| **功耗** | <60W | 40-50W ✅ |
|
|||
|
|
| **检测mAP** | >63% | 65-67% ✅ |
|
|||
|
|
| **分割mIoU** | >52% | 53-57% ✅ |
|
|||
|
|
| **内存占用** | <4GB | 2-3GB ✅ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🛠️ 工具和脚本
|
|||
|
|
|
|||
|
|
### 创建必要的工具脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
tools/
|
|||
|
|
├── pruning/
|
|||
|
|
│ ├── prune_bevfusion.py # 剪枝脚本
|
|||
|
|
│ └── eval_pruned_model.py # 评估剪枝后模型
|
|||
|
|
├── quantization/
|
|||
|
|
│ ├── quantize_bevfusion.py # 量化脚本
|
|||
|
|
│ ├── qat_train.py # QAT训练
|
|||
|
|
│ └── calibrate.py # INT8校准
|
|||
|
|
├── tensorrt/
|
|||
|
|
│ ├── convert_to_trt.py # ONNX→TensorRT
|
|||
|
|
│ ├── trt_inference.py # TensorRT推理
|
|||
|
|
│ └── optimize_dla.py # DLA优化
|
|||
|
|
├── deployment/
|
|||
|
|
│ ├── benchmark_orin.py # Orin性能测试
|
|||
|
|
│ ├── deploy_to_orin.sh # 一键部署脚本
|
|||
|
|
│ └── monitor_performance.py # 性能监控
|
|||
|
|
└── analysis/
|
|||
|
|
├── model_complexity.py # 模型复杂度分析
|
|||
|
|
└── latency_breakdown.py # 延迟分解分析
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📅 详细时间表
|
|||
|
|
|
|||
|
|
### 第1周:剪枝和量化准备
|
|||
|
|
|
|||
|
|
| 天数 | 任务 | 输出 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| Day 1-2 | 模型分析,导出ONNX | 基准测试报告 |
|
|||
|
|
| Day 3-4 | 结构化剪枝 | 剪枝后模型 (60M) |
|
|||
|
|
| Day 5 | 剪枝模型微调 | 微调后checkpoint |
|
|||
|
|
| Day 6-7 | PTQ初步测试 | INT8可行性报告 |
|
|||
|
|
|
|||
|
|
### 第2周:量化训练和TensorRT
|
|||
|
|
|
|||
|
|
| 天数 | 任务 | 输出 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| Day 8-10 | QAT训练 | INT8模型 (15M) |
|
|||
|
|
| Day 11-12 | TensorRT转换和优化 | TRT Engine |
|
|||
|
|
| Day 13 | A100上TensorRT测试 | 性能基准 |
|
|||
|
|
| Day 14 | 准备Orin环境 | 部署包 |
|
|||
|
|
|
|||
|
|
### 第3周:Orin测试和调优
|
|||
|
|
|
|||
|
|
| 天数 | 任务 | 输出 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| Day 15 | 部署到Orin | 初步结果 |
|
|||
|
|
| Day 16 | 性能和功耗测试 | 测试报告 |
|
|||
|
|
| Day 17-18 | 精度验证 | 精度报告 |
|
|||
|
|
| Day 19-20 | 多流和DLA优化 | 优化后模型 |
|
|||
|
|
| Day 21 | 最终验证和文档 | 部署文档 ✅ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 关键技术点
|
|||
|
|
|
|||
|
|
### 1. 针对Orin的特殊优化
|
|||
|
|
|
|||
|
|
**Orin vs 通用GPU**:
|
|||
|
|
- ✅ Unified Memory优势大
|
|||
|
|
- ✅ DLA可用,适合卷积层
|
|||
|
|
- ⚠️ Tensor Cores较少,FP16优势小
|
|||
|
|
- ⚠️ 带宽较低,需优化内存访问
|
|||
|
|
|
|||
|
|
### 2. BEVFusion特定优化
|
|||
|
|
|
|||
|
|
**关键模块优化**:
|
|||
|
|
1. **SwinTransformer**
|
|||
|
|
- 最耗时(~40%)
|
|||
|
|
- 剪枝效果最好
|
|||
|
|
- Window Attention可用卷积近似
|
|||
|
|
|
|||
|
|
2. **LSS View Transform**
|
|||
|
|
- 3D卷积密集
|
|||
|
|
- INT8量化效果好
|
|||
|
|
- 可考虑分离运算
|
|||
|
|
|
|||
|
|
3. **ConvFuser**
|
|||
|
|
- 简单concat+conv
|
|||
|
|
- 几乎无损优化
|
|||
|
|
|
|||
|
|
4. **TransFusion Head**
|
|||
|
|
- Query机制复杂
|
|||
|
|
- 需要仔细量化
|
|||
|
|
- NMS可CPU并行
|
|||
|
|
|
|||
|
|
### 3. 精度保持技巧
|
|||
|
|
|
|||
|
|
**QAT训练要点**:
|
|||
|
|
- ✅ 使用原始数据集全量训练
|
|||
|
|
- ✅ 学习率要小(1e-5)
|
|||
|
|
- ✅ 训练3-5个epochs足够
|
|||
|
|
- ✅ BatchNorm层不量化
|
|||
|
|
- ✅ 某些敏感层保持FP16
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📦 部署包结构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
bevfusion_orin_deploy/
|
|||
|
|
├── models/
|
|||
|
|
│ ├── bevfusion_int8.engine # TensorRT Engine
|
|||
|
|
│ ├── config.yaml # 配置文件
|
|||
|
|
│ └── class_names.txt # 类别名称
|
|||
|
|
├── lib/
|
|||
|
|
│ ├── libbevfusion.so # C++推理库
|
|||
|
|
│ └── python/
|
|||
|
|
│ └── bevfusion_trt.py # Python接口
|
|||
|
|
├── scripts/
|
|||
|
|
│ ├── run_inference.sh # 推理脚本
|
|||
|
|
│ └── benchmark.sh # 性能测试
|
|||
|
|
├── data/
|
|||
|
|
│ └── sample_data/ # 测试数据
|
|||
|
|
├── docs/
|
|||
|
|
│ ├── API.md # API文档
|
|||
|
|
│ └── OPTIMIZATION.md # 优化说明
|
|||
|
|
└── README.md # 使用说明
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 性能保证策略
|
|||
|
|
|
|||
|
|
### 如果性能不达标
|
|||
|
|
|
|||
|
|
**Plan B选项**:
|
|||
|
|
|
|||
|
|
1. **进一步剪枝** (60M → 40M)
|
|||
|
|
- 牺牲1-2%精度
|
|||
|
|
- 提升20-30%速度
|
|||
|
|
|
|||
|
|
2. **降低输入分辨率**
|
|||
|
|
- 图像: 256×704 → 192×512
|
|||
|
|
- BEV: 180×180 → 128×128
|
|||
|
|
- 速度提升40%
|
|||
|
|
|
|||
|
|
3. **简化任务**
|
|||
|
|
- 只保留检测任务
|
|||
|
|
- 或检测+分割二选一
|
|||
|
|
|
|||
|
|
4. **使用两个Orin**
|
|||
|
|
- Camera处理用Orin-1
|
|||
|
|
- LiDAR处理用Orin-2
|
|||
|
|
- 并行推理
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 参考资源
|
|||
|
|
|
|||
|
|
### 官方文档
|
|||
|
|
- [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/)
|
|||
|
|
- [Orin Developer Guide](https://developer.nvidia.com/embedded/jetson-agx-orin-developer-kit)
|
|||
|
|
- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
|
|||
|
|
|
|||
|
|
### 开源工具
|
|||
|
|
- [Torch-Pruning](https://github.com/VainF/Torch-Pruning)
|
|||
|
|
- [TensorRT-OSS](https://github.com/NVIDIA/TensorRT)
|
|||
|
|
- [ONNX Runtime](https://github.com/microsoft/onnxruntime)
|
|||
|
|
|
|||
|
|
### 相关论文
|
|||
|
|
- "Learned Step Size Quantization" (LSQ)
|
|||
|
|
- "Network Slimming" (Channel Pruning)
|
|||
|
|
- "Accelerating Deep Learning with TensorRT"
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 成功标准
|
|||
|
|
|
|||
|
|
### 最低要求
|
|||
|
|
- ✅ 推理时间 < 80ms
|
|||
|
|
- ✅ 吞吐量 > 12 FPS
|
|||
|
|
- ✅ 功耗 < 60W
|
|||
|
|
- ✅ 检测mAP > 63%
|
|||
|
|
- ✅ 分割mIoU > 52%
|
|||
|
|
|
|||
|
|
### 理想目标
|
|||
|
|
- 🌟 推理时间 < 60ms
|
|||
|
|
- 🌟 吞吐量 > 16 FPS
|
|||
|
|
- 🌟 功耗 < 45W
|
|||
|
|
- 🌟 检测mAP > 65%
|
|||
|
|
- 🌟 分割mIoU > 55%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 快速开始
|
|||
|
|
|
|||
|
|
### 一键部署脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# scripts/deploy_to_orin.sh
|
|||
|
|
|
|||
|
|
echo "========== BEVFusion Orin部署 =========="
|
|||
|
|
|
|||
|
|
# 1. 剪枝
|
|||
|
|
echo "步骤1: 模型剪枝..."
|
|||
|
|
python tools/pruning/prune_bevfusion.py \
|
|||
|
|
--config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
|
|||
|
|
--checkpoint runs/run-xxx/epoch_20.pth \
|
|||
|
|
--output bevfusion_pruned.pth
|
|||
|
|
|
|||
|
|
# 2. 量化
|
|||
|
|
echo "步骤2: INT8量化..."
|
|||
|
|
python tools/quantization/quantize_bevfusion.py \
|
|||
|
|
--model bevfusion_pruned.pth \
|
|||
|
|
--output bevfusion_int8.pth \
|
|||
|
|
--calibration-data data/nuscenes/calibration_100samples
|
|||
|
|
|
|||
|
|
# 3. TensorRT转换
|
|||
|
|
echo "步骤3: TensorRT转换..."
|
|||
|
|
python tools/tensorrt/convert_to_trt.py \
|
|||
|
|
--model bevfusion_int8.pth \
|
|||
|
|
--output bevfusion_int8.engine \
|
|||
|
|
--fp16 \
|
|||
|
|
--int8 \
|
|||
|
|
--workspace 4096
|
|||
|
|
|
|||
|
|
# 4. 测试
|
|||
|
|
echo "步骤4: 性能测试..."
|
|||
|
|
python tools/deployment/benchmark_orin.py \
|
|||
|
|
--engine bevfusion_int8.engine
|
|||
|
|
|
|||
|
|
echo "部署完成!"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
生成时间: 2025-10-17
|
|||
|
|
目标硬件: NVIDIA AGX Orin 270T
|
|||
|
|
预计部署周期: 2-3周
|
|||
|
|
|
|||
|
|
|