bev-project/project/docs/ORIN_DEPLOYMENT_PLAN.md

# BEVFusion 部署到 NVIDIA Orin 270T 方案

## 🎯 目标

将训练好的BEVFusion双任务/三任务模型部署到**NVIDIA AGX Orin 270T**，实现：
- ✅ 实时推理（>10 FPS）
- ✅ 低延迟（<100ms）
- ✅ 低功耗（<60W）
- ✅ 保持精度（mAP下降<3%）

---

## 📊 NVIDIA Orin 270T 规格

### 硬件参数
| 参数 | 规格 |
|------|------|
| **GPU** | 2048 CUDA cores + 64 Tensor cores |
| **AI算力** | 275 TOPS (INT8) |
| **显存** | 64GB unified memory |
| **CPU** | 12-core ARM Cortex-A78AE |
| **功耗** | 15W - 60W (可配置) |
| **架构** | Ampere (类似A100) |

### 性能基准
- **FP32**: ~5 TFLOPS
- **FP16**: ~10 TFLOPS
- **INT8**: ~20 TOPS
- **与A100对比**: ~1/10性能，但功耗仅1/5

---

## 📋 部署流程总览

```
训练完成 (A100 × 8)
    ↓
步骤1: 模型分析和优化 (1-2天)
    ↓
步骤2: 结构化剪枝 (2-3天)
    ↓
步骤3: 量化训练 (QAT) (3-4天)
    ↓
步骤4: TensorRT优化 (2-3天)
    ↓
步骤5: Orin上测试 (1-2天)
    ↓
步骤6: 性能调优 (2-3天)
    ↓
生产部署 ✅
```

**总时间**: 约2-3周

---

## 🔧 步骤1: 模型分析和优化（1-2天）

### 1.1 模型复杂度分析

```bash
# 分析模型参数量和FLOPs
python tools/analysis/model_complexity.py \
    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
    --checkpoint runs/run-xxx/epoch_20.pth
```

**预期输出**:
```
BEVFusion 双任务模型:
  - 参数量: 110M
  - FLOPs: 450 GFLOPs
  - 推理时间 (A100): 90ms
  - 推理时间 (Orin估算): 450-900ms (太慢！)
```

### 1.2 性能瓶颈分析

使用Nsight Systems分析：

```bash
# 在A100上profiling
nsys profile -o bevfusion_profile \
    python tools/benchmark.py \
    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
    --checkpoint runs/run-xxx/epoch_20.pth
```

**关注模块**:
- SwinTransformer backbone (最耗时)
- Multi-head attention
- 3D卷积操作
- NMS后处理

### 1.3 导出基准模型

```bash
# 导出ONNX格式
python tools/export_onnx.py \
    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
    --checkpoint runs/run-xxx/epoch_20.pth \
    --output bevfusion_fp32.onnx
```

---

## ✂️ 步骤2: 结构化剪枝（2-3天）

### 2.1 剪枝策略

**目标**: 减少40-50%参数量和FLOPs

**剪枝方案**:
1. **Channel Pruning** (通道剪枝)
   - SwinTransformer: 减少20% channels
   - FPN: 减少30% channels
   - Decoder: 减少25% channels

2. **Layer Pruning** (层剪枝)
   - SwinTransformer: 6层→4层
   - Decoder: 5层→4层

3. **Attention Head Pruning**
   - Multi-head数量: 8→6

### 2.2 剪枝工具选择

**推荐**: Torch-Pruning

```python
# tools/pruning/prune_bevfusion.py

import torch
import torch_pruning as tp

# 加载模型
model = build_model(config)
model.load_state_dict(checkpoint)

# 定义剪枝策略
strategy = tp.strategy.L1Strategy()

# 对SwinTransformer剪枝
pruner = tp.pruner.MagnitudePruner(
    model.encoders['camera'].backbone,
    example_inputs=example_images,
    importance=strategy,
    pruning_ratio=0.3,  # 剪枝30%
    iterative_steps=5,
)

# 执行剪枝
for i in range(5):
    pruner.step()
    
# 微调
finetune(model, train_loader, epochs=5)

# 保存剪枝后模型
torch.save(model.state_dict(), 'bevfusion_pruned.pth')
```

### 2.3 剪枝后微调

```bash
# 在原始数据集上微调5个epochs
torchpack dist-run -np 8 python tools/train.py \
    configs/nuscenes/multitask/fusion-det-seg-swint_pruned.yaml \
    --load_from bevfusion_pruned.pth \
    --cfg-options \
        max_epochs=5 \
        optimizer.lr=5.0e-5  # 较小的学习率
```

**预期结果**:
- 参数量: 110M → 60M (-45%)
- FLOPs: 450G → 250G (-44%)
- 精度损失: <2%
- 推理时间: 90ms → 50ms (A100)

---

## 🔢 步骤3: 量化训练 QAT（3-4天）

### 3.1 量化策略

**目标**: FP32 → INT8，保持精度损失<2%

**量化方案**:
```
FP32模型 (110M参数)
    ↓
PTQ (Post-Training Quantization) - 快速验证
    ↓
QAT (Quantization-Aware Training) - 精度恢复
    ↓
INT8模型 (27.5M参数，4倍压缩)
```

### 3.2 使用PyTorch Quantization

```python
# tools/quantization/quantize_bevfusion.py

import torch
from torch.quantization import prepare_qat, convert

# 加载剪枝后的模型
model = load_pruned_model('bevfusion_pruned.pth')
model.eval()

# 设置量化配置
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')

# 准备QAT
model_qat = prepare_qat(model)

# QAT训练 (重要！)
# 使用较小学习率，训练3-5个epochs
train_qat(
    model_qat,
    train_loader,
    epochs=5,
    lr=1e-5
)

# 转换为INT8
model_int8 = convert(model_qat)

# 保存
torch.save(model_int8.state_dict(), 'bevfusion_int8.pth')
```

### 3.3 QAT训练配置

```yaml
# configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml

_base_: ./fusion-det-seg-swint_pruned.yaml

# QAT特定配置
quantization:
  enabled: true
  qconfig: 'fbgemm'
  
# 训练参数
max_epochs: 5
optimizer:
  lr: 1.0e-5  # 很小的学习率
  weight_decay: 0.0001

# 数据增强减弱
augment2d:
  resize: [[0.45, 0.48], [0.48, 0.48]]  # 减少resize范围
  rotate: [-2.0, 2.0]  # 减少旋转

augment3d:
  scale: [0.95, 1.05]  # 减少缩放
  rotate: [-0.39, 0.39]  # 减少旋转
  translate: 0.25  # 减少平移
```

### 3.4 量化验证

```bash
# 验证INT8模型精度
python tools/test.py \
    configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml \
    bevfusion_int8.pth \
    --eval bbox map
```

**预期结果**:
- 模型大小: 110M → 27.5M (-75%)
- 推理速度: 2-4倍提升
- 精度损失: 1-2%
- 内存占用: 减少75%

---

## 🚀 步骤4: TensorRT优化（2-3天）

### 4.1 TensorRT转换

```python
# tools/tensorrt/convert_to_trt.py

import tensorrt as trt
import torch

# 1. 导出ONNX（从INT8模型）
torch.onnx.export(
    model_int8,
    dummy_input,
    'bevfusion_int8.onnx',
    opset_version=17,
    input_names=['images', 'points'],
    output_names=['bboxes', 'scores', 'labels', 'masks'],
    dynamic_axes={
        'images': {0: 'batch'},
        'points': {0: 'batch'}
    }
)

# 2. 构建TensorRT Engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)

# 解析ONNX
parser = trt.OnnxParser(network, TRT_LOGGER)
with open('bevfusion_int8.onnx', 'rb') as f:
    parser.parse(f.read())

# 配置TensorRT
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30)  # 4GB

# INT8优化
config.set_flag(trt.BuilderFlag.INT8)
config.set_flag(trt.BuilderFlag.FP16)  # FP16作为fallback

# Calibration (用于PTQ)
config.int8_calibrator = BEVFusionCalibrator(
    calibration_dataset,
    cache_file='bevfusion_calibration.cache'
)

# 构建Engine
serialized_engine = builder.build_serialized_network(network, config)

# 保存
with open('bevfusion_int8.engine', 'wb') as f:
    f.write(serialized_engine)
```

### 4.2 TensorRT推理接口

```python
# tools/tensorrt/trt_inference.py

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

class BEVFusionTRT:
    def __init__(self, engine_path):
        # 加载engine
        with open(engine_path, 'rb') as f:
            runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
            self.engine = runtime.deserialize_cuda_engine(f.read())
        
        self.context = self.engine.create_execution_context()
        
        # 分配GPU内存
        self.allocate_buffers()
    
    def allocate_buffers(self):
        self.inputs = []
        self.outputs = []
        self.bindings = []
        
        for i in range(self.engine.num_bindings):
            binding = self.engine.get_binding_name(i)
            size = trt.volume(self.engine.get_binding_shape(i))
            dtype = trt.nptype(self.engine.get_binding_dtype(i))
            
            # 分配device内存
            device_mem = cuda.mem_alloc(size * dtype.itemsize)
            self.bindings.append(int(device_mem))
            
            if self.engine.binding_is_input(i):
                self.inputs.append({'binding': binding, 'memory': device_mem})
            else:
                self.outputs.append({'binding': binding, 'memory': device_mem})
    
    def infer(self, images, points):
        # 拷贝输入到GPU
        cuda.memcpy_htod(self.inputs[0]['memory'], images)
        cuda.memcpy_htod(self.inputs[1]['memory'], points)
        
        # 执行推理
        self.context.execute_v2(bindings=self.bindings)
        
        # 拷贝输出到CPU
        outputs = []
        for output in self.outputs:
            host_mem = cuda.pagelocked_empty(output['shape'], output['dtype'])
            cuda.memcpy_dtoh(host_mem, output['memory'])
            outputs.append(host_mem)
        
        return outputs

# 使用
trt_model = BEVFusionTRT('bevfusion_int8.engine')
bboxes, scores, labels, masks = trt_model.infer(images, points)
```

### 4.3 TensorRT优化技巧

**针对Orin的优化**:

```python
# 1. DLA加速（Orin有2个DLA）
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
config.default_device_type = trt.DeviceType.DLA
config.DLA_core = 0  # 使用DLA core 0

# 2. Kernel自动调优
config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)

# 3. 优化Batch Size（Orin适合小batch）
config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, True)

# 4. Profile优化（针对真实输入shape）
profile = builder.create_optimization_profile()
profile.set_shape(
    "images",
    min=(1, 6, 3, 256, 704),
    opt=(1, 6, 3, 256, 704),  # 最优shape
    max=(2, 6, 3, 256, 704)
)
config.add_optimization_profile(profile)
```

---

## 🧪 步骤5: Orin上测试（1-2天）

### 5.1 环境准备

```bash
# 在Orin上安装依赖
# JetPack 5.1+ (包含CUDA 11.4, cuDNN 8.6, TensorRT 8.5)

# 安装Python依赖
pip3 install pycuda
pip3 install numpy opencv-python

# 拷贝模型文件
scp bevfusion_int8.engine orin@192.168.1.100:/home/orin/models/
```

### 5.2 性能测试

```python
# tools/benchmark_orin.py

import time
import numpy as np

# 加载TensorRT模型
trt_model = BEVFusionTRT('bevfusion_int8.engine')

# 预热
for _ in range(10):
    trt_model.infer(dummy_images, dummy_points)

# 性能测试
times = []
for i in range(100):
    start = time.time()
    outputs = trt_model.infer(images, points)
    end = time.time()
    times.append((end - start) * 1000)  # ms

print(f"平均推理时间: {np.mean(times):.2f} ms")
print(f"吞吐量: {1000/np.mean(times):.2f} FPS")
print(f"P99延迟: {np.percentile(times, 99):.2f} ms")
```

### 5.3 功耗测试

```bash
# 监控功耗
sudo tegrastats --interval 1000 > power_log.txt &

# 运行推理
python3 tools/benchmark_orin.py

# 分析功耗
cat power_log.txt | grep "VDD_GPU_SOC"
```

### 5.4 精度验证

```bash
# 在Orin上跑nuScenes验证集
python3 tools/test_orin.py \
    --engine bevfusion_int8.engine \
    --data-root /data/nuscenes \
    --eval bbox map
```

**预期性能**:
- **推理时间**: 60-80ms (vs 90ms on A100)
- **FPS**: 12-16 FPS ✅
- **功耗**: 40-50W
- **精度损失**: <3%

---

## ⚡ 步骤6: 性能调优（2-3天）

### 6.1 多流并行

```python
# 使用CUDA Streams加速预处理
class OptimizedPipeline:
    def __init__(self):
        self.preprocess_stream = cuda.Stream()
        self.infer_stream = cuda.Stream()
        self.postprocess_stream = cuda.Stream()
    
    def process_frame(self, raw_images, raw_points):
        # 预处理（异步）
        with self.preprocess_stream:
            images = preprocess_images(raw_images)
            points = preprocess_points(raw_points)
        
        # 推理（异步）
        with self.infer_stream:
            self.infer_stream.wait_for_event(preprocess_done)
            outputs = self.trt_model.infer(images, points)
        
        # 后处理（异步）
        with self.postprocess_stream:
            self.postprocess_stream.wait_for_event(infer_done)
            results = postprocess_outputs(outputs)
        
        return results
```

### 6.2 内存优化

```python
# 使用Unified Memory减少拷贝
import pycuda.driver as cuda

# 分配unified memory
images_um = cuda.managed_empty(shape, dtype=np.float32)
points_um = cuda.managed_empty(shape, dtype=np.float32)

# 直接在CPU上填充数据
np.copyto(images_um, preprocessed_images)

# GPU可以直接访问，无需显式拷贝
outputs = trt_model.infer(images_um, points_um)
```

### 6.3 DLA Offload

针对Orin的2个DLA核心：

```python
# 将部分网络offload到DLA
# DLA适合：卷积、池化、归一化
# GPU保留：Attention、复杂操作

# Engine构建时指定
dla_layers = [
    'encoder/camera/backbone/conv1',
    'encoder/camera/backbone/layer1',
    'encoder/lidar/voxelize',
]

for layer_name in dla_layers:
    layer = network.get_layer_by_name(layer_name)
    layer.device_type = trt.DeviceType.DLA
```

---

## 📊 预期性能对比

### 各优化阶段性能

| 阶段 | 参数量 | FLOPs | 推理时间(Orin) | 精度损失 | 说明 |
|------|--------|-------|---------------|---------|------|
| **原始FP32** | 110M | 450G | 900ms | - | 太慢 ❌ |
| **剪枝后FP32** | 60M | 250G | 500ms | -1.5% | 仍慢 ⚠️ |
| **剪枝+INT8** | 15M | 62G | 80ms | -2.5% | 可用 ✅ |
| **+TensorRT** | 15M | 62G | 65ms | -2.5% | 良好 ✅ |
| **+多流优化** | 15M | 62G | 50ms | -2.5% | 最优 🌟 |

### 最终性能目标

| 指标 | 目标值 | 预期达到 |
|------|--------|---------|
| **推理时间** | <80ms | 50-65ms ✅ |
| **吞吐量** | >10 FPS | 15-20 FPS ✅ |
| **功耗** | <60W | 40-50W ✅ |
| **检测mAP** | >63% | 65-67% ✅ |
| **分割mIoU** | >52% | 53-57% ✅ |
| **内存占用** | <4GB | 2-3GB ✅ |

---

## 🛠️ 工具和脚本

### 创建必要的工具脚本

```bash
tools/
├── pruning/
│   ├── prune_bevfusion.py          # 剪枝脚本
│   └── eval_pruned_model.py        # 评估剪枝后模型
├── quantization/
│   ├── quantize_bevfusion.py       # 量化脚本
│   ├── qat_train.py                # QAT训练
│   └── calibrate.py                # INT8校准
├── tensorrt/
│   ├── convert_to_trt.py           # ONNX→TensorRT
│   ├── trt_inference.py            # TensorRT推理
│   └── optimize_dla.py             # DLA优化
├── deployment/
│   ├── benchmark_orin.py           # Orin性能测试
│   ├── deploy_to_orin.sh           # 一键部署脚本
│   └── monitor_performance.py     # 性能监控
└── analysis/
    ├── model_complexity.py         # 模型复杂度分析
    └── latency_breakdown.py        # 延迟分解分析
```

---

## 📅 详细时间表

### 第1周：剪枝和量化准备

| 天数 | 任务 | 输出 |
|------|------|------|
| Day 1-2 | 模型分析，导出ONNX | 基准测试报告 |
| Day 3-4 | 结构化剪枝 | 剪枝后模型 (60M) |
| Day 5 | 剪枝模型微调 | 微调后checkpoint |
| Day 6-7 | PTQ初步测试 | INT8可行性报告 |

### 第2周：量化训练和TensorRT

| 天数 | 任务 | 输出 |
|------|------|------|
| Day 8-10 | QAT训练 | INT8模型 (15M) |
| Day 11-12 | TensorRT转换和优化 | TRT Engine |
| Day 13 | A100上TensorRT测试 | 性能基准 |
| Day 14 | 准备Orin环境 | 部署包 |

### 第3周：Orin测试和调优

| 天数 | 任务 | 输出 |
|------|------|------|
| Day 15 | 部署到Orin | 初步结果 |
| Day 16 | 性能和功耗测试 | 测试报告 |
| Day 17-18 | 精度验证 | 精度报告 |
| Day 19-20 | 多流和DLA优化 | 优化后模型 |
| Day 21 | 最终验证和文档 | 部署文档 ✅ |

---

## 🔍 关键技术点

### 1. 针对Orin的特殊优化

**Orin vs 通用GPU**:
- ✅ Unified Memory优势大
- ✅ DLA可用，适合卷积层
- ⚠️ Tensor Cores较少，FP16优势小
- ⚠️ 带宽较低，需优化内存访问

### 2. BEVFusion特定优化

**关键模块优化**:
1. **SwinTransformer**
   - 最耗时（~40%）
   - 剪枝效果最好
   - Window Attention可用卷积近似

2. **LSS View Transform**
   - 3D卷积密集
   - INT8量化效果好
   - 可考虑分离运算

3. **ConvFuser**
   - 简单concat+conv
   - 几乎无损优化

4. **TransFusion Head**
   - Query机制复杂
   - 需要仔细量化
   - NMS可CPU并行

### 3. 精度保持技巧

**QAT训练要点**:
- ✅ 使用原始数据集全量训练
- ✅ 学习率要小（1e-5）
- ✅ 训练3-5个epochs足够
- ✅ BatchNorm层不量化
- ✅ 某些敏感层保持FP16

---

## 📦 部署包结构

```
bevfusion_orin_deploy/
├── models/
│   ├── bevfusion_int8.engine      # TensorRT Engine
│   ├── config.yaml                # 配置文件
│   └── class_names.txt            # 类别名称
├── lib/
│   ├── libbevfusion.so            # C++推理库
│   └── python/
│       └── bevfusion_trt.py       # Python接口
├── scripts/
│   ├── run_inference.sh           # 推理脚本
│   └── benchmark.sh               # 性能测试
├── data/
│   └── sample_data/               # 测试数据
├── docs/
│   ├── API.md                     # API文档
│   └── OPTIMIZATION.md            # 优化说明
└── README.md                      # 使用说明
```

---

## 🎯 性能保证策略

### 如果性能不达标

**Plan B选项**:

1. **进一步剪枝** (60M → 40M)
   - 牺牲1-2%精度
   - 提升20-30%速度

2. **降低输入分辨率**
   - 图像: 256×704 → 192×512
   - BEV: 180×180 → 128×128
   - 速度提升40%

3. **简化任务**
   - 只保留检测任务
   - 或检测+分割二选一

4. **使用两个Orin**
   - Camera处理用Orin-1
   - LiDAR处理用Orin-2
   - 并行推理

---

## 📚 参考资源

### 官方文档
- [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/)
- [Orin Developer Guide](https://developer.nvidia.com/embedded/jetson-agx-orin-developer-kit)
- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)

### 开源工具
- [Torch-Pruning](https://github.com/VainF/Torch-Pruning)
- [TensorRT-OSS](https://github.com/NVIDIA/TensorRT)
- [ONNX Runtime](https://github.com/microsoft/onnxruntime)

### 相关论文
- "Learned Step Size Quantization" (LSQ)
- "Network Slimming" (Channel Pruning)
- "Accelerating Deep Learning with TensorRT"

---

## ✅ 成功标准

### 最低要求
- ✅ 推理时间 < 80ms
- ✅ 吞吐量 > 12 FPS
- ✅ 功耗 < 60W
- ✅ 检测mAP > 63%
- ✅ 分割mIoU > 52%

### 理想目标
- 🌟 推理时间 < 60ms
- 🌟 吞吐量 > 16 FPS
- 🌟 功耗 < 45W
- 🌟 检测mAP > 65%
- 🌟 分割mIoU > 55%

---

## 🚀 快速开始

### 一键部署脚本

```bash
#!/bin/bash
# scripts/deploy_to_orin.sh

echo "========== BEVFusion Orin部署 =========="

# 1. 剪枝
echo "步骤1: 模型剪枝..."
python tools/pruning/prune_bevfusion.py \
    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
    --checkpoint runs/run-xxx/epoch_20.pth \
    --output bevfusion_pruned.pth

# 2. 量化
echo "步骤2: INT8量化..."
python tools/quantization/quantize_bevfusion.py \
    --model bevfusion_pruned.pth \
    --output bevfusion_int8.pth \
    --calibration-data data/nuscenes/calibration_100samples

# 3. TensorRT转换
echo "步骤3: TensorRT转换..."
python tools/tensorrt/convert_to_trt.py \
    --model bevfusion_int8.pth \
    --output bevfusion_int8.engine \
    --fp16 \
    --int8 \
    --workspace 4096

# 4. 测试
echo "步骤4: 性能测试..."
python tools/deployment/benchmark_orin.py \
    --engine bevfusion_int8.engine

echo "部署完成！"
```

---

生成时间: 2025-10-17  
目标硬件: NVIDIA AGX Orin 270T  
预计部署周期: 2-3周
-												Complete project state snapshot: Phase 4B RMT-PPAD Integration

🎯 Training Status:
- Current Epoch: 2/10 (13.3% complete)
- Segmentation Dice: 0.9594
- Detection IoU: 0.5742
- Training stable with 8 GPUs

🔧 Technical Achievements:
- ✅ RMT-PPAD Transformer segmentation decoder integrated
- ✅ Task-specific GCA architecture optimized
- ✅ Multi-scale feature fusion (180×180, 360×360, 600×600)
- ✅ Adaptive scale weight learning implemented
- ✅ BEVFusion multi-task framework enhanced

📊 Performance Highlights:
- Divider segmentation: 0.9793 Dice (excellent)
- Pedestrian crossing: 0.9812 Dice (excellent)
- Stop line: 0.9812 Dice (excellent)
- Carpark area: 0.9802 Dice (excellent)
- Walkway: 0.9401 Dice (good)
- Drivable area: 0.8959 Dice (good)

🛠️ Code Changes Included:
- Enhanced BEVFusion model (bevfusion.py)
- RMT-PPAD integration modules (rmtppad_integration.py)
- Transformer segmentation head (enhanced_transformer.py)
- GCA module optimizations (gca.py)
- Configuration updates (Phase 4B configs)
- Training scripts and automation tools
- Comprehensive documentation and analysis reports

📅 Snapshot Date: Fri Nov 14 09:06:09 UTC 2025
📍 Environment: Docker container
🎯 Phase: RMT-PPAD Integration Complete

											
										
										
											2025-11-14 17:06:09 +08:00
+								# BEVFusion 部署到 NVIDIA Orin 270T 方案
 								## 🎯 目标
 								将训练好的BEVFusion双任务/三任务模型部署到**NVIDIA AGX Orin 270T**，实现：
 								- ✅ 实时推理（>10 FPS）
 								- ✅ 低延迟（<100ms）
 								- ✅ 低功耗（<60W）
 								- ✅ 保持精度（mAP下降<3%）
 								---
 								## 📊 NVIDIA Orin 270T 规格
 								### 硬件参数
 								| 参数 | 规格 |
 								|------|------|
 								| **GPU** | 2048 CUDA cores + 64 Tensor cores |
 								| **AI算力** | 275 TOPS (INT8) |
 								| **显存** | 64GB unified memory |
 								| **CPU** | 12-core ARM Cortex-A78AE |
 								| **功耗** | 15W - 60W (可配置) |
 								| **架构** | Ampere (类似A100) |
 								### 性能基准
 								- **FP32**: ~5 TFLOPS
 								- **FP16**: ~10 TFLOPS
 								- **INT8**: ~20 TOPS
 								- **与A100对比**: ~1/10性能，但功耗仅1/5
 								---
 								## 📋 部署流程总览
 								```
 								训练完成 (A100 × 8)
 								    ↓
 								步骤1: 模型分析和优化 (1-2天)
 								    ↓
 								步骤2: 结构化剪枝 (2-3天)
 								    ↓
 								步骤3: 量化训练 (QAT) (3-4天)
 								    ↓
 								步骤4: TensorRT优化 (2-3天)
 								    ↓
 								步骤5: Orin上测试 (1-2天)
 								    ↓
 								步骤6: 性能调优 (2-3天)
 								    ↓
 								生产部署 ✅
 								```
 								**总时间**: 约2-3周
 								---
 								## 🔧 步骤1: 模型分析和优化（1-2天）
 								### 1.1 模型复杂度分析
 								```bash
 								# 分析模型参数量和FLOPs
 								python tools/analysis/model_complexity.py \
 								    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
 								    --checkpoint runs/run-xxx/epoch_20.pth
 								```
 								**预期输出**:
 								```
 								BEVFusion 双任务模型:
 								  - 参数量: 110M
 								  - FLOPs: 450 GFLOPs
 								  - 推理时间 (A100): 90ms
 								  - 推理时间 (Orin估算): 450-900ms (太慢！)
 								```
 								### 1.2 性能瓶颈分析
 								使用Nsight Systems分析：
 								```bash
 								# 在A100上profiling
 								nsys profile -o bevfusion_profile \
 								    python tools/benchmark.py \
 								    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
 								    --checkpoint runs/run-xxx/epoch_20.pth
 								```
 								**关注模块**:
 								- SwinTransformer backbone (最耗时)
 								- Multi-head attention
 								- 3D卷积操作
 								- NMS后处理
 								### 1.3 导出基准模型
 								```bash
 								# 导出ONNX格式
 								python tools/export_onnx.py \
 								    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
 								    --checkpoint runs/run-xxx/epoch_20.pth \
 								    --output bevfusion_fp32.onnx
 								```
 								---
 								## ✂️ 步骤2: 结构化剪枝（2-3天）
 								### 2.1 剪枝策略
 								**目标**: 减少40-50%参数量和FLOPs
 								**剪枝方案**:
 . **Channel Pruning** (通道剪枝)
 								   - SwinTransformer: 减少20% channels
 								   - FPN: 减少30% channels
 								   - Decoder: 减少25% channels
 . **Layer Pruning** (层剪枝)
 								   - SwinTransformer: 6层→4层
 								   - Decoder: 5层→4层
 . **Attention Head Pruning**
 								   - Multi-head数量: 8→6
 								### 2.2 剪枝工具选择
 								**推荐**: Torch-Pruning
 								```python
 								# tools/pruning/prune_bevfusion.py
 								import torch
 								import torch_pruning as tp
 								# 加载模型
 								model = build_model(config)
 								model.load_state_dict(checkpoint)
 								# 定义剪枝策略
 								strategy = tp.strategy.L1Strategy()
 								# 对SwinTransformer剪枝
 								pruner = tp.pruner.MagnitudePruner(
 								    model.encoders['camera'].backbone,
 								    example_inputs=example_images,
 								    importance=strategy,
 								    pruning_ratio=0.3,  # 剪枝30%
 								    iterative_steps=5,
 								)
 								# 执行剪枝
 								for i in range(5):
 								    pruner.step()
 								# 微调
 								finetune(model, train_loader, epochs=5)
 								# 保存剪枝后模型
 								torch.save(model.state_dict(), 'bevfusion_pruned.pth')
 								```
 								### 2.3 剪枝后微调
 								```bash
 								# 在原始数据集上微调5个epochs
 								torchpack dist-run -np 8 python tools/train.py \
 								    configs/nuscenes/multitask/fusion-det-seg-swint_pruned.yaml \
 								    --load_from bevfusion_pruned.pth \
 								    --cfg-options \
 								        max_epochs=5 \
 								        optimizer.lr=5.0e-5  # 较小的学习率
 								```
 								**预期结果**:
 								- 参数量: 110M → 60M (-45%)
 								- FLOPs: 450G → 250G (-44%)
 								- 精度损失: <2%
 								- 推理时间: 90ms → 50ms (A100)
 								---
 								## 🔢 步骤3: 量化训练 QAT（3-4天）
 								### 3.1 量化策略
 								**目标**: FP32 → INT8，保持精度损失<2%
 								**量化方案**:
 								```
 								FP32模型 (110M参数)
 								    ↓
 								PTQ (Post-Training Quantization) - 快速验证
 								    ↓
 								QAT (Quantization-Aware Training) - 精度恢复
 								    ↓
 								INT8模型 (27.5M参数，4倍压缩)
 								```
 								### 3.2 使用PyTorch Quantization
 								```python
 								# tools/quantization/quantize_bevfusion.py
 								import torch
 								from torch.quantization import prepare_qat, convert
 								# 加载剪枝后的模型
 								model = load_pruned_model('bevfusion_pruned.pth')
 								model.eval()
 								# 设置量化配置
 								model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
 								# 准备QAT
 								model_qat = prepare_qat(model)
 								# QAT训练 (重要！)
 								# 使用较小学习率，训练3-5个epochs
 								train_qat(
 								    model_qat,
 								    train_loader,
 								    epochs=5,
 								    lr=1e-5
 								)
 								# 转换为INT8
 								model_int8 = convert(model_qat)
 								# 保存
 								torch.save(model_int8.state_dict(), 'bevfusion_int8.pth')
 								```
 								### 3.3 QAT训练配置
 								```yaml
 								# configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml
 								_base_: ./fusion-det-seg-swint_pruned.yaml
 								# QAT特定配置
 								quantization:
 								  enabled: true
 								  qconfig: 'fbgemm'
 								# 训练参数
 								max_epochs: 5
 								optimizer:
 								  lr: 1.0e-5  # 很小的学习率
 								  weight_decay: 0.0001
 								# 数据增强减弱
 								augment2d:
 								  resize: [[0.45, 0.48], [0.48, 0.48]]  # 减少resize范围
 								  rotate: [-2.0, 2.0]  # 减少旋转
 								augment3d:
 								  scale: [0.95, 1.05]  # 减少缩放
 								  rotate: [-0.39, 0.39]  # 减少旋转
 								  translate: 0.25  # 减少平移
 								```
 								### 3.4 量化验证
 								```bash
 								# 验证INT8模型精度
 								python tools/test.py \
 								    configs/nuscenes/multitask/fusion-det-seg-swint_qat.yaml \
 								    bevfusion_int8.pth \
 								    --eval bbox map
 								```
 								**预期结果**:
 								- 模型大小: 110M → 27.5M (-75%)
 								- 推理速度: 2-4倍提升
 								- 精度损失: 1-2%
 								- 内存占用: 减少75%
 								---
 								## 🚀 步骤4: TensorRT优化（2-3天）
 								### 4.1 TensorRT转换
 								```python
 								# tools/tensorrt/convert_to_trt.py
 								import tensorrt as trt
 								import torch
 								# 1. 导出ONNX（从INT8模型）
 								torch.onnx.export(
 								    model_int8,
 								    dummy_input,
 								    'bevfusion_int8.onnx',
 								    opset_version=17,
 								    input_names=['images', 'points'],
 								    output_names=['bboxes', 'scores', 'labels', 'masks'],
 								    dynamic_axes={
 								        'images': {0: 'batch'},
 								        'points': {0: 'batch'}
 								    }
 								)
 								# 2. 构建TensorRT Engine
 								TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
 								builder = trt.Builder(TRT_LOGGER)
 								network = builder.create_network(
 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
 								)
 								# 解析ONNX
 								parser = trt.OnnxParser(network, TRT_LOGGER)
 								with open('bevfusion_int8.onnx', 'rb') as f:
 								    parser.parse(f.read())
 								# 配置TensorRT
 								config = builder.create_builder_config()
 								config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30)  # 4GB
 								# INT8优化
 								config.set_flag(trt.BuilderFlag.INT8)
 								config.set_flag(trt.BuilderFlag.FP16)  # FP16作为fallback
 								# Calibration (用于PTQ)
 								config.int8_calibrator = BEVFusionCalibrator(
 								    calibration_dataset,
 								    cache_file='bevfusion_calibration.cache'
 								)
 								# 构建Engine
 								serialized_engine = builder.build_serialized_network(network, config)
 								# 保存
 								with open('bevfusion_int8.engine', 'wb') as f:
 								    f.write(serialized_engine)
 								```
 								### 4.2 TensorRT推理接口
 								```python
 								# tools/tensorrt/trt_inference.py
 								import tensorrt as trt
 								import pycuda.driver as cuda
 								import pycuda.autoinit
 								class BEVFusionTRT:
 								    def __init__(self, engine_path):
 								        # 加载engine
 								        with open(engine_path, 'rb') as f:
 								            runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING))
 								            self.engine = runtime.deserialize_cuda_engine(f.read())
 								        self.context = self.engine.create_execution_context()
 								        # 分配GPU内存
 								        self.allocate_buffers()
 								    def allocate_buffers(self):
 								        self.inputs = []
 								        self.outputs = []
 								        self.bindings = []
 								        for i in range(self.engine.num_bindings):
 								            binding = self.engine.get_binding_name(i)
 								            size = trt.volume(self.engine.get_binding_shape(i))
 								            dtype = trt.nptype(self.engine.get_binding_dtype(i))
 								            # 分配device内存
 								            device_mem = cuda.mem_alloc(size * dtype.itemsize)
 								            self.bindings.append(int(device_mem))
 								            if self.engine.binding_is_input(i):
 								                self.inputs.append({'binding': binding, 'memory': device_mem})
 								            else:
 								                self.outputs.append({'binding': binding, 'memory': device_mem})
 								    def infer(self, images, points):
 								        # 拷贝输入到GPU
 								        cuda.memcpy_htod(self.inputs[0]['memory'], images)
 								        cuda.memcpy_htod(self.inputs[1]['memory'], points)
 								        # 执行推理
 								        self.context.execute_v2(bindings=self.bindings)
 								        # 拷贝输出到CPU
 								        outputs = []
 								        for output in self.outputs:
 								            host_mem = cuda.pagelocked_empty(output['shape'], output['dtype'])
 								            cuda.memcpy_dtoh(host_mem, output['memory'])
 								            outputs.append(host_mem)
 								        return outputs
 								# 使用
 								trt_model = BEVFusionTRT('bevfusion_int8.engine')
 								bboxes, scores, labels, masks = trt_model.infer(images, points)
 								```
 								### 4.3 TensorRT优化技巧
 								**针对Orin的优化**:
 								```python
 								# 1. DLA加速（Orin有2个DLA）
 								config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
 								config.default_device_type = trt.DeviceType.DLA
 								config.DLA_core = 0  # 使用DLA core 0
 								# 2. Kernel自动调优
 								config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
 								# 3. 优化Batch Size（Orin适合小batch）
 								config.set_preview_feature(trt.PreviewFeature.FASTER_DYNAMIC_SHAPES_0805, True)
 								# 4. Profile优化（针对真实输入shape）
 								profile = builder.create_optimization_profile()
 								profile.set_shape(
 								    "images",
 								    min=(1, 6, 3, 256, 704),
 								    opt=(1, 6, 3, 256, 704),  # 最优shape
 								    max=(2, 6, 3, 256, 704)
 								)
 								config.add_optimization_profile(profile)
 								```
 								---
 								## 🧪 步骤5: Orin上测试（1-2天）
 								### 5.1 环境准备
 								```bash
 								# 在Orin上安装依赖
 								# JetPack 5.1+ (包含CUDA 11.4, cuDNN 8.6, TensorRT 8.5)
 								# 安装Python依赖
 								pip3 install pycuda
 								pip3 install numpy opencv-python
 								# 拷贝模型文件
 								scp bevfusion_int8.engine orin@192.168.1.100:/home/orin/models/
 								```
 								### 5.2 性能测试
 								```python
 								# tools/benchmark_orin.py
 								import time
 								import numpy as np
 								# 加载TensorRT模型
 								trt_model = BEVFusionTRT('bevfusion_int8.engine')
 								# 预热
 								for _ in range(10):
 								    trt_model.infer(dummy_images, dummy_points)
 								# 性能测试
 								times = []
 								for i in range(100):
 								    start = time.time()
 								    outputs = trt_model.infer(images, points)
 								    end = time.time()
 								    times.append((end - start) * 1000)  # ms
 								print(f"平均推理时间: {np.mean(times):.2f} ms")
 								print(f"吞吐量: {1000/np.mean(times):.2f} FPS")
 								print(f"P99延迟: {np.percentile(times, 99):.2f} ms")
 								```
 								### 5.3 功耗测试
 								```bash
 								# 监控功耗
 								sudo tegrastats --interval 1000 > power_log.txt &
 								# 运行推理
 								python3 tools/benchmark_orin.py
 								# 分析功耗
 								cat power_log.txt | grep "VDD_GPU_SOC"
 								```
 								### 5.4 精度验证
 								```bash
 								# 在Orin上跑nuScenes验证集
 								python3 tools/test_orin.py \
 								    --engine bevfusion_int8.engine \
 								    --data-root /data/nuscenes \
 								    --eval bbox map
 								```
 								**预期性能**:
 								- **推理时间**: 60-80ms (vs 90ms on A100)
 								- **FPS**: 12-16 FPS ✅
 								- **功耗**: 40-50W
 								- **精度损失**: <3%
 								---
 								## ⚡ 步骤6: 性能调优（2-3天）
 								### 6.1 多流并行
 								```python
 								# 使用CUDA Streams加速预处理
 								class OptimizedPipeline:
 								    def __init__(self):
 								        self.preprocess_stream = cuda.Stream()
 								        self.infer_stream = cuda.Stream()
 								        self.postprocess_stream = cuda.Stream()
 								    def process_frame(self, raw_images, raw_points):
 								        # 预处理（异步）
 								        with self.preprocess_stream:
 								            images = preprocess_images(raw_images)
 								            points = preprocess_points(raw_points)
 								        # 推理（异步）
 								        with self.infer_stream:
 								            self.infer_stream.wait_for_event(preprocess_done)
 								            outputs = self.trt_model.infer(images, points)
 								        # 后处理（异步）
 								        with self.postprocess_stream:
 								            self.postprocess_stream.wait_for_event(infer_done)
 								            results = postprocess_outputs(outputs)
 								        return results
 								```
 								### 6.2 内存优化
 								```python
 								# 使用Unified Memory减少拷贝
 								import pycuda.driver as cuda
 								# 分配unified memory
 								images_um = cuda.managed_empty(shape, dtype=np.float32)
 								points_um = cuda.managed_empty(shape, dtype=np.float32)
 								# 直接在CPU上填充数据
 								np.copyto(images_um, preprocessed_images)
 								# GPU可以直接访问，无需显式拷贝
 								outputs = trt_model.infer(images_um, points_um)
 								```
 								### 6.3 DLA Offload
 								针对Orin的2个DLA核心：
 								```python
 								# 将部分网络offload到DLA
 								# DLA适合：卷积、池化、归一化
 								# GPU保留：Attention、复杂操作
 								# Engine构建时指定
 								dla_layers = [
 								    'encoder/camera/backbone/conv1',
 								    'encoder/camera/backbone/layer1',
 								    'encoder/lidar/voxelize',
 								]
 								for layer_name in dla_layers:
 								    layer = network.get_layer_by_name(layer_name)
 								    layer.device_type = trt.DeviceType.DLA
 								```
 								---
 								## 📊 预期性能对比
 								### 各优化阶段性能
 								| 阶段 | 参数量 | FLOPs | 推理时间(Orin) | 精度损失 | 说明 |
 								|------|--------|-------|---------------|---------|------|
 								| **原始FP32** | 110M | 450G | 900ms | - | 太慢 ❌ |
 								| **剪枝后FP32** | 60M | 250G | 500ms | -1.5% | 仍慢 ⚠️ |
 								| **剪枝+INT8** | 15M | 62G | 80ms | -2.5% | 可用 ✅ |
 								| **+TensorRT** | 15M | 62G | 65ms | -2.5% | 良好 ✅ |
 								| **+多流优化** | 15M | 62G | 50ms | -2.5% | 最优 🌟 |
 								### 最终性能目标
 								| 指标 | 目标值 | 预期达到 |
 								|------|--------|---------|
 								| **推理时间** | <80ms | 50-65ms ✅ |
 								| **吞吐量** | >10 FPS | 15-20 FPS ✅ |
 								| **功耗** | <60W | 40-50W ✅ |
 								| **检测mAP** | >63% | 65-67% ✅ |
 								| **分割mIoU** | >52% | 53-57% ✅ |
 								| **内存占用** | <4GB | 2-3GB ✅ |
 								---
 								## 🛠️ 工具和脚本
 								### 创建必要的工具脚本
 								```bash
 								tools/
 								├── pruning/
 								│   ├── prune_bevfusion.py          # 剪枝脚本
 								│   └── eval_pruned_model.py        # 评估剪枝后模型
 								├── quantization/
 								│   ├── quantize_bevfusion.py       # 量化脚本
 								│   ├── qat_train.py                # QAT训练
 								│   └── calibrate.py                # INT8校准
 								├── tensorrt/
 								│   ├── convert_to_trt.py           # ONNX→TensorRT
 								│   ├── trt_inference.py            # TensorRT推理
 								│   └── optimize_dla.py             # DLA优化
 								├── deployment/
 								│   ├── benchmark_orin.py           # Orin性能测试
 								│   ├── deploy_to_orin.sh           # 一键部署脚本
 								│   └── monitor_performance.py     # 性能监控
 								└── analysis/
 								    ├── model_complexity.py         # 模型复杂度分析
 								    └── latency_breakdown.py        # 延迟分解分析
 								```
 								---
 								## 📅 详细时间表
 								### 第1周：剪枝和量化准备
 								| 天数 | 任务 | 输出 |
 								|------|------|------|
 								| Day 1-2 | 模型分析，导出ONNX | 基准测试报告 |
 								| Day 3-4 | 结构化剪枝 | 剪枝后模型 (60M) |
 								| Day 5 | 剪枝模型微调 | 微调后checkpoint |
 								| Day 6-7 | PTQ初步测试 | INT8可行性报告 |
 								### 第2周：量化训练和TensorRT
 								| 天数 | 任务 | 输出 |
 								|------|------|------|
 								| Day 8-10 | QAT训练 | INT8模型 (15M) |
 								| Day 11-12 | TensorRT转换和优化 | TRT Engine |
 								| Day 13 | A100上TensorRT测试 | 性能基准 |
 								| Day 14 | 准备Orin环境 | 部署包 |
 								### 第3周：Orin测试和调优
 								| 天数 | 任务 | 输出 |
 								|------|------|------|
 								| Day 15 | 部署到Orin | 初步结果 |
 								| Day 16 | 性能和功耗测试 | 测试报告 |
 								| Day 17-18 | 精度验证 | 精度报告 |
 								| Day 19-20 | 多流和DLA优化 | 优化后模型 |
 								| Day 21 | 最终验证和文档 | 部署文档 ✅ |
 								---
 								## 🔍 关键技术点
 								### 1. 针对Orin的特殊优化
 								**Orin vs 通用GPU**:
 								- ✅ Unified Memory优势大
 								- ✅ DLA可用，适合卷积层
 								- ⚠️ Tensor Cores较少，FP16优势小
 								- ⚠️ 带宽较低，需优化内存访问
 								### 2. BEVFusion特定优化
 								**关键模块优化**:
 . **SwinTransformer**
 								   - 最耗时（~40%）
 								   - 剪枝效果最好
 								   - Window Attention可用卷积近似
 . **LSS View Transform**
 								   - 3D卷积密集
 								   - INT8量化效果好
 								   - 可考虑分离运算
 . **ConvFuser**
 								   - 简单concat+conv
 								   - 几乎无损优化
 . **TransFusion Head**
 								   - Query机制复杂
 								   - 需要仔细量化
 								   - NMS可CPU并行
 								### 3. 精度保持技巧
 								**QAT训练要点**:
 								- ✅ 使用原始数据集全量训练
 								- ✅ 学习率要小（1e-5）
 								- ✅ 训练3-5个epochs足够
 								- ✅ BatchNorm层不量化
 								- ✅ 某些敏感层保持FP16
 								---
 								## 📦 部署包结构
 								```
 								bevfusion_orin_deploy/
 								├── models/
 								│   ├── bevfusion_int8.engine      # TensorRT Engine
 								│   ├── config.yaml                # 配置文件
 								│   └── class_names.txt            # 类别名称
 								├── lib/
 								│   ├── libbevfusion.so            # C++推理库
 								│   └── python/
 								│       └── bevfusion_trt.py       # Python接口
 								├── scripts/
 								│   ├── run_inference.sh           # 推理脚本
 								│   └── benchmark.sh               # 性能测试
 								├── data/
 								│   └── sample_data/               # 测试数据
 								├── docs/
 								│   ├── API.md                     # API文档
 								│   └── OPTIMIZATION.md            # 优化说明
 								└── README.md                      # 使用说明
 								```
 								---
 								## 🎯 性能保证策略
 								### 如果性能不达标
 								**Plan B选项**:
 . **进一步剪枝** (60M → 40M)
 								   - 牺牲1-2%精度
 								   - 提升20-30%速度
 . **降低输入分辨率**
 								   - 图像: 256×704 → 192×512
 								   - BEV: 180×180 → 128×128
 								   - 速度提升40%
 . **简化任务**
 								   - 只保留检测任务
 								   - 或检测+分割二选一
 . **使用两个Orin**
 								   - Camera处理用Orin-1
 								   - LiDAR处理用Orin-2
 								   - 并行推理
 								---
 								## 📚 参考资源
 								### 官方文档
 								- [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/)
 								- [Orin Developer Guide](https://developer.nvidia.com/embedded/jetson-agx-orin-developer-kit)
 								- [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
 								### 开源工具
 								- [Torch-Pruning](https://github.com/VainF/Torch-Pruning)
 								- [TensorRT-OSS](https://github.com/NVIDIA/TensorRT)
 								- [ONNX Runtime](https://github.com/microsoft/onnxruntime)
 								### 相关论文
 								- "Learned Step Size Quantization" (LSQ)
 								- "Network Slimming" (Channel Pruning)
 								- "Accelerating Deep Learning with TensorRT"
 								---
 								## ✅ 成功标准
 								### 最低要求
 								- ✅ 推理时间 < 80ms
 								- ✅ 吞吐量 > 12 FPS
 								- ✅ 功耗 < 60W
 								- ✅ 检测mAP > 63%
 								- ✅ 分割mIoU > 52%
 								### 理想目标
 								- 🌟 推理时间 < 60ms
 								- 🌟 吞吐量 > 16 FPS
 								- 🌟 功耗 < 45W
 								- 🌟 检测mAP > 65%
 								- 🌟 分割mIoU > 55%
 								---
 								## 🚀 快速开始
 								### 一键部署脚本
 								```bash
 								#!/bin/bash
 								# scripts/deploy_to_orin.sh
 								echo "========== BEVFusion Orin部署 =========="
 								# 1. 剪枝
 								echo "步骤1: 模型剪枝..."
 								python tools/pruning/prune_bevfusion.py \
 								    --config configs/nuscenes/multitask/fusion-det-seg-swint.yaml \
 								    --checkpoint runs/run-xxx/epoch_20.pth \
 								    --output bevfusion_pruned.pth
 								# 2. 量化
 								echo "步骤2: INT8量化..."
 								python tools/quantization/quantize_bevfusion.py \
 								    --model bevfusion_pruned.pth \
 								    --output bevfusion_int8.pth \
 								    --calibration-data data/nuscenes/calibration_100samples
 								# 3. TensorRT转换
 								echo "步骤3: TensorRT转换..."
 								python tools/tensorrt/convert_to_trt.py \
 								    --model bevfusion_int8.pth \
 								    --output bevfusion_int8.engine \
 								    --fp16 \
 								    --int8 \
 								    --workspace 4096
 								# 4. 测试
 								echo "步骤4: 性能测试..."
 								python tools/deployment/benchmark_orin.py \
 								    --engine bevfusion_int8.engine
 								echo "部署完成！"
 								```
 								---
 								生成时间: 2025-10-17
 								目标硬件: NVIDIA AGX Orin 270T
 								预计部署周期: 2-3周