860 lines
20 KiB
Markdown
860 lines
20 KiB
Markdown
|
|
# BEVFusion 项目进度分析与准备清单
|
|||
|
|
|
|||
|
|
**分析时间**:2025-10-22 14:16 UTC(北京时间 22:16)
|
|||
|
|
**当前阶段**:Phase 2 - 增强版训练进行中
|
|||
|
|
**总体进度**:9.2% (Epoch 3/23)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 当前训练状态
|
|||
|
|
|
|||
|
|
### 实时进度
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────┐
|
|||
|
|
│ 训练状态:🟢 正常运行中 │
|
|||
|
|
│ 运行时长:17小时55分钟 │
|
|||
|
|
│ 当前Epoch:3 / 23 (13%) │
|
|||
|
|
│ Iteration:1,200 / 10,299 (11.7%) │
|
|||
|
|
│ 总体进度:9.2% │
|
|||
|
|
│ 当前Loss:0.7436 │
|
|||
|
|
│ 预计剩余:6.8天 (163.6小时) │
|
|||
|
|
└─────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### GPU使用状态
|
|||
|
|
```
|
|||
|
|
GPU 0-5: 利用率100%, 显存31GB/32GB, 温度40°C ✅
|
|||
|
|
GPU 6-7: 空闲 ⚪
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 已完成工作
|
|||
|
|
- ✅ Epoch 1完成(10-22 04:14)
|
|||
|
|
- ✅ Epoch 2完成+验证(10-22 12:42)
|
|||
|
|
- 检测mAP: 65.32%
|
|||
|
|
- 分割mIoU: 34.17%
|
|||
|
|
- 🔄 Epoch 3进行中(11.7%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 项目整体进度分析
|
|||
|
|
|
|||
|
|
### Phase进度总览
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Phase 1: 基础训练 ████████████████████ 100% ✅ 已完成
|
|||
|
|
├── Epoch 1-19原始配置训练
|
|||
|
|
├── 性能评估: mAP 66.26%, mIoU 36.44%
|
|||
|
|
└── 完成时间: 2025-10-19
|
|||
|
|
|
|||
|
|
Phase 2: 增强版训练 ██░░░░░░░░░░░░░░░░░░ 9% 🔄 进行中
|
|||
|
|
├── 当前: Epoch 3/23
|
|||
|
|
├── 预计完成: 2025-10-29
|
|||
|
|
└── 目标: mIoU 60-65%
|
|||
|
|
|
|||
|
|
Phase 3: MapTR集成 ░░░░░░░░░░░░░░░░░░░░ 0% ⏳ 待决策
|
|||
|
|
├── 三任务训练
|
|||
|
|
├── 预计时间: 2周
|
|||
|
|
└── 状态: 可选阶段
|
|||
|
|
|
|||
|
|
Phase 4: 模型优化 ░░░░░░░░░░░░░░░░░░░░ 0% ⏳ 待开始
|
|||
|
|
├── 剪枝: 110M → 60M
|
|||
|
|
├── 量化: FP32 → INT8
|
|||
|
|
└── 预计时间: 1周
|
|||
|
|
|
|||
|
|
Phase 5: TensorRT ░░░░░░░░░░░░░░░░░░░░ 0% ⏳ 待开始
|
|||
|
|
├── ONNX导出
|
|||
|
|
├── Engine构建
|
|||
|
|
└── 预计时间: 4-5天
|
|||
|
|
|
|||
|
|
Phase 6: Orin部署 ░░░░░░░░░░░░░░░░░░░░ 0% ⏳ 待开始
|
|||
|
|
├── 部署测试
|
|||
|
|
├── 性能调优
|
|||
|
|
└── 预计时间: 1周
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 关键时间节点
|
|||
|
|
|
|||
|
|
### 已完成里程碑 ✅
|
|||
|
|
- ✅ 2025-10-19:Epoch 19完成(原始版)
|
|||
|
|
- ✅ 2025-10-21 20:21:增强版训练启动
|
|||
|
|
- ✅ 2025-10-22 04:14:Epoch 1完成
|
|||
|
|
- ✅ 2025-10-22 12:42:Epoch 2完成+验证
|
|||
|
|
|
|||
|
|
### 即将到来的里程碑 ⏳
|
|||
|
|
- ⏳ 2025-10-22 20:30:Epoch 3完成
|
|||
|
|
- ⏳ 2025-10-23 12:00:Epoch 5完成(短期目标)
|
|||
|
|
- ⏳ 2025-10-25 12:00:Epoch 10完成(中期评估点)
|
|||
|
|
- ⏳ 2025-10-27 12:00:Epoch 15完成
|
|||
|
|
- ⏳ 2025-10-29 17:49:Epoch 23完成(训练完成)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 可以提前准备的工作
|
|||
|
|
|
|||
|
|
### 🟢 P0 - 立即可以准备(本周内)
|
|||
|
|
|
|||
|
|
#### 1. 准备评估脚本 ⭐⭐⭐⭐⭐
|
|||
|
|
**时间**:1-2小时
|
|||
|
|
**紧急度**:高(Epoch 10评估时需要)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 创建中期评估脚本
|
|||
|
|
cat > /workspace/bevfusion/scripts/evaluate_checkpoint.sh << 'SCRIPT'
|
|||
|
|
#!/bin/bash
|
|||
|
|
# 评估指定checkpoint的性能
|
|||
|
|
|
|||
|
|
CHECKPOINT=$1
|
|||
|
|
if [ -z "$CHECKPOINT" ]; then
|
|||
|
|
echo "用法: bash evaluate_checkpoint.sh <checkpoint_path>"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "评估checkpoint: $CHECKPOINT"
|
|||
|
|
|
|||
|
|
# 评估检测性能
|
|||
|
|
torchpack dist-run -np 8 python tools/test.py \
|
|||
|
|
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
|
|||
|
|
$CHECKPOINT \
|
|||
|
|
--eval bbox
|
|||
|
|
|
|||
|
|
# 评估分割性能
|
|||
|
|
torchpack dist-run -np 8 python tools/test.py \
|
|||
|
|
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml \
|
|||
|
|
$CHECKPOINT \
|
|||
|
|
--eval map
|
|||
|
|
|
|||
|
|
echo "评估完成!"
|
|||
|
|
SCRIPT
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**准备内容**:
|
|||
|
|
- ✅ 评估脚本(检测+分割)
|
|||
|
|
- ✅ 结果保存路径配置
|
|||
|
|
- ✅ 自动化报告生成脚本
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 2. 准备可视化工具 ⭐⭐⭐⭐
|
|||
|
|
**时间**:2-3小时
|
|||
|
|
**紧急度**:中高
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 创建对比可视化脚本
|
|||
|
|
# /workspace/bevfusion/scripts/compare_epochs.py
|
|||
|
|
|
|||
|
|
# 功能:
|
|||
|
|
# - 对比不同epoch的性能
|
|||
|
|
# - 生成Loss曲线图
|
|||
|
|
# - 生成mIoU趋势图
|
|||
|
|
# - 生成各类别IoU对比
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**准备内容**:
|
|||
|
|
- 对比不同checkpoint的脚本
|
|||
|
|
- Loss曲线可视化
|
|||
|
|
- mIoU趋势图生成
|
|||
|
|
- 各类别IoU热力图
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 3. 研究剪枝和量化工具 ⭐⭐⭐⭐
|
|||
|
|
**时间**:3-4小时
|
|||
|
|
**紧急度**:中
|
|||
|
|
|
|||
|
|
**现在可以做的**:
|
|||
|
|
```bash
|
|||
|
|
# 1. 安装剪枝工具
|
|||
|
|
pip install torch-pruning
|
|||
|
|
|
|||
|
|
# 2. 研究Torch-Pruning文档
|
|||
|
|
# https://github.com/VainF/Torch-Pruning
|
|||
|
|
|
|||
|
|
# 3. 准备剪枝脚本模板
|
|||
|
|
cat > /workspace/bevfusion/tools/pruning/prune_bevfusion.py << 'EOF'
|
|||
|
|
import torch
|
|||
|
|
import torch_pruning as tp
|
|||
|
|
from mmdet3d.models import build_model
|
|||
|
|
|
|||
|
|
def prune_model(config, checkpoint, pruning_ratio=0.3):
|
|||
|
|
"""
|
|||
|
|
剪枝BEVFusion模型
|
|||
|
|
|
|||
|
|
Args:
|
|||
|
|
config: 配置文件
|
|||
|
|
checkpoint: 训练好的模型
|
|||
|
|
pruning_ratio: 剪枝比例(0.3表示剪30%)
|
|||
|
|
"""
|
|||
|
|
# TODO: 实现剪枝逻辑
|
|||
|
|
pass
|
|||
|
|
|
|||
|
|
if __name__ == '__main__':
|
|||
|
|
# 测试剪枝
|
|||
|
|
pass
|
|||
|
|
EOF
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**准备内容**:
|
|||
|
|
- 安装torch-pruning
|
|||
|
|
- 研究API文档
|
|||
|
|
- 准备剪枝脚本框架
|
|||
|
|
- 测试小规模剪枝
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 4. 准备量化工具 ⭐⭐⭐⭐
|
|||
|
|
**时间**:2-3小时
|
|||
|
|
**紧急度**:中
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. 研究PyTorch量化
|
|||
|
|
# https://pytorch.org/docs/stable/quantization.html
|
|||
|
|
|
|||
|
|
# 2. 准备量化脚本模板
|
|||
|
|
mkdir -p /workspace/bevfusion/tools/quantization
|
|||
|
|
|
|||
|
|
cat > /workspace/bevfusion/tools/quantization/qat_config.yaml << 'EOF'
|
|||
|
|
# QAT训练配置
|
|||
|
|
quantization:
|
|||
|
|
enabled: true
|
|||
|
|
backend: fbgemm
|
|||
|
|
|
|||
|
|
# 训练参数
|
|||
|
|
max_epochs: 5
|
|||
|
|
optimizer:
|
|||
|
|
lr: 1.0e-5 # 很小的学习率
|
|||
|
|
|
|||
|
|
# 不量化的层
|
|||
|
|
exclude_layers:
|
|||
|
|
- BatchNorm
|
|||
|
|
- LayerNorm
|
|||
|
|
EOF
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🟡 P1 - 近期准备(1-2周内)
|
|||
|
|
|
|||
|
|
#### 5. 研究TensorRT ⭐⭐⭐⭐
|
|||
|
|
**时间**:4-6小时
|
|||
|
|
**何时需要**:训练完成后
|
|||
|
|
|
|||
|
|
**现在可以做的**:
|
|||
|
|
```bash
|
|||
|
|
# 1. 下载TensorRT文档
|
|||
|
|
# https://docs.nvidia.com/deeplearning/tensorrt/
|
|||
|
|
|
|||
|
|
# 2. 学习ONNX导出
|
|||
|
|
# 了解BEVFusion的ONNX导出注意事项
|
|||
|
|
|
|||
|
|
# 3. 准备转换脚本框架
|
|||
|
|
mkdir -p /workspace/bevfusion/tools/tensorrt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**学习重点**:
|
|||
|
|
- ONNX导出技巧
|
|||
|
|
- TensorRT优化策略
|
|||
|
|
- INT8校准方法
|
|||
|
|
- DLA使用(针对Orin)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 6. 准备部署环境文档 ⭐⭐⭐
|
|||
|
|
**时间**:3-4小时
|
|||
|
|
**何时需要**:Orin部署前
|
|||
|
|
|
|||
|
|
```markdown
|
|||
|
|
# 需要准备的文档
|
|||
|
|
1. Orin环境配置清单
|
|||
|
|
2. 依赖库安装脚本
|
|||
|
|
3. 模型部署步骤
|
|||
|
|
4. 性能测试方案
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 7. 准备MapTR数据(如果决定集成)⭐⭐⭐
|
|||
|
|
**时间**:4-6小时
|
|||
|
|
**何时需要**:决定集成MapTR后
|
|||
|
|
|
|||
|
|
**现在可以做的**:
|
|||
|
|
```bash
|
|||
|
|
# 1. 检查MapTR代码
|
|||
|
|
cd /workspace
|
|||
|
|
git clone https://github.com/hustvl/MapTR.git
|
|||
|
|
|
|||
|
|
# 2. 研究矢量地图数据格式
|
|||
|
|
# 了解nuScenes map API
|
|||
|
|
|
|||
|
|
# 3. 准备数据提取脚本
|
|||
|
|
# 基于MAPTR_INTEGRATION_PLAN.md中的方案
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 🔵 P2 - 未来准备(2-4周后)
|
|||
|
|
|
|||
|
|
#### 8. 准备Orin硬件环境 ⭐⭐
|
|||
|
|
**何时需要**:部署阶段
|
|||
|
|
|
|||
|
|
**硬件清单**:
|
|||
|
|
- NVIDIA AGX Orin 270T开发板
|
|||
|
|
- 64GB SD卡(JetPack系统)
|
|||
|
|
- 电源适配器
|
|||
|
|
- 网络连接
|
|||
|
|
- 散热风扇
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### 9. 准备测试数据集 ⭐⭐
|
|||
|
|
**何时需要**:部署验证时
|
|||
|
|
|
|||
|
|
**数据准备**:
|
|||
|
|
- nuScenes验证集子集(100-500样本)
|
|||
|
|
- 转换为Orin可用格式
|
|||
|
|
- 压缩和传输到Orin
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 立即可执行的准备任务
|
|||
|
|
|
|||
|
|
### Task 1: 创建评估脚本(30分钟)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
mkdir -p /workspace/bevfusion/scripts
|
|||
|
|
|
|||
|
|
# 评估脚本
|
|||
|
|
cat > /workspace/bevfusion/scripts/evaluate_checkpoint.sh << 'SCRIPT'
|
|||
|
|
#!/bin/bash
|
|||
|
|
CHECKPOINT=${1:-"runs/enhanced_from_epoch19/latest.pth"}
|
|||
|
|
CONFIG="configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml"
|
|||
|
|
|
|||
|
|
echo "评估: $CHECKPOINT"
|
|||
|
|
|
|||
|
|
# 检测评估
|
|||
|
|
torchpack dist-run -np 8 python tools/test.py \
|
|||
|
|
$CONFIG $CHECKPOINT --eval bbox
|
|||
|
|
|
|||
|
|
# 分割评估
|
|||
|
|
torchpack dist-run -np 8 python tools/test.py \
|
|||
|
|
$CONFIG $CHECKPOINT --eval map
|
|||
|
|
SCRIPT
|
|||
|
|
|
|||
|
|
chmod +x /workspace/bevfusion/scripts/evaluate_checkpoint.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 2: 创建Loss监控脚本(20分钟)
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# /workspace/bevfusion/scripts/monitor_training.py
|
|||
|
|
|
|||
|
|
import re
|
|||
|
|
import matplotlib.pyplot as plt
|
|||
|
|
|
|||
|
|
def plot_loss_curve(log_file):
|
|||
|
|
"""绘制Loss曲线"""
|
|||
|
|
with open(log_file, 'r') as f:
|
|||
|
|
lines = f.readlines()
|
|||
|
|
|
|||
|
|
iters = []
|
|||
|
|
losses = []
|
|||
|
|
for line in lines:
|
|||
|
|
if 'Epoch [' in line and 'loss:' in line:
|
|||
|
|
iter_match = re.search(r'\[(\d+)\]', line)
|
|||
|
|
loss_match = re.search(r'loss: ([\d.]+)', line)
|
|||
|
|
if iter_match and loss_match:
|
|||
|
|
iters.append(int(iter_match.group(1)))
|
|||
|
|
losses.append(float(loss_match.group(1)))
|
|||
|
|
|
|||
|
|
plt.figure(figsize=(12, 6))
|
|||
|
|
plt.plot(iters, losses, linewidth=0.5, alpha=0.5)
|
|||
|
|
plt.xlabel('Iteration')
|
|||
|
|
plt.ylabel('Loss')
|
|||
|
|
plt.title('BEVFusion Training Loss Curve')
|
|||
|
|
plt.grid(True, alpha=0.3)
|
|||
|
|
plt.savefig('training_loss_curve.png', dpi=150)
|
|||
|
|
print("✅ Loss曲线已保存")
|
|||
|
|
|
|||
|
|
if __name__ == '__main__':
|
|||
|
|
plot_loss_curve('enhanced_training_6gpus.log')
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 3: 研究剪枝工具(1小时)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 安装torch-pruning
|
|||
|
|
pip install torch-pruning
|
|||
|
|
|
|||
|
|
# 阅读文档
|
|||
|
|
# https://github.com/VainF/Torch-Pruning
|
|||
|
|
|
|||
|
|
# 测试基本功能
|
|||
|
|
python3 << 'EOF'
|
|||
|
|
import torch
|
|||
|
|
import torch_pruning as tp
|
|||
|
|
|
|||
|
|
# 创建简单测试
|
|||
|
|
model = torch.nn.Sequential(
|
|||
|
|
torch.nn.Conv2d(3, 64, 3),
|
|||
|
|
torch.nn.BatchNorm2d(64),
|
|||
|
|
torch.nn.ReLU(),
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# 测试剪枝
|
|||
|
|
example_input = torch.randn(1, 3, 224, 224)
|
|||
|
|
pruner = tp.pruner.MagnitudePruner(
|
|||
|
|
model,
|
|||
|
|
example_input,
|
|||
|
|
importance=tp.importance.MagnitudeImportance(),
|
|||
|
|
pruning_ratio=0.3,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
print("✅ Torch-Pruning可用")
|
|||
|
|
EOF
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Task 4: 准备数据集子集(1小时)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 创建小规模测试数据集
|
|||
|
|
# 用于快速验证剪枝和量化效果
|
|||
|
|
|
|||
|
|
mkdir -p /workspace/bevfusion/data/nuscenes_mini
|
|||
|
|
|
|||
|
|
# 复制100个验证样本
|
|||
|
|
# 用于快速测试
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📅 时间规划
|
|||
|
|
|
|||
|
|
### Week 1(当前,10-21 ~ 10-27)
|
|||
|
|
|
|||
|
|
| 日期 | 训练进度 | 可准备工作 | 状态 |
|
|||
|
|
|------|---------|-----------|------|
|
|||
|
|
| 10-22 | Epoch 3 | ✅ 评估脚本 | 🔄 |
|
|||
|
|
| 10-23 | Epoch 5 | ✅ 可视化工具 | 待做 |
|
|||
|
|
| 10-24 | Epoch 7 | ✅ 剪枝工具研究 | 待做 |
|
|||
|
|
| 10-25 | Epoch 10 | ⭐ 中期评估 | 待做 |
|
|||
|
|
| 10-26 | Epoch 12 | ✅ 量化工具研究 | 待做 |
|
|||
|
|
| 10-27 | Epoch 15 | ✅ TensorRT学习 | 待做 |
|
|||
|
|
|
|||
|
|
### Week 2(10-28 ~ 11-03)
|
|||
|
|
|
|||
|
|
| 日期 | 训练进度 | 可准备工作 | 状态 |
|
|||
|
|
|------|---------|-----------|------|
|
|||
|
|
| 10-28 | Epoch 20 | ✅ 部署文档准备 | 待做 |
|
|||
|
|
| 10-29 | Epoch 23完成 | ⭐ 最终评估 | 待做 |
|
|||
|
|
| 10-30 | - | 🔄 决策MapTR集成 | 待做 |
|
|||
|
|
| 10-31 | - | 🔄 开始剪枝 | 待做 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 提前准备清单(优先级排序)
|
|||
|
|
|
|||
|
|
### 立即可做(不影响训练)
|
|||
|
|
|
|||
|
|
#### ✅ 1. 创建评估和监控脚本
|
|||
|
|
**时间**:1小时
|
|||
|
|
**价值**:⭐⭐⭐⭐⭐
|
|||
|
|
**内容**:
|
|||
|
|
- [x] 评估脚本(检测+分割)
|
|||
|
|
- [ ] Loss曲线绘制
|
|||
|
|
- [ ] 性能对比脚本
|
|||
|
|
- [ ] 自动化报告生成
|
|||
|
|
|
|||
|
|
**执行**:
|
|||
|
|
```bash
|
|||
|
|
mkdir -p /workspace/bevfusion/scripts
|
|||
|
|
# 创建上述脚本
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 2. 安装和学习剪枝工具
|
|||
|
|
**时间**:2小时
|
|||
|
|
**价值**:⭐⭐⭐⭐⭐
|
|||
|
|
**内容**:
|
|||
|
|
- [ ] 安装torch-pruning
|
|||
|
|
- [ ] 阅读文档和示例
|
|||
|
|
- [ ] 测试基本功能
|
|||
|
|
- [ ] 准备BEVFusion剪枝脚本框架
|
|||
|
|
|
|||
|
|
**执行**:
|
|||
|
|
```bash
|
|||
|
|
pip install torch-pruning
|
|||
|
|
# 学习API和最佳实践
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 3. 研究量化方法
|
|||
|
|
**时间**:2小时
|
|||
|
|
**价值**:⭐⭐⭐⭐
|
|||
|
|
**内容**:
|
|||
|
|
- [ ] PyTorch量化文档
|
|||
|
|
- [ ] QAT vs PTQ对比
|
|||
|
|
- [ ] 准备量化配置模板
|
|||
|
|
- [ ] 了解INT8校准流程
|
|||
|
|
|
|||
|
|
**资源**:
|
|||
|
|
- PyTorch Quantization Guide
|
|||
|
|
- TensorRT INT8 Calibration
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 4. 学习TensorRT基础
|
|||
|
|
**时间**:3小时
|
|||
|
|
**价值**:⭐⭐⭐⭐
|
|||
|
|
**内容**:
|
|||
|
|
- [ ] TensorRT文档阅读
|
|||
|
|
- [ ] ONNX导出最佳实践
|
|||
|
|
- [ ] DLA使用方法(Orin专用)
|
|||
|
|
- [ ] 性能优化技巧
|
|||
|
|
|
|||
|
|
**资源**:
|
|||
|
|
- NVIDIA TensorRT Documentation
|
|||
|
|
- ONNX Runtime Optimization
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 5. 准备部署文档框架
|
|||
|
|
**时间**:1小时
|
|||
|
|
**价值**:⭐⭐⭐
|
|||
|
|
**内容**:
|
|||
|
|
- [ ] Orin环境配置清单
|
|||
|
|
- [ ] 依赖库列表
|
|||
|
|
- [ ] 部署步骤框架
|
|||
|
|
- [ ] 性能测试方案
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 本周可做(训练同时进行)
|
|||
|
|
|
|||
|
|
#### ✅ 6. MapTR决策准备
|
|||
|
|
**时间**:4小时
|
|||
|
|
**价值**:⭐⭐⭐
|
|||
|
|
**何时决策**:10-30
|
|||
|
|
|
|||
|
|
**准备内容**:
|
|||
|
|
```bash
|
|||
|
|
# 1. 克隆MapTR代码(如果还没有)
|
|||
|
|
cd /workspace
|
|||
|
|
git clone https://github.com/hustvl/MapTR.git
|
|||
|
|
|
|||
|
|
# 2. 分析代码结构
|
|||
|
|
# 重点: MapTRHead实现
|
|||
|
|
|
|||
|
|
# 3. 评估工作量
|
|||
|
|
# - 数据准备: 1天
|
|||
|
|
# - 代码集成: 2天
|
|||
|
|
# - 训练: 2-3天
|
|||
|
|
# 总计: 5-6天
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**决策因素**:
|
|||
|
|
- 项目时间是否充裕?
|
|||
|
|
- 是否真的需要矢量地图?
|
|||
|
|
- 团队资源是否足够?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 7. 准备测试数据集
|
|||
|
|
**时间**:2小时
|
|||
|
|
**价值**:⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 创建小规模测试集
|
|||
|
|
# 用于快速验证剪枝/量化效果
|
|||
|
|
|
|||
|
|
# 100个验证样本(代表性采样)
|
|||
|
|
# 覆盖各种场景:白天、夜晚、雨天等
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 8. 准备benchmark脚本
|
|||
|
|
**时间**:2小时
|
|||
|
|
**价值**:⭐⭐⭐
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# /workspace/bevfusion/tools/benchmark.py
|
|||
|
|
# 测试推理速度和资源使用
|
|||
|
|
|
|||
|
|
import time
|
|||
|
|
import torch
|
|||
|
|
import numpy as np
|
|||
|
|
|
|||
|
|
def benchmark_model(model, data_loader, num_iters=100):
|
|||
|
|
"""
|
|||
|
|
Benchmark模型性能
|
|||
|
|
- 推理时间
|
|||
|
|
- 吞吐量
|
|||
|
|
- 显存使用
|
|||
|
|
"""
|
|||
|
|
times = []
|
|||
|
|
for i, data in enumerate(data_loader):
|
|||
|
|
if i >= num_iters:
|
|||
|
|
break
|
|||
|
|
|
|||
|
|
torch.cuda.synchronize()
|
|||
|
|
start = time.time()
|
|||
|
|
|
|||
|
|
with torch.no_grad():
|
|||
|
|
model(data)
|
|||
|
|
|
|||
|
|
torch.cuda.synchronize()
|
|||
|
|
end = time.time()
|
|||
|
|
|
|||
|
|
times.append((end - start) * 1000)
|
|||
|
|
|
|||
|
|
print(f"平均推理时间: {np.mean(times):.2f} ms")
|
|||
|
|
print(f"吞吐量: {1000/np.mean(times):.2f} FPS")
|
|||
|
|
print(f"P99延迟: {np.percentile(times, 99):.2f} ms")
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 下周可做(训练接近完成)
|
|||
|
|
|
|||
|
|
#### ✅ 9. 联系Orin硬件
|
|||
|
|
**时间**:采购流程
|
|||
|
|
**何时需要**:11月初
|
|||
|
|
|
|||
|
|
**准备内容**:
|
|||
|
|
- [ ] 确认Orin型号(AGX Orin 270T)
|
|||
|
|
- [ ] 准备采购或借用
|
|||
|
|
- [ ] 准备JetPack安装U盘
|
|||
|
|
- [ ] 准备网络环境
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
#### ✅ 10. 准备数据传输方案
|
|||
|
|
**时间**:1小时
|
|||
|
|
**价值**:⭐⭐⭐
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 大模型和数据如何传输到Orin
|
|||
|
|
# 方案1: rsync over SSH
|
|||
|
|
# 方案2: 网络存储挂载
|
|||
|
|
# 方案3: USB移动硬盘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 准备工作时间表
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
现在 ~ 10-23(Epoch 5前)
|
|||
|
|
├─ ✅ 创建评估脚本 1小时 P0
|
|||
|
|
├─ ✅ 准备可视化工具 2小时 P0
|
|||
|
|
├─ ✅ 安装剪枝工具 1小时 P0
|
|||
|
|
└─ ✅ 研究量化方法 2小时 P0
|
|||
|
|
总计: 6小时
|
|||
|
|
|
|||
|
|
10-23 ~ 10-25(Epoch 10前)
|
|||
|
|
├─ ✅ 学习TensorRT 3小时 P1
|
|||
|
|
├─ ✅ 准备benchmark脚本 2小时 P1
|
|||
|
|
├─ ✅ 准备测试数据集 2小时 P1
|
|||
|
|
└─ ✅ 研究MapTR代码 4小时 P1
|
|||
|
|
总计: 11小时
|
|||
|
|
|
|||
|
|
10-25 ~ 10-29(训练期间)
|
|||
|
|
├─ ⭐ Epoch 10中期评估 完成后立即
|
|||
|
|
├─ ✅ 准备部署文档 2小时 P1
|
|||
|
|
├─ ✅ 准备Orin环境清单 1小时 P1
|
|||
|
|
└─ 🔄 MapTR决策 10-30
|
|||
|
|
|
|||
|
|
10-29 ~ 11-05(训练完成后)
|
|||
|
|
├─ ⭐ 最终评估和对比 完成后立即
|
|||
|
|
├─ 🔄 执行剪枝 2-3天 P0
|
|||
|
|
├─ 🔄 量化训练 3-4天 P0
|
|||
|
|
└─ 🔄 TensorRT转换 2-3天 P0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 优先级建议
|
|||
|
|
|
|||
|
|
### 本周必做(P0)
|
|||
|
|
1. ✅ **创建评估脚本**(1小时)- Epoch 10评估时需要
|
|||
|
|
2. ✅ **安装剪枝工具**(1小时)- 训练完成后立即使用
|
|||
|
|
3. ✅ **研究量化方法**(2小时)- 提前了解
|
|||
|
|
|
|||
|
|
### 本周建议做(P1)
|
|||
|
|
4. ✅ **准备可视化工具**(2小时)- 性能对比需要
|
|||
|
|
5. ✅ **学习TensorRT**(3小时)- 部署必需
|
|||
|
|
6. ✅ **准备测试数据**(2小时)- 快速验证需要
|
|||
|
|
|
|||
|
|
### 下周可做(P2)
|
|||
|
|
7. ✅ **MapTR代码研究**(4小时)- 如果决定集成
|
|||
|
|
8. ✅ **准备Orin硬件**(采购)- 部署前准备
|
|||
|
|
9. ✅ **部署文档框架**(2小时)- 提前规划
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 关键决策点
|
|||
|
|
|
|||
|
|
### Decision Point 1: Epoch 10评估(10-25)⭐⭐⭐⭐⭐
|
|||
|
|
**评估内容**:
|
|||
|
|
- 检测mAP是否稳定在65%+
|
|||
|
|
- 分割mIoU是否达到45-50%
|
|||
|
|
- Loss收敛趋势是否健康
|
|||
|
|
|
|||
|
|
**决策**:
|
|||
|
|
- 如果进展良好 → 继续训练
|
|||
|
|
- 如果mIoU<40% → 考虑调整配置
|
|||
|
|
- 如果检测下降 → 检查loss权重
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Decision Point 2: MapTR集成决策(10-30)⭐⭐⭐⭐
|
|||
|
|
**考虑因素**:
|
|||
|
|
- 项目时间是否充裕
|
|||
|
|
- 是否需要矢量地图
|
|||
|
|
- 团队技术能力
|
|||
|
|
|
|||
|
|
**建议**:
|
|||
|
|
- ✅ 时间紧 → 跳过MapTR,直接优化
|
|||
|
|
- ⚠️ 时间充裕 → 可以尝试集成
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Decision Point 3: 剪枝程度(训练完成后)⭐⭐⭐
|
|||
|
|
**根据最终性能决定**:
|
|||
|
|
- 如果mIoU>60% → 可激进剪枝(45-50%)
|
|||
|
|
- 如果mIoU=55-60% → 保守剪枝(30-35%)
|
|||
|
|
- 如果mIoU<55% → 轻度剪枝(20-25%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 本周行动清单
|
|||
|
|
|
|||
|
|
### 今天(10-22)可以做 ✅
|
|||
|
|
- [ ] 创建evaluate_checkpoint.sh脚本
|
|||
|
|
- [ ] 创建monitor_training.py脚本
|
|||
|
|
- [ ] 安装torch-pruning库
|
|||
|
|
- [ ] 阅读剪枝文档
|
|||
|
|
|
|||
|
|
**预计时间**:2-3小时
|
|||
|
|
**收益**:为后续工作打好基础
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 明天(10-23)可以做 ✅
|
|||
|
|
- [ ] 准备Loss可视化脚本
|
|||
|
|
- [ ] 准备性能对比脚本
|
|||
|
|
- [ ] 研究PyTorch量化文档
|
|||
|
|
- [ ] 准备量化配置模板
|
|||
|
|
|
|||
|
|
**预计时间**:3-4小时
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 本周末(10-26 ~ 10-27)可以做 ✅
|
|||
|
|
- [ ] 学习TensorRT基础
|
|||
|
|
- [ ] 准备ONNX导出脚本
|
|||
|
|
- [ ] 研究MapTR代码(如果倾向集成)
|
|||
|
|
- [ ] 准备部署文档框架
|
|||
|
|
|
|||
|
|
**预计时间**:6-8小时
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 快速启动命令
|
|||
|
|
|
|||
|
|
### 创建所有准备脚本(一键执行)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
# 创建scripts目录
|
|||
|
|
mkdir -p scripts tools/pruning tools/quantization tools/tensorrt
|
|||
|
|
|
|||
|
|
# 1. 评估脚本
|
|||
|
|
cat > scripts/evaluate_checkpoint.sh << 'EOF'
|
|||
|
|
#!/bin/bash
|
|||
|
|
CHECKPOINT=${1:-"runs/enhanced_from_epoch19/latest.pth"}
|
|||
|
|
CONFIG="configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml"
|
|||
|
|
torchpack dist-run -np 8 python tools/test.py $CONFIG $CHECKPOINT --eval bbox map
|
|||
|
|
EOF
|
|||
|
|
chmod +x scripts/evaluate_checkpoint.sh
|
|||
|
|
|
|||
|
|
# 2. Loss监控脚本
|
|||
|
|
cat > scripts/plot_loss.py << 'EOF'
|
|||
|
|
import re, matplotlib.pyplot as plt
|
|||
|
|
with open('enhanced_training_6gpus.log') as f:
|
|||
|
|
lines = f.readlines()
|
|||
|
|
losses = []
|
|||
|
|
for line in lines:
|
|||
|
|
if 'loss:' in line:
|
|||
|
|
match = re.search(r'loss: ([\d.]+)', line)
|
|||
|
|
if match:
|
|||
|
|
losses.append(float(match.group(1)))
|
|||
|
|
plt.plot(losses, linewidth=0.5)
|
|||
|
|
plt.xlabel('Iteration (×50)'); plt.ylabel('Loss')
|
|||
|
|
plt.title('Training Loss Curve'); plt.grid(True)
|
|||
|
|
plt.savefig('loss_curve.png', dpi=150)
|
|||
|
|
print("✅ saved to loss_curve.png")
|
|||
|
|
EOF
|
|||
|
|
|
|||
|
|
# 3. 剪枝脚本框架
|
|||
|
|
cat > tools/pruning/prune_template.py << 'EOF'
|
|||
|
|
# BEVFusion剪枝脚本模板
|
|||
|
|
# TODO: 训练完成后实现
|
|||
|
|
import torch
|
|||
|
|
import torch_pruning as tp
|
|||
|
|
EOF
|
|||
|
|
|
|||
|
|
# 4. 量化配置模板
|
|||
|
|
cat > tools/quantization/qat_config.yaml << 'EOF'
|
|||
|
|
# QAT配置
|
|||
|
|
max_epochs: 5
|
|||
|
|
optimizer:
|
|||
|
|
lr: 1.0e-5
|
|||
|
|
EOF
|
|||
|
|
|
|||
|
|
echo "✅ 所有准备脚本已创建!"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 总结
|
|||
|
|
|
|||
|
|
### 当前状态
|
|||
|
|
- ✅ 训练正常运行(17.9小时)
|
|||
|
|
- ✅ 已完成2个epochs
|
|||
|
|
- ✅ Epoch 2验证通过(mAP 65.32%)
|
|||
|
|
- 🔄 Epoch 3进行中(11.7%)
|
|||
|
|
- ⏳ 预计6.8天后完成
|
|||
|
|
|
|||
|
|
### 可以提前准备的工作(10项)
|
|||
|
|
|
|||
|
|
**立即可做(本周)**:
|
|||
|
|
1. ✅ 评估脚本(1小时)⭐⭐⭐⭐⭐
|
|||
|
|
2. ✅ 可视化工具(2小时)⭐⭐⭐⭐
|
|||
|
|
3. ✅ 剪枝工具研究(2小时)⭐⭐⭐⭐⭐
|
|||
|
|
4. ✅ 量化方法学习(2小时)⭐⭐⭐⭐
|
|||
|
|
5. ✅ TensorRT学习(3小时)⭐⭐⭐⭐
|
|||
|
|
|
|||
|
|
**下周可做**:
|
|||
|
|
6. ✅ MapTR代码研究(4小时)⭐⭐⭐
|
|||
|
|
7. ✅ 测试数据准备(2小时)⭐⭐⭐
|
|||
|
|
8. ✅ 部署文档框架(2小时)⭐⭐⭐
|
|||
|
|
9. ✅ Benchmark脚本(2小时)⭐⭐⭐
|
|||
|
|
10. ✅ Orin硬件准备(采购)⭐⭐
|
|||
|
|
|
|||
|
|
**总计**:约22小时准备工作,可在训练期间完成
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**建议**:从P0高优先级任务开始,利用训练期间的等待时间,为后续阶段做好充分准备!
|