303 lines
6.3 KiB
Markdown
303 lines
6.3 KiB
Markdown
|
|
# Epoch 23 评估与部署 - 快速启动指南
|
|||
|
|
|
|||
|
|
**更新时间**: 2025-10-30
|
|||
|
|
**文档类型**: 快速操作指南
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 一句话总结
|
|||
|
|
|
|||
|
|
**立即使用空闲的GPU 4-7评估epoch 23性能,为Stage 1提供精确对比基准,同时不影响当前训练。**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚡ 5分钟快速启动
|
|||
|
|
|
|||
|
|
### 方式1: 一键启动(推荐)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
# 赋予执行权限
|
|||
|
|
chmod +x EVAL_EPOCH23_COMPLETE.sh
|
|||
|
|
|
|||
|
|
# 后台启动评估
|
|||
|
|
nohup bash EVAL_EPOCH23_COMPLETE.sh > eval_epoch23_$(date +%Y%m%d_%H%M%S).log 2>&1 &
|
|||
|
|
|
|||
|
|
# 记录进程ID
|
|||
|
|
echo $! > eval_epoch23.pid
|
|||
|
|
|
|||
|
|
# 实时监控
|
|||
|
|
tail -f eval_epoch23_*.log
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 方式2: 分步启动
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
# 创建评估脚本(如果还没有)
|
|||
|
|
# 脚本已创建: EVAL_EPOCH23_COMPLETE.sh
|
|||
|
|
|
|||
|
|
# 启动评估
|
|||
|
|
bash EVAL_EPOCH23_COMPLETE.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 监控评估进度
|
|||
|
|
|
|||
|
|
### 终端1: 监控评估日志
|
|||
|
|
```bash
|
|||
|
|
tail -f eval_epoch23_*.log | grep -E "(Epoch|NDS|mAP|mIoU|完成)"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 终端2: 监控训练日志(确保不受影响)
|
|||
|
|
```bash
|
|||
|
|
tail -f phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 终端3: 监控GPU状态
|
|||
|
|
```bash
|
|||
|
|
watch -n 10 'nvidia-smi | head -20'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期GPU状态**:
|
|||
|
|
```
|
|||
|
|
GPU 0-3: 训练中 (Stage 1) - 显存~30GB, 利用率100%
|
|||
|
|
GPU 4-7: 评估中 (Epoch 23) - 显存~20GB, 利用率70-90%
|
|||
|
|
|
|||
|
|
总体利用率: 100% ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⏱️ 预期时间
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
阶段1: 3D检测评估 45-60分钟
|
|||
|
|
阶段2: BEV分割评估 30-45分钟
|
|||
|
|
阶段3: 综合评估 60-90分钟
|
|||
|
|
─────────────────────────────
|
|||
|
|
总计: 2.5-3小时
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📁 评估结果位置
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
eval_results/epoch23_complete_<timestamp>/
|
|||
|
|
├── detection_eval.log # 检测评估日志
|
|||
|
|
├── segmentation_eval.log # 分割评估日志
|
|||
|
|
├── complete_eval.log # 综合评估日志
|
|||
|
|
├── detection_results.pkl # 检测结果(15GB)
|
|||
|
|
├── segmentation_results.pkl # 分割结果(8GB)
|
|||
|
|
├── complete_results.pkl # 综合结果(18GB)
|
|||
|
|
└── SUMMARY.txt # 评估摘要 ⭐ 先看这个
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 查看结果
|
|||
|
|
|
|||
|
|
### 快速查看摘要
|
|||
|
|
```bash
|
|||
|
|
# 找到最新的评估目录
|
|||
|
|
EVAL_DIR=$(ls -td eval_results/epoch23_complete_* | head -1)
|
|||
|
|
|
|||
|
|
# 查看摘要
|
|||
|
|
cat $EVAL_DIR/SUMMARY.txt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 查看详细指标
|
|||
|
|
|
|||
|
|
#### 3D检测
|
|||
|
|
```bash
|
|||
|
|
cat $EVAL_DIR/detection_eval.log | grep -A 30 "Evaluation"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### BEV分割
|
|||
|
|
```bash
|
|||
|
|
cat $EVAL_DIR/segmentation_eval.log | grep -A 20 "IoU"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 各类别详细
|
|||
|
|
```bash
|
|||
|
|
# 检测per-class AP
|
|||
|
|
cat $EVAL_DIR/detection_eval.log | grep -E "(car|pedestrian|truck|bus)"
|
|||
|
|
|
|||
|
|
# 分割per-class IoU
|
|||
|
|
cat $EVAL_DIR/segmentation_eval.log | grep -E "(drivable|walkway|stop_line|divider)"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 成功标准
|
|||
|
|
|
|||
|
|
### 评估完成标志
|
|||
|
|
- ✅ 日志中出现"全部评估完成"
|
|||
|
|
- ✅ 生成了SUMMARY.txt
|
|||
|
|
- ✅ 3个pkl文件都存在
|
|||
|
|
- ✅ 日志文件大小>100MB
|
|||
|
|
|
|||
|
|
### 预期性能(基于训练日志)
|
|||
|
|
```
|
|||
|
|
3D检测:
|
|||
|
|
NDS: ~0.6941
|
|||
|
|
mAP: ~0.6446
|
|||
|
|
|
|||
|
|
BEV分割:
|
|||
|
|
mIoU: ~0.4130
|
|||
|
|
Drivable Area: ~0.7063
|
|||
|
|
Stop Line: ~0.2657
|
|||
|
|
Divider: ~0.1903
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 常见问题
|
|||
|
|
|
|||
|
|
### Q1: 评估会影响训练吗?
|
|||
|
|
**A**: 不会。通过`CUDA_VISIBLE_DEVICES=4,5,6,7`完全隔离GPU。
|
|||
|
|
- 训练使用GPU 0-3
|
|||
|
|
- 评估使用GPU 4-7
|
|||
|
|
- 零干扰
|
|||
|
|
|
|||
|
|
### Q2: 如果想停止评估怎么办?
|
|||
|
|
```bash
|
|||
|
|
# 查找进程ID
|
|||
|
|
cat eval_epoch23.pid
|
|||
|
|
|
|||
|
|
# 停止评估
|
|||
|
|
kill $(cat eval_epoch23.pid)
|
|||
|
|
|
|||
|
|
# 或者
|
|||
|
|
pkill -f "test.py.*epoch_23"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Q3: 评估失败了怎么办?
|
|||
|
|
```bash
|
|||
|
|
# 检查错误日志
|
|||
|
|
tail -100 eval_epoch23_*.log
|
|||
|
|
|
|||
|
|
# 常见问题:
|
|||
|
|
# 1. workers共享内存错误 → 脚本已设置workers=0
|
|||
|
|
# 2. GPU冲突 → 检查CUDA_VISIBLE_DEVICES
|
|||
|
|
# 3. 环境问题 → 重启Docker容器
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Q4: 可以只评估检测或分割吗?
|
|||
|
|
```bash
|
|||
|
|
# 只评估检测(~1小时)
|
|||
|
|
CUDA_VISIBLE_DEVICES=4,5,6,7 \
|
|||
|
|
torchpack dist-run -np 4 python tools/test.py \
|
|||
|
|
configs/.../multitask_enhanced_phase1_HIGHRES.yaml \
|
|||
|
|
runs/enhanced_from_epoch19/epoch_23.pth \
|
|||
|
|
--eval bbox \
|
|||
|
|
--cfg-options data.workers_per_gpu=0
|
|||
|
|
|
|||
|
|
# 只评估分割(~45分钟)
|
|||
|
|
CUDA_VISIBLE_DEVICES=4,5,6,7 \
|
|||
|
|
torchpack dist-run -np 4 python tools/test.py \
|
|||
|
|
configs/.../multitask_enhanced_phase1_HIGHRES.yaml \
|
|||
|
|
runs/enhanced_from_epoch19/epoch_23.pth \
|
|||
|
|
--eval map \
|
|||
|
|
--cfg-options data.workers_per_gpu=0
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 评估后做什么
|
|||
|
|
|
|||
|
|
### 1. 生成详细分析报告
|
|||
|
|
```bash
|
|||
|
|
# 提取详细性能数据
|
|||
|
|
python tools/analysis/extract_metrics.py \
|
|||
|
|
--eval-dir eval_results/epoch23_complete_<timestamp> \
|
|||
|
|
--output epoch23_detailed_report.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 对比Stage 1进度
|
|||
|
|
```bash
|
|||
|
|
# 等Epoch 1完成后
|
|||
|
|
python tools/analysis/compare_checkpoints.py \
|
|||
|
|
--baseline runs/enhanced_from_epoch19/epoch_23.pth \
|
|||
|
|
--current runs/run-326653dc-c038af2c/epoch_1.pth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 识别改进空间
|
|||
|
|
```bash
|
|||
|
|
# 分析失败cases
|
|||
|
|
python tools/analysis/analyze_failures.py \
|
|||
|
|
--results eval_results/epoch23_complete_<timestamp>/complete_results.pkl \
|
|||
|
|
--threshold 0.3
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 下一步行动
|
|||
|
|
|
|||
|
|
### 立即(评估期间)
|
|||
|
|
- ✅ 启动评估脚本
|
|||
|
|
- ✅ 监控评估和训练都正常
|
|||
|
|
- ✅ 继续等待Stage 1训练
|
|||
|
|
|
|||
|
|
### 评估完成后(3小时后)
|
|||
|
|
- [ ] 查看评估结果
|
|||
|
|
- [ ] 生成详细报告
|
|||
|
|
- [ ] 对比baseline数据
|
|||
|
|
- [ ] 记录关键发现
|
|||
|
|
|
|||
|
|
### 本周内
|
|||
|
|
- [ ] 完成模型分析
|
|||
|
|
- [ ] 设计优化策略
|
|||
|
|
- [ ] 准备剪枝工具
|
|||
|
|
|
|||
|
|
### 下周开始
|
|||
|
|
- [ ] 执行模型剪枝
|
|||
|
|
- [ ] 量化训练
|
|||
|
|
- [ ] TensorRT转换
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 相关文档
|
|||
|
|
|
|||
|
|
### 主文档
|
|||
|
|
- 📘 **EPOCH23_评估与部署完整计划.md** - 完整计划(推荐先读)
|
|||
|
|
- 📗 **EPOCH23_快速启动指南.md** - 快速指南(本文档)
|
|||
|
|
|
|||
|
|
### 参考文档
|
|||
|
|
- 📄 `PHASE3_EPOCH23_BASELINE_PERFORMANCE.md` - Baseline数据
|
|||
|
|
- 📄 `ORIN_DEPLOYMENT_PLAN.md` - 部署详细方案
|
|||
|
|
- 📄 `UPDATED_PLAN_WITH_EVAL.md` - 评估策略
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚡ TL;DR(极简版)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. 启动评估(后台运行)
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
chmod +x EVAL_EPOCH23_COMPLETE.sh
|
|||
|
|
nohup bash EVAL_EPOCH23_COMPLETE.sh > eval_$(date +%Y%m%d_%H%M%S).log 2>&1 &
|
|||
|
|
|
|||
|
|
# 2. 监控进度
|
|||
|
|
tail -f eval_*.log
|
|||
|
|
|
|||
|
|
# 3. 查看结果(3小时后)
|
|||
|
|
cat eval_results/epoch23_complete_*/SUMMARY.txt
|
|||
|
|
|
|||
|
|
# 完成!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**状态**: ✅ 脚本已准备就绪
|
|||
|
|
**可立即执行**: 是
|
|||
|
|
**预计完成**: 2.5-3小时
|
|||
|
|
**训练影响**: 无
|
|||
|
|
|
|||
|
|
**建议**: 立即启动,充分利用GPU资源!🚀
|
|||
|
|
|