394 lines
8.0 KiB
Markdown
394 lines
8.0 KiB
Markdown
|
|
# Epoch 23部署测试方案分析
|
|||
|
|
|
|||
|
|
**时间**: 2025-10-30 15:15
|
|||
|
|
**需求**: 对epoch_23.pth进行evaluation/deployment测试
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 当前环境状态
|
|||
|
|
|
|||
|
|
### GPU资源
|
|||
|
|
```
|
|||
|
|
GPU 0-3: Stage 1训练中 (30GB显存, 100%利用率)
|
|||
|
|
GPU 4-7: 完全空闲 (可用于测试)
|
|||
|
|
|
|||
|
|
可用资源: 4张GPU (GPU 4-7)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Docker环境
|
|||
|
|
```
|
|||
|
|
PyTorch: 1.10.1+cu102 ✅
|
|||
|
|
mmcv: 1.4.0 ✅
|
|||
|
|
符号链接: 已修复 ✅
|
|||
|
|
工作目录: /workspace/bevfusion ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 之前评估失败原因
|
|||
|
|
```
|
|||
|
|
问题1: MASTER_HOST环境变量缺失
|
|||
|
|
问题2: DataLoader workers共享内存错误
|
|||
|
|
→ RuntimeError: unable to write to file </torch_xxx>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 两种方案对比
|
|||
|
|
|
|||
|
|
### 方案A: 同Docker并行测试 (推荐 ⭐⭐⭐)
|
|||
|
|
|
|||
|
|
#### 配置
|
|||
|
|
```bash
|
|||
|
|
使用GPU: 4-7 (4张,避开训练GPU)
|
|||
|
|
Workers: 0 (避免共享内存问题)
|
|||
|
|
进程隔离: CUDA_VISIBLE_DEVICES=4,5,6,7
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 优点
|
|||
|
|
```
|
|||
|
|
✅ 无需额外Docker容器
|
|||
|
|
✅ 直接访问checkpoint和数据
|
|||
|
|
✅ 环境已配置好(符号链接等)
|
|||
|
|
✅ 可以实时对比训练进度
|
|||
|
|
✅ 资源利用率100% (8张GPU全用)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 缺点
|
|||
|
|
```
|
|||
|
|
⚠️ 需要确保GPU隔离(通过CUDA_VISIBLE_DEVICES)
|
|||
|
|
⚠️ 需要workers=0避免共享内存冲突
|
|||
|
|
⚠️ 可能的资源竞争(IO, CPU)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 风险评估
|
|||
|
|
```
|
|||
|
|
GPU冲突风险: 极低 (CUDA_VISIBLE_DEVICES隔离)
|
|||
|
|
共享内存风险: 低 (workers=0)
|
|||
|
|
训练影响风险: 极低 (独立GPU)
|
|||
|
|
IO竞争风险: 低 (评估2-3小时vs训练9天)
|
|||
|
|
|
|||
|
|
总体风险: 低
|
|||
|
|
成功概率: 85%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 方案B: 新Docker容器测试
|
|||
|
|
|
|||
|
|
#### 配置
|
|||
|
|
```bash
|
|||
|
|
新容器: 独立的BEVFusion环境
|
|||
|
|
挂载: 共享/workspace/bevfusion目录
|
|||
|
|
GPU: 可以使用4-7或独立分配
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 优点
|
|||
|
|
```
|
|||
|
|
✅ 完全隔离,零干扰风险
|
|||
|
|
✅ 可以使用不同配置
|
|||
|
|
✅ 更灵活的资源分配
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 缺点
|
|||
|
|
```
|
|||
|
|
❌ 需要启动新Docker容器
|
|||
|
|
❌ 需要重新配置环境(符号链接等)
|
|||
|
|
❌ 可能需要重新安装依赖
|
|||
|
|
❌ 数据访问需要挂载配置
|
|||
|
|
❌ 增加管理复杂度
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 实施成本
|
|||
|
|
```
|
|||
|
|
环境准备: 30-60分钟
|
|||
|
|
- 启动Docker
|
|||
|
|
- 配置环境
|
|||
|
|
- 创建符号链接
|
|||
|
|
- 验证mmcv等
|
|||
|
|
|
|||
|
|
vs 同Docker方案: 5分钟
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 我的推荐
|
|||
|
|
|
|||
|
|
### ⭐⭐⭐ 推荐: 方案A - 同Docker并行测试
|
|||
|
|
|
|||
|
|
#### 核心理由
|
|||
|
|
|
|||
|
|
**1. 环境已就绪**
|
|||
|
|
```
|
|||
|
|
当前Docker中:
|
|||
|
|
✓ PyTorch 1.10.1已配置
|
|||
|
|
✓ mmcv符号链接已修复
|
|||
|
|
✓ 所有依赖已安装
|
|||
|
|
✓ checkpoint和数据已就位
|
|||
|
|
|
|||
|
|
新Docker需要:
|
|||
|
|
✗ 重新配置所有环境
|
|||
|
|
✗ 30-60分钟准备时间
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. GPU隔离很简单**
|
|||
|
|
```bash
|
|||
|
|
# 通过CUDA_VISIBLE_DEVICES完全隔离
|
|||
|
|
CUDA_VISIBLE_DEVICES=4,5,6,7 运行评估
|
|||
|
|
→ 评估进程只能看到GPU 4-7
|
|||
|
|
→ 训练进程只能看到GPU 0-3
|
|||
|
|
→ 零冲突风险
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 共享内存问题可解决**
|
|||
|
|
```bash
|
|||
|
|
# 之前失败的原因: workers=4
|
|||
|
|
# 解决方法: workers=0
|
|||
|
|
|
|||
|
|
torchpack dist-run -np 4 python tools/test.py \
|
|||
|
|
--cfg-options data.workers_per_gpu=0 # ← 关键
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**4. 风险可控**
|
|||
|
|
```
|
|||
|
|
最坏情况: 评估失败
|
|||
|
|
影响: 无(训练继续运行)
|
|||
|
|
成本: 浪费2-3小时GPU时间
|
|||
|
|
恢复: 立即(停止评估即可)
|
|||
|
|
|
|||
|
|
vs 新Docker方案:
|
|||
|
|
最坏情况: 环境配置失败
|
|||
|
|
成本: 浪费1-2小时调试
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 推荐实施方案
|
|||
|
|
|
|||
|
|
### 修复后的评估脚本
|
|||
|
|
|
|||
|
|
创建: `EVAL_EPOCH23_FIXED.sh`
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# Epoch 23评估 - 修复版(同Docker并行)
|
|||
|
|
|
|||
|
|
set -e
|
|||
|
|
|
|||
|
|
export PATH=/opt/conda/bin:$PATH
|
|||
|
|
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
|||
|
|
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
|
|||
|
|
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
echo "========================================================================"
|
|||
|
|
echo "Phase 3 Epoch 23 评估 (GPU 4-7, workers=0)"
|
|||
|
|
echo "========================================================================"
|
|||
|
|
echo ""
|
|||
|
|
|
|||
|
|
EVAL_DIR="eval_results/epoch23_$(date +%Y%m%d_%H%M%S)"
|
|||
|
|
mkdir -p "$EVAL_DIR"
|
|||
|
|
|
|||
|
|
CONFIG="configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml"
|
|||
|
|
CHECKPOINT="runs/enhanced_from_epoch19/epoch_23.pth"
|
|||
|
|
|
|||
|
|
echo "配置: $CONFIG"
|
|||
|
|
echo "Checkpoint: $CHECKPOINT"
|
|||
|
|
echo "输出: $EVAL_DIR"
|
|||
|
|
echo ""
|
|||
|
|
|
|||
|
|
# 关键修复:
|
|||
|
|
# 1. 使用GPU 4-7(通过CUDA_VISIBLE_DEVICES)
|
|||
|
|
# 2. workers=0(避免共享内存问题)
|
|||
|
|
# 3. 明确设置环境变量
|
|||
|
|
|
|||
|
|
CUDA_VISIBLE_DEVICES=4,5,6,7 \
|
|||
|
|
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
|
|||
|
|
PATH=/opt/conda/bin:$PATH \
|
|||
|
|
/opt/conda/bin/torchpack dist-run -np 4 /opt/conda/bin/python tools/test.py \
|
|||
|
|
"$CONFIG" \
|
|||
|
|
"$CHECKPOINT" \
|
|||
|
|
--eval bbox \
|
|||
|
|
--out "$EVAL_DIR/results.pkl" \
|
|||
|
|
--cfg-options data.workers_per_gpu=0 \
|
|||
|
|
2>&1 | tee "$EVAL_DIR/eval.log"
|
|||
|
|
|
|||
|
|
echo ""
|
|||
|
|
echo "========================================================================"
|
|||
|
|
echo "评估完成!"
|
|||
|
|
echo "========================================================================"
|
|||
|
|
echo "结果: $EVAL_DIR/results.pkl"
|
|||
|
|
echo "日志: $EVAL_DIR/eval.log"
|
|||
|
|
echo ""
|
|||
|
|
echo "提取性能指标:"
|
|||
|
|
grep -E "(NDS|mAP|mIoU)" "$EVAL_DIR/eval.log" | tail -20
|
|||
|
|
echo "========================================================================"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 实施步骤
|
|||
|
|
|
|||
|
|
### Step 1: 验证训练不受影响
|
|||
|
|
```bash
|
|||
|
|
# 检查训练GPU隔离
|
|||
|
|
nvidia-smi | grep -A 1 "GPU 0\|GPU 1\|GPU 2\|GPU 3"
|
|||
|
|
|
|||
|
|
# 确认训练正常
|
|||
|
|
tail -5 phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 2: 启动评估
|
|||
|
|
```bash
|
|||
|
|
bash EVAL_EPOCH23_FIXED.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 3: 监控并行任务
|
|||
|
|
```bash
|
|||
|
|
# 终端1: 监控训练
|
|||
|
|
tail -f phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
|
|||
|
|
# 终端2: 监控评估
|
|||
|
|
tail -f eval_results/epoch23_*/eval.log
|
|||
|
|
|
|||
|
|
# 终端3: GPU状态
|
|||
|
|
watch -n 5 nvidia-smi
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 4: 评估完成后
|
|||
|
|
```bash
|
|||
|
|
# 查看结果
|
|||
|
|
cat eval_results/epoch23_*/eval.log | grep -A 50 "Evaluation"
|
|||
|
|
|
|||
|
|
# 对比baseline
|
|||
|
|
# (我们已有baseline数据)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚠️ 注意事项
|
|||
|
|
|
|||
|
|
### 必须设置的参数
|
|||
|
|
|
|||
|
|
**1. CUDA_VISIBLE_DEVICES=4,5,6,7**
|
|||
|
|
```
|
|||
|
|
作用: 限制评估进程只能看到GPU 4-7
|
|||
|
|
重要性: ⭐⭐⭐ 防止GPU冲突
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. data.workers_per_gpu=0**
|
|||
|
|
```
|
|||
|
|
作用: 避免DataLoader共享内存错误
|
|||
|
|
重要性: ⭐⭐⭐ 之前失败的根本原因
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. 环境变量**
|
|||
|
|
```
|
|||
|
|
LD_LIBRARY_PATH: 必须设置
|
|||
|
|
PATH: 使用conda Python
|
|||
|
|
重要性: ⭐⭐ 保证库加载正常
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 风险预防
|
|||
|
|
|
|||
|
|
**如果评估导致训练变慢**:
|
|||
|
|
```bash
|
|||
|
|
# 立即停止评估
|
|||
|
|
pkill -f "test.py"
|
|||
|
|
|
|||
|
|
# 不会影响训练(训练进程独立)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**如果出现GPU冲突**:
|
|||
|
|
```bash
|
|||
|
|
# 检查GPU分配
|
|||
|
|
nvidia-smi
|
|||
|
|
|
|||
|
|
# 确认CUDA_VISIBLE_DEVICES生效
|
|||
|
|
echo $CUDA_VISIBLE_DEVICES
|
|||
|
|
|
|||
|
|
# 停止评估重新启动
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 方案对比表
|
|||
|
|
|
|||
|
|
| 指标 | 方案A: 同Docker并行 | 方案B: 新Docker |
|
|||
|
|
|------|-------------------|----------------|
|
|||
|
|
| **准备时间** | 5分钟 | 30-60分钟 |
|
|||
|
|
| **环境配置** | 已就绪 | 需重新配置 |
|
|||
|
|
| **GPU隔离** | CUDA_VISIBLE_DEVICES | 物理隔离 |
|
|||
|
|
| **风险等级** | 低 | 极低 |
|
|||
|
|
| **资源利用** | 100% (8GPU) | 取决于分配 |
|
|||
|
|
| **管理复杂度** | 低 | 中 |
|
|||
|
|
| **训练影响** | 极低 | 零 |
|
|||
|
|
| **成功概率** | 85% | 95% |
|
|||
|
|
| **推荐指数** | ⭐⭐⭐ | ⭐ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 最终建议
|
|||
|
|
|
|||
|
|
### 推荐: 方案A - 同Docker并行 (85%确信)
|
|||
|
|
|
|||
|
|
**实施步骤**:
|
|||
|
|
```bash
|
|||
|
|
# 1. 创建修复后的评估脚本
|
|||
|
|
# (见上文 EVAL_EPOCH23_FIXED.sh)
|
|||
|
|
|
|||
|
|
# 2. 启动评估
|
|||
|
|
bash EVAL_EPOCH23_FIXED.sh
|
|||
|
|
|
|||
|
|
# 3. 监控(新终端)
|
|||
|
|
bash monitor_all_tasks.sh
|
|||
|
|
|
|||
|
|
# 4. 等待完成 (2-3小时)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期结果**:
|
|||
|
|
```
|
|||
|
|
评估时间: 2-3小时
|
|||
|
|
训练影响: 无
|
|||
|
|
GPU利用率: 100% (8张全用)
|
|||
|
|
成功概率: 85%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**如果失败**:
|
|||
|
|
```
|
|||
|
|
降级方案: 等Epoch 1完成后再评估
|
|||
|
|
- 训练暂停期间评估
|
|||
|
|
- 可以使用全部8张GPU
|
|||
|
|
- 1小时完成
|
|||
|
|
- 成功概率: 95%
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 备选: 方案B - 新Docker (15%推荐)
|
|||
|
|
|
|||
|
|
**仅在以下情况使用**:
|
|||
|
|
- 方案A尝试失败
|
|||
|
|
- 对训练干扰零容忍
|
|||
|
|
- 有时间配置新环境
|
|||
|
|
|
|||
|
|
**实施成本**: 30-60分钟
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 修复后的评估脚本
|
|||
|
|
|
|||
|
|
关键改进:
|
|||
|
|
1. ✅ 使用GPU 4-7 (CUDA_VISIBLE_DEVICES)
|
|||
|
|
2. ✅ workers=0 (避免共享内存)
|
|||
|
|
3. ✅ 完整环境变量设置
|
|||
|
|
|
|||
|
|
脚本已准备好,随时可以启动!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**推荐**: 使用同Docker并行测试 ✅
|
|||
|
|
**备选**: 新Docker容器 (如果并行失败)
|
|||
|
|
**监控**: monitor_all_tasks.sh
|
|||
|
|
|
|||
|
|
|
|||
|
|
|