448 lines
10 KiB
Markdown
448 lines
10 KiB
Markdown
# 新Docker容器评估Epoch 23指南
|
||
|
||
**用途**: 在新的Docker容器中评估epoch_23.pth
|
||
**优势**: 完全隔离,不影响训练Docker
|
||
**生成时间**: 2025-10-30 15:15
|
||
|
||
---
|
||
|
||
## 🎯 方案概述
|
||
|
||
```
|
||
训练Docker (当前):
|
||
- GPU 0-3: Stage 1训练
|
||
- 持续运行,不受影响
|
||
|
||
评估Docker (新建):
|
||
- GPU 4-7或其他: Epoch 23评估
|
||
- 独立环境,完全隔离
|
||
- 评估完成后可删除
|
||
```
|
||
|
||
---
|
||
|
||
## 📋 新Docker启动步骤
|
||
|
||
### Step 1: 准备Docker命令(在主机执行)
|
||
|
||
```bash
|
||
# 方案1: 使用相同镜像(推荐)
|
||
docker run -it --gpus '"device=4,5,6,7"' \
|
||
--shm-size=8g \
|
||
-v /workspace/bevfusion:/workspace/bevfusion \
|
||
-v /path/to/dataset:/dataset \
|
||
--name bevfusion-eval \
|
||
<相同的镜像名称> \
|
||
/bin/bash
|
||
|
||
# 方案2: 使用device ID范围
|
||
docker run -it --gpus '"device=4,5,6,7"' \
|
||
--shm-size=8g \
|
||
-v /workspace/bevfusion:/workspace/bevfusion \
|
||
--name bevfusion-eval \
|
||
<镜像名称> \
|
||
/bin/bash
|
||
```
|
||
|
||
**关键参数**:
|
||
- `--gpus '"device=4,5,6,7"'`: 分配GPU 4-7
|
||
- `--shm-size=8g`: 增大共享内存(避免DataLoader错误)
|
||
- `-v /workspace/bevfusion:/workspace/bevfusion`: 挂载工作目录
|
||
- `--name bevfusion-eval`: 容器名称
|
||
|
||
### Step 2: 配置新Docker环境(在新Docker内执行)
|
||
|
||
```bash
|
||
# 2.1 设置PATH
|
||
export PATH=/opt/conda/bin:$PATH
|
||
|
||
# 2.2 创建符号链接(关键!)
|
||
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
||
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
||
|
||
# 2.3 设置LD_LIBRARY_PATH
|
||
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||
|
||
# 2.4 验证环境
|
||
cd /workspace/bevfusion
|
||
python -c "import torch; print('PyTorch:', torch.__version__, 'CUDA:', torch.cuda.is_available())"
|
||
python -c "from mmcv.ops import nms_match; import mmcv; print('mmcv:', mmcv.__version__)"
|
||
python -c "from mmdet3d.apis import train_model; print('✅ 所有依赖正常')"
|
||
|
||
# 2.5 检查GPU可见性
|
||
python -c "import torch; print('可见GPU数量:', torch.cuda.device_count())"
|
||
nvidia-smi
|
||
```
|
||
|
||
### Step 3: 准备评估脚本(在新Docker内)
|
||
|
||
创建 `/workspace/bevfusion/eval_in_new_docker.sh`:
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# 新Docker中的Epoch 23评估脚本
|
||
|
||
set -e
|
||
|
||
export PATH=/opt/conda/bin:$PATH
|
||
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
|
||
|
||
cd /workspace/bevfusion
|
||
|
||
echo "========================================================================"
|
||
echo "Epoch 23评估 (新Docker容器)"
|
||
echo "========================================================================"
|
||
echo "Checkpoint: epoch_23.pth"
|
||
echo "GPU: 全部可用GPU"
|
||
echo "========================================================================"
|
||
echo ""
|
||
|
||
# 创建评估输出目录
|
||
EVAL_DIR="eval_results/epoch23_new_docker_$(date +%Y%m%d_%H%M%S)"
|
||
mkdir -p "$EVAL_DIR"
|
||
|
||
CONFIG="configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml"
|
||
CHECKPOINT="runs/enhanced_from_epoch19/epoch_23.pth"
|
||
|
||
echo "配置: $CONFIG"
|
||
echo "Checkpoint: $CHECKPOINT"
|
||
echo "输出: $EVAL_DIR"
|
||
echo ""
|
||
|
||
# 检查GPU数量
|
||
GPU_COUNT=$(python -c "import torch; print(torch.cuda.device_count())")
|
||
echo "可用GPU数量: $GPU_COUNT"
|
||
echo ""
|
||
|
||
echo "开始评估..."
|
||
echo ""
|
||
|
||
# 使用所有可用GPU
|
||
/opt/conda/bin/torchpack dist-run -np $GPU_COUNT /opt/conda/bin/python tools/test.py \
|
||
"$CONFIG" \
|
||
"$CHECKPOINT" \
|
||
--eval bbox \
|
||
--out "$EVAL_DIR/results.pkl" \
|
||
--cfg-options data.workers_per_gpu=4 \
|
||
2>&1 | tee "$EVAL_DIR/eval.log"
|
||
|
||
echo ""
|
||
echo "========================================================================"
|
||
echo "评估完成!"
|
||
echo "========================================================================"
|
||
echo "结果: $EVAL_DIR/results.pkl"
|
||
echo "日志: $EVAL_DIR/eval.log"
|
||
echo ""
|
||
|
||
# 提取关键指标
|
||
echo "性能指标摘要:"
|
||
echo "========================================================================"
|
||
grep -E "(NDS|mAP|mIoU)" "$EVAL_DIR/eval.log" | tail -30
|
||
|
||
echo ""
|
||
echo "详细结果请查看: $EVAL_DIR/eval.log"
|
||
echo "========================================================================"
|
||
```
|
||
|
||
### Step 4: 运行评估
|
||
|
||
```bash
|
||
cd /workspace/bevfusion
|
||
chmod +x eval_in_new_docker.sh
|
||
bash eval_in_new_docker.sh
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 数据访问说明
|
||
|
||
### 需要挂载的目录
|
||
|
||
**必须挂载**:
|
||
```
|
||
/workspace/bevfusion → 代码、配置、checkpoint
|
||
/path/to/nuscenes → nuScenes数据集
|
||
```
|
||
|
||
**验证数据访问**:
|
||
```bash
|
||
# 检查checkpoint
|
||
ls -lh /workspace/bevfusion/runs/enhanced_from_epoch19/epoch_23.pth
|
||
|
||
# 检查数据集
|
||
ls /dataset/nuScenes/ # 或您的数据集路径
|
||
|
||
# 检查配置中的dataset_root
|
||
grep "dataset_root" /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_enhanced_phase1_HIGHRES.yaml
|
||
```
|
||
|
||
---
|
||
|
||
## 🔧 环境配置Checklist
|
||
|
||
### 新Docker容器中必做
|
||
|
||
- [ ] 启动Docker容器(GPU 4-7)
|
||
- [ ] 设置PATH环境变量
|
||
- [ ] 创建符号链接(3个)
|
||
- [ ] 设置LD_LIBRARY_PATH
|
||
- [ ] 验证PyTorch可用
|
||
- [ ] 验证mmcv可加载
|
||
- [ ] 验证GPU可见性
|
||
- [ ] 检查数据集访问
|
||
- [ ] 检查checkpoint访问
|
||
- [ ] 运行评估脚本
|
||
|
||
---
|
||
|
||
## ⚠️ 常见问题与解决
|
||
|
||
### 问题1: mmcv无法加载
|
||
```bash
|
||
# 症状
|
||
ImportError: libtorch_cuda_cu.so
|
||
|
||
# 解决
|
||
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
||
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
||
```
|
||
|
||
### 问题2: GPU不可见
|
||
```bash
|
||
# 检查
|
||
nvidia-smi
|
||
python -c "import torch; print(torch.cuda.device_count())"
|
||
|
||
# 如果GPU数量不对
|
||
# 检查docker run的--gpus参数
|
||
```
|
||
|
||
### 问题3: 数据集路径错误
|
||
```bash
|
||
# 检查配置
|
||
grep "dataset_root" CONFIG_FILE
|
||
|
||
# 如果路径不对,使用--cfg-options覆盖
|
||
--cfg-options dataset_root=/your/actual/path
|
||
```
|
||
|
||
### 问题4: 共享内存不足
|
||
```bash
|
||
# 症状
|
||
RuntimeError: unable to write to file
|
||
|
||
# 解决1: Docker启动时加--shm-size=8g
|
||
# 解决2: 评估时使用workers=0
|
||
--cfg-options data.workers_per_gpu=0
|
||
```
|
||
|
||
---
|
||
|
||
## 📊 预期评估输出
|
||
|
||
### 评估时间
|
||
```
|
||
GPU数量: 4张 (GPU 4-7)
|
||
验证集大小: ~6,000样本
|
||
预计时间: 2-3小时
|
||
```
|
||
|
||
### 输出文件
|
||
```
|
||
eval_results/epoch23_new_docker_TIMESTAMP/
|
||
├── results.pkl # 详细预测结果
|
||
├── eval.log # 评估日志
|
||
└── (可能的可视化输出)
|
||
```
|
||
|
||
### 性能指标
|
||
```
|
||
3D检测:
|
||
- NDS, mAP
|
||
- 各类别AP (Car, Pedestrian, etc.)
|
||
- 错误指标 (mATE, mASE, mAOE, etc.)
|
||
|
||
BEV分割:
|
||
- mIoU
|
||
- 各类别IoU (6类)
|
||
- 不同阈值下的性能
|
||
```
|
||
|
||
---
|
||
|
||
## 🔄 与训练Docker的协调
|
||
|
||
### 文件共享
|
||
```
|
||
通过/workspace/bevfusion挂载:
|
||
✓ 评估结果保存在共享目录
|
||
✓ 训练Docker可以访问评估结果
|
||
✓ 便于后续对比分析
|
||
```
|
||
|
||
### GPU隔离
|
||
```
|
||
训练Docker: GPU 0-3
|
||
评估Docker: GPU 4-7
|
||
→ 完全物理隔离,零冲突
|
||
```
|
||
|
||
### 资源竞争
|
||
```
|
||
CPU: 可能有轻微竞争
|
||
内存: 评估需要~8GB,应该充足
|
||
磁盘IO: 轻微竞争,但影响不大
|
||
网络: 无竞争
|
||
```
|
||
|
||
---
|
||
|
||
## 📝 快速启动清单
|
||
|
||
**在主机执行**:
|
||
```bash
|
||
# 1. 确认镜像名称
|
||
docker images | grep bevfusion
|
||
|
||
# 2. 启动新Docker (替换<镜像名称>)
|
||
docker run -it --gpus '"device=4,5,6,7"' \
|
||
--shm-size=8g \
|
||
-v /workspace/bevfusion:/workspace/bevfusion \
|
||
--name bevfusion-eval \
|
||
<镜像名称> \
|
||
/bin/bash
|
||
```
|
||
|
||
**在新Docker内执行**:
|
||
```bash
|
||
# 1. 设置环境
|
||
export PATH=/opt/conda/bin:$PATH
|
||
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||
|
||
# 2. 创建符号链接
|
||
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
||
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
||
|
||
# 3. 验证环境
|
||
cd /workspace/bevfusion
|
||
python -c "from mmcv.ops import nms_match; print('✅ OK')"
|
||
|
||
# 4. 运行评估
|
||
bash eval_in_new_docker.sh
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 评估完成后
|
||
|
||
### 提取结果
|
||
```bash
|
||
# 在训练Docker或新Docker中
|
||
cd /workspace/bevfusion
|
||
|
||
# 查看评估摘要
|
||
grep -A 100 "Evaluation" eval_results/epoch23_new_docker_*/eval.log
|
||
|
||
# 提取关键指标
|
||
grep -E "(NDS|mAP|mIoU|stop_line|divider)" eval_results/epoch23_new_docker_*/eval.log
|
||
```
|
||
|
||
### 对比分析
|
||
```
|
||
Epoch 23评估结果 vs Baseline (日志提取)
|
||
→ 验证一致性
|
||
→ 建立精确baseline
|
||
|
||
等Epoch 1完成后:
|
||
Epoch 1评估结果 vs Epoch 23
|
||
→ 量化改进幅度
|
||
```
|
||
|
||
---
|
||
|
||
## ✅ 优势总结
|
||
|
||
### 新Docker方案优势
|
||
```
|
||
✅ 完全隔离: 零干扰风险
|
||
✅ 独立环境: 可以不同配置
|
||
✅ 灵活分配: GPU可自由选择
|
||
✅ 安全可靠: 训练100%不受影响
|
||
```
|
||
|
||
### 实施成本
|
||
```
|
||
准备时间: 30-60分钟
|
||
- 启动Docker: 5分钟
|
||
- 环境配置: 10-15分钟
|
||
- 验证测试: 10分钟
|
||
|
||
评估时间: 2-3小时
|
||
总计: 3-4小时
|
||
```
|
||
|
||
---
|
||
|
||
## 📞 需要的信息
|
||
|
||
为了帮您启动新Docker,需要确认:
|
||
|
||
1. **Docker镜像名称**:
|
||
```
|
||
训练Docker使用的镜像是什么?
|
||
可通过: docker ps 查看
|
||
```
|
||
|
||
2. **数据集挂载路径**:
|
||
```
|
||
nuScenes数据集在主机的哪个目录?
|
||
当前配置中dataset_root指向哪里?
|
||
```
|
||
|
||
3. **GPU分配**:
|
||
```
|
||
评估使用GPU 4-7吗?
|
||
还是其他GPU组合?
|
||
```
|
||
|
||
---
|
||
|
||
## 🚀 简化版启动(如果环境相同)
|
||
|
||
如果新Docker镜像与训练Docker完全相同:
|
||
|
||
```bash
|
||
# 在主机执行 - 一键启动
|
||
docker run -it --gpus '"device=4,5,6,7"' \
|
||
--shm-size=8g \
|
||
-v /workspace/bevfusion:/workspace/bevfusion \
|
||
--name bevfusion-eval \
|
||
<训练Docker的镜像名> \
|
||
bash -c "
|
||
export PATH=/opt/conda/bin:\$PATH && \
|
||
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:\$LD_LIBRARY_PATH && \
|
||
cd /opt/conda/lib/python3.8/site-packages/torch/lib && \
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cu.so && \
|
||
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so && \
|
||
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so && \
|
||
cd /workspace/bevfusion && \
|
||
python -c 'from mmcv.ops import nms_match; print(\"\u2705 环境就绪\")' && \
|
||
bash eval_in_new_docker.sh
|
||
"
|
||
```
|
||
|
||
---
|
||
|
||
**状态**: 新Docker评估指南已准备完成
|
||
**下一步**: 提供Docker镜像名称和数据集路径即可启动
|
||
|
||
|
||
|