152 lines
4.8 KiB
Plaintext
152 lines
4.8 KiB
Plaintext
|
|
================================================================================
|
|||
|
|
BEVFusion训练状态 + 新Docker评估指南
|
|||
|
|
================================================================================
|
|||
|
|
时间: 2025-10-30 16:25
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
一、当前训练状态 ✅ 优秀!
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
进度: Epoch [1][4450/30895] (14.4%完成)
|
|||
|
|
Loss: 6.9 → 4.5 → 4.3 (稳定下降35%) ✅
|
|||
|
|
GPU: 0-3 @ 100%利用率, 93-94%显存
|
|||
|
|
温度: 44-46°C (健康)
|
|||
|
|
进程: 24个正常运行
|
|||
|
|
|
|||
|
|
预计完成:
|
|||
|
|
- Epoch 1: ~15小时
|
|||
|
|
- 10 epochs: ~8.5天
|
|||
|
|
|
|||
|
|
性能改善:
|
|||
|
|
Stop Line dice: 0.94→0.74 (优化中)
|
|||
|
|
Divider dice: 0.96→0.85 (优化中)
|
|||
|
|
3D IoU: 0.620 (稳定)
|
|||
|
|
|
|||
|
|
结论: 训练非常稳定,无需干预 ✅
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
二、新Docker评估指南 (3步启动)
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
Step 1: 启动新Docker (在主机)
|
|||
|
|
────────────────────────────────────────
|
|||
|
|
docker run -it --gpus '"device=4,5,6,7"' \
|
|||
|
|
--shm-size=8g \
|
|||
|
|
-v /workspace/bevfusion:/workspace/bevfusion \
|
|||
|
|
-v <数据集路径>:/dataset/nuScenes \
|
|||
|
|
--name bevfusion-eval \
|
|||
|
|
<镜像名称> \
|
|||
|
|
/bin/bash
|
|||
|
|
|
|||
|
|
需要替换:
|
|||
|
|
<数据集路径> - nuScenes主机路径
|
|||
|
|
<镜像名称> - 与训练Docker相同
|
|||
|
|
|
|||
|
|
|
|||
|
|
Step 2: 配置环境 (在新Docker内,10分钟)
|
|||
|
|
────────────────────────────────────────
|
|||
|
|
# 2.1 环境变量
|
|||
|
|
export PATH=/opt/conda/bin:$PATH
|
|||
|
|
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
|||
|
|
|
|||
|
|
# 2.2 符号链接
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
|
|||
|
|
# 2.3 验证
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
python -c "from mmcv.ops import nms_match; print('✅ OK')"
|
|||
|
|
python -c "import torch; print('GPU数量:', torch.cuda.device_count())"
|
|||
|
|
|
|||
|
|
|
|||
|
|
Step 3: 运行评估 (2-3小时)
|
|||
|
|
────────────────────────────────────────
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
bash eval_in_new_docker.sh
|
|||
|
|
|
|||
|
|
# 监控(另开终端进入同一容器)
|
|||
|
|
docker exec -it bevfusion-eval bash
|
|||
|
|
tail -f eval_results/epoch23_new_docker_*/eval.log
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
三、GPU资源分配
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
训练Docker:
|
|||
|
|
GPU 0-3: Stage 1训练 ████████ (持续运行)
|
|||
|
|
|
|||
|
|
评估Docker:
|
|||
|
|
GPU 4-7: Epoch 23评估 ████████ (2-3小时)
|
|||
|
|
|
|||
|
|
总体利用率: 100% ✅ 充分利用
|
|||
|
|
物理隔离: ✅ 零冲突风险
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
四、已准备的文件
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
在 /workspace/bevfusion/ (共享目录):
|
|||
|
|
|
|||
|
|
评估准备:
|
|||
|
|
✅ eval_in_new_docker.sh (评估脚本)
|
|||
|
|
✅ NEW_DOCKER_EVAL_GUIDE.md (详细指南)
|
|||
|
|
✅ EVAL_DEPLOYMENT_ANALYSIS.md (方案对比)
|
|||
|
|
|
|||
|
|
Baseline数据:
|
|||
|
|
✅ PHASE3_EPOCH23_BASELINE_PERFORMANCE.md
|
|||
|
|
→ NDS 0.6941, mAP 0.6446
|
|||
|
|
→ mIoU 0.4130, Stop Line 0.2657, Divider 0.1903
|
|||
|
|
|
|||
|
|
训练状态:
|
|||
|
|
✅ TRAINING_STATUS_REPORT_20251030_1515.md
|
|||
|
|
✅ monitor_phase4a_stage1.sh
|
|||
|
|
|
|||
|
|
所需文件 (已存在):
|
|||
|
|
✅ epoch_23.pth (516MB)
|
|||
|
|
✅ 配置文件
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
五、评估完成后
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
结果位置:
|
|||
|
|
eval_results/epoch23_new_docker_TIMESTAMP/
|
|||
|
|
├── results.pkl
|
|||
|
|
└── eval.log
|
|||
|
|
|
|||
|
|
查看性能:
|
|||
|
|
grep -E "(NDS|mAP|mIoU)" eval_results/epoch23_*/eval.log
|
|||
|
|
|
|||
|
|
用途:
|
|||
|
|
→ 验证baseline准确性
|
|||
|
|
→ 等Epoch 1完成后对比
|
|||
|
|
→ 量化Stage 1改进
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
快速参考
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
训练监控 (当前Docker):
|
|||
|
|
bash monitor_phase4a_stage1.sh
|
|||
|
|
tail -f phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
|
|||
|
|
评估监控 (新Docker内):
|
|||
|
|
tail -f eval_results/epoch23_*/eval.log
|
|||
|
|
|
|||
|
|
GPU监控 (主机):
|
|||
|
|
watch -n 5 nvidia-smi
|
|||
|
|
|
|||
|
|
停止评估 (新Docker内):
|
|||
|
|
pkill -f "test.py"
|
|||
|
|
|
|||
|
|
删除评估Docker (主机,评估完成后):
|
|||
|
|
docker stop bevfusion-eval
|
|||
|
|
docker rm bevfusion-eval
|
|||
|
|
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
当前状态: 训练稳定运行,新Docker评估指南已准备完成 ✅
|
|||
|
|
|