bev-project/project/docs/QUICK_REFERENCE_CARD.md

2.6 KiB
Raw Blame History

BEVFusion训练快速参考卡

更新: 2025-10-30
用途: 后续训练的快速参考手册


🚨 Docker重启后必做

cd /workspace/bevfusion
export PATH=/opt/conda/bin:$PATH

# 1. 创建必要的符号链接 (关键!)
cd /opt/conda/lib/python3.8/site-packages/torch/lib
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so

# 2. 验证环境
cd /workspace/bevfusion
python -c "import torch; from mmcv.ops import nms_match; print('✅ 环境OK')"

# 3. 查看训练状态
bash monitor_phase4a_stage1.sh  # 如果Stage 1在运行

快速启动训练

cd /workspace/bevfusion

# Stage 1 (600×600) - 当前推荐
bash START_PHASE4A_STAGE1.sh

# 监控
bash monitor_phase4a_stage1.sh
tail -f phase4a_stage1_*.log | grep "Epoch \["

🔧 常见问题快速修复

mmcv无法加载

cd /opt/conda/lib/python3.8/site-packages/torch/lib
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so

显存不足

# 减少GPU数量或降低分辨率
# 600×600: 4 GPU可行
# 800×800: 3 GPU + gradient checkpointing

代码修改不生效

find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null

训练卡住

pkill -9 -f "torchpack\|mpirun"
nvidia-smi  # 检查GPU
bash START_SCRIPT.sh  # 重新启动

📊 性能baseline

Phase 3 (epoch_23):
  NDS: 0.6941
  mAP: 0.6446
  mIoU: 0.41
  Stop Line: 0.27
  Divider: 0.19

Stage 1目标 (10 epochs):
  Stop Line: 0.35+
  Divider: 0.28+
  mIoU: 0.48+

📂 关键文件位置

Checkpoint:
  Phase 3: runs/enhanced_from_epoch19/epoch_23.pth
  Stage 1: runs/run-326653dc-c038af2c/epoch_*.pth

配置:
  Phase 3: configs/.../multitask_enhanced_phase1_HIGHRES.yaml
  Stage 1: configs/.../multitask_BEV2X_phase4a_stage1.yaml

启动脚本:
  Stage 1: START_PHASE4A_STAGE1.sh
  
监控:
  monitor_phase4a_stage1.sh

代码:
  分割头: mmdet3d/models/heads/segm/enhanced.py

🎯 训练配置速查

配置 Phase 3 Stage 1 Stage 2计划
BEV分辨率 0.3m (360×360) 0.2m (540×540) 0.15m (720×720)
GT分辨率 0.25m (400×400) 0.167m (600×600) 0.125m (800×800)
Decoder 2层 4层 4层
Deep Sup
Dice Loss
GPU 8张 4张 3-4张
显存/GPU ~8GB ~30GB ~32GB

完整文档: 项目进展与问题解决总结_20251030.md