135 lines
2.6 KiB
Markdown
135 lines
2.6 KiB
Markdown
|
|
# BEVFusion训练快速参考卡
|
|||
|
|
|
|||
|
|
**更新**: 2025-10-30
|
|||
|
|
**用途**: 后续训练的快速参考手册
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚨 Docker重启后必做
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
export PATH=/opt/conda/bin:$PATH
|
|||
|
|
|
|||
|
|
# 1. 创建必要的符号链接 (关键!)
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
|
|||
|
|
# 2. 验证环境
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
python -c "import torch; from mmcv.ops import nms_match; print('✅ 环境OK')"
|
|||
|
|
|
|||
|
|
# 3. 查看训练状态
|
|||
|
|
bash monitor_phase4a_stage1.sh # 如果Stage 1在运行
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚡ 快速启动训练
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
# Stage 1 (600×600) - 当前推荐
|
|||
|
|
bash START_PHASE4A_STAGE1.sh
|
|||
|
|
|
|||
|
|
# 监控
|
|||
|
|
bash monitor_phase4a_stage1.sh
|
|||
|
|
tail -f phase4a_stage1_*.log | grep "Epoch \["
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 常见问题快速修复
|
|||
|
|
|
|||
|
|
### mmcv无法加载
|
|||
|
|
```bash
|
|||
|
|
cd /opt/conda/lib/python3.8/site-packages/torch/lib
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cu.so
|
|||
|
|
ln -sf libtorch_cuda.so libtorch_cuda_cpp.so
|
|||
|
|
ln -sf libtorch_cpu.so libtorch_cpu_cpp.so
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 显存不足
|
|||
|
|
```bash
|
|||
|
|
# 减少GPU数量或降低分辨率
|
|||
|
|
# 600×600: 4 GPU可行
|
|||
|
|
# 800×800: 3 GPU + gradient checkpointing
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 代码修改不生效
|
|||
|
|
```bash
|
|||
|
|
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 训练卡住
|
|||
|
|
```bash
|
|||
|
|
pkill -9 -f "torchpack\|mpirun"
|
|||
|
|
nvidia-smi # 检查GPU
|
|||
|
|
bash START_SCRIPT.sh # 重新启动
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 性能baseline
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Phase 3 (epoch_23):
|
|||
|
|
NDS: 0.6941
|
|||
|
|
mAP: 0.6446
|
|||
|
|
mIoU: 0.41
|
|||
|
|
Stop Line: 0.27
|
|||
|
|
Divider: 0.19
|
|||
|
|
|
|||
|
|
Stage 1目标 (10 epochs):
|
|||
|
|
Stop Line: 0.35+
|
|||
|
|
Divider: 0.28+
|
|||
|
|
mIoU: 0.48+
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📂 关键文件位置
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Checkpoint:
|
|||
|
|
Phase 3: runs/enhanced_from_epoch19/epoch_23.pth
|
|||
|
|
Stage 1: runs/run-326653dc-c038af2c/epoch_*.pth
|
|||
|
|
|
|||
|
|
配置:
|
|||
|
|
Phase 3: configs/.../multitask_enhanced_phase1_HIGHRES.yaml
|
|||
|
|
Stage 1: configs/.../multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
|
|||
|
|
启动脚本:
|
|||
|
|
Stage 1: START_PHASE4A_STAGE1.sh
|
|||
|
|
|
|||
|
|
监控:
|
|||
|
|
monitor_phase4a_stage1.sh
|
|||
|
|
|
|||
|
|
代码:
|
|||
|
|
分割头: mmdet3d/models/heads/segm/enhanced.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 训练配置速查
|
|||
|
|
|
|||
|
|
| 配置 | Phase 3 | Stage 1 | Stage 2计划 |
|
|||
|
|
|------|---------|---------|-------------|
|
|||
|
|
| BEV分辨率 | 0.3m (360×360) | 0.2m (540×540) | 0.15m (720×720) |
|
|||
|
|
| GT分辨率 | 0.25m (400×400) | 0.167m (600×600) | 0.125m (800×800) |
|
|||
|
|
| Decoder | 2层 | 4层 | 4层 |
|
|||
|
|
| Deep Sup | ❌ | ✅ | ✅ |
|
|||
|
|
| Dice Loss | ❌ | ✅ | ✅ |
|
|||
|
|
| GPU | 8张 | 4张 | 3-4张 |
|
|||
|
|
| 显存/GPU | ~8GB | ~30GB | ~32GB |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**完整文档**: `项目进展与问题解决总结_20251030.md`
|
|||
|
|
|
|||
|
|
|
|||
|
|
|