79 lines
1.1 KiB
Markdown
79 lines
1.1 KiB
Markdown
|
|
# Phase 4A Stage 1 - 8卡训练快速参考
|
|||
|
|
|
|||
|
|
**版本**: v1.0
|
|||
|
|
**更新**: 2025-11-01
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 快速启动
|
|||
|
|
|
|||
|
|
### 启动训练
|
|||
|
|
```bash
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
nohup bash START_FROM_EPOCH1.sh > /tmp/train_8gpu.log 2>&1 &
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 监控状态
|
|||
|
|
```bash
|
|||
|
|
# GPU状态
|
|||
|
|
nvidia-smi
|
|||
|
|
|
|||
|
|
# 训练进度
|
|||
|
|
tail -f $(ls -t phase4a_stage1_new_*.log | head -1) | grep "Epoch"
|
|||
|
|
|
|||
|
|
# 磁盘空间
|
|||
|
|
df -h | grep -E "/workspace|/data"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 关键配置
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
GPU: 8×Tesla V100S-32GB
|
|||
|
|
Batch: 1/GPU (总batch=8)
|
|||
|
|
分辨率: 600×600 GT
|
|||
|
|
Epochs: 10
|
|||
|
|
预计耗时: 9.5天
|
|||
|
|
输出: /data/runs/phase4a_stage1/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 预期性能
|
|||
|
|
|
|||
|
|
**Epoch 1 (11/2完成)**
|
|||
|
|
- Stop Line: 0.27 → 0.30+
|
|||
|
|
- Divider: 0.19 → 0.22+
|
|||
|
|
- mIoU: 0.41 → 0.43+
|
|||
|
|
|
|||
|
|
**Epoch 10 (11/10完成)**
|
|||
|
|
- Stop Line: 0.27 → 0.35+
|
|||
|
|
- Divider: 0.19 → 0.28+
|
|||
|
|
- mIoU: 0.41 → 0.48+
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 常用命令
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 停止训练
|
|||
|
|
pkill -f "train.py"
|
|||
|
|
|
|||
|
|
# 清理评估文件
|
|||
|
|
bash cleanup_eval_hook.sh
|
|||
|
|
|
|||
|
|
# 查看checkpoints
|
|||
|
|
ls -lh /data/runs/phase4a_stage1/
|
|||
|
|
|
|||
|
|
# 查看进程
|
|||
|
|
ps aux | grep "train.py" | grep -v grep
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 完整文档
|
|||
|
|
|
|||
|
|
详见: `project/docs/Phase4A_Stage1_8GPU配置_20251101.md`
|
|||
|
|
|