124 lines
2.3 KiB
Markdown
124 lines
2.3 KiB
Markdown
|
|
# Phase 4A Stage 1 - 当前8卡训练配置
|
|||
|
|
|
|||
|
|
**更新**: 2025-11-01 12:20
|
|||
|
|
**状态**: ✅ 训练进行中
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 关键配置
|
|||
|
|
|
|||
|
|
### 硬件
|
|||
|
|
```
|
|||
|
|
GPU: 8×Tesla V100S-32GB
|
|||
|
|
显存: 28.8-29.3GB/GPU (88-89%利用率)
|
|||
|
|
温度: 44-47°C
|
|||
|
|
功耗: 65-70W/GPU
|
|||
|
|
状态: 100%满载
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 训练参数
|
|||
|
|
```
|
|||
|
|
配置文件: configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
启动脚本: START_FROM_EPOCH1.sh
|
|||
|
|
分布式: torchpack dist-run -np 8
|
|||
|
|
Batch Size: 1/GPU (总batch=8)
|
|||
|
|
分辨率: 600×600 GT
|
|||
|
|
Epochs: 10
|
|||
|
|
输出目录: /data/runs/phase4a_stage1/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 性能指标
|
|||
|
|
```
|
|||
|
|
训练速度: 2.67秒/迭代
|
|||
|
|
单epoch: ~11小时
|
|||
|
|
10 epochs: ~9.5天
|
|||
|
|
加速比: 1.7× (vs 4卡18天)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📂 重要路径
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 配置文件
|
|||
|
|
/workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
|
|||
|
|
# 启动脚本
|
|||
|
|
/workspace/bevfusion/START_FROM_EPOCH1.sh
|
|||
|
|
|
|||
|
|
# 输出目录
|
|||
|
|
/data/runs/phase4a_stage1/
|
|||
|
|
|
|||
|
|
# 预训练模型
|
|||
|
|
/data/pretrained/swint-nuimages-pretrained.pth
|
|||
|
|
|
|||
|
|
# 初始权重
|
|||
|
|
/data/runs/phase4a_stage1/epoch_1.pth
|
|||
|
|
|
|||
|
|
# 训练日志
|
|||
|
|
/workspace/bevfusion/phase4a_stage1_new_*.log
|
|||
|
|
|
|||
|
|
# 文档目录
|
|||
|
|
/workspace/bevfusion/project/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 监控命令
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# GPU状态
|
|||
|
|
nvidia-smi
|
|||
|
|
|
|||
|
|
# 训练进度
|
|||
|
|
tail -100 $(ls -t /workspace/bevfusion/phase4a_stage1_new_*.log | head -1) | grep "Epoch \[1\]\[" | tail -5
|
|||
|
|
|
|||
|
|
# 磁盘空间
|
|||
|
|
df -h | grep -E "/workspace|/data"
|
|||
|
|
|
|||
|
|
# 进程状态
|
|||
|
|
ps aux | grep "train.py" | grep -v grep | wc -l
|
|||
|
|
|
|||
|
|
# Checkpoints
|
|||
|
|
ls -lh /data/runs/phase4a_stage1/*.pth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 当前进度
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Epoch: 1/10
|
|||
|
|
迭代: 4000/15448 (25.9%)
|
|||
|
|
Loss: 2.63-2.78 (下降中)
|
|||
|
|
学习率: 2.000e-05
|
|||
|
|
IOU: 0.618-0.623
|
|||
|
|
|
|||
|
|
预计完成:
|
|||
|
|
- Epoch 1: 2025-11-02 20:00
|
|||
|
|
- Epoch 10: 2025-11-10 20:00
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📖 详细文档
|
|||
|
|
|
|||
|
|
- **完整配置**: `project/docs/Phase4A_Stage1_8GPU配置_20251101.md`
|
|||
|
|
- **快速参考**: `project/docs/8卡训练快速参考.md`
|
|||
|
|
- **项目总览**: `project/README.md`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 优化要点
|
|||
|
|
|
|||
|
|
1. ✅ **8卡加速**: 从4卡18天→8卡9.5天
|
|||
|
|
2. ✅ **磁盘管理**: evaluation.interval 1→5
|
|||
|
|
3. ✅ **输出路径**: work_dir → /data分区
|
|||
|
|
4. ✅ **显存优化**: batch=1, workers=0
|
|||
|
|
5. ✅ **稳定运行**: 无OOM,无异常
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**训练将在2025-11-10完成!** 🚀
|
|||
|
|
|