bev-project/CURRENT_8GPU_CONFIG.md

124 lines
2.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 4A Stage 1 - 当前8卡训练配置
**更新**: 2025-11-01 12:20
**状态**: ✅ 训练进行中
---
## 🚀 关键配置
### 硬件
```
GPU: 8×Tesla V100S-32GB
显存: 28.8-29.3GB/GPU (88-89%利用率)
温度: 44-47°C
功耗: 65-70W/GPU
状态: 100%满载
```
### 训练参数
```
配置文件: configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
启动脚本: START_FROM_EPOCH1.sh
分布式: torchpack dist-run -np 8
Batch Size: 1/GPU (总batch=8)
分辨率: 600×600 GT
Epochs: 10
输出目录: /data/runs/phase4a_stage1/
```
### 性能指标
```
训练速度: 2.67秒/迭代
单epoch: ~11小时
10 epochs: ~9.5天
加速比: 1.7× (vs 4卡18天)
```
---
## 📂 重要路径
```bash
# 配置文件
/workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
# 启动脚本
/workspace/bevfusion/START_FROM_EPOCH1.sh
# 输出目录
/data/runs/phase4a_stage1/
# 预训练模型
/data/pretrained/swint-nuimages-pretrained.pth
# 初始权重
/data/runs/phase4a_stage1/epoch_1.pth
# 训练日志
/workspace/bevfusion/phase4a_stage1_new_*.log
# 文档目录
/workspace/bevfusion/project/
```
---
## 🔍 监控命令
```bash
# GPU状态
nvidia-smi
# 训练进度
tail -100 $(ls -t /workspace/bevfusion/phase4a_stage1_new_*.log | head -1) | grep "Epoch \[1\]\[" | tail -5
# 磁盘空间
df -h | grep -E "/workspace|/data"
# 进程状态
ps aux | grep "train.py" | grep -v grep | wc -l
# Checkpoints
ls -lh /data/runs/phase4a_stage1/*.pth
```
---
## 📊 当前进度
```
Epoch: 1/10
迭代: 4000/15448 (25.9%)
Loss: 2.63-2.78 (下降中)
学习率: 2.000e-05
IOU: 0.618-0.623
预计完成:
- Epoch 1: 2025-11-02 20:00
- Epoch 10: 2025-11-10 20:00
```
---
## 📖 详细文档
- **完整配置**: `project/docs/Phase4A_Stage1_8GPU配置_20251101.md`
- **快速参考**: `project/docs/8卡训练快速参考.md`
- **项目总览**: `project/README.md`
---
## 🎯 优化要点
1.**8卡加速**: 从4卡18天→8卡9.5天
2.**磁盘管理**: evaluation.interval 1→5
3.**输出路径**: work_dir → /data分区
4.**显存优化**: batch=1, workers=0
5.**稳定运行**: 无OOM无异常
---
**训练将在2025-11-10完成** 🚀