bev-project/当前8卡训练配置总结.md

359 lines
8.3 KiB
Markdown
Raw Permalink Normal View History

# 当前8卡训练配置总结FP16优化前
**更新时间**: 2025-11-01 12:50
**状态**: ✅ 正在运行中
**进度**: Epoch 1 - 4400/15448 (28.5%)
---
## 📜 训练脚本
### START_FROM_EPOCH1.sh
```bash
#!/bin/bash
# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练
set -e
export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
cd /workspace/bevfusion
echo "========================================================================"
echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
echo "========================================================================"
echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
echo "训练Epochs: 1-10"
echo "输出目录: /data/runs/phase4a_stage1"
echo "GPU配置: 8×Tesla V100S-32GB"
echo "========================================================================"
# 环境验证
python -c "import torch; print('✓ PyTorch:', torch.__version__)"
python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
echo "✓ 环境验证成功"
# 确认文件存在
if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
exit 1
fi
echo "✓ epoch_1.pth已就绪"
LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"
echo ""
echo "开始训练..."
echo "日志文件: $LOG_FILE"
echo ""
# 从epoch_1.pth加载权重重新开始训练不resume- 使用8卡
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
PATH=/opt/conda/bin:$PATH \
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0 \
2>&1 | tee "$LOG_FILE"
```
**关键参数**
- `-np 8`: 8个进程8张GPU
- `samples_per_gpu 1`: 每GPU batch=1
- `workers_per_gpu 0`: 无数据加载worker
- `--load_from`: 从epoch_1.pth加载权重不resume日志
---
## 📋 配置文件
### multitask_BEV2X_phase4a_stage1.yaml
**基础配置**
```yaml
_base_: ./convfuser.yaml
work_dir: /data/runs/phase4a_stage1
# LiDAR配置
voxel_size: [0.075, 0.075, 0.2]
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
```
**模型配置**
```yaml
model:
encoders:
camera:
backbone:
type: SwinTransformer
embed_dims: 96
depths: [2, 2, 6, 2]
num_heads: [3, 6, 12, 24]
window_size: 7
out_indices: [1, 2, 3]
with_cp: false # ⚠️ 未启用gradient checkpointing
neck:
type: GeneralizedLSSFPN
in_channels: [192, 384, 768]
out_channels: 256
vtransform:
type: DepthLSSTransform
in_channels: 256
out_channels: 80
xbound: [-54.0, 54.0, 0.2] # 540×540 BEV
ybound: [-54.0, 54.0, 0.2]
zbound: [-10.0, 10.0, 20.0]
dbound: [1.0, 60.0, 0.5] # 118 depth bins
downsample: 2
lidar:
voxelize:
max_num_points: 10
max_voxels: [120000, 160000]
backbone:
type: SparseEncoder
sparse_shape: [1440, 1440, 41]
output_channels: 128
fuser:
type: ConvFuser
in_channels: [80, 256]
out_channels: 256
decoder:
backbone:
type: SECOND
in_channels: 256
out_channels: [128, 256]
layer_nums: [5, 5]
heads:
object:
type: TransFusionHead
num_proposals: 200
in_channels: 256 * 2
map:
type: EnhancedBEVSegmentationHead
in_channels: 256
decoder_channels: [256, 256, 128, 128] # 4层decoder
num_classes: 6
deep_supervision: true
loss_cfg:
- type: FocalLoss (weight 1.0)
- type: DiceLoss (weight 2.0)
```
**训练配置**
```yaml
# 数据配置
data:
samples_per_gpu: 1 # 每GPU batch=1
workers_per_gpu: 0 # 无worker避免冲突
train:
type: NuScenesDataset
# ... (详见配置文件)
# 学习率配置
lr_config:
policy: CosineAnnealing
warmup: linear
warmup_iters: 500
warmup_ratio: 0.33333333
min_lr_ratio: 1.0e-3
# 优化器
optimizer:
type: AdamW
lr: 2.0e-5 # 学习率 (从5e-5降至2e-5)
weight_decay: 0.01
optimizer_config:
grad_clip:
max_norm: 35
norm_type: 2
# Runner配置
runner:
type: EpochBasedRunner
max_epochs: 20
# Checkpoint配置
checkpoint_config:
interval: 1
max_keep_ckpts: 5
# 评估配置
evaluation:
interval: 5 # ⭐ 每5个epoch评估避免磁盘满
pipeline: ${test_pipeline}
metric: [bbox, map]
save_best: auto
# 日志配置
log_config:
interval: 50
# 其他配置
find_unused_parameters: false
sync_bn: true
cudnn_benchmark: true
```
**GT标签配置**
```yaml
# BEV分割GT分辨率: 600×600 (0.167m/pixel)
map_grid_conf:
xbound: [-50.0, 50.0, 0.167] # 600 pixels
ybound: [-50.0, 50.0, 0.167] # 600 pixels
```
---
## 📊 当前训练状态
### 实时性能指标
**GPU状态**2025-11-01 12:32
```
GPU 0-7: 100% 利用率
显存: 28.8-29.3GB/GPU (88-89%)
温度: 44-47°C
功耗: 65-70W/GPU
```
**训练进度**
```
Epoch: 1/10
迭代: 4400/15448 (28.5%)
学习率: 2.000e-05
Loss: 2.63-2.79 (下降中)
梯度范数: 9.9-17.8
IOU: 0.615-0.623
```
**性能指标**
```
每迭代时间: 2.66秒
数据加载: 0.45秒 (17%)
计算时间: 2.21秒 (83%)
显存峰值: 18.9GB
```
### Loss趋势Epoch 1
**BEV分割**
```
Drivable Area Dice: 0.15 (下降中)
Stop Line Dice: 0.38-0.41 (波动)
Divider Dice: 0.55-0.60 (缓慢下降)
```
**3D检测**
```
Heatmap Loss: 0.22-0.23
Bbox Loss: 0.29-0.32
Matched IOU: 0.615-0.623 (稳定)
```
---
## ⏱️ 预计时间
### 当前配置
```
启动时间: 2025-11-01 09:15 UTC
Epoch 1进度: 28.5% (4400/15448)
每迭代时间: 2.66秒
剩余迭代: 11,048次
Epoch 1剩余: ~8.2小时
Epoch 1完成: 2025-11-01 20:30 UTC
10 epochs完成: 2025-11-10 20:00 UTC (9.5天)
```
---
## 🔍 关键配置对比
### 当前配置 vs FP16优化方案
| 配置项 | 当前8卡FP32 | FP16优化 | 改进 |
|--------|----------------|----------|------|
| **显存** | 29GB | 20GB | -31% |
| **Batch/GPU** | 1 | 4 | 4× |
| **总Batch** | 8 | 32 | 4× |
| **Workers** | 0 | 2 | +2 |
| **学习率** | 2e-5 | 4e-5 | 2× |
| **迭代时间** | 2.66s | ~2.0s | -25% |
| **Epoch时间** | 11h | 7.5h | -32% |
| **10 epochs** | 9.5天 | 6.5天 | -32% |
| **精度** | FP32 | FP16 | 混合 |
---
## 📂 文件位置
### 脚本和配置
```
训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
```
### 数据和输出
```
输出目录: /data/runs/phase4a_stage1/
训练日志: /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
初始权重: /data/runs/phase4a_stage1/epoch_1.pth
```
---
## 🎯 关键特点
### ✅ 优点
1. **稳定运行**: GPU 100%利用无OOM
2. **显存充足**: 还有3.5-4GB剩余
3. **Loss下降**: 正常收敛中
4. **评估优化**: interval=5避免磁盘满
### ⚠️ 可优化点
1. **显存利用**: 仅用88%,还有优化空间
2. **Batch Size**: 可增大到2-4
3. **FP16**: 未启用混合精度训练
4. **Checkpointing**: 未启用gradient checkpointing
5. **Workers**: 0 workers数据加载占17%
---
## 📌 总结
**当前配置**
- ✅ 8×V100S-32GB100%满载
- ✅ Batch=1/GPU稳定运行
- ✅ 预计9.5天完成10 epochs
- ⚠️ 还有优化空间FP16+Batch↑
**优化潜力**
- FP16混合精度 → 节省9GB显存
- Batch增至4 → 训练加速33%
- 完成时间缩短至6.5天
---
*生成时间: 2025-11-01 12:50 UTC*
*基于: phase4a_stage1_new_20251101_091503.log*