bev-project/当前8卡训练配置总结.md

359 lines
8.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 当前8卡训练配置总结FP16优化前
**更新时间**: 2025-11-01 12:50
**状态**: ✅ 正在运行中
**进度**: Epoch 1 - 4400/15448 (28.5%)
---
## 📜 训练脚本
### START_FROM_EPOCH1.sh
```bash
#!/bin/bash
# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练
set -e
export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
cd /workspace/bevfusion
echo "========================================================================"
echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
echo "========================================================================"
echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
echo "训练Epochs: 1-10"
echo "输出目录: /data/runs/phase4a_stage1"
echo "GPU配置: 8×Tesla V100S-32GB"
echo "========================================================================"
# 环境验证
python -c "import torch; print('✓ PyTorch:', torch.__version__)"
python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
echo "✓ 环境验证成功"
# 确认文件存在
if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
exit 1
fi
echo "✓ epoch_1.pth已就绪"
LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"
echo ""
echo "开始训练..."
echo "日志文件: $LOG_FILE"
echo ""
# 从epoch_1.pth加载权重重新开始训练不resume- 使用8卡
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
PATH=/opt/conda/bin:$PATH \
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0 \
2>&1 | tee "$LOG_FILE"
```
**关键参数**
- `-np 8`: 8个进程8张GPU
- `samples_per_gpu 1`: 每GPU batch=1
- `workers_per_gpu 0`: 无数据加载worker
- `--load_from`: 从epoch_1.pth加载权重不resume日志
---
## 📋 配置文件
### multitask_BEV2X_phase4a_stage1.yaml
**基础配置**
```yaml
_base_: ./convfuser.yaml
work_dir: /data/runs/phase4a_stage1
# LiDAR配置
voxel_size: [0.075, 0.075, 0.2]
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
```
**模型配置**
```yaml
model:
encoders:
camera:
backbone:
type: SwinTransformer
embed_dims: 96
depths: [2, 2, 6, 2]
num_heads: [3, 6, 12, 24]
window_size: 7
out_indices: [1, 2, 3]
with_cp: false # ⚠️ 未启用gradient checkpointing
neck:
type: GeneralizedLSSFPN
in_channels: [192, 384, 768]
out_channels: 256
vtransform:
type: DepthLSSTransform
in_channels: 256
out_channels: 80
xbound: [-54.0, 54.0, 0.2] # 540×540 BEV
ybound: [-54.0, 54.0, 0.2]
zbound: [-10.0, 10.0, 20.0]
dbound: [1.0, 60.0, 0.5] # 118 depth bins
downsample: 2
lidar:
voxelize:
max_num_points: 10
max_voxels: [120000, 160000]
backbone:
type: SparseEncoder
sparse_shape: [1440, 1440, 41]
output_channels: 128
fuser:
type: ConvFuser
in_channels: [80, 256]
out_channels: 256
decoder:
backbone:
type: SECOND
in_channels: 256
out_channels: [128, 256]
layer_nums: [5, 5]
heads:
object:
type: TransFusionHead
num_proposals: 200
in_channels: 256 * 2
map:
type: EnhancedBEVSegmentationHead
in_channels: 256
decoder_channels: [256, 256, 128, 128] # 4层decoder
num_classes: 6
deep_supervision: true
loss_cfg:
- type: FocalLoss (weight 1.0)
- type: DiceLoss (weight 2.0)
```
**训练配置**
```yaml
# 数据配置
data:
samples_per_gpu: 1 # 每GPU batch=1
workers_per_gpu: 0 # 无worker避免冲突
train:
type: NuScenesDataset
# ... (详见配置文件)
# 学习率配置
lr_config:
policy: CosineAnnealing
warmup: linear
warmup_iters: 500
warmup_ratio: 0.33333333
min_lr_ratio: 1.0e-3
# 优化器
optimizer:
type: AdamW
lr: 2.0e-5 # 学习率 (从5e-5降至2e-5)
weight_decay: 0.01
optimizer_config:
grad_clip:
max_norm: 35
norm_type: 2
# Runner配置
runner:
type: EpochBasedRunner
max_epochs: 20
# Checkpoint配置
checkpoint_config:
interval: 1
max_keep_ckpts: 5
# 评估配置
evaluation:
interval: 5 # ⭐ 每5个epoch评估避免磁盘满
pipeline: ${test_pipeline}
metric: [bbox, map]
save_best: auto
# 日志配置
log_config:
interval: 50
# 其他配置
find_unused_parameters: false
sync_bn: true
cudnn_benchmark: true
```
**GT标签配置**
```yaml
# BEV分割GT分辨率: 600×600 (0.167m/pixel)
map_grid_conf:
xbound: [-50.0, 50.0, 0.167] # 600 pixels
ybound: [-50.0, 50.0, 0.167] # 600 pixels
```
---
## 📊 当前训练状态
### 实时性能指标
**GPU状态**2025-11-01 12:32
```
GPU 0-7: 100% 利用率
显存: 28.8-29.3GB/GPU (88-89%)
温度: 44-47°C
功耗: 65-70W/GPU
```
**训练进度**
```
Epoch: 1/10
迭代: 4400/15448 (28.5%)
学习率: 2.000e-05
Loss: 2.63-2.79 (下降中)
梯度范数: 9.9-17.8
IOU: 0.615-0.623
```
**性能指标**
```
每迭代时间: 2.66秒
数据加载: 0.45秒 (17%)
计算时间: 2.21秒 (83%)
显存峰值: 18.9GB
```
### Loss趋势Epoch 1
**BEV分割**
```
Drivable Area Dice: 0.15 (下降中)
Stop Line Dice: 0.38-0.41 (波动)
Divider Dice: 0.55-0.60 (缓慢下降)
```
**3D检测**
```
Heatmap Loss: 0.22-0.23
Bbox Loss: 0.29-0.32
Matched IOU: 0.615-0.623 (稳定)
```
---
## ⏱️ 预计时间
### 当前配置
```
启动时间: 2025-11-01 09:15 UTC
Epoch 1进度: 28.5% (4400/15448)
每迭代时间: 2.66秒
剩余迭代: 11,048次
Epoch 1剩余: ~8.2小时
Epoch 1完成: 2025-11-01 20:30 UTC
10 epochs完成: 2025-11-10 20:00 UTC (9.5天)
```
---
## 🔍 关键配置对比
### 当前配置 vs FP16优化方案
| 配置项 | 当前8卡FP32 | FP16优化 | 改进 |
|--------|----------------|----------|------|
| **显存** | 29GB | 20GB | -31% |
| **Batch/GPU** | 1 | 4 | 4× |
| **总Batch** | 8 | 32 | 4× |
| **Workers** | 0 | 2 | +2 |
| **学习率** | 2e-5 | 4e-5 | 2× |
| **迭代时间** | 2.66s | ~2.0s | -25% |
| **Epoch时间** | 11h | 7.5h | -32% |
| **10 epochs** | 9.5天 | 6.5天 | -32% |
| **精度** | FP32 | FP16 | 混合 |
---
## 📂 文件位置
### 脚本和配置
```
训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
```
### 数据和输出
```
输出目录: /data/runs/phase4a_stage1/
训练日志: /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
初始权重: /data/runs/phase4a_stage1/epoch_1.pth
```
---
## 🎯 关键特点
### ✅ 优点
1. **稳定运行**: GPU 100%利用无OOM
2. **显存充足**: 还有3.5-4GB剩余
3. **Loss下降**: 正常收敛中
4. **评估优化**: interval=5避免磁盘满
### ⚠️ 可优化点
1. **显存利用**: 仅用88%,还有优化空间
2. **Batch Size**: 可增大到2-4
3. **FP16**: 未启用混合精度训练
4. **Checkpointing**: 未启用gradient checkpointing
5. **Workers**: 0 workers数据加载占17%
---
## 📌 总结
**当前配置**
- ✅ 8×V100S-32GB100%满载
- ✅ Batch=1/GPU稳定运行
- ✅ 预计9.5天完成10 epochs
- ⚠️ 还有优化空间FP16+Batch↑
**优化潜力**
- FP16混合精度 → 节省9GB显存
- Batch增至4 → 训练加速33%
- 完成时间缩短至6.5天
---
*生成时间: 2025-11-01 12:50 UTC*
*基于: phase4a_stage1_new_20251101_091503.log*