359 lines
8.3 KiB
Markdown
359 lines
8.3 KiB
Markdown
|
|
# 当前8卡训练配置总结(FP16优化前)
|
|||
|
|
|
|||
|
|
**更新时间**: 2025-11-01 12:50
|
|||
|
|
**状态**: ✅ 正在运行中
|
|||
|
|
**进度**: Epoch 1 - 4400/15448 (28.5%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📜 训练脚本
|
|||
|
|
|
|||
|
|
### START_FROM_EPOCH1.sh
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练
|
|||
|
|
|
|||
|
|
set -e
|
|||
|
|
|
|||
|
|
export PATH=/opt/conda/bin:$PATH
|
|||
|
|
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
|||
|
|
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
|
|||
|
|
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
|
|||
|
|
echo "========================================================================"
|
|||
|
|
echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
|
|||
|
|
echo "========================================================================"
|
|||
|
|
echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
|
|||
|
|
echo "训练Epochs: 1-10"
|
|||
|
|
echo "输出目录: /data/runs/phase4a_stage1"
|
|||
|
|
echo "GPU配置: 8×Tesla V100S-32GB"
|
|||
|
|
echo "========================================================================"
|
|||
|
|
|
|||
|
|
# 环境验证
|
|||
|
|
python -c "import torch; print('✓ PyTorch:', torch.__version__)"
|
|||
|
|
python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
|
|||
|
|
echo "✓ 环境验证成功"
|
|||
|
|
|
|||
|
|
# 确认文件存在
|
|||
|
|
if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
|
|||
|
|
echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
echo "✓ epoch_1.pth已就绪"
|
|||
|
|
|
|||
|
|
LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"
|
|||
|
|
|
|||
|
|
echo ""
|
|||
|
|
echo "开始训练..."
|
|||
|
|
echo "日志文件: $LOG_FILE"
|
|||
|
|
echo ""
|
|||
|
|
|
|||
|
|
# 从epoch_1.pth加载权重,重新开始训练(不resume)- 使用8卡
|
|||
|
|
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
|
|||
|
|
PATH=/opt/conda/bin:$PATH \
|
|||
|
|
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
|
|||
|
|
/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
|
|||
|
|
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
|||
|
|
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
|||
|
|
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
|||
|
|
--data.samples_per_gpu 1 \
|
|||
|
|
--data.workers_per_gpu 0 \
|
|||
|
|
2>&1 | tee "$LOG_FILE"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键参数**:
|
|||
|
|
- `-np 8`: 8个进程(8张GPU)
|
|||
|
|
- `samples_per_gpu 1`: 每GPU batch=1
|
|||
|
|
- `workers_per_gpu 0`: 无数据加载worker
|
|||
|
|
- `--load_from`: 从epoch_1.pth加载权重(不resume日志)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 配置文件
|
|||
|
|
|
|||
|
|
### multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
|
|||
|
|
**基础配置**:
|
|||
|
|
```yaml
|
|||
|
|
_base_: ./convfuser.yaml
|
|||
|
|
|
|||
|
|
work_dir: /data/runs/phase4a_stage1
|
|||
|
|
|
|||
|
|
# LiDAR配置
|
|||
|
|
voxel_size: [0.075, 0.075, 0.2]
|
|||
|
|
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**模型配置**:
|
|||
|
|
```yaml
|
|||
|
|
model:
|
|||
|
|
encoders:
|
|||
|
|
camera:
|
|||
|
|
backbone:
|
|||
|
|
type: SwinTransformer
|
|||
|
|
embed_dims: 96
|
|||
|
|
depths: [2, 2, 6, 2]
|
|||
|
|
num_heads: [3, 6, 12, 24]
|
|||
|
|
window_size: 7
|
|||
|
|
out_indices: [1, 2, 3]
|
|||
|
|
with_cp: false # ⚠️ 未启用gradient checkpointing
|
|||
|
|
|
|||
|
|
neck:
|
|||
|
|
type: GeneralizedLSSFPN
|
|||
|
|
in_channels: [192, 384, 768]
|
|||
|
|
out_channels: 256
|
|||
|
|
|
|||
|
|
vtransform:
|
|||
|
|
type: DepthLSSTransform
|
|||
|
|
in_channels: 256
|
|||
|
|
out_channels: 80
|
|||
|
|
xbound: [-54.0, 54.0, 0.2] # 540×540 BEV
|
|||
|
|
ybound: [-54.0, 54.0, 0.2]
|
|||
|
|
zbound: [-10.0, 10.0, 20.0]
|
|||
|
|
dbound: [1.0, 60.0, 0.5] # 118 depth bins
|
|||
|
|
downsample: 2
|
|||
|
|
|
|||
|
|
lidar:
|
|||
|
|
voxelize:
|
|||
|
|
max_num_points: 10
|
|||
|
|
max_voxels: [120000, 160000]
|
|||
|
|
backbone:
|
|||
|
|
type: SparseEncoder
|
|||
|
|
sparse_shape: [1440, 1440, 41]
|
|||
|
|
output_channels: 128
|
|||
|
|
|
|||
|
|
fuser:
|
|||
|
|
type: ConvFuser
|
|||
|
|
in_channels: [80, 256]
|
|||
|
|
out_channels: 256
|
|||
|
|
|
|||
|
|
decoder:
|
|||
|
|
backbone:
|
|||
|
|
type: SECOND
|
|||
|
|
in_channels: 256
|
|||
|
|
out_channels: [128, 256]
|
|||
|
|
layer_nums: [5, 5]
|
|||
|
|
|
|||
|
|
heads:
|
|||
|
|
object:
|
|||
|
|
type: TransFusionHead
|
|||
|
|
num_proposals: 200
|
|||
|
|
in_channels: 256 * 2
|
|||
|
|
|
|||
|
|
map:
|
|||
|
|
type: EnhancedBEVSegmentationHead
|
|||
|
|
in_channels: 256
|
|||
|
|
decoder_channels: [256, 256, 128, 128] # 4层decoder
|
|||
|
|
num_classes: 6
|
|||
|
|
deep_supervision: true
|
|||
|
|
loss_cfg:
|
|||
|
|
- type: FocalLoss (weight 1.0)
|
|||
|
|
- type: DiceLoss (weight 2.0)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**训练配置**:
|
|||
|
|
```yaml
|
|||
|
|
# 数据配置
|
|||
|
|
data:
|
|||
|
|
samples_per_gpu: 1 # 每GPU batch=1
|
|||
|
|
workers_per_gpu: 0 # 无worker(避免冲突)
|
|||
|
|
train:
|
|||
|
|
type: NuScenesDataset
|
|||
|
|
# ... (详见配置文件)
|
|||
|
|
|
|||
|
|
# 学习率配置
|
|||
|
|
lr_config:
|
|||
|
|
policy: CosineAnnealing
|
|||
|
|
warmup: linear
|
|||
|
|
warmup_iters: 500
|
|||
|
|
warmup_ratio: 0.33333333
|
|||
|
|
min_lr_ratio: 1.0e-3
|
|||
|
|
|
|||
|
|
# 优化器
|
|||
|
|
optimizer:
|
|||
|
|
type: AdamW
|
|||
|
|
lr: 2.0e-5 # 学习率 (从5e-5降至2e-5)
|
|||
|
|
weight_decay: 0.01
|
|||
|
|
|
|||
|
|
optimizer_config:
|
|||
|
|
grad_clip:
|
|||
|
|
max_norm: 35
|
|||
|
|
norm_type: 2
|
|||
|
|
|
|||
|
|
# Runner配置
|
|||
|
|
runner:
|
|||
|
|
type: EpochBasedRunner
|
|||
|
|
max_epochs: 20
|
|||
|
|
|
|||
|
|
# Checkpoint配置
|
|||
|
|
checkpoint_config:
|
|||
|
|
interval: 1
|
|||
|
|
max_keep_ckpts: 5
|
|||
|
|
|
|||
|
|
# 评估配置
|
|||
|
|
evaluation:
|
|||
|
|
interval: 5 # ⭐ 每5个epoch评估(避免磁盘满)
|
|||
|
|
pipeline: ${test_pipeline}
|
|||
|
|
metric: [bbox, map]
|
|||
|
|
save_best: auto
|
|||
|
|
|
|||
|
|
# 日志配置
|
|||
|
|
log_config:
|
|||
|
|
interval: 50
|
|||
|
|
|
|||
|
|
# 其他配置
|
|||
|
|
find_unused_parameters: false
|
|||
|
|
sync_bn: true
|
|||
|
|
cudnn_benchmark: true
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**GT标签配置**:
|
|||
|
|
```yaml
|
|||
|
|
# BEV分割GT分辨率: 600×600 (0.167m/pixel)
|
|||
|
|
map_grid_conf:
|
|||
|
|
xbound: [-50.0, 50.0, 0.167] # 600 pixels
|
|||
|
|
ybound: [-50.0, 50.0, 0.167] # 600 pixels
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 当前训练状态
|
|||
|
|
|
|||
|
|
### 实时性能指标
|
|||
|
|
|
|||
|
|
**GPU状态**(2025-11-01 12:32):
|
|||
|
|
```
|
|||
|
|
GPU 0-7: 100% 利用率
|
|||
|
|
显存: 28.8-29.3GB/GPU (88-89%)
|
|||
|
|
温度: 44-47°C
|
|||
|
|
功耗: 65-70W/GPU
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**训练进度**:
|
|||
|
|
```
|
|||
|
|
Epoch: 1/10
|
|||
|
|
迭代: 4400/15448 (28.5%)
|
|||
|
|
学习率: 2.000e-05
|
|||
|
|
Loss: 2.63-2.79 (下降中)
|
|||
|
|
梯度范数: 9.9-17.8
|
|||
|
|
IOU: 0.615-0.623
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**性能指标**:
|
|||
|
|
```
|
|||
|
|
每迭代时间: 2.66秒
|
|||
|
|
数据加载: 0.45秒 (17%)
|
|||
|
|
计算时间: 2.21秒 (83%)
|
|||
|
|
显存峰值: 18.9GB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Loss趋势(Epoch 1)
|
|||
|
|
|
|||
|
|
**BEV分割**:
|
|||
|
|
```
|
|||
|
|
Drivable Area Dice: 0.15 (下降中)
|
|||
|
|
Stop Line Dice: 0.38-0.41 (波动)
|
|||
|
|
Divider Dice: 0.55-0.60 (缓慢下降)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3D检测**:
|
|||
|
|
```
|
|||
|
|
Heatmap Loss: 0.22-0.23
|
|||
|
|
Bbox Loss: 0.29-0.32
|
|||
|
|
Matched IOU: 0.615-0.623 (稳定)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⏱️ 预计时间
|
|||
|
|
|
|||
|
|
### 当前配置
|
|||
|
|
```
|
|||
|
|
启动时间: 2025-11-01 09:15 UTC
|
|||
|
|
Epoch 1进度: 28.5% (4400/15448)
|
|||
|
|
每迭代时间: 2.66秒
|
|||
|
|
剩余迭代: 11,048次
|
|||
|
|
|
|||
|
|
Epoch 1剩余: ~8.2小时
|
|||
|
|
Epoch 1完成: 2025-11-01 20:30 UTC
|
|||
|
|
10 epochs完成: 2025-11-10 20:00 UTC (9.5天)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 关键配置对比
|
|||
|
|
|
|||
|
|
### 当前配置 vs FP16优化方案
|
|||
|
|
|
|||
|
|
| 配置项 | 当前(8卡FP32) | FP16优化 | 改进 |
|
|||
|
|
|--------|----------------|----------|------|
|
|||
|
|
| **显存** | 29GB | 20GB | -31% |
|
|||
|
|
| **Batch/GPU** | 1 | 4 | 4× |
|
|||
|
|
| **总Batch** | 8 | 32 | 4× |
|
|||
|
|
| **Workers** | 0 | 2 | +2 |
|
|||
|
|
| **学习率** | 2e-5 | 4e-5 | 2× |
|
|||
|
|
| **迭代时间** | 2.66s | ~2.0s | -25% |
|
|||
|
|
| **Epoch时间** | 11h | 7.5h | -32% |
|
|||
|
|
| **10 epochs** | 9.5天 | 6.5天 | -32% |
|
|||
|
|
| **精度** | FP32 | FP16 | 混合 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📂 文件位置
|
|||
|
|
|
|||
|
|
### 脚本和配置
|
|||
|
|
```
|
|||
|
|
训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
|
|||
|
|
配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 数据和输出
|
|||
|
|
```
|
|||
|
|
输出目录: /data/runs/phase4a_stage1/
|
|||
|
|
训练日志: /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
|
|||
|
|
Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
|
|||
|
|
预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
|
|||
|
|
初始权重: /data/runs/phase4a_stage1/epoch_1.pth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 关键特点
|
|||
|
|
|
|||
|
|
### ✅ 优点
|
|||
|
|
1. **稳定运行**: GPU 100%利用,无OOM
|
|||
|
|
2. **显存充足**: 还有3.5-4GB剩余
|
|||
|
|
3. **Loss下降**: 正常收敛中
|
|||
|
|
4. **评估优化**: interval=5,避免磁盘满
|
|||
|
|
|
|||
|
|
### ⚠️ 可优化点
|
|||
|
|
1. **显存利用**: 仅用88%,还有优化空间
|
|||
|
|
2. **Batch Size**: 可增大到2-4
|
|||
|
|
3. **FP16**: 未启用混合精度训练
|
|||
|
|
4. **Checkpointing**: 未启用gradient checkpointing
|
|||
|
|
5. **Workers**: 0 workers,数据加载占17%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📌 总结
|
|||
|
|
|
|||
|
|
**当前配置**:
|
|||
|
|
- ✅ 8×V100S-32GB,100%满载
|
|||
|
|
- ✅ Batch=1/GPU,稳定运行
|
|||
|
|
- ✅ 预计9.5天完成10 epochs
|
|||
|
|
- ⚠️ 还有优化空间(FP16+Batch↑)
|
|||
|
|
|
|||
|
|
**优化潜力**:
|
|||
|
|
- FP16混合精度 → 节省9GB显存
|
|||
|
|
- Batch增至4 → 训练加速33%
|
|||
|
|
- 完成时间缩短至6.5天
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*生成时间: 2025-11-01 12:50 UTC*
|
|||
|
|
*基于: phase4a_stage1_new_20251101_091503.log*
|
|||
|
|
|
|||
|
|
|