8.3 KiB
8.3 KiB
当前8卡训练配置总结(FP16优化前)
更新时间: 2025-11-01 12:50
状态: ✅ 正在运行中
进度: Epoch 1 - 4400/15448 (28.5%)
📜 训练脚本
START_FROM_EPOCH1.sh
#!/bin/bash
# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练
set -e
export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
cd /workspace/bevfusion
echo "========================================================================"
echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
echo "========================================================================"
echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
echo "训练Epochs: 1-10"
echo "输出目录: /data/runs/phase4a_stage1"
echo "GPU配置: 8×Tesla V100S-32GB"
echo "========================================================================"
# 环境验证
python -c "import torch; print('✓ PyTorch:', torch.__version__)"
python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
echo "✓ 环境验证成功"
# 确认文件存在
if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
exit 1
fi
echo "✓ epoch_1.pth已就绪"
LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"
echo ""
echo "开始训练..."
echo "日志文件: $LOG_FILE"
echo ""
# 从epoch_1.pth加载权重,重新开始训练(不resume)- 使用8卡
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
PATH=/opt/conda/bin:$PATH \
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0 \
2>&1 | tee "$LOG_FILE"
关键参数:
-np 8: 8个进程(8张GPU)samples_per_gpu 1: 每GPU batch=1workers_per_gpu 0: 无数据加载worker--load_from: 从epoch_1.pth加载权重(不resume日志)
📋 配置文件
multitask_BEV2X_phase4a_stage1.yaml
基础配置:
_base_: ./convfuser.yaml
work_dir: /data/runs/phase4a_stage1
# LiDAR配置
voxel_size: [0.075, 0.075, 0.2]
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
模型配置:
model:
encoders:
camera:
backbone:
type: SwinTransformer
embed_dims: 96
depths: [2, 2, 6, 2]
num_heads: [3, 6, 12, 24]
window_size: 7
out_indices: [1, 2, 3]
with_cp: false # ⚠️ 未启用gradient checkpointing
neck:
type: GeneralizedLSSFPN
in_channels: [192, 384, 768]
out_channels: 256
vtransform:
type: DepthLSSTransform
in_channels: 256
out_channels: 80
xbound: [-54.0, 54.0, 0.2] # 540×540 BEV
ybound: [-54.0, 54.0, 0.2]
zbound: [-10.0, 10.0, 20.0]
dbound: [1.0, 60.0, 0.5] # 118 depth bins
downsample: 2
lidar:
voxelize:
max_num_points: 10
max_voxels: [120000, 160000]
backbone:
type: SparseEncoder
sparse_shape: [1440, 1440, 41]
output_channels: 128
fuser:
type: ConvFuser
in_channels: [80, 256]
out_channels: 256
decoder:
backbone:
type: SECOND
in_channels: 256
out_channels: [128, 256]
layer_nums: [5, 5]
heads:
object:
type: TransFusionHead
num_proposals: 200
in_channels: 256 * 2
map:
type: EnhancedBEVSegmentationHead
in_channels: 256
decoder_channels: [256, 256, 128, 128] # 4层decoder
num_classes: 6
deep_supervision: true
loss_cfg:
- type: FocalLoss (weight 1.0)
- type: DiceLoss (weight 2.0)
训练配置:
# 数据配置
data:
samples_per_gpu: 1 # 每GPU batch=1
workers_per_gpu: 0 # 无worker(避免冲突)
train:
type: NuScenesDataset
# ... (详见配置文件)
# 学习率配置
lr_config:
policy: CosineAnnealing
warmup: linear
warmup_iters: 500
warmup_ratio: 0.33333333
min_lr_ratio: 1.0e-3
# 优化器
optimizer:
type: AdamW
lr: 2.0e-5 # 学习率 (从5e-5降至2e-5)
weight_decay: 0.01
optimizer_config:
grad_clip:
max_norm: 35
norm_type: 2
# Runner配置
runner:
type: EpochBasedRunner
max_epochs: 20
# Checkpoint配置
checkpoint_config:
interval: 1
max_keep_ckpts: 5
# 评估配置
evaluation:
interval: 5 # ⭐ 每5个epoch评估(避免磁盘满)
pipeline: ${test_pipeline}
metric: [bbox, map]
save_best: auto
# 日志配置
log_config:
interval: 50
# 其他配置
find_unused_parameters: false
sync_bn: true
cudnn_benchmark: true
GT标签配置:
# BEV分割GT分辨率: 600×600 (0.167m/pixel)
map_grid_conf:
xbound: [-50.0, 50.0, 0.167] # 600 pixels
ybound: [-50.0, 50.0, 0.167] # 600 pixels
📊 当前训练状态
实时性能指标
GPU状态(2025-11-01 12:32):
GPU 0-7: 100% 利用率
显存: 28.8-29.3GB/GPU (88-89%)
温度: 44-47°C
功耗: 65-70W/GPU
训练进度:
Epoch: 1/10
迭代: 4400/15448 (28.5%)
学习率: 2.000e-05
Loss: 2.63-2.79 (下降中)
梯度范数: 9.9-17.8
IOU: 0.615-0.623
性能指标:
每迭代时间: 2.66秒
数据加载: 0.45秒 (17%)
计算时间: 2.21秒 (83%)
显存峰值: 18.9GB
Loss趋势(Epoch 1)
BEV分割:
Drivable Area Dice: 0.15 (下降中)
Stop Line Dice: 0.38-0.41 (波动)
Divider Dice: 0.55-0.60 (缓慢下降)
3D检测:
Heatmap Loss: 0.22-0.23
Bbox Loss: 0.29-0.32
Matched IOU: 0.615-0.623 (稳定)
⏱️ 预计时间
当前配置
启动时间: 2025-11-01 09:15 UTC
Epoch 1进度: 28.5% (4400/15448)
每迭代时间: 2.66秒
剩余迭代: 11,048次
Epoch 1剩余: ~8.2小时
Epoch 1完成: 2025-11-01 20:30 UTC
10 epochs完成: 2025-11-10 20:00 UTC (9.5天)
🔍 关键配置对比
当前配置 vs FP16优化方案
| 配置项 | 当前(8卡FP32) | FP16优化 | 改进 |
|---|---|---|---|
| 显存 | 29GB | 20GB | -31% |
| Batch/GPU | 1 | 4 | 4× |
| 总Batch | 8 | 32 | 4× |
| Workers | 0 | 2 | +2 |
| 学习率 | 2e-5 | 4e-5 | 2× |
| 迭代时间 | 2.66s | ~2.0s | -25% |
| Epoch时间 | 11h | 7.5h | -32% |
| 10 epochs | 9.5天 | 6.5天 | -32% |
| 精度 | FP32 | FP16 | 混合 |
📂 文件位置
脚本和配置
训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
数据和输出
输出目录: /data/runs/phase4a_stage1/
训练日志: /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
初始权重: /data/runs/phase4a_stage1/epoch_1.pth
🎯 关键特点
✅ 优点
- 稳定运行: GPU 100%利用,无OOM
- 显存充足: 还有3.5-4GB剩余
- Loss下降: 正常收敛中
- 评估优化: interval=5,避免磁盘满
⚠️ 可优化点
- 显存利用: 仅用88%,还有优化空间
- Batch Size: 可增大到2-4
- FP16: 未启用混合精度训练
- Checkpointing: 未启用gradient checkpointing
- Workers: 0 workers,数据加载占17%
📌 总结
当前配置:
- ✅ 8×V100S-32GB,100%满载
- ✅ Batch=1/GPU,稳定运行
- ✅ 预计9.5天完成10 epochs
- ⚠️ 还有优化空间(FP16+Batch↑)
优化潜力:
- FP16混合精度 → 节省9GB显存
- Batch增至4 → 训练加速33%
- 完成时间缩短至6.5天
生成时间: 2025-11-01 12:50 UTC
基于: phase4a_stage1_new_20251101_091503.log