bev-project/当前8卡训练配置总结.md

8.3 KiB
Raw Blame History

当前8卡训练配置总结FP16优化前

更新时间: 2025-11-01 12:50
状态: 正在运行中
进度: Epoch 1 - 4400/15448 (28.5%)


📜 训练脚本

START_FROM_EPOCH1.sh

#!/bin/bash
# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练

set -e

export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH

cd /workspace/bevfusion

echo "========================================================================"
echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
echo "========================================================================"
echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
echo "训练Epochs: 1-10"
echo "输出目录: /data/runs/phase4a_stage1"
echo "GPU配置: 8×Tesla V100S-32GB"
echo "========================================================================"

# 环境验证
python -c "import torch; print('✓ PyTorch:', torch.__version__)"
python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
echo "✓ 环境验证成功"

# 确认文件存在
if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
    echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
    exit 1
fi
echo "✓ epoch_1.pth已就绪"

LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"

echo ""
echo "开始训练..."
echo "日志文件: $LOG_FILE"
echo ""

# 从epoch_1.pth加载权重重新开始训练不resume- 使用8卡
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
PATH=/opt/conda/bin:$PATH \
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
  configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
  --model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
  --load_from /data/runs/phase4a_stage1/epoch_1.pth \
  --data.samples_per_gpu 1 \
  --data.workers_per_gpu 0 \
  2>&1 | tee "$LOG_FILE"

关键参数

  • -np 8: 8个进程8张GPU
  • samples_per_gpu 1: 每GPU batch=1
  • workers_per_gpu 0: 无数据加载worker
  • --load_from: 从epoch_1.pth加载权重不resume日志

📋 配置文件

multitask_BEV2X_phase4a_stage1.yaml

基础配置

_base_: ./convfuser.yaml

work_dir: /data/runs/phase4a_stage1

# LiDAR配置
voxel_size: [0.075, 0.075, 0.2]
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]

模型配置

model:
  encoders:
    camera:
      backbone:
        type: SwinTransformer
        embed_dims: 96
        depths: [2, 2, 6, 2]
        num_heads: [3, 6, 12, 24]
        window_size: 7
        out_indices: [1, 2, 3]
        with_cp: false  # ⚠️ 未启用gradient checkpointing
        
      neck:
        type: GeneralizedLSSFPN
        in_channels: [192, 384, 768]
        out_channels: 256
        
      vtransform:
        type: DepthLSSTransform
        in_channels: 256
        out_channels: 80
        xbound: [-54.0, 54.0, 0.2]  # 540×540 BEV
        ybound: [-54.0, 54.0, 0.2]
        zbound: [-10.0, 10.0, 20.0]
        dbound: [1.0, 60.0, 0.5]    # 118 depth bins
        downsample: 2
        
    lidar:
      voxelize:
        max_num_points: 10
        max_voxels: [120000, 160000]
      backbone:
        type: SparseEncoder
        sparse_shape: [1440, 1440, 41]
        output_channels: 128
        
  fuser:
    type: ConvFuser
    in_channels: [80, 256]
    out_channels: 256
    
  decoder:
    backbone:
      type: SECOND
      in_channels: 256
      out_channels: [128, 256]
      layer_nums: [5, 5]
      
  heads:
    object:
      type: TransFusionHead
      num_proposals: 200
      in_channels: 256 * 2
      
    map:
      type: EnhancedBEVSegmentationHead
      in_channels: 256
      decoder_channels: [256, 256, 128, 128]  # 4层decoder
      num_classes: 6
      deep_supervision: true
      loss_cfg:
        - type: FocalLoss (weight 1.0)
        - type: DiceLoss (weight 2.0)

训练配置

# 数据配置
data:
  samples_per_gpu: 1  # 每GPU batch=1
  workers_per_gpu: 0  # 无worker避免冲突
  train:
    type: NuScenesDataset
    # ... (详见配置文件)

# 学习率配置
lr_config:
  policy: CosineAnnealing
  warmup: linear
  warmup_iters: 500
  warmup_ratio: 0.33333333
  min_lr_ratio: 1.0e-3

# 优化器
optimizer:
  type: AdamW
  lr: 2.0e-5  # 学习率 (从5e-5降至2e-5)
  weight_decay: 0.01

optimizer_config:
  grad_clip:
    max_norm: 35
    norm_type: 2

# Runner配置
runner:
  type: EpochBasedRunner
  max_epochs: 20

# Checkpoint配置
checkpoint_config:
  interval: 1
  max_keep_ckpts: 5

# 评估配置
evaluation:
  interval: 5  # ⭐ 每5个epoch评估避免磁盘满
  pipeline: ${test_pipeline}
  metric: [bbox, map]
  save_best: auto

# 日志配置
log_config:
  interval: 50

# 其他配置
find_unused_parameters: false
sync_bn: true
cudnn_benchmark: true

GT标签配置

# BEV分割GT分辨率: 600×600 (0.167m/pixel)
map_grid_conf:
  xbound: [-50.0, 50.0, 0.167]  # 600 pixels
  ybound: [-50.0, 50.0, 0.167]  # 600 pixels

📊 当前训练状态

实时性能指标

GPU状态2025-11-01 12:32

GPU 0-7: 100% 利用率
显存:    28.8-29.3GB/GPU (88-89%)
温度:    44-47°C
功耗:    65-70W/GPU

训练进度

Epoch:     1/10
迭代:      4400/15448 (28.5%)
学习率:    2.000e-05
Loss:      2.63-2.79 (下降中)
梯度范数:  9.9-17.8
IOU:       0.615-0.623

性能指标

每迭代时间:   2.66秒
数据加载:     0.45秒 (17%)
计算时间:     2.21秒 (83%)
显存峰值:     18.9GB

Loss趋势Epoch 1

BEV分割

Drivable Area Dice: 0.15 (下降中)
Stop Line Dice:     0.38-0.41 (波动)
Divider Dice:       0.55-0.60 (缓慢下降)

3D检测

Heatmap Loss: 0.22-0.23
Bbox Loss:    0.29-0.32
Matched IOU:  0.615-0.623 (稳定)

⏱️ 预计时间

当前配置

启动时间:       2025-11-01 09:15 UTC
Epoch 1进度:   28.5% (4400/15448)
每迭代时间:     2.66秒
剩余迭代:       11,048次

Epoch 1剩余:   ~8.2小时
Epoch 1完成:   2025-11-01 20:30 UTC
10 epochs完成: 2025-11-10 20:00 UTC (9.5天)

🔍 关键配置对比

当前配置 vs FP16优化方案

配置项 当前8卡FP32 FP16优化 改进
显存 29GB 20GB -31%
Batch/GPU 1 4 4×
总Batch 8 32 4×
Workers 0 2 +2
学习率 2e-5 4e-5 2×
迭代时间 2.66s ~2.0s -25%
Epoch时间 11h 7.5h -32%
10 epochs 9.5天 6.5天 -32%
精度 FP32 FP16 混合

📂 文件位置

脚本和配置

训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml

数据和输出

输出目录:   /data/runs/phase4a_stage1/
训练日志:   /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
初始权重:   /data/runs/phase4a_stage1/epoch_1.pth

🎯 关键特点

优点

  1. 稳定运行: GPU 100%利用无OOM
  2. 显存充足: 还有3.5-4GB剩余
  3. Loss下降: 正常收敛中
  4. 评估优化: interval=5避免磁盘满

⚠️ 可优化点

  1. 显存利用: 仅用88%,还有优化空间
  2. Batch Size: 可增大到2-4
  3. FP16: 未启用混合精度训练
  4. Checkpointing: 未启用gradient checkpointing
  5. Workers: 0 workers数据加载占17%

📌 总结

当前配置

  • 8×V100S-32GB100%满载
  • Batch=1/GPU稳定运行
  • 预计9.5天完成10 epochs
  • ⚠️ 还有优化空间FP16+Batch↑

优化潜力

  • FP16混合精度 → 节省9GB显存
  • Batch增至4 → 训练加速33%
  • 完成时间缩短至6.5天

生成时间: 2025-11-01 12:50 UTC
基于: phase4a_stage1_new_20251101_091503.log