# 当前8卡训练配置总结(FP16优化前) **更新时间**: 2025-11-01 12:50 **状态**: ✅ 正在运行中 **进度**: Epoch 1 - 4400/15448 (28.5%) --- ## 📜 训练脚本 ### START_FROM_EPOCH1.sh ```bash #!/bin/bash # Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练 set -e export PATH=/opt/conda/bin:$PATH export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH cd /workspace/bevfusion echo "========================================================================" echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)" echo "========================================================================" echo "加载权重: epoch_1.pth (已训练过600×600的模型)" echo "训练Epochs: 1-10" echo "输出目录: /data/runs/phase4a_stage1" echo "GPU配置: 8×Tesla V100S-32GB" echo "========================================================================" # 环境验证 python -c "import torch; print('✓ PyTorch:', torch.__version__)" python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1 echo "✓ 环境验证成功" # 确认文件存在 if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth" exit 1 fi echo "✓ epoch_1.pth已就绪" LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log" echo "" echo "开始训练..." echo "日志文件: $LOG_FILE" echo "" # 从epoch_1.pth加载权重,重新开始训练(不resume)- 使用8卡 LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \ PATH=/opt/conda/bin:$PATH \ PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \ /opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \ configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \ --model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \ --load_from /data/runs/phase4a_stage1/epoch_1.pth \ --data.samples_per_gpu 1 \ --data.workers_per_gpu 0 \ 2>&1 | tee "$LOG_FILE" ``` **关键参数**: - `-np 8`: 8个进程(8张GPU) - `samples_per_gpu 1`: 每GPU batch=1 - `workers_per_gpu 0`: 无数据加载worker - `--load_from`: 从epoch_1.pth加载权重(不resume日志) --- ## 📋 配置文件 ### multitask_BEV2X_phase4a_stage1.yaml **基础配置**: ```yaml _base_: ./convfuser.yaml work_dir: /data/runs/phase4a_stage1 # LiDAR配置 voxel_size: [0.075, 0.075, 0.2] point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0] ``` **模型配置**: ```yaml model: encoders: camera: backbone: type: SwinTransformer embed_dims: 96 depths: [2, 2, 6, 2] num_heads: [3, 6, 12, 24] window_size: 7 out_indices: [1, 2, 3] with_cp: false # ⚠️ 未启用gradient checkpointing neck: type: GeneralizedLSSFPN in_channels: [192, 384, 768] out_channels: 256 vtransform: type: DepthLSSTransform in_channels: 256 out_channels: 80 xbound: [-54.0, 54.0, 0.2] # 540×540 BEV ybound: [-54.0, 54.0, 0.2] zbound: [-10.0, 10.0, 20.0] dbound: [1.0, 60.0, 0.5] # 118 depth bins downsample: 2 lidar: voxelize: max_num_points: 10 max_voxels: [120000, 160000] backbone: type: SparseEncoder sparse_shape: [1440, 1440, 41] output_channels: 128 fuser: type: ConvFuser in_channels: [80, 256] out_channels: 256 decoder: backbone: type: SECOND in_channels: 256 out_channels: [128, 256] layer_nums: [5, 5] heads: object: type: TransFusionHead num_proposals: 200 in_channels: 256 * 2 map: type: EnhancedBEVSegmentationHead in_channels: 256 decoder_channels: [256, 256, 128, 128] # 4层decoder num_classes: 6 deep_supervision: true loss_cfg: - type: FocalLoss (weight 1.0) - type: DiceLoss (weight 2.0) ``` **训练配置**: ```yaml # 数据配置 data: samples_per_gpu: 1 # 每GPU batch=1 workers_per_gpu: 0 # 无worker(避免冲突) train: type: NuScenesDataset # ... (详见配置文件) # 学习率配置 lr_config: policy: CosineAnnealing warmup: linear warmup_iters: 500 warmup_ratio: 0.33333333 min_lr_ratio: 1.0e-3 # 优化器 optimizer: type: AdamW lr: 2.0e-5 # 学习率 (从5e-5降至2e-5) weight_decay: 0.01 optimizer_config: grad_clip: max_norm: 35 norm_type: 2 # Runner配置 runner: type: EpochBasedRunner max_epochs: 20 # Checkpoint配置 checkpoint_config: interval: 1 max_keep_ckpts: 5 # 评估配置 evaluation: interval: 5 # ⭐ 每5个epoch评估(避免磁盘满) pipeline: ${test_pipeline} metric: [bbox, map] save_best: auto # 日志配置 log_config: interval: 50 # 其他配置 find_unused_parameters: false sync_bn: true cudnn_benchmark: true ``` **GT标签配置**: ```yaml # BEV分割GT分辨率: 600×600 (0.167m/pixel) map_grid_conf: xbound: [-50.0, 50.0, 0.167] # 600 pixels ybound: [-50.0, 50.0, 0.167] # 600 pixels ``` --- ## 📊 当前训练状态 ### 实时性能指标 **GPU状态**(2025-11-01 12:32): ``` GPU 0-7: 100% 利用率 显存: 28.8-29.3GB/GPU (88-89%) 温度: 44-47°C 功耗: 65-70W/GPU ``` **训练进度**: ``` Epoch: 1/10 迭代: 4400/15448 (28.5%) 学习率: 2.000e-05 Loss: 2.63-2.79 (下降中) 梯度范数: 9.9-17.8 IOU: 0.615-0.623 ``` **性能指标**: ``` 每迭代时间: 2.66秒 数据加载: 0.45秒 (17%) 计算时间: 2.21秒 (83%) 显存峰值: 18.9GB ``` ### Loss趋势(Epoch 1) **BEV分割**: ``` Drivable Area Dice: 0.15 (下降中) Stop Line Dice: 0.38-0.41 (波动) Divider Dice: 0.55-0.60 (缓慢下降) ``` **3D检测**: ``` Heatmap Loss: 0.22-0.23 Bbox Loss: 0.29-0.32 Matched IOU: 0.615-0.623 (稳定) ``` --- ## ⏱️ 预计时间 ### 当前配置 ``` 启动时间: 2025-11-01 09:15 UTC Epoch 1进度: 28.5% (4400/15448) 每迭代时间: 2.66秒 剩余迭代: 11,048次 Epoch 1剩余: ~8.2小时 Epoch 1完成: 2025-11-01 20:30 UTC 10 epochs完成: 2025-11-10 20:00 UTC (9.5天) ``` --- ## 🔍 关键配置对比 ### 当前配置 vs FP16优化方案 | 配置项 | 当前(8卡FP32) | FP16优化 | 改进 | |--------|----------------|----------|------| | **显存** | 29GB | 20GB | -31% | | **Batch/GPU** | 1 | 4 | 4× | | **总Batch** | 8 | 32 | 4× | | **Workers** | 0 | 2 | +2 | | **学习率** | 2e-5 | 4e-5 | 2× | | **迭代时间** | 2.66s | ~2.0s | -25% | | **Epoch时间** | 11h | 7.5h | -32% | | **10 epochs** | 9.5天 | 6.5天 | -32% | | **精度** | FP32 | FP16 | 混合 | --- ## 📂 文件位置 ### 脚本和配置 ``` 训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh 配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml ``` ### 数据和输出 ``` 输出目录: /data/runs/phase4a_stage1/ 训练日志: /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth 预训练模型: /data/pretrained/swint-nuimages-pretrained.pth 初始权重: /data/runs/phase4a_stage1/epoch_1.pth ``` --- ## 🎯 关键特点 ### ✅ 优点 1. **稳定运行**: GPU 100%利用,无OOM 2. **显存充足**: 还有3.5-4GB剩余 3. **Loss下降**: 正常收敛中 4. **评估优化**: interval=5,避免磁盘满 ### ⚠️ 可优化点 1. **显存利用**: 仅用88%,还有优化空间 2. **Batch Size**: 可增大到2-4 3. **FP16**: 未启用混合精度训练 4. **Checkpointing**: 未启用gradient checkpointing 5. **Workers**: 0 workers,数据加载占17% --- ## 📌 总结 **当前配置**: - ✅ 8×V100S-32GB,100%满载 - ✅ Batch=1/GPU,稳定运行 - ✅ 预计9.5天完成10 epochs - ⚠️ 还有优化空间(FP16+Batch↑) **优化潜力**: - FP16混合精度 → 节省9GB显存 - Batch增至4 → 训练加速33% - 完成时间缩短至6.5天 --- *生成时间: 2025-11-01 12:50 UTC* *基于: phase4a_stage1_new_20251101_091503.log*