bev-project/当前8卡训练配置总结.md

# 当前8卡训练配置总结（FP16优化前）

**更新时间**: 2025-11-01 12:50  
**状态**: ✅ 正在运行中  
**进度**: Epoch 1 - 4400/15448 (28.5%)

---

## 📜 训练脚本

### START_FROM_EPOCH1.sh

```bash
#!/bin/bash
# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练

set -e

export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH

cd /workspace/bevfusion

echo "========================================================================"
echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
echo "========================================================================"
echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
echo "训练Epochs: 1-10"
echo "输出目录: /data/runs/phase4a_stage1"
echo "GPU配置: 8×Tesla V100S-32GB"
echo "========================================================================"

# 环境验证
python -c "import torch; print('✓ PyTorch:', torch.__version__)"
python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
echo "✓ 环境验证成功"

# 确认文件存在
if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
    echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
    exit 1
fi
echo "✓ epoch_1.pth已就绪"

LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"

echo ""
echo "开始训练..."
echo "日志文件: $LOG_FILE"
echo ""

# 从epoch_1.pth加载权重，重新开始训练（不resume）- 使用8卡
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
PATH=/opt/conda/bin:$PATH \
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
  configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
  --model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
  --load_from /data/runs/phase4a_stage1/epoch_1.pth \
  --data.samples_per_gpu 1 \
  --data.workers_per_gpu 0 \
  2>&1 | tee "$LOG_FILE"
```

**关键参数**：
- `-np 8`: 8个进程（8张GPU）
- `samples_per_gpu 1`: 每GPU batch=1
- `workers_per_gpu 0`: 无数据加载worker
- `--load_from`: 从epoch_1.pth加载权重（不resume日志）

---

## 📋 配置文件

### multitask_BEV2X_phase4a_stage1.yaml

**基础配置**：
```yaml
_base_: ./convfuser.yaml

work_dir: /data/runs/phase4a_stage1

# LiDAR配置
voxel_size: [0.075, 0.075, 0.2]
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
```

**模型配置**：
```yaml
model:
  encoders:
    camera:
      backbone:
        type: SwinTransformer
        embed_dims: 96
        depths: [2, 2, 6, 2]
        num_heads: [3, 6, 12, 24]
        window_size: 7
        out_indices: [1, 2, 3]
        with_cp: false  # ⚠️ 未启用gradient checkpointing
        
      neck:
        type: GeneralizedLSSFPN
        in_channels: [192, 384, 768]
        out_channels: 256
        
      vtransform:
        type: DepthLSSTransform
        in_channels: 256
        out_channels: 80
        xbound: [-54.0, 54.0, 0.2]  # 540×540 BEV
        ybound: [-54.0, 54.0, 0.2]
        zbound: [-10.0, 10.0, 20.0]
        dbound: [1.0, 60.0, 0.5]    # 118 depth bins
        downsample: 2
        
    lidar:
      voxelize:
        max_num_points: 10
        max_voxels: [120000, 160000]
      backbone:
        type: SparseEncoder
        sparse_shape: [1440, 1440, 41]
        output_channels: 128
        
  fuser:
    type: ConvFuser
    in_channels: [80, 256]
    out_channels: 256
    
  decoder:
    backbone:
      type: SECOND
      in_channels: 256
      out_channels: [128, 256]
      layer_nums: [5, 5]
      
  heads:
    object:
      type: TransFusionHead
      num_proposals: 200
      in_channels: 256 * 2
      
    map:
      type: EnhancedBEVSegmentationHead
      in_channels: 256
      decoder_channels: [256, 256, 128, 128]  # 4层decoder
      num_classes: 6
      deep_supervision: true
      loss_cfg:
        - type: FocalLoss (weight 1.0)
        - type: DiceLoss (weight 2.0)
```

**训练配置**：
```yaml
# 数据配置
data:
  samples_per_gpu: 1  # 每GPU batch=1
  workers_per_gpu: 0  # 无worker（避免冲突）
  train:
    type: NuScenesDataset
    # ... (详见配置文件)

# 学习率配置
lr_config:
  policy: CosineAnnealing
  warmup: linear
  warmup_iters: 500
  warmup_ratio: 0.33333333
  min_lr_ratio: 1.0e-3

# 优化器
optimizer:
  type: AdamW
  lr: 2.0e-5  # 学习率 (从5e-5降至2e-5)
  weight_decay: 0.01

optimizer_config:
  grad_clip:
    max_norm: 35
    norm_type: 2

# Runner配置
runner:
  type: EpochBasedRunner
  max_epochs: 20

# Checkpoint配置
checkpoint_config:
  interval: 1
  max_keep_ckpts: 5

# 评估配置
evaluation:
  interval: 5  # ⭐ 每5个epoch评估（避免磁盘满）
  pipeline: ${test_pipeline}
  metric: [bbox, map]
  save_best: auto

# 日志配置
log_config:
  interval: 50

# 其他配置
find_unused_parameters: false
sync_bn: true
cudnn_benchmark: true
```

**GT标签配置**：
```yaml
# BEV分割GT分辨率: 600×600 (0.167m/pixel)
map_grid_conf:
  xbound: [-50.0, 50.0, 0.167]  # 600 pixels
  ybound: [-50.0, 50.0, 0.167]  # 600 pixels
```

---

## 📊 当前训练状态

### 实时性能指标

**GPU状态**（2025-11-01 12:32）：
```
GPU 0-7: 100% 利用率
显存:    28.8-29.3GB/GPU (88-89%)
温度:    44-47°C
功耗:    65-70W/GPU
```

**训练进度**：
```
Epoch:     1/10
迭代:      4400/15448 (28.5%)
学习率:    2.000e-05
Loss:      2.63-2.79 (下降中)
梯度范数:  9.9-17.8
IOU:       0.615-0.623
```

**性能指标**：
```
每迭代时间:   2.66秒
数据加载:     0.45秒 (17%)
计算时间:     2.21秒 (83%)
显存峰值:     18.9GB
```

### Loss趋势（Epoch 1）

**BEV分割**：
```
Drivable Area Dice: 0.15 (下降中)
Stop Line Dice:     0.38-0.41 (波动)
Divider Dice:       0.55-0.60 (缓慢下降)
```

**3D检测**：
```
Heatmap Loss: 0.22-0.23
Bbox Loss:    0.29-0.32
Matched IOU:  0.615-0.623 (稳定)
```

---

## ⏱️ 预计时间

### 当前配置
```
启动时间:       2025-11-01 09:15 UTC
Epoch 1进度:   28.5% (4400/15448)
每迭代时间:     2.66秒
剩余迭代:       11,048次

Epoch 1剩余:   ~8.2小时
Epoch 1完成:   2025-11-01 20:30 UTC
10 epochs完成: 2025-11-10 20:00 UTC (9.5天)
```

---

## 🔍 关键配置对比

### 当前配置 vs FP16优化方案

| 配置项 | 当前（8卡FP32） | FP16优化 | 改进 |
|--------|----------------|----------|------|
| **显存** | 29GB | 20GB | -31% |
| **Batch/GPU** | 1 | 4 | 4× |
| **总Batch** | 8 | 32 | 4× |
| **Workers** | 0 | 2 | +2 |
| **学习率** | 2e-5 | 4e-5 | 2× |
| **迭代时间** | 2.66s | ~2.0s | -25% |
| **Epoch时间** | 11h | 7.5h | -32% |
| **10 epochs** | 9.5天 | 6.5天 | -32% |
| **精度** | FP32 | FP16 | 混合 |

---

## 📂 文件位置

### 脚本和配置
```
训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
```

### 数据和输出
```
输出目录:   /data/runs/phase4a_stage1/
训练日志:   /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
初始权重:   /data/runs/phase4a_stage1/epoch_1.pth
```

---

## 🎯 关键特点

### ✅ 优点
1. **稳定运行**: GPU 100%利用，无OOM
2. **显存充足**: 还有3.5-4GB剩余
3. **Loss下降**: 正常收敛中
4. **评估优化**: interval=5，避免磁盘满

### ⚠️ 可优化点
1. **显存利用**: 仅用88%，还有优化空间
2. **Batch Size**: 可增大到2-4
3. **FP16**: 未启用混合精度训练
4. **Checkpointing**: 未启用gradient checkpointing
5. **Workers**: 0 workers，数据加载占17%

---

## 📌 总结

**当前配置**：
- ✅ 8×V100S-32GB，100%满载
- ✅ Batch=1/GPU，稳定运行
- ✅ 预计9.5天完成10 epochs
- ⚠️ 还有优化空间（FP16+Batch↑）

**优化潜力**：
- FP16混合精度 → 节省9GB显存
- Batch增至4 → 训练加速33%
- 完成时间缩短至6.5天

---

*生成时间: 2025-11-01 12:50 UTC*  
*基于: phase4a_stage1_new_20251101_091503.log*
-												Complete project state snapshot: Phase 4B RMT-PPAD Integration

🎯 Training Status:
- Current Epoch: 2/10 (13.3% complete)
- Segmentation Dice: 0.9594
- Detection IoU: 0.5742
- Training stable with 8 GPUs

🔧 Technical Achievements:
- ✅ RMT-PPAD Transformer segmentation decoder integrated
- ✅ Task-specific GCA architecture optimized
- ✅ Multi-scale feature fusion (180×180, 360×360, 600×600)
- ✅ Adaptive scale weight learning implemented
- ✅ BEVFusion multi-task framework enhanced

📊 Performance Highlights:
- Divider segmentation: 0.9793 Dice (excellent)
- Pedestrian crossing: 0.9812 Dice (excellent)
- Stop line: 0.9812 Dice (excellent)
- Carpark area: 0.9802 Dice (excellent)
- Walkway: 0.9401 Dice (good)
- Drivable area: 0.8959 Dice (good)

🛠️ Code Changes Included:
- Enhanced BEVFusion model (bevfusion.py)
- RMT-PPAD integration modules (rmtppad_integration.py)
- Transformer segmentation head (enhanced_transformer.py)
- GCA module optimizations (gca.py)
- Configuration updates (Phase 4B configs)
- Training scripts and automation tools
- Comprehensive documentation and analysis reports

📅 Snapshot Date: Fri Nov 14 09:06:09 UTC 2025
📍 Environment: Docker container
🎯 Phase: RMT-PPAD Integration Complete

											
										
										
											2025-11-14 17:06:09 +08:00
+								# 当前8卡训练配置总结（FP16优化前）
 								**更新时间**: 2025-11-01 12:50
 								**状态**: ✅ 正在运行中
 								**进度**: Epoch 1 - 4400/15448 (28.5%)
 								---
 								## 📜 训练脚本
 								### START_FROM_EPOCH1.sh
 								```bash
 								#!/bin/bash
 								# Phase 4A Stage 1: 从epoch_1.pth加载权重重新开始训练
 								set -e
 								export PATH=/opt/conda/bin:$PATH
 								export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 								export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
 								cd /workspace/bevfusion
 								echo "========================================================================"
 								echo "Phase 4A Stage 1: 从epoch_1.pth重新开始训练 (8 GPUs)"
 								echo "========================================================================"
 								echo "加载权重: epoch_1.pth (已训练过600×600的模型)"
 								echo "训练Epochs: 1-10"
 								echo "输出目录: /data/runs/phase4a_stage1"
 								echo "GPU配置: 8×Tesla V100S-32GB"
 								echo "========================================================================"
 								# 环境验证
 								python -c "import torch; print('✓ PyTorch:', torch.__version__)"
 								python -c "from mmcv.ops import nms_match; import mmcv; print('✓ mmcv:', mmcv.__version__)" || exit 1
 								echo "✓ 环境验证成功"
 								# 确认文件存在
 								if [ ! -f "/data/runs/phase4a_stage1/epoch_1.pth" ]; then
 								    echo "❌ 找不到 /data/runs/phase4a_stage1/epoch_1.pth"
 								    exit 1
 								fi
 								echo "✓ epoch_1.pth已就绪"
 								LOG_FILE="phase4a_stage1_new_$(date +%Y%m%d_%H%M%S).log"
 								echo ""
 								echo "开始训练..."
 								echo "日志文件: $LOG_FILE"
 								echo ""
 								# 从epoch_1.pth加载权重，重新开始训练（不resume）- 使用8卡
 								LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
 								PATH=/opt/conda/bin:$PATH \
 								PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
 								/opt/conda/bin/torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py \
 								  configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
 								  --model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
 								  --load_from /data/runs/phase4a_stage1/epoch_1.pth \
 								  --data.samples_per_gpu 1 \
 								  --data.workers_per_gpu 0 \
 >&1 | tee "$LOG_FILE"
 								```
 								**关键参数**：
 								- `-np 8`: 8个进程（8张GPU）
 								- `samples_per_gpu 1`: 每GPU batch=1
 								- `workers_per_gpu 0`: 无数据加载worker
 								- `--load_from`: 从epoch_1.pth加载权重（不resume日志）
 								---
 								## 📋 配置文件
 								### multitask_BEV2X_phase4a_stage1.yaml
 								**基础配置**：
 								```yaml
 								_base_: ./convfuser.yaml
 								work_dir: /data/runs/phase4a_stage1
 								# LiDAR配置
 								voxel_size: [0.075, 0.075, 0.2]
 								point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
 								```
 								**模型配置**：
 								```yaml
 								model:
 								  encoders:
 								    camera:
 								      backbone:
 								        type: SwinTransformer
 								        embed_dims: 96
 								        depths: [2, 2, 6, 2]
 								        num_heads: [3, 6, 12, 24]
 								        window_size: 7
 								        out_indices: [1, 2, 3]
 								        with_cp: false  # ⚠️ 未启用gradient checkpointing
 								      neck:
 								        type: GeneralizedLSSFPN
 								        in_channels: [192, 384, 768]
 								        out_channels: 256
 								      vtransform:
 								        type: DepthLSSTransform
 								        in_channels: 256
 								        out_channels: 80
 								        xbound: [-54.0, 54.0, 0.2]  # 540×540 BEV
 								        ybound: [-54.0, 54.0, 0.2]
 								        zbound: [-10.0, 10.0, 20.0]
 								        dbound: [1.0, 60.0, 0.5]    # 118 depth bins
 								        downsample: 2
 								    lidar:
 								      voxelize:
 								        max_num_points: 10
 								        max_voxels: [120000, 160000]
 								      backbone:
 								        type: SparseEncoder
 								        sparse_shape: [1440, 1440, 41]
 								        output_channels: 128
 								  fuser:
 								    type: ConvFuser
 								    in_channels: [80, 256]
 								    out_channels: 256
 								  decoder:
 								    backbone:
 								      type: SECOND
 								      in_channels: 256
 								      out_channels: [128, 256]
 								      layer_nums: [5, 5]
 								  heads:
 								    object:
 								      type: TransFusionHead
 								      num_proposals: 200
 								      in_channels: 256 * 2
 								    map:
 								      type: EnhancedBEVSegmentationHead
 								      in_channels: 256
 								      decoder_channels: [256, 256, 128, 128]  # 4层decoder
 								      num_classes: 6
 								      deep_supervision: true
 								      loss_cfg:
 								        - type: FocalLoss (weight 1.0)
 								        - type: DiceLoss (weight 2.0)
 								```
 								**训练配置**：
 								```yaml
 								# 数据配置
 								data:
 								  samples_per_gpu: 1  # 每GPU batch=1
 								  workers_per_gpu: 0  # 无worker（避免冲突）
 								  train:
 								    type: NuScenesDataset
 								    # ... (详见配置文件)
 								# 学习率配置
 								lr_config:
 								  policy: CosineAnnealing
 								  warmup: linear
 								  warmup_iters: 500
 								  warmup_ratio: 0.33333333
 								  min_lr_ratio: 1.0e-3
 								# 优化器
 								optimizer:
 								  type: AdamW
 								  lr: 2.0e-5  # 学习率 (从5e-5降至2e-5)
 								  weight_decay: 0.01
 								optimizer_config:
 								  grad_clip:
 								    max_norm: 35
 								    norm_type: 2
 								# Runner配置
 								runner:
 								  type: EpochBasedRunner
 								  max_epochs: 20
 								# Checkpoint配置
 								checkpoint_config:
 								  interval: 1
 								  max_keep_ckpts: 5
 								# 评估配置
 								evaluation:
 								  interval: 5  # ⭐ 每5个epoch评估（避免磁盘满）
 								  pipeline: ${test_pipeline}
 								  metric: [bbox, map]
 								  save_best: auto
 								# 日志配置
 								log_config:
 								  interval: 50
 								# 其他配置
 								find_unused_parameters: false
 								sync_bn: true
 								cudnn_benchmark: true
 								```
 								**GT标签配置**：
 								```yaml
 								# BEV分割GT分辨率: 600×600 (0.167m/pixel)
 								map_grid_conf:
 								  xbound: [-50.0, 50.0, 0.167]  # 600 pixels
 								  ybound: [-50.0, 50.0, 0.167]  # 600 pixels
 								```
 								---
 								## 📊 当前训练状态
 								### 实时性能指标
 								**GPU状态**（2025-11-01 12:32）：
 								```
 								GPU 0-7: 100% 利用率
 								显存:    28.8-29.3GB/GPU (88-89%)
 								温度:    44-47°C
 								功耗:    65-70W/GPU
 								```
 								**训练进度**：
 								```
 								Epoch:     1/10
 								迭代:      4400/15448 (28.5%)
 								学习率:    2.000e-05
 								Loss:      2.63-2.79 (下降中)
 								梯度范数:  9.9-17.8
 								IOU:       0.615-0.623
 								```
 								**性能指标**：
 								```
 								每迭代时间:   2.66秒
 								数据加载:     0.45秒 (17%)
 								计算时间:     2.21秒 (83%)
 								显存峰值:     18.9GB
 								```
 								### Loss趋势（Epoch 1）
 								**BEV分割**：
 								```
 								Drivable Area Dice: 0.15 (下降中)
 								Stop Line Dice:     0.38-0.41 (波动)
 								Divider Dice:       0.55-0.60 (缓慢下降)
 								```
 								**3D检测**：
 								```
 								Heatmap Loss: 0.22-0.23
 								Bbox Loss:    0.29-0.32
 								Matched IOU:  0.615-0.623 (稳定)
 								```
 								---
 								## ⏱️ 预计时间
 								### 当前配置
 								```
 								启动时间:       2025-11-01 09:15 UTC
 								Epoch 1进度:   28.5% (4400/15448)
 								每迭代时间:     2.66秒
 								剩余迭代:       11,048次
 								Epoch 1剩余:   ~8.2小时
 								Epoch 1完成:   2025-11-01 20:30 UTC
 epochs完成: 2025-11-10 20:00 UTC (9.5天)
 								```
 								---
 								## 🔍 关键配置对比
 								### 当前配置 vs FP16优化方案
 								| 配置项 | 当前（8卡FP32） | FP16优化 | 改进 |
 								|--------|----------------|----------|------|
 								| **显存** | 29GB | 20GB | -31% |
 								| **Batch/GPU** | 1 | 4 | 4× |
 								| **总Batch** | 8 | 32 | 4× |
 								| **Workers** | 0 | 2 | +2 |
 								| **学习率** | 2e-5 | 4e-5 | 2× |
 								| **迭代时间** | 2.66s | ~2.0s | -25% |
 								| **Epoch时间** | 11h | 7.5h | -32% |
 								| **10 epochs** | 9.5天 | 6.5天 | -32% |
 								| **精度** | FP32 | FP16 | 混合 |
 								---
 								## 📂 文件位置
 								### 脚本和配置
 								```
 								训练脚本: /workspace/bevfusion/START_FROM_EPOCH1.sh
 								配置文件: /workspace/bevfusion/configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml
 								```
 								### 数据和输出
 								```
 								输出目录:   /data/runs/phase4a_stage1/
 								训练日志:   /workspace/bevfusion/phase4a_stage1_new_20251101_091503.log
 								Checkpoints: /data/runs/phase4a_stage1/epoch_*.pth
 								预训练模型: /data/pretrained/swint-nuimages-pretrained.pth
 								初始权重:   /data/runs/phase4a_stage1/epoch_1.pth
 								```
 								---
 								## 🎯 关键特点
 								### ✅ 优点
 . **稳定运行**: GPU 100%利用，无OOM
 . **显存充足**: 还有3.5-4GB剩余
 . **Loss下降**: 正常收敛中
 . **评估优化**: interval=5，避免磁盘满
 								### ⚠️ 可优化点
 . **显存利用**: 仅用88%，还有优化空间
 . **Batch Size**: 可增大到2-4
 . **FP16**: 未启用混合精度训练
 . **Checkpointing**: 未启用gradient checkpointing
 . **Workers**: 0 workers，数据加载占17%
 								---
 								## 📌 总结
 								**当前配置**：
 								- ✅ 8×V100S-32GB，100%满载
 								- ✅ Batch=1/GPU，稳定运行
 								- ✅ 预计9.5天完成10 epochs
 								- ⚠️ 还有优化空间（FP16+Batch↑）
 								**优化潜力**：
 								- FP16混合精度 → 节省9GB显存
 								- Batch增至4 → 训练加速33%
 								- 完成时间缩短至6.5天
 								---
 								*生成时间: 2025-11-01 12:50 UTC*
 								*基于: phase4a_stage1_new_20251101_091503.log*