bev-project/project/docs/多机多卡训练配置指南.md

624 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# BEVFusion 多机多卡训练配置指南
**版本**: v1.0
**更新时间**: 2025-11-01
**当前支持**: 单机8卡 → 多机多卡扩展
---
## 📋 目录
1. [当前单机配置](#当前单机配置)
2. [多机多卡架构](#多机多卡架构)
3. [环境准备](#环境准备)
4. [方案一torchpack多机训练](#方案一torchpack多机训练)
5. [方案二PyTorch原生DDP](#方案二pytorch原生ddp)
6. [网络配置](#网络配置)
7. [常见问题](#常见问题)
---
## 当前单机配置
### 现有配置单机8卡
```bash
# START_FROM_EPOCH1.sh
torchpack dist-run -np 8 python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0
```
**特点**
- 单机localhost
- 8个进程-np 8
- 每进程1个GPU
- 通信:进程间共享内存
---
## 多机多卡架构
### 典型场景
**2机16卡配置**
```
节点1 (master): 8×V100S (GPU 0-7)
节点2 (worker): 8×V100S (GPU 0-7)
总计: 16 GPUs
```
**4机32卡配置**
```
节点1 (master): 8×V100S
节点2 (worker): 8×V100S
节点3 (worker): 8×V100S
节点4 (worker): 8×V100S
总计: 32 GPUs
```
### 性能提升预估
| 配置 | GPU数 | Epoch耗时 | 10 Epochs | 加速比 |
|------|-------|-----------|-----------|--------|
| 单机4卡 | 4 | 18小时 | 18天 | 1.0× |
| 单机8卡 | 8 | 11小时 | 9.5天 | 1.7× |
| **2机16卡** | **16** | **~6小时** | **~5天** | **3.0×** |
| **4机32卡** | **32** | **~3.5小时** | **~3天** | **4.5×** |
---
## 环境准备
### 1. 硬件要求
**每个节点**
```yaml
GPU: 8×V100S (或同等级)
显存: 32GB/GPU
网络: 10Gbps+ (InfiniBand最佳)
存储: 共享存储 或 相同数据集路径
```
### 2. 网络要求
**必须满足**
- ✅ 所有节点可互相ping通
- ✅ 特定端口开放如29500
- ✅ SSH免密登录torchpack需要
- ✅ 相同的CUDA/PyTorch版本
**验证网络**
```bash
# 在master节点执行
ping <worker_node_ip>
# 测试SSH免密
ssh <worker_node_ip> "hostname"
# 检查端口
nc -zv <worker_node_ip> 29500
```
### 3. 数据集配置
**方案A共享存储推荐**
```bash
# 所有节点挂载同一NFS/GPFS
/data/nuscenes/ (所有节点相同路径)
```
**方案B本地复制**
```bash
# 每个节点本地存储
# 确保数据集路径完全一致
node1: /data/nuscenes/
node2: /data/nuscenes/
```
---
## 方案一torchpack多机训练
### 优点
- ✅ 简单易用,一行命令启动
- ✅ 自动SSH到各节点启动进程
- ✅ 与现有脚本兼容
### 配置步骤
#### 1. SSH免密配置
```bash
# 在master节点执行
ssh-keygen -t rsa # 一路回车
# 复制公钥到所有worker节点
ssh-copy-id root@node2
ssh-copy-id root@node3
ssh-copy-id root@node4
# 验证
ssh node2 "nvidia-smi"
```
#### 2. 创建host文件
```bash
# 方法1: 使用IP地址
cat > /workspace/bevfusion/hosts.txt << 'EOF'
192.168.1.101:8 # node1 (master) - 8 GPUs
192.168.1.102:8 # node2 (worker) - 8 GPUs
EOF
# 方法2: 使用主机名(需配置/etc/hosts
cat > /workspace/bevfusion/hosts.txt << 'EOF'
node1:8
node2:8
EOF
```
#### 3. 多机训练脚本
```bash
#!/bin/bash
# START_MULTINODE_TRAINING.sh
set -e
export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
cd /workspace/bevfusion
echo "========================================================================"
echo "Phase 4A Stage 1: 多机多卡训练 (2节点×8卡=16 GPUs)"
echo "========================================================================"
echo "节点配置:"
echo " - node1 (master): 192.168.1.101 - 8×V100S"
echo " - node2 (worker): 192.168.1.102 - 8×V100S"
echo "========================================================================"
# 定义节点
MASTER_ADDR="192.168.1.101"
NODE1="192.168.1.101:8"
NODE2="192.168.1.102:8"
TOTAL_GPUS=16
LOG_FILE="phase4a_stage1_multinode_$(date +%Y%m%d_%H%M%S).log"
# torchpack多机启动
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
PATH=/opt/conda/bin:$PATH \
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
/opt/conda/bin/torchpack dist-run \
-np ${TOTAL_GPUS} \
-H ${NODE1},${NODE2} \
/opt/conda/bin/python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0 \
2>&1 | tee "$LOG_FILE"
echo "训练完成!"
```
**关键参数说明**
```bash
-np 16 # 总进程数 = 总GPU数
-H node1:8,node2:8 # 节点列表格式hostname:gpu_count
```
---
## 方案二PyTorch原生DDP
### 优点
- ✅ 更灵活的控制
- ✅ 不依赖torchpack的SSH机制
- ✅ 适合复杂网络环境
### 配置步骤
#### 1. Master节点启动脚本
```bash
#!/bin/bash
# MASTER_NODE_TRAIN.sh
export MASTER_ADDR="192.168.1.101" # master节点IP
export MASTER_PORT="29500"
export WORLD_SIZE=16 # 总GPU数
export RANK=0 # master rank=0
export LOCAL_RANK=0
cd /workspace/bevfusion
# 启动8个进程master节点的8个GPU
for i in {0..7}; do
export RANK=$i
export LOCAL_RANK=$i
CUDA_VISIBLE_DEVICES=$i python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0 \
2>&1 | tee "logs/master_gpu${i}.log" &
done
wait
```
#### 2. Worker节点启动脚本
```bash
#!/bin/bash
# WORKER_NODE_TRAIN.sh
export MASTER_ADDR="192.168.1.101" # master节点IP
export MASTER_PORT="29500"
export WORLD_SIZE=16 # 总GPU数
export NODE_RANK=1 # worker节点编号1,2,3...
cd /workspace/bevfusion
# 启动8个进程worker节点的8个GPU
for i in {0..7}; do
RANK=$((NODE_RANK * 8 + i)) # 全局rank = 节点rank*8 + 本地GPU
LOCAL_RANK=$i
CUDA_VISIBLE_DEVICES=$i python tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0 \
2>&1 | tee "logs/worker${NODE_RANK}_gpu${i}.log" &
done
wait
```
#### 3. 启动流程
```bash
# 1. 在master节点启动
ssh node1
bash MASTER_NODE_TRAIN.sh &
# 2. 在worker节点启动
ssh node2
bash WORKER_NODE_TRAIN.sh &
# 3. 监控
watch -n 5 'nvidia-smi'
```
---
## 方案三使用torchrun推荐PyTorch 1.9+
### 最简单的多机启动方式
#### Master节点
```bash
#!/bin/bash
# 在master节点执行
export MASTER_ADDR="192.168.1.101"
export MASTER_PORT="29500"
torchrun \
--nnodes=2 \
--nproc_per_node=8 \
--node_rank=0 \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0
```
#### Worker节点
```bash
#!/bin/bash
# 在worker节点执行
export MASTER_ADDR="192.168.1.101"
export MASTER_PORT="29500"
torchrun \
--nnodes=2 \
--nproc_per_node=8 \
--node_rank=1 \
--master_addr=${MASTER_ADDR} \
--master_port=${MASTER_PORT} \
tools/train.py \
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
--data.samples_per_gpu 1 \
--data.workers_per_gpu 0
```
---
## 网络配置
### 1. 防火墙规则
```bash
# 在所有节点执行
# 开放master端口
firewall-cmd --permanent --add-port=29500/tcp
firewall-cmd --reload
# 或临时关闭防火墙(测试用)
systemctl stop firewalld
```
### 2. /etc/hosts配置
```bash
# 在所有节点的/etc/hosts添加
192.168.1.101 node1 master
192.168.1.102 node2 worker1
192.168.1.103 node3 worker2
192.168.1.104 node4 worker3
```
### 3. 网络带宽测试
```bash
# 安装iperf3
apt-get install iperf3
# node1上启动服务端
iperf3 -s
# node2上测试
iperf3 -c node1 -t 10
# 期望: >1Gbps10Gbps网络最佳
```
---
## 常见问题
### 1. SSH连接失败
**问题**: `ssh: connect to host node2 port 22: Connection refused`
**解决**:
```bash
# 检查SSH服务
systemctl status sshd
# 启动SSH
systemctl start sshd
# 重新配置免密
ssh-copy-id root@node2
```
### 2. NCCL初始化超时
**问题**: `NCCL timeout in init`
**解决**:
```bash
# 设置更长的超时
export NCCL_SOCKET_TIMEOUT=3600
export NCCL_TIMEOUT=3600
# 设置NCCL调试
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
```
### 3. 不同节点路径不一致
**问题**: `FileNotFoundError: /data/nuscenes/...`
**解决**:
```bash
# 方案1: 统一路径(推荐)
# 所有节点使用相同的绝对路径
# 方案2: 软链接
ln -s /mnt/dataset/nuscenes /data/nuscenes
```
### 4. GPU通信错误
**问题**: `RuntimeError: NCCL error`
**解决**:
```bash
# 检查NCCL版本
python -c "import torch; print(torch.cuda.nccl.version())"
# 设置NCCL为P2P模式
export NCCL_P2P_DISABLE=0
export NCCL_IB_DISABLE=0 # 如果有InfiniBand
```
### 5. 显存不均衡
**问题**: master节点显存占用高于worker
**原因**: 日志、保存checkpoint都在master节点
**解决**:
```yaml
# 配置文件中设置
checkpoint_config:
interval: 1
max_keep_ckpts: 2 # 限制保存数量
```
---
## 监控与调试
### 1. 实时监控
```bash
# 监控所有节点GPU
while true; do
echo "=== Node1 ==="
ssh node1 "nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv"
echo "=== Node2 ==="
ssh node2 "nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv"
sleep 5
done
```
### 2. 分布式日志
```bash
# 查看master日志
tail -f phase4a_stage1_multinode_*.log | grep "Epoch"
# 查看各节点日志
ssh node1 "tail -f /workspace/bevfusion/logs/master_gpu0.log"
ssh node2 "tail -f /workspace/bevfusion/logs/worker1_gpu0.log"
```
### 3. NCCL性能分析
```bash
# 设置NCCL调试
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=COLL,P2P,NET
# 运行训练,查看通信模式
# 输出会显示:
# - Ring, Tree, or Direct通信
# - 带宽利用率
# - P2P/IB状态
```
---
## 性能优化建议
### 1. Batch Size调整
```yaml
# 多机训练可以增大batch size
data:
samples_per_gpu: 2 # 从1增加到2
workers_per_gpu: 2 # 从0增加到2
```
### 2. 梯度累积
```python
# 如果单卡batch=1可以用梯度累积模拟大batch
optimizer_config:
type: GradientCumulativeOptimizerHook
cumulative_iters: 4 # 累积4次=等效batch 4
```
### 3. 混合精度训练
```yaml
fp16:
loss_scale: dynamic
```
### 4. 通信优化
```bash
# InfiniBand优化
export NCCL_IB_HCA=mlx5_0,mlx5_1
export NCCL_IB_GID_INDEX=3
# 以太网优化
export NCCL_SOCKET_IFNAME=eth0
export NCCL_NET_GDR_LEVEL=5
```
---
## 配置模板总结
### 2机16卡推荐torchpack
```bash
torchpack dist-run \
-np 16 \
-H node1:8,node2:8 \
python tools/train.py config.yaml
```
### 4机32卡推荐torchrun
```bash
# 每个节点执行,修改--node_rank
torchrun \
--nnodes=4 \
--nproc_per_node=8 \
--node_rank=<0,1,2,3> \
--master_addr=node1 \
--master_port=29500 \
tools/train.py config.yaml
```
---
## 预期性能Phase 4A Stage 1
### 2机16卡配置
```
训练速度: ~6小时/epoch
10 epochs: ~5天
加速比: 3.0× vs 单机4卡
总训练时间节省: 13天
```
### 4机32卡配置
```
训练速度: ~3.5小时/epoch
10 epochs: ~3天
加速比: 4.5× vs 单机4卡
总训练时间节省: 15天
```
---
## 下一步
1. ✅ 准备SSH免密配置
2. ✅ 验证网络连通性
3. ✅ 统一数据集路径
4. ✅ 选择启动方案torchpack/torchrun
5. ✅ 创建启动脚本
6. ✅ 小规模测试2 epochs
7. ✅ 全量训练
---
**如需帮助配置具体环境,请提供**
- 节点数量和IP
- 每节点GPU数量
- 网络类型10Gbps以太网/InfiniBand
- 存储方案NFS/本地)
---
*文档版本: 1.0*
*最后更新: 2025-11-01*
*适用于: BEVFusion Phase 4A及后续阶段*