624 lines
13 KiB
Markdown
624 lines
13 KiB
Markdown
# BEVFusion 多机多卡训练配置指南
|
||
|
||
**版本**: v1.0
|
||
**更新时间**: 2025-11-01
|
||
**当前支持**: 单机8卡 → 多机多卡扩展
|
||
|
||
---
|
||
|
||
## 📋 目录
|
||
|
||
1. [当前单机配置](#当前单机配置)
|
||
2. [多机多卡架构](#多机多卡架构)
|
||
3. [环境准备](#环境准备)
|
||
4. [方案一:torchpack多机训练](#方案一torchpack多机训练)
|
||
5. [方案二:PyTorch原生DDP](#方案二pytorch原生ddp)
|
||
6. [网络配置](#网络配置)
|
||
7. [常见问题](#常见问题)
|
||
|
||
---
|
||
|
||
## 当前单机配置
|
||
|
||
### 现有配置(单机8卡)
|
||
|
||
```bash
|
||
# START_FROM_EPOCH1.sh
|
||
torchpack dist-run -np 8 python tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0
|
||
```
|
||
|
||
**特点**:
|
||
- 单机:localhost
|
||
- 8个进程:-np 8
|
||
- 每进程1个GPU
|
||
- 通信:进程间共享内存
|
||
|
||
---
|
||
|
||
## 多机多卡架构
|
||
|
||
### 典型场景
|
||
|
||
**2机16卡配置**
|
||
```
|
||
节点1 (master): 8×V100S (GPU 0-7)
|
||
节点2 (worker): 8×V100S (GPU 0-7)
|
||
总计: 16 GPUs
|
||
```
|
||
|
||
**4机32卡配置**
|
||
```
|
||
节点1 (master): 8×V100S
|
||
节点2 (worker): 8×V100S
|
||
节点3 (worker): 8×V100S
|
||
节点4 (worker): 8×V100S
|
||
总计: 32 GPUs
|
||
```
|
||
|
||
### 性能提升预估
|
||
|
||
| 配置 | GPU数 | Epoch耗时 | 10 Epochs | 加速比 |
|
||
|------|-------|-----------|-----------|--------|
|
||
| 单机4卡 | 4 | 18小时 | 18天 | 1.0× |
|
||
| 单机8卡 | 8 | 11小时 | 9.5天 | 1.7× |
|
||
| **2机16卡** | **16** | **~6小时** | **~5天** | **3.0×** |
|
||
| **4机32卡** | **32** | **~3.5小时** | **~3天** | **4.5×** |
|
||
|
||
---
|
||
|
||
## 环境准备
|
||
|
||
### 1. 硬件要求
|
||
|
||
**每个节点**:
|
||
```yaml
|
||
GPU: 8×V100S (或同等级)
|
||
显存: 32GB/GPU
|
||
网络: 10Gbps+ (InfiniBand最佳)
|
||
存储: 共享存储 或 相同数据集路径
|
||
```
|
||
|
||
### 2. 网络要求
|
||
|
||
**必须满足**:
|
||
- ✅ 所有节点可互相ping通
|
||
- ✅ 特定端口开放(如29500)
|
||
- ✅ SSH免密登录(torchpack需要)
|
||
- ✅ 相同的CUDA/PyTorch版本
|
||
|
||
**验证网络**:
|
||
```bash
|
||
# 在master节点执行
|
||
ping <worker_node_ip>
|
||
|
||
# 测试SSH免密
|
||
ssh <worker_node_ip> "hostname"
|
||
|
||
# 检查端口
|
||
nc -zv <worker_node_ip> 29500
|
||
```
|
||
|
||
### 3. 数据集配置
|
||
|
||
**方案A:共享存储(推荐)**
|
||
```bash
|
||
# 所有节点挂载同一NFS/GPFS
|
||
/data/nuscenes/ (所有节点相同路径)
|
||
```
|
||
|
||
**方案B:本地复制**
|
||
```bash
|
||
# 每个节点本地存储
|
||
# 确保数据集路径完全一致
|
||
node1: /data/nuscenes/
|
||
node2: /data/nuscenes/
|
||
```
|
||
|
||
---
|
||
|
||
## 方案一:torchpack多机训练
|
||
|
||
### 优点
|
||
- ✅ 简单易用,一行命令启动
|
||
- ✅ 自动SSH到各节点启动进程
|
||
- ✅ 与现有脚本兼容
|
||
|
||
### 配置步骤
|
||
|
||
#### 1. SSH免密配置
|
||
|
||
```bash
|
||
# 在master节点执行
|
||
ssh-keygen -t rsa # 一路回车
|
||
|
||
# 复制公钥到所有worker节点
|
||
ssh-copy-id root@node2
|
||
ssh-copy-id root@node3
|
||
ssh-copy-id root@node4
|
||
|
||
# 验证
|
||
ssh node2 "nvidia-smi"
|
||
```
|
||
|
||
#### 2. 创建host文件
|
||
|
||
```bash
|
||
# 方法1: 使用IP地址
|
||
cat > /workspace/bevfusion/hosts.txt << 'EOF'
|
||
192.168.1.101:8 # node1 (master) - 8 GPUs
|
||
192.168.1.102:8 # node2 (worker) - 8 GPUs
|
||
EOF
|
||
|
||
# 方法2: 使用主机名(需配置/etc/hosts)
|
||
cat > /workspace/bevfusion/hosts.txt << 'EOF'
|
||
node1:8
|
||
node2:8
|
||
EOF
|
||
```
|
||
|
||
#### 3. 多机训练脚本
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# START_MULTINODE_TRAINING.sh
|
||
|
||
set -e
|
||
|
||
export PATH=/opt/conda/bin:$PATH
|
||
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
||
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
|
||
|
||
cd /workspace/bevfusion
|
||
|
||
echo "========================================================================"
|
||
echo "Phase 4A Stage 1: 多机多卡训练 (2节点×8卡=16 GPUs)"
|
||
echo "========================================================================"
|
||
echo "节点配置:"
|
||
echo " - node1 (master): 192.168.1.101 - 8×V100S"
|
||
echo " - node2 (worker): 192.168.1.102 - 8×V100S"
|
||
echo "========================================================================"
|
||
|
||
# 定义节点
|
||
MASTER_ADDR="192.168.1.101"
|
||
NODE1="192.168.1.101:8"
|
||
NODE2="192.168.1.102:8"
|
||
TOTAL_GPUS=16
|
||
|
||
LOG_FILE="phase4a_stage1_multinode_$(date +%Y%m%d_%H%M%S).log"
|
||
|
||
# torchpack多机启动
|
||
LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH \
|
||
PATH=/opt/conda/bin:$PATH \
|
||
PYTHONPATH=/workspace/bevfusion:$PYTHONPATH \
|
||
/opt/conda/bin/torchpack dist-run \
|
||
-np ${TOTAL_GPUS} \
|
||
-H ${NODE1},${NODE2} \
|
||
/opt/conda/bin/python tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0 \
|
||
2>&1 | tee "$LOG_FILE"
|
||
|
||
echo "训练完成!"
|
||
```
|
||
|
||
**关键参数说明**:
|
||
```bash
|
||
-np 16 # 总进程数 = 总GPU数
|
||
-H node1:8,node2:8 # 节点列表,格式:hostname:gpu_count
|
||
```
|
||
|
||
---
|
||
|
||
## 方案二:PyTorch原生DDP
|
||
|
||
### 优点
|
||
- ✅ 更灵活的控制
|
||
- ✅ 不依赖torchpack的SSH机制
|
||
- ✅ 适合复杂网络环境
|
||
|
||
### 配置步骤
|
||
|
||
#### 1. Master节点启动脚本
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# MASTER_NODE_TRAIN.sh
|
||
|
||
export MASTER_ADDR="192.168.1.101" # master节点IP
|
||
export MASTER_PORT="29500"
|
||
export WORLD_SIZE=16 # 总GPU数
|
||
export RANK=0 # master rank=0
|
||
export LOCAL_RANK=0
|
||
|
||
cd /workspace/bevfusion
|
||
|
||
# 启动8个进程(master节点的8个GPU)
|
||
for i in {0..7}; do
|
||
export RANK=$i
|
||
export LOCAL_RANK=$i
|
||
CUDA_VISIBLE_DEVICES=$i python tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0 \
|
||
2>&1 | tee "logs/master_gpu${i}.log" &
|
||
done
|
||
|
||
wait
|
||
```
|
||
|
||
#### 2. Worker节点启动脚本
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# WORKER_NODE_TRAIN.sh
|
||
|
||
export MASTER_ADDR="192.168.1.101" # master节点IP
|
||
export MASTER_PORT="29500"
|
||
export WORLD_SIZE=16 # 总GPU数
|
||
export NODE_RANK=1 # worker节点编号(1,2,3...)
|
||
|
||
cd /workspace/bevfusion
|
||
|
||
# 启动8个进程(worker节点的8个GPU)
|
||
for i in {0..7}; do
|
||
RANK=$((NODE_RANK * 8 + i)) # 全局rank = 节点rank*8 + 本地GPU
|
||
LOCAL_RANK=$i
|
||
CUDA_VISIBLE_DEVICES=$i python tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0 \
|
||
2>&1 | tee "logs/worker${NODE_RANK}_gpu${i}.log" &
|
||
done
|
||
|
||
wait
|
||
```
|
||
|
||
#### 3. 启动流程
|
||
|
||
```bash
|
||
# 1. 在master节点启动
|
||
ssh node1
|
||
bash MASTER_NODE_TRAIN.sh &
|
||
|
||
# 2. 在worker节点启动
|
||
ssh node2
|
||
bash WORKER_NODE_TRAIN.sh &
|
||
|
||
# 3. 监控
|
||
watch -n 5 'nvidia-smi'
|
||
```
|
||
|
||
---
|
||
|
||
## 方案三:使用torchrun(推荐PyTorch 1.9+)
|
||
|
||
### 最简单的多机启动方式
|
||
|
||
#### Master节点
|
||
```bash
|
||
#!/bin/bash
|
||
# 在master节点执行
|
||
|
||
export MASTER_ADDR="192.168.1.101"
|
||
export MASTER_PORT="29500"
|
||
|
||
torchrun \
|
||
--nnodes=2 \
|
||
--nproc_per_node=8 \
|
||
--node_rank=0 \
|
||
--master_addr=${MASTER_ADDR} \
|
||
--master_port=${MASTER_PORT} \
|
||
tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0
|
||
```
|
||
|
||
#### Worker节点
|
||
```bash
|
||
#!/bin/bash
|
||
# 在worker节点执行
|
||
|
||
export MASTER_ADDR="192.168.1.101"
|
||
export MASTER_PORT="29500"
|
||
|
||
torchrun \
|
||
--nnodes=2 \
|
||
--nproc_per_node=8 \
|
||
--node_rank=1 \
|
||
--master_addr=${MASTER_ADDR} \
|
||
--master_port=${MASTER_PORT} \
|
||
tools/train.py \
|
||
configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/multitask_BEV2X_phase4a_stage1.yaml \
|
||
--model.encoders.camera.backbone.init_cfg.checkpoint /data/pretrained/swint-nuimages-pretrained.pth \
|
||
--load_from /data/runs/phase4a_stage1/epoch_1.pth \
|
||
--data.samples_per_gpu 1 \
|
||
--data.workers_per_gpu 0
|
||
```
|
||
|
||
---
|
||
|
||
## 网络配置
|
||
|
||
### 1. 防火墙规则
|
||
|
||
```bash
|
||
# 在所有节点执行
|
||
# 开放master端口
|
||
firewall-cmd --permanent --add-port=29500/tcp
|
||
firewall-cmd --reload
|
||
|
||
# 或临时关闭防火墙(测试用)
|
||
systemctl stop firewalld
|
||
```
|
||
|
||
### 2. /etc/hosts配置
|
||
|
||
```bash
|
||
# 在所有节点的/etc/hosts添加
|
||
192.168.1.101 node1 master
|
||
192.168.1.102 node2 worker1
|
||
192.168.1.103 node3 worker2
|
||
192.168.1.104 node4 worker3
|
||
```
|
||
|
||
### 3. 网络带宽测试
|
||
|
||
```bash
|
||
# 安装iperf3
|
||
apt-get install iperf3
|
||
|
||
# node1上启动服务端
|
||
iperf3 -s
|
||
|
||
# node2上测试
|
||
iperf3 -c node1 -t 10
|
||
# 期望: >1Gbps(10Gbps网络最佳)
|
||
```
|
||
|
||
---
|
||
|
||
## 常见问题
|
||
|
||
### 1. SSH连接失败
|
||
|
||
**问题**: `ssh: connect to host node2 port 22: Connection refused`
|
||
|
||
**解决**:
|
||
```bash
|
||
# 检查SSH服务
|
||
systemctl status sshd
|
||
|
||
# 启动SSH
|
||
systemctl start sshd
|
||
|
||
# 重新配置免密
|
||
ssh-copy-id root@node2
|
||
```
|
||
|
||
### 2. NCCL初始化超时
|
||
|
||
**问题**: `NCCL timeout in init`
|
||
|
||
**解决**:
|
||
```bash
|
||
# 设置更长的超时
|
||
export NCCL_SOCKET_TIMEOUT=3600
|
||
export NCCL_TIMEOUT=3600
|
||
|
||
# 设置NCCL调试
|
||
export NCCL_DEBUG=INFO
|
||
export NCCL_DEBUG_SUBSYS=ALL
|
||
```
|
||
|
||
### 3. 不同节点路径不一致
|
||
|
||
**问题**: `FileNotFoundError: /data/nuscenes/...`
|
||
|
||
**解决**:
|
||
```bash
|
||
# 方案1: 统一路径(推荐)
|
||
# 所有节点使用相同的绝对路径
|
||
|
||
# 方案2: 软链接
|
||
ln -s /mnt/dataset/nuscenes /data/nuscenes
|
||
```
|
||
|
||
### 4. GPU通信错误
|
||
|
||
**问题**: `RuntimeError: NCCL error`
|
||
|
||
**解决**:
|
||
```bash
|
||
# 检查NCCL版本
|
||
python -c "import torch; print(torch.cuda.nccl.version())"
|
||
|
||
# 设置NCCL为P2P模式
|
||
export NCCL_P2P_DISABLE=0
|
||
export NCCL_IB_DISABLE=0 # 如果有InfiniBand
|
||
```
|
||
|
||
### 5. 显存不均衡
|
||
|
||
**问题**: master节点显存占用高于worker
|
||
|
||
**原因**: 日志、保存checkpoint都在master节点
|
||
|
||
**解决**:
|
||
```yaml
|
||
# 配置文件中设置
|
||
checkpoint_config:
|
||
interval: 1
|
||
max_keep_ckpts: 2 # 限制保存数量
|
||
```
|
||
|
||
---
|
||
|
||
## 监控与调试
|
||
|
||
### 1. 实时监控
|
||
|
||
```bash
|
||
# 监控所有节点GPU
|
||
while true; do
|
||
echo "=== Node1 ==="
|
||
ssh node1 "nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv"
|
||
echo "=== Node2 ==="
|
||
ssh node2 "nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv"
|
||
sleep 5
|
||
done
|
||
```
|
||
|
||
### 2. 分布式日志
|
||
|
||
```bash
|
||
# 查看master日志
|
||
tail -f phase4a_stage1_multinode_*.log | grep "Epoch"
|
||
|
||
# 查看各节点日志
|
||
ssh node1 "tail -f /workspace/bevfusion/logs/master_gpu0.log"
|
||
ssh node2 "tail -f /workspace/bevfusion/logs/worker1_gpu0.log"
|
||
```
|
||
|
||
### 3. NCCL性能分析
|
||
|
||
```bash
|
||
# 设置NCCL调试
|
||
export NCCL_DEBUG=INFO
|
||
export NCCL_DEBUG_SUBSYS=COLL,P2P,NET
|
||
|
||
# 运行训练,查看通信模式
|
||
# 输出会显示:
|
||
# - Ring, Tree, or Direct通信
|
||
# - 带宽利用率
|
||
# - P2P/IB状态
|
||
```
|
||
|
||
---
|
||
|
||
## 性能优化建议
|
||
|
||
### 1. Batch Size调整
|
||
|
||
```yaml
|
||
# 多机训练可以增大batch size
|
||
data:
|
||
samples_per_gpu: 2 # 从1增加到2
|
||
workers_per_gpu: 2 # 从0增加到2
|
||
```
|
||
|
||
### 2. 梯度累积
|
||
|
||
```python
|
||
# 如果单卡batch=1,可以用梯度累积模拟大batch
|
||
optimizer_config:
|
||
type: GradientCumulativeOptimizerHook
|
||
cumulative_iters: 4 # 累积4次=等效batch 4
|
||
```
|
||
|
||
### 3. 混合精度训练
|
||
|
||
```yaml
|
||
fp16:
|
||
loss_scale: dynamic
|
||
```
|
||
|
||
### 4. 通信优化
|
||
|
||
```bash
|
||
# InfiniBand优化
|
||
export NCCL_IB_HCA=mlx5_0,mlx5_1
|
||
export NCCL_IB_GID_INDEX=3
|
||
|
||
# 以太网优化
|
||
export NCCL_SOCKET_IFNAME=eth0
|
||
export NCCL_NET_GDR_LEVEL=5
|
||
```
|
||
|
||
---
|
||
|
||
## 配置模板总结
|
||
|
||
### 2机16卡(推荐torchpack)
|
||
|
||
```bash
|
||
torchpack dist-run \
|
||
-np 16 \
|
||
-H node1:8,node2:8 \
|
||
python tools/train.py config.yaml
|
||
```
|
||
|
||
### 4机32卡(推荐torchrun)
|
||
|
||
```bash
|
||
# 每个节点执行,修改--node_rank
|
||
torchrun \
|
||
--nnodes=4 \
|
||
--nproc_per_node=8 \
|
||
--node_rank=<0,1,2,3> \
|
||
--master_addr=node1 \
|
||
--master_port=29500 \
|
||
tools/train.py config.yaml
|
||
```
|
||
|
||
---
|
||
|
||
## 预期性能(Phase 4A Stage 1)
|
||
|
||
### 2机16卡配置
|
||
```
|
||
训练速度: ~6小时/epoch
|
||
10 epochs: ~5天
|
||
加速比: 3.0× vs 单机4卡
|
||
总训练时间节省: 13天
|
||
```
|
||
|
||
### 4机32卡配置
|
||
```
|
||
训练速度: ~3.5小时/epoch
|
||
10 epochs: ~3天
|
||
加速比: 4.5× vs 单机4卡
|
||
总训练时间节省: 15天
|
||
```
|
||
|
||
---
|
||
|
||
## 下一步
|
||
|
||
1. ✅ 准备SSH免密配置
|
||
2. ✅ 验证网络连通性
|
||
3. ✅ 统一数据集路径
|
||
4. ✅ 选择启动方案(torchpack/torchrun)
|
||
5. ✅ 创建启动脚本
|
||
6. ✅ 小规模测试(2 epochs)
|
||
7. ✅ 全量训练
|
||
|
||
---
|
||
|
||
**如需帮助配置具体环境,请提供**:
|
||
- 节点数量和IP
|
||
- 每节点GPU数量
|
||
- 网络类型(10Gbps以太网/InfiniBand)
|
||
- 存储方案(NFS/本地)
|
||
|
||
---
|
||
|
||
*文档版本: 1.0*
|
||
*最后更新: 2025-11-01*
|
||
*适用于: BEVFusion Phase 4A及后续阶段*
|
||
|