61 lines
1.7 KiB
Plaintext
61 lines
1.7 KiB
Plaintext
================================================================================
|
||
BEVFusion 多机多卡训练 - 快速参考
|
||
================================================================================
|
||
|
||
【支持配置】
|
||
✓ 单机8卡: 8 GPUs (当前配置)
|
||
✓ 2机16卡: 16 GPUs (~3.0×加速, 5天完成)
|
||
✓ 4机32卡: 32 GPUs (~4.5×加速, 3天完成)
|
||
|
||
【前置要求】
|
||
1. SSH免密登录
|
||
ssh-keygen -t rsa
|
||
ssh-copy-id root@node2
|
||
|
||
2. 网络互通
|
||
ping node2
|
||
nc -zv node2 29500
|
||
|
||
3. 数据集路径统一
|
||
所有节点: /data/nuscenes/
|
||
|
||
【启动方式】
|
||
|
||
方案1: torchpack(推荐,简单)
|
||
─────────────────────────────────────
|
||
# 修改START_MULTINODE_TRAINING.sh中的IP
|
||
MASTER_ADDR="192.168.1.101"
|
||
WORKER1_ADDR="192.168.1.102"
|
||
|
||
# 启动
|
||
bash START_MULTINODE_TRAINING.sh
|
||
|
||
方案2: torchrun(灵活)
|
||
─────────────────────────────────────
|
||
# master节点
|
||
torchrun --nnodes=2 --nproc_per_node=8 \
|
||
--node_rank=0 --master_addr=192.168.1.101 \
|
||
tools/train.py config.yaml
|
||
|
||
# worker节点
|
||
torchrun --nnodes=2 --nproc_per_node=8 \
|
||
--node_rank=1 --master_addr=192.168.1.101 \
|
||
tools/train.py config.yaml
|
||
|
||
【监控命令】
|
||
# 查看所有节点GPU
|
||
ssh node1 "nvidia-smi" && ssh node2 "nvidia-smi"
|
||
|
||
# 查看训练进度
|
||
tail -f phase4a_stage1_multinode_*.log | grep Epoch
|
||
|
||
【常见问题】
|
||
1. SSH连接失败 → 检查ssh-copy-id
|
||
2. NCCL超时 → export NCCL_SOCKET_TIMEOUT=3600
|
||
3. 路径不一致 → 统一为/data/nuscenes
|
||
|
||
【详细文档】
|
||
project/docs/多机多卡训练配置指南.md
|
||
|
||
================================================================================
|