61 lines
1.7 KiB
Plaintext
61 lines
1.7 KiB
Plaintext
|
|
================================================================================
|
|||
|
|
BEVFusion 多机多卡训练 - 快速参考
|
|||
|
|
================================================================================
|
|||
|
|
|
|||
|
|
【支持配置】
|
|||
|
|
✓ 单机8卡: 8 GPUs (当前配置)
|
|||
|
|
✓ 2机16卡: 16 GPUs (~3.0×加速, 5天完成)
|
|||
|
|
✓ 4机32卡: 32 GPUs (~4.5×加速, 3天完成)
|
|||
|
|
|
|||
|
|
【前置要求】
|
|||
|
|
1. SSH免密登录
|
|||
|
|
ssh-keygen -t rsa
|
|||
|
|
ssh-copy-id root@node2
|
|||
|
|
|
|||
|
|
2. 网络互通
|
|||
|
|
ping node2
|
|||
|
|
nc -zv node2 29500
|
|||
|
|
|
|||
|
|
3. 数据集路径统一
|
|||
|
|
所有节点: /data/nuscenes/
|
|||
|
|
|
|||
|
|
【启动方式】
|
|||
|
|
|
|||
|
|
方案1: torchpack(推荐,简单)
|
|||
|
|
─────────────────────────────────────
|
|||
|
|
# 修改START_MULTINODE_TRAINING.sh中的IP
|
|||
|
|
MASTER_ADDR="192.168.1.101"
|
|||
|
|
WORKER1_ADDR="192.168.1.102"
|
|||
|
|
|
|||
|
|
# 启动
|
|||
|
|
bash START_MULTINODE_TRAINING.sh
|
|||
|
|
|
|||
|
|
方案2: torchrun(灵活)
|
|||
|
|
─────────────────────────────────────
|
|||
|
|
# master节点
|
|||
|
|
torchrun --nnodes=2 --nproc_per_node=8 \
|
|||
|
|
--node_rank=0 --master_addr=192.168.1.101 \
|
|||
|
|
tools/train.py config.yaml
|
|||
|
|
|
|||
|
|
# worker节点
|
|||
|
|
torchrun --nnodes=2 --nproc_per_node=8 \
|
|||
|
|
--node_rank=1 --master_addr=192.168.1.101 \
|
|||
|
|
tools/train.py config.yaml
|
|||
|
|
|
|||
|
|
【监控命令】
|
|||
|
|
# 查看所有节点GPU
|
|||
|
|
ssh node1 "nvidia-smi" && ssh node2 "nvidia-smi"
|
|||
|
|
|
|||
|
|
# 查看训练进度
|
|||
|
|
tail -f phase4a_stage1_multinode_*.log | grep Epoch
|
|||
|
|
|
|||
|
|
【常见问题】
|
|||
|
|
1. SSH连接失败 → 检查ssh-copy-id
|
|||
|
|
2. NCCL超时 → export NCCL_SOCKET_TIMEOUT=3600
|
|||
|
|
3. 路径不一致 → 统一为/data/nuscenes
|
|||
|
|
|
|||
|
|
【详细文档】
|
|||
|
|
project/docs/多机多卡训练配置指南.md
|
|||
|
|
|
|||
|
|
================================================================================
|