bev-project/多机训练快速参考.txt

61 lines
1.7 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

================================================================================
BEVFusion 多机多卡训练 - 快速参考
================================================================================
【支持配置】
✓ 单机8卡: 8 GPUs (当前配置)
✓ 2机16卡: 16 GPUs (~3.0×加速, 5天完成)
✓ 4机32卡: 32 GPUs (~4.5×加速, 3天完成)
【前置要求】
1. SSH免密登录
ssh-keygen -t rsa
ssh-copy-id root@node2
2. 网络互通
ping node2
nc -zv node2 29500
3. 数据集路径统一
所有节点: /data/nuscenes/
【启动方式】
方案1: torchpack推荐简单
─────────────────────────────────────
# 修改START_MULTINODE_TRAINING.sh中的IP
MASTER_ADDR="192.168.1.101"
WORKER1_ADDR="192.168.1.102"
# 启动
bash START_MULTINODE_TRAINING.sh
方案2: torchrun灵活
─────────────────────────────────────
# master节点
torchrun --nnodes=2 --nproc_per_node=8 \
--node_rank=0 --master_addr=192.168.1.101 \
tools/train.py config.yaml
# worker节点
torchrun --nnodes=2 --nproc_per_node=8 \
--node_rank=1 --master_addr=192.168.1.101 \
tools/train.py config.yaml
【监控命令】
# 查看所有节点GPU
ssh node1 "nvidia-smi" && ssh node2 "nvidia-smi"
# 查看训练进度
tail -f phase4a_stage1_multinode_*.log | grep Epoch
【常见问题】
1. SSH连接失败 → 检查ssh-copy-id
2. NCCL超时 → export NCCL_SOCKET_TIMEOUT=3600
3. 路径不一致 → 统一为/data/nuscenes
【详细文档】
project/docs/多机多卡训练配置指南.md
================================================================================