254 lines
4.1 KiB
Markdown
254 lines
4.1 KiB
Markdown
|
|
# Task-specific GCA后台训练使用指南
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🚀 一键启动 (后台运行)
|
||
|
|
|
||
|
|
### 在Docker容器内执行
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /workspace/bevfusion
|
||
|
|
bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
**特点**:
|
||
|
|
- ✅ 自动后台运行
|
||
|
|
- ✅ 输出重定向到日志文件
|
||
|
|
- ✅ 退出终端不影响训练
|
||
|
|
- ✅ 自动显示监控命令
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 监控训练
|
||
|
|
|
||
|
|
### 方式1: 使用监控脚本 (推荐)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion bash /workspace/bevfusion/MONITOR_TASK_GCA.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
显示:
|
||
|
|
- 训练进程状态
|
||
|
|
- GPU使用情况
|
||
|
|
- 最新日志 (最后100行)
|
||
|
|
|
||
|
|
### 方式2: 实时查看日志
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/train_*.log
|
||
|
|
```
|
||
|
|
|
||
|
|
### 方式3: 查看关键指标
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion bash -c "tail -f /data/runs/phase4a_stage1_task_gca/train_*.log | grep -E 'Epoch|loss/map/divider|loss/object'"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 方式4: GPU监控
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion nvidia-smi -l 5
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔍 检查训练状态
|
||
|
|
|
||
|
|
### 查看进程
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion ps aux | grep "tools/train.py"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 查看最新checkpoint
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion ls -lth /data/runs/phase4a_stage1_task_gca/epoch_*.pth | head -5
|
||
|
|
```
|
||
|
|
|
||
|
|
### 查看日志摘要
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion tail -n 200 /data/runs/phase4a_stage1_task_gca/train_*.log | grep -E "Task-specific|Epoch \[|loss/"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ⏸️ 停止训练
|
||
|
|
|
||
|
|
### 查找进程ID
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion ps aux | grep "tools/train.py" | grep -v grep
|
||
|
|
```
|
||
|
|
|
||
|
|
### 停止进程
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion kill <PID>
|
||
|
|
```
|
||
|
|
|
||
|
|
或优雅停止:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec -it bevfusion pkill -f "tools/train.py"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📁 输出位置
|
||
|
|
|
||
|
|
```
|
||
|
|
日志文件:
|
||
|
|
/data/runs/phase4a_stage1_task_gca/train_YYYYMMDD_HHMMSS.log
|
||
|
|
|
||
|
|
Checkpoints:
|
||
|
|
/data/runs/phase4a_stage1_task_gca/epoch_1.pth
|
||
|
|
/data/runs/phase4a_stage1_task_gca/epoch_2.pth
|
||
|
|
...
|
||
|
|
/data/runs/phase4a_stage1_task_gca/epoch_20.pth
|
||
|
|
|
||
|
|
配置快照:
|
||
|
|
/data/runs/phase4a_stage1_task_gca/configs.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 预期日志内容
|
||
|
|
|
||
|
|
### 启动阶段
|
||
|
|
|
||
|
|
```
|
||
|
|
[BEVFusion] ⚪ Skipping camera backbone init_weights
|
||
|
|
[BEVFusion] ✨✨ Task-specific GCA mode enabled ✨✨
|
||
|
|
[object] GCA: params: 131,072
|
||
|
|
[map] GCA: params: 131,072
|
||
|
|
|
||
|
|
load checkpoint from .../epoch_5.pth
|
||
|
|
|
||
|
|
The following keys in model are not found in checkpoint:
|
||
|
|
task_gca.* (正常,随机初始化)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 训练阶段
|
||
|
|
|
||
|
|
```
|
||
|
|
Epoch [1][50/xxx]
|
||
|
|
lr: 2.00e-05
|
||
|
|
loss/object/loss_heatmap: 0.240
|
||
|
|
loss/object/loss_bbox: 0.310
|
||
|
|
loss/map/divider/dice: 0.525
|
||
|
|
loss/map/divider/focal: 0.180
|
||
|
|
loss/map/drivable_area/dice: 0.090
|
||
|
|
grad_norm: 12.5
|
||
|
|
memory: 18500
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📊 关键指标监控
|
||
|
|
|
||
|
|
### 每50次迭代关注
|
||
|
|
|
||
|
|
```
|
||
|
|
检测:
|
||
|
|
loss/object/loss_heatmap # 应该稳定或下降
|
||
|
|
stats/object/matched_ious # 应该上升
|
||
|
|
|
||
|
|
分割:
|
||
|
|
loss/map/divider/dice # 从0.52→0.45→0.42 (降低是好事!)
|
||
|
|
loss/map/drivable_area/dice
|
||
|
|
|
||
|
|
通用:
|
||
|
|
grad_norm # 8-15正常
|
||
|
|
memory # <20000 MB
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ⏰ 时间预估
|
||
|
|
|
||
|
|
```
|
||
|
|
剩余epochs: 15 (epoch 6-20)
|
||
|
|
每epoch时间: ~11小时
|
||
|
|
总时间: ~7天
|
||
|
|
预计完成: 2025-11-13
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🛠️ 故障排查
|
||
|
|
|
||
|
|
### 训练未启动
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 查看日志
|
||
|
|
cat /data/runs/phase4a_stage1_task_gca/train_*.log
|
||
|
|
|
||
|
|
# 常见原因:
|
||
|
|
# 1. 环境变量未设置
|
||
|
|
# 2. Checkpoint路径错误
|
||
|
|
# 3. 磁盘空间不足
|
||
|
|
```
|
||
|
|
|
||
|
|
### 训练中断
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 查看最后100行日志
|
||
|
|
tail -n 100 /data/runs/phase4a_stage1_task_gca/train_*.log
|
||
|
|
|
||
|
|
# 常见原因:
|
||
|
|
# 1. OOM (显存不足)
|
||
|
|
# 2. 磁盘空间不足
|
||
|
|
# 3. 网络问题
|
||
|
|
```
|
||
|
|
|
||
|
|
### 重新启动
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 检查最新checkpoint
|
||
|
|
ls -lt /data/runs/phase4a_stage1_task_gca/epoch_*.pth | head -1
|
||
|
|
|
||
|
|
# 修改脚本中的LATEST_CKPT路径
|
||
|
|
# 然后重新执行启动脚本
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📋 快速命令参考
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 启动训练 (后台)
|
||
|
|
bash START_PHASE4A_TASK_GCA_BACKGROUND.sh
|
||
|
|
|
||
|
|
# 监控训练
|
||
|
|
bash MONITOR_TASK_GCA.sh
|
||
|
|
|
||
|
|
# 实时日志
|
||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/train_*.log
|
||
|
|
|
||
|
|
# GPU状态
|
||
|
|
nvidia-smi -l 5
|
||
|
|
|
||
|
|
# 检查进程
|
||
|
|
ps aux | grep train.py
|
||
|
|
|
||
|
|
# 停止训练
|
||
|
|
pkill -f "tools/train.py"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**🎉 后台训练脚本已准备就绪!执行 `bash START_PHASE4A_TASK_GCA_BACKGROUND.sh` 开始训练!**
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|