231 lines
4.2 KiB
Markdown
231 lines
4.2 KiB
Markdown
|
|
# Task-specific GCA训练启动 - 完整步骤
|
|||
|
|
|
|||
|
|
📅 **日期**: 2025-11-06
|
|||
|
|
⚠️ **重要**: 必须在Docker容器内执行
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⚠️ 环境问题解决
|
|||
|
|
|
|||
|
|
### 问题: torchpack: command not found
|
|||
|
|
|
|||
|
|
**原因**: 未在Docker容器内,或环境变量未设置
|
|||
|
|
|
|||
|
|
**解决**: 启动脚本已自动设置环境变量 ✅
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 正确的启动方式
|
|||
|
|
|
|||
|
|
### 方式1: 在Docker容器内执行 (推荐)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Step 1: 从宿主机进入Docker容器
|
|||
|
|
docker exec -it bevfusion bash
|
|||
|
|
|
|||
|
|
# Step 2: 在容器内执行启动脚本
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
bash START_PHASE4A_TASK_GCA.sh
|
|||
|
|
|
|||
|
|
# 看到提示时输入 'y'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 方式2: 一行命令(自动进入容器)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 在宿主机执行
|
|||
|
|
docker exec -it bevfusion bash /workspace/bevfusion/一键启动.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 环境配置说明
|
|||
|
|
|
|||
|
|
启动脚本会自动设置以下环境变量:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
export PATH=/opt/conda/bin:$PATH
|
|||
|
|
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
|||
|
|
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
并验证:
|
|||
|
|
```
|
|||
|
|
✅ PyTorch: 1.10.1
|
|||
|
|
✅ mmcv: 1.4.0
|
|||
|
|
✅ torchpack: /opt/conda/bin/torchpack
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 完整启动流程
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
1. 进入容器
|
|||
|
|
docker exec -it bevfusion bash
|
|||
|
|
|
|||
|
|
2. 脚本自动执行:
|
|||
|
|
├─ 设置环境变量 ✅
|
|||
|
|
├─ 验证Python环境 ✅
|
|||
|
|
├─ 检查磁盘空间 ✅
|
|||
|
|
├─ 确认checkpoint ✅
|
|||
|
|
├─ 清理.eval_hook ✅
|
|||
|
|
└─ 显示配置摘要
|
|||
|
|
|
|||
|
|
3. 用户确认:
|
|||
|
|
输入 'y' 启动
|
|||
|
|
|
|||
|
|
4. 训练启动:
|
|||
|
|
torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py ...
|
|||
|
|
|
|||
|
|
5. 日志输出:
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/*.log
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ 启动后验证
|
|||
|
|
|
|||
|
|
### 检查Task-specific GCA是否启用
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 查看日志前100行
|
|||
|
|
docker exec -it bevfusion head -n 200 /data/runs/phase4a_stage1_task_gca/*.log | grep -A 10 "Task-specific"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
应该看到:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[BEVFusion] ⚪ Shared BEV-level GCA disabled
|
|||
|
|
[BEVFusion] ✨✨ Task-specific GCA mode enabled ✨✨
|
|||
|
|
[object] GCA:
|
|||
|
|
- in_channels: 512
|
|||
|
|
- reduction: 4
|
|||
|
|
- params: 131,072
|
|||
|
|
[map] GCA:
|
|||
|
|
- in_channels: 512
|
|||
|
|
- reduction: 4
|
|||
|
|
- params: 131,072
|
|||
|
|
Total task-specific GCA params: 262,144
|
|||
|
|
Advantage: Each task selects features by its own needs ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 查看训练loss
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 实时监控
|
|||
|
|
docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/*.log
|
|||
|
|
|
|||
|
|
# 查看divider改善
|
|||
|
|
docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/*.log | grep "loss/map/divider/dice"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 监控指标
|
|||
|
|
|
|||
|
|
### 每50次迭代关注
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
检测:
|
|||
|
|
loss/object/loss_heatmap # 应该稳定或下降
|
|||
|
|
stats/object/matched_ious # 应该上升
|
|||
|
|
|
|||
|
|
分割:
|
|||
|
|
loss/map/divider/dice # 应该从0.52→0.45→0.42 (降低是好事!)
|
|||
|
|
loss/map/drivable_area/dice
|
|||
|
|
|
|||
|
|
通用:
|
|||
|
|
grad_norm # 8-15正常
|
|||
|
|
memory # <20000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 预期性能 (Epoch 20)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
检测: mAP 0.68 → 0.70 (+2.9%)
|
|||
|
|
分割: mIoU 0.55 → 0.61 (+11%)
|
|||
|
|
Divider: Dice Loss 0.525 → 0.420 (-20% = 变好!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**重要**: Dice Loss越低越好!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📁 输出位置
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Checkpoints:
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/epoch_*.pth
|
|||
|
|
|
|||
|
|
日志:
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/*.log
|
|||
|
|
|
|||
|
|
配置快照:
|
|||
|
|
/data/runs/phase4a_stage1_task_gca/configs.yaml
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ⏰ 时间预估
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
剩余epochs: 15 (epoch 6-20)
|
|||
|
|
每epoch时间: ~11小时
|
|||
|
|
总时间: ~7天
|
|||
|
|
预计完成: 2025-11-13
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 故障排查
|
|||
|
|
|
|||
|
|
### 如果torchpack仍未找到
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 手动设置环境
|
|||
|
|
export PATH=/opt/conda/bin:$PATH
|
|||
|
|
which torchpack
|
|||
|
|
|
|||
|
|
# 或使用完整路径
|
|||
|
|
/opt/conda/bin/torchpack --version
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 如果Python导入错误
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
|
|||
|
|
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 快速命令参考
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 进入容器
|
|||
|
|
docker exec -it bevfusion bash
|
|||
|
|
|
|||
|
|
# 启动训练
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
bash START_PHASE4A_TASK_GCA.sh
|
|||
|
|
|
|||
|
|
# 监控日志
|
|||
|
|
tail -f /data/runs/phase4a_stage1_task_gca/*.log
|
|||
|
|
|
|||
|
|
# 检查GPU
|
|||
|
|
nvidia-smi
|
|||
|
|
|
|||
|
|
# 检查磁盘
|
|||
|
|
df -h /workspace /data
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**🎉 环境问题已修复!现在可以正确启动了!**
|
|||
|
|
|
|||
|
|
**在Docker容器内执行**: `bash START_PHASE4A_TASK_GCA.sh`
|
|||
|
|
|