bev-project/启动训练_完整步骤.md

231 lines
4.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Task-specific GCA训练启动 - 完整步骤
📅 **日期**: 2025-11-06
⚠️ **重要**: 必须在Docker容器内执行
---
## ⚠️ 环境问题解决
### 问题: torchpack: command not found
**原因**: 未在Docker容器内或环境变量未设置
**解决**: 启动脚本已自动设置环境变量 ✅
---
## 🚀 正确的启动方式
### 方式1: 在Docker容器内执行 (推荐)
```bash
# Step 1: 从宿主机进入Docker容器
docker exec -it bevfusion bash
# Step 2: 在容器内执行启动脚本
cd /workspace/bevfusion
bash START_PHASE4A_TASK_GCA.sh
# 看到提示时输入 'y'
```
### 方式2: 一行命令(自动进入容器)
```bash
# 在宿主机执行
docker exec -it bevfusion bash /workspace/bevfusion/一键启动.sh
```
---
## ✅ 环境配置说明
启动脚本会自动设置以下环境变量:
```bash
export PATH=/opt/conda/bin:$PATH
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
```
并验证:
```
✅ PyTorch: 1.10.1
✅ mmcv: 1.4.0
✅ torchpack: /opt/conda/bin/torchpack
```
---
## 📊 完整启动流程
```
1. 进入容器
docker exec -it bevfusion bash
2. 脚本自动执行:
├─ 设置环境变量 ✅
├─ 验证Python环境 ✅
├─ 检查磁盘空间 ✅
├─ 确认checkpoint ✅
├─ 清理.eval_hook ✅
└─ 显示配置摘要
3. 用户确认:
输入 'y' 启动
4. 训练启动:
torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py ...
5. 日志输出:
/data/runs/phase4a_stage1_task_gca/*.log
```
---
## ✅ 启动后验证
### 检查Task-specific GCA是否启用
```bash
# 查看日志前100行
docker exec -it bevfusion head -n 200 /data/runs/phase4a_stage1_task_gca/*.log | grep -A 10 "Task-specific"
```
应该看到:
```
[BEVFusion] ⚪ Shared BEV-level GCA disabled
[BEVFusion] ✨✨ Task-specific GCA mode enabled ✨✨
[object] GCA:
- in_channels: 512
- reduction: 4
- params: 131,072
[map] GCA:
- in_channels: 512
- reduction: 4
- params: 131,072
Total task-specific GCA params: 262,144
Advantage: Each task selects features by its own needs ✅
```
### 查看训练loss
```bash
# 实时监控
docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/*.log
# 查看divider改善
docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/*.log | grep "loss/map/divider/dice"
```
---
## 📊 监控指标
### 每50次迭代关注
```
检测:
loss/object/loss_heatmap # 应该稳定或下降
stats/object/matched_ious # 应该上升
分割:
loss/map/divider/dice # 应该从0.52→0.45→0.42 (降低是好事!)
loss/map/drivable_area/dice
通用:
grad_norm # 8-15正常
memory # <20000
```
---
## 🎯 预期性能 (Epoch 20)
```
检测: mAP 0.68 → 0.70 (+2.9%)
分割: mIoU 0.55 → 0.61 (+11%)
Divider: Dice Loss 0.525 → 0.420 (-20% = 变好!)
```
**重要**: Dice Loss越低越好
---
## 📁 输出位置
```
Checkpoints:
/data/runs/phase4a_stage1_task_gca/epoch_*.pth
日志:
/data/runs/phase4a_stage1_task_gca/*.log
配置快照:
/data/runs/phase4a_stage1_task_gca/configs.yaml
```
---
## ⏰ 时间预估
```
剩余epochs: 15 (epoch 6-20)
每epoch时间: ~11小时
总时间: ~7天
预计完成: 2025-11-13
```
---
## 🔧 故障排查
### 如果torchpack仍未找到
```bash
# 手动设置环境
export PATH=/opt/conda/bin:$PATH
which torchpack
# 或使用完整路径
/opt/conda/bin/torchpack --version
```
### 如果Python导入错误
```bash
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH
```
---
## 📋 快速命令参考
```bash
# 进入容器
docker exec -it bevfusion bash
# 启动训练
cd /workspace/bevfusion
bash START_PHASE4A_TASK_GCA.sh
# 监控日志
tail -f /data/runs/phase4a_stage1_task_gca/*.log
# 检查GPU
nvidia-smi
# 检查磁盘
df -h /workspace /data
```
---
**🎉 环境问题已修复!现在可以正确启动了!**
**在Docker容器内执行**: `bash START_PHASE4A_TASK_GCA.sh`