# Task-specific GCA训练启动 - 完整步骤 📅 **日期**: 2025-11-06 ⚠️ **重要**: 必须在Docker容器内执行 --- ## ⚠️ 环境问题解决 ### 问题: torchpack: command not found **原因**: 未在Docker容器内,或环境变量未设置 **解决**: 启动脚本已自动设置环境变量 ✅ --- ## 🚀 正确的启动方式 ### 方式1: 在Docker容器内执行 (推荐) ```bash # Step 1: 从宿主机进入Docker容器 docker exec -it bevfusion bash # Step 2: 在容器内执行启动脚本 cd /workspace/bevfusion bash START_PHASE4A_TASK_GCA.sh # 看到提示时输入 'y' ``` ### 方式2: 一行命令(自动进入容器) ```bash # 在宿主机执行 docker exec -it bevfusion bash /workspace/bevfusion/一键启动.sh ``` --- ## ✅ 环境配置说明 启动脚本会自动设置以下环境变量: ```bash export PATH=/opt/conda/bin:$PATH export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH ``` 并验证: ``` ✅ PyTorch: 1.10.1 ✅ mmcv: 1.4.0 ✅ torchpack: /opt/conda/bin/torchpack ``` --- ## 📊 完整启动流程 ``` 1. 进入容器 docker exec -it bevfusion bash 2. 脚本自动执行: ├─ 设置环境变量 ✅ ├─ 验证Python环境 ✅ ├─ 检查磁盘空间 ✅ ├─ 确认checkpoint ✅ ├─ 清理.eval_hook ✅ └─ 显示配置摘要 3. 用户确认: 输入 'y' 启动 4. 训练启动: torchpack dist-run -np 8 /opt/conda/bin/python tools/train.py ... 5. 日志输出: /data/runs/phase4a_stage1_task_gca/*.log ``` --- ## ✅ 启动后验证 ### 检查Task-specific GCA是否启用 ```bash # 查看日志前100行 docker exec -it bevfusion head -n 200 /data/runs/phase4a_stage1_task_gca/*.log | grep -A 10 "Task-specific" ``` 应该看到: ``` [BEVFusion] ⚪ Shared BEV-level GCA disabled [BEVFusion] ✨✨ Task-specific GCA mode enabled ✨✨ [object] GCA: - in_channels: 512 - reduction: 4 - params: 131,072 [map] GCA: - in_channels: 512 - reduction: 4 - params: 131,072 Total task-specific GCA params: 262,144 Advantage: Each task selects features by its own needs ✅ ``` ### 查看训练loss ```bash # 实时监控 docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/*.log # 查看divider改善 docker exec -it bevfusion tail -f /data/runs/phase4a_stage1_task_gca/*.log | grep "loss/map/divider/dice" ``` --- ## 📊 监控指标 ### 每50次迭代关注 ``` 检测: loss/object/loss_heatmap # 应该稳定或下降 stats/object/matched_ious # 应该上升 分割: loss/map/divider/dice # 应该从0.52→0.45→0.42 (降低是好事!) loss/map/drivable_area/dice 通用: grad_norm # 8-15正常 memory # <20000 ``` --- ## 🎯 预期性能 (Epoch 20) ``` 检测: mAP 0.68 → 0.70 (+2.9%) 分割: mIoU 0.55 → 0.61 (+11%) Divider: Dice Loss 0.525 → 0.420 (-20% = 变好!) ``` **重要**: Dice Loss越低越好! --- ## 📁 输出位置 ``` Checkpoints: /data/runs/phase4a_stage1_task_gca/epoch_*.pth 日志: /data/runs/phase4a_stage1_task_gca/*.log 配置快照: /data/runs/phase4a_stage1_task_gca/configs.yaml ``` --- ## ⏰ 时间预估 ``` 剩余epochs: 15 (epoch 6-20) 每epoch时间: ~11小时 总时间: ~7天 预计完成: 2025-11-13 ``` --- ## 🔧 故障排查 ### 如果torchpack仍未找到 ```bash # 手动设置环境 export PATH=/opt/conda/bin:$PATH which torchpack # 或使用完整路径 /opt/conda/bin/torchpack --version ``` ### 如果Python导入错误 ```bash export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib:/usr/local/cuda/lib64:$LD_LIBRARY_PATH export PYTHONPATH=/workspace/bevfusion:$PYTHONPATH ``` --- ## 📋 快速命令参考 ```bash # 进入容器 docker exec -it bevfusion bash # 启动训练 cd /workspace/bevfusion bash START_PHASE4A_TASK_GCA.sh # 监控日志 tail -f /data/runs/phase4a_stage1_task_gca/*.log # 检查GPU nvidia-smi # 检查磁盘 df -h /workspace /data ``` --- **🎉 环境问题已修复!现在可以正确启动了!** **在Docker容器内执行**: `bash START_PHASE4A_TASK_GCA.sh`