131 lines
2.5 KiB
Markdown
131 lines
2.5 KiB
Markdown
# 环境变化检测报告
|
|
|
|
**检测时间**: 2025-10-30
|
|
**严重程度**: 🔴 高 - 阻塞Phase 4A训练
|
|
|
|
---
|
|
|
|
## 🚨 关键发现
|
|
|
|
### PyTorch版本不匹配
|
|
|
|
**当前环境**:
|
|
```
|
|
PyTorch: 2.4.1+cu121 ❌ 新版本
|
|
CUDA: 12.1
|
|
```
|
|
|
|
**Phase 3训练环境** (2025-10-21 ~ 10-29):
|
|
```
|
|
PyTorch: 1.10.1 (推测) ✅ 旧版本
|
|
CUDA: 11.3
|
|
```
|
|
|
|
**问题**:
|
|
- mmcv是为PyTorch 1.10.1编译的
|
|
- 现在PyTorch升级到2.4.1
|
|
- mmcv的C++扩展与新PyTorch不兼容
|
|
|
|
---
|
|
|
|
## 💡 解决方案
|
|
|
|
### 方案1: 降级PyTorch (推荐 ⭐⭐⭐⭐⭐)
|
|
|
|
**恢复到训练成功的环境**:
|
|
```bash
|
|
# 1. 卸载当前PyTorch
|
|
pip uninstall torch torchvision -y
|
|
|
|
# 2. 安装PyTorch 1.10.1 + CUDA 11.3
|
|
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 \
|
|
-f https://download.pytorch.org/whl/cu113/torch_stable.html
|
|
|
|
# 3. 验证
|
|
python -c "import torch; print(torch.__version__)"
|
|
python -c "from mmcv.ops import nms_match; print('成功')"
|
|
|
|
# 4. 启动训练
|
|
bash START_PHASE4A_BEV2X.sh
|
|
```
|
|
|
|
### 方案2: 重新编译mmcv for PyTorch 2.4
|
|
|
|
**重新安装mmcv** (耗时1-2小时):
|
|
```bash
|
|
# 卸载旧mmcv
|
|
pip uninstall mmcv-full -y
|
|
|
|
# 安装适配PyTorch 2.4的mmcv
|
|
pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.4/index.html
|
|
|
|
# 或从源码编译
|
|
git clone https://github.com/open-mmlab/mmcv.git
|
|
cd mmcv
|
|
MMCV_WITH_OPS=1 pip install -e .
|
|
```
|
|
|
|
### 方案3: 使用Docker镜像恢复 (如有备份)
|
|
|
|
```bash
|
|
# 如果有Phase 3训练时的Docker镜像
|
|
docker images | grep bevfusion
|
|
|
|
# 使用旧镜像
|
|
docker run -it [旧镜像ID] /bin/bash
|
|
```
|
|
|
|
---
|
|
|
|
## 🔍 环境对比
|
|
|
|
### Phase 3 (成功)
|
|
```
|
|
时间: 2025-10-21 ~ 10-29
|
|
PyTorch: 1.10.1
|
|
CUDA: 11.3
|
|
mmcv: 1.4.0 (for torch 1.10.1)
|
|
状态: ✅ 23 epochs成功完成
|
|
```
|
|
|
|
### Phase 4A (失败)
|
|
```
|
|
时间: 2025-10-30
|
|
PyTorch: 2.4.1 ❌ 版本改变
|
|
CUDA: 12.1
|
|
mmcv: 1.4.0 (for torch 1.10.1) ❌ 不兼容
|
|
状态: ❌ 无法启动
|
|
```
|
|
|
|
---
|
|
|
|
## ⏭️ 立即行动
|
|
|
|
**推荐: 降级到PyTorch 1.10.1** (最快最稳)
|
|
|
|
```bash
|
|
cd /workspace/bevfusion
|
|
|
|
# 降级PyTorch
|
|
pip uninstall torch torchvision -y
|
|
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 \
|
|
-f https://download.pytorch.org/whl/cu113/torch_stable.html
|
|
|
|
# 验证
|
|
python -c "import torch; print(torch.__version__)"
|
|
python -c "from mmcv.ops import nms_match; print('mmcv正常')"
|
|
|
|
# 启动Phase 4A
|
|
bash START_PHASE4A_BEV2X.sh
|
|
```
|
|
|
|
---
|
|
|
|
**根本原因**: PyTorch环境在Phase 3和Phase 4A之间被升级了
|
|
**解决时间**: 10-15分钟 (降级PyTorch)
|
|
**风险**: 低 (恢复到之前工作的版本)
|
|
|
|
|
|
|
|
|