320 lines
8.5 KiB
Markdown
320 lines
8.5 KiB
Markdown
|
|
# .eval_hook 产生原因 & nuScenes数据集完整性报告
|
|||
|
|
|
|||
|
|
📅 **日期**: 2025-11-06
|
|||
|
|
🎯 **问题**: 为何产生75GB的.eval_hook缓存?数据集是否完整?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 一、.eval_hook 产生原因详解
|
|||
|
|
|
|||
|
|
### 1. 触发机制
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# multitask_BEV2X_phase4a_stage1.yaml 配置
|
|||
|
|
evaluation:
|
|||
|
|
interval: 5 # ⬅️ 关键配置:每5个epoch评估一次
|
|||
|
|
pipeline: ${test_pipeline}
|
|||
|
|
metric:
|
|||
|
|
- bbox # 3D检测评估
|
|||
|
|
- map # BEV分割评估
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**触发时间线**:
|
|||
|
|
```
|
|||
|
|
Epoch 1 ✅ 训练完成 → 保存checkpoint
|
|||
|
|
Epoch 2 ✅ 训练完成 → 保存checkpoint
|
|||
|
|
Epoch 3 ✅ 训练完成 → 保存checkpoint
|
|||
|
|
Epoch 4 ✅ 训练完成 → 保存checkpoint
|
|||
|
|
Epoch 5 ✅ 训练完成 → 🔥 触发evaluation (interval=5)
|
|||
|
|
↓
|
|||
|
|
📁 创建.eval_hook目录
|
|||
|
|
↓
|
|||
|
|
🚀 在validation集上推理
|
|||
|
|
↓
|
|||
|
|
💾 保存推理结果到.eval_hook/part_X.pkl
|
|||
|
|
↓
|
|||
|
|
❌ 磁盘满 → 进程中断 → 清理未执行
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 二、evaluation工作流程
|
|||
|
|
|
|||
|
|
### Step 1: 分布式推理
|
|||
|
|
```
|
|||
|
|
GPU 0 → 处理 val samples 1-4269 → .eval_hook/part_0.pkl (9.8GB)
|
|||
|
|
GPU 1 → 处理 val samples 4270-8538 → .eval_hook/part_1.pkl (13GB)
|
|||
|
|
GPU 2 → 处理 val samples 8539-12807 → .eval_hook/part_2.pkl (6.8GB)
|
|||
|
|
GPU 3 → 处理 val samples 12808-17076 → .eval_hook/part_3.pkl (7.6GB)
|
|||
|
|
GPU 4 → 处理 val samples 17077-21345 → .eval_hook/part_4.pkl (8.5GB)
|
|||
|
|
GPU 5 → 处理 val samples 21346-25614 → .eval_hook/part_5.pkl (15GB)
|
|||
|
|
GPU 6 → 处理 val samples 25615-29883 → .eval_hook/part_6.pkl (11GB)
|
|||
|
|
GPU 7 → 处理 val samples 29884-34151 → .eval_hook/part_7.pkl (4.5GB)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**总计**: 8个GPU × 平均9.4GB = **75GB**
|
|||
|
|
|
|||
|
|
### Step 2: 每个part文件内容
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
part_X.pkl = {
|
|||
|
|
'bbox_results': [
|
|||
|
|
# 每个sample的3D检测结果
|
|||
|
|
{
|
|||
|
|
'boxes_3d': Tensor[N, 9], # N个3D框
|
|||
|
|
'scores_3d': Tensor[N], # 置信度
|
|||
|
|
'labels_3d': Tensor[N], # 10个类别
|
|||
|
|
},
|
|||
|
|
... # 约4000个samples
|
|||
|
|
],
|
|||
|
|
'pts_seg_results': [
|
|||
|
|
# 每个sample的BEV分割结果
|
|||
|
|
{
|
|||
|
|
'seg_pred': Tensor[6, 600, 600], # 6类分割 @ 600×600
|
|||
|
|
'seg_logits': Tensor[6, 600, 600, 2], # logits
|
|||
|
|
},
|
|||
|
|
... # 约4000个samples
|
|||
|
|
]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**为何75GB这么大?**
|
|||
|
|
- 高分辨率: 600×600 (vs Phase 3的180×180)
|
|||
|
|
- 6个分割类别 × 600×600 × 4 bytes (float32) = 8.6MB **每个sample**
|
|||
|
|
- 34151个validation samples × 8.6MB = **294GB理论值**
|
|||
|
|
- 实际75GB是因为压缩和仅保存必要字段
|
|||
|
|
|
|||
|
|
### Step 3: 主进程聚合
|
|||
|
|
```python
|
|||
|
|
# mmdetection evaluation hook
|
|||
|
|
def after_train_epoch(self, runner):
|
|||
|
|
if self.every_n_epochs(runner, self.interval): # interval=5
|
|||
|
|
# 1. 收集所有part文件
|
|||
|
|
results = collect_results_gpu(...)
|
|||
|
|
|
|||
|
|
# 2. 计算指标
|
|||
|
|
metrics = self.dataloader.dataset.evaluate(results)
|
|||
|
|
# - 3D Detection: mAP, NDS
|
|||
|
|
# - BEV Segmentation: mIoU (per-class)
|
|||
|
|
|
|||
|
|
# 3. 🗑️ 清理临时文件
|
|||
|
|
if rank == 0:
|
|||
|
|
shutil.rmtree('.eval_hook/') # ⬅️ 这一步因磁盘满未执行
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 三、为何未自动清理?
|
|||
|
|
|
|||
|
|
### 正常流程
|
|||
|
|
```
|
|||
|
|
Epoch 5 训练完成
|
|||
|
|
↓
|
|||
|
|
创建 .eval_hook/
|
|||
|
|
↓
|
|||
|
|
8个GPU并行推理 → 保存part_0.pkl ~ part_7.pkl
|
|||
|
|
↓
|
|||
|
|
主进程收集结果 → 计算mAP/mIoU
|
|||
|
|
↓
|
|||
|
|
✅ 自动删除 .eval_hook/ ⬅️ 正常情况
|
|||
|
|
↓
|
|||
|
|
继续 Epoch 6
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 本次情况
|
|||
|
|
```
|
|||
|
|
Epoch 5 训练完成 (10:06:01 UTC)
|
|||
|
|
↓
|
|||
|
|
保存 epoch_5.pth (525MB) ✅
|
|||
|
|
↓
|
|||
|
|
创建 .eval_hook/
|
|||
|
|
↓
|
|||
|
|
8个GPU并行推理
|
|||
|
|
↓
|
|||
|
|
part_0.pkl ~ part_5.pkl 保存成功 (累计63GB)
|
|||
|
|
↓
|
|||
|
|
part_6.pkl 保存到一半...
|
|||
|
|
↓
|
|||
|
|
❌ 磁盘满 (100% /workspace)
|
|||
|
|
↓
|
|||
|
|
进程 killed / 异常退出
|
|||
|
|
↓
|
|||
|
|
清理代码未执行 → .eval_hook/ 残留 ❌
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 四、nuScenes数据集完整性检查
|
|||
|
|
|
|||
|
|
### ✅ 数据集位置
|
|||
|
|
```bash
|
|||
|
|
原始数据: /data/nuscenes/
|
|||
|
|
软链接: /home/data -> /data
|
|||
|
|
项目访问: /workspace/bevfusion/data/nuscenes -> /data/nuscenes
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ✅ 索引文件 (完整)
|
|||
|
|
```
|
|||
|
|
/data/nuscenes/
|
|||
|
|
├─ nuscenes_infos_train.pkl 1.2GB ✅ 训练集索引
|
|||
|
|
├─ nuscenes_infos_val.pkl 257MB ✅ 验证集索引
|
|||
|
|
├─ nuscenes_infos_train_radar.pkl 15MB ✅ 训练集雷达索引
|
|||
|
|
├─ nuscenes_infos_val_radar.pkl 3.9MB ✅ 验证集雷达索引
|
|||
|
|
├─ nuscenes_dbinfos_train.pkl 2.7MB ✅ GT数据库采样
|
|||
|
|
├─ vector_maps_train.pkl 1.4MB ✅ BEV分割标签 (train)
|
|||
|
|
├─ vector_maps_val.pkl 1.4MB ✅ BEV分割标签 (val)
|
|||
|
|
└─ vector_maps_bevfusion.pkl 1.4MB ✅ BEVFusion格式标签
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ✅ 原始数据 (完整)
|
|||
|
|
```
|
|||
|
|
/data/nuscenes/
|
|||
|
|
├─ samples/ 53GB ✅ 关键帧数据
|
|||
|
|
│ ├─ CAM_FRONT/ 34,149张 ✅
|
|||
|
|
│ ├─ CAM_BACK/ 34,149张 ✅
|
|||
|
|
│ ├─ CAM_FRONT_LEFT/ 34,149张 ✅
|
|||
|
|
│ ├─ CAM_FRONT_RIGHT/ 34,149张 ✅
|
|||
|
|
│ ├─ CAM_BACK_LEFT/ 34,149张 ✅
|
|||
|
|
│ ├─ CAM_BACK_RIGHT/ 34,149张 ✅
|
|||
|
|
│ ├─ LIDAR_TOP/ 34,149个 ✅
|
|||
|
|
│ └─ RADAR_*/ 6个雷达 ✅
|
|||
|
|
│
|
|||
|
|
├─ sweeps/ (多帧数据) ✅
|
|||
|
|
│
|
|||
|
|
├─ v1.0-trainval/ 2.5GB ✅ 元数据
|
|||
|
|
│ ├─ sample.json 7.1MB ✅ 样本索引
|
|||
|
|
│ ├─ sample_annotation.json 557MB ✅ 3D标注
|
|||
|
|
│ ├─ ego_pose.json 616MB ✅ 自车位姿
|
|||
|
|
│ ├─ calibrated_sensor.json 3.2MB ✅ 传感器标定
|
|||
|
|
│ ├─ instance.json 16MB ✅ 实例跟踪
|
|||
|
|
│ └─ category.json 4.7KB ✅ 类别定义
|
|||
|
|
│
|
|||
|
|
├─ maps/ ✅ 高精地图
|
|||
|
|
│ ├─ boston-seaport/
|
|||
|
|
│ ├─ singapore-hollandvillage/
|
|||
|
|
│ ├─ singapore-onenorth/
|
|||
|
|
│ └─ singapore-queenstown/
|
|||
|
|
│
|
|||
|
|
└─ nuscenes_gt_database/ 912K dirs ✅ GT增强数据库
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ✅ 数据集统计
|
|||
|
|
```
|
|||
|
|
总大小: ~120GB (原始数据 + 索引)
|
|||
|
|
训练样本: 28,130个关键帧
|
|||
|
|
验证样本: 6,019个关键帧
|
|||
|
|
总计: 34,149个关键帧
|
|||
|
|
传感器: 6个摄像头 + 1个LiDAR + 5个雷达
|
|||
|
|
标注类别: 10个3D物体类别 + 6个BEV分割类别
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 五、解决方案
|
|||
|
|
|
|||
|
|
### 方案A: 立即清理 (推荐)
|
|||
|
|
```bash
|
|||
|
|
# 1. 删除.eval_hook缓存 (释放75GB)
|
|||
|
|
rm -rf /workspace/bevfusion/runs/run-326653dc-2334d461/.eval_hook/
|
|||
|
|
|
|||
|
|
# 2. 验证磁盘空间
|
|||
|
|
df -h /workspace # 应显示~75GB可用
|
|||
|
|
|
|||
|
|
# 3. 恢复训练
|
|||
|
|
cd /workspace/bevfusion
|
|||
|
|
bash RESTART_FP32_STABLE.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 简单快速,立即生效
|
|||
|
|
- ✅ 完全安全,不影响checkpoint
|
|||
|
|
- ✅ 训练从epoch_5无缝继续
|
|||
|
|
|
|||
|
|
### 方案B: 优化evaluation频率
|
|||
|
|
```yaml
|
|||
|
|
# 修改 multitask_BEV2X_phase4a_stage1.yaml
|
|||
|
|
evaluation:
|
|||
|
|
interval: 10 # 从5改为10,减少评估频率
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**效果**:
|
|||
|
|
- 减少evaluation次数: 20 epochs只评估2次 (vs 4次)
|
|||
|
|
- 降低磁盘压力
|
|||
|
|
|
|||
|
|
### 方案C: 使用/data目录
|
|||
|
|
```bash
|
|||
|
|
# 1. 迁移runs到/data (240GB可用)
|
|||
|
|
mv /workspace/bevfusion/runs /data/
|
|||
|
|
ln -s /data/runs /workspace/bevfusion/runs
|
|||
|
|
|
|||
|
|
# 2. 修改配置
|
|||
|
|
work_dir: /data/runs/phase4a_stage1
|
|||
|
|
|
|||
|
|
# 3. 清理.eval_hook
|
|||
|
|
rm -rf /data/runs/run-326653dc-2334d461/.eval_hook/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**优点**:
|
|||
|
|
- ✅ 永久解决磁盘空间问题
|
|||
|
|
- ✅ /data有240GB可用空间
|
|||
|
|
- ✅ 避免未来再次磁盘满
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 六、预防措施
|
|||
|
|
|
|||
|
|
### 1. 监控磁盘使用
|
|||
|
|
```bash
|
|||
|
|
# 添加到训练脚本
|
|||
|
|
watch -n 60 'df -h /workspace | tail -1'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. 自动清理脚本
|
|||
|
|
```bash
|
|||
|
|
# 训练过程中定期清理
|
|||
|
|
while true; do
|
|||
|
|
sleep 3600 # 每小时检查一次
|
|||
|
|
find /workspace/bevfusion/runs -name ".eval_hook" -type d -mmin +30 -exec rm -rf {} \; 2>/dev/null
|
|||
|
|
done &
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3. 使用tmpfs (临时)
|
|||
|
|
```yaml
|
|||
|
|
# mmdetection配置
|
|||
|
|
evaluation:
|
|||
|
|
tmpdir: '/dev/shm/eval_tmp' # 使用内存临时目录
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 七、总结
|
|||
|
|
|
|||
|
|
### 问题根源
|
|||
|
|
```
|
|||
|
|
1. evaluation.interval=5 → Epoch 5触发评估
|
|||
|
|
2. 高分辨率BEV (600×600) → 每个样本占用大
|
|||
|
|
3. 8个GPU并行 → 生成8个part文件
|
|||
|
|
4. 磁盘满 → 进程中断 → 清理未执行
|
|||
|
|
5. .eval_hook残留 → 占用75GB
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### nuScenes数据集
|
|||
|
|
```
|
|||
|
|
✅ 完整: 34,149个样本,120GB原始数据
|
|||
|
|
✅ 位置: /data/nuscenes/ (软链接到项目data/nuscenes/)
|
|||
|
|
✅ 索引: train/val pkl文件齐全
|
|||
|
|
✅ 标签: vector_maps_*.pkl (BEV分割) 齐全
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 立即行动
|
|||
|
|
```bash
|
|||
|
|
# 执行清理
|
|||
|
|
rm -rf /workspace/bevfusion/runs/run-326653dc-2334d461/.eval_hook/
|
|||
|
|
|
|||
|
|
# 重启训练
|
|||
|
|
cd /workspace/bevfusion && bash RESTART_FP32_STABLE.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**预期效果**: 释放75GB → 训练从Epoch 6继续 → 剩余15 epochs完成
|
|||
|
|
|