Add training details (#150)
* [Major] Update training details. * [Minor] Update README.md. * [Minor] Remove comment.
This commit is contained in:
parent
e4d599edd5
commit
f39a4a0752
36
README.md
36
README.md
|
|
@ -34,8 +34,8 @@ Multi-sensor fusion is essential for an accurate and reliable autonomous driving
|
|||
|
||||
| Model | Modality | mAP | NDS | Checkpoint |
|
||||
| :------------------: | :------: | :--: | :--: | :---------: |
|
||||
| [BEVFusion](configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml) | C+L | 68.39 | 71.32 | [Link](https://bevfusion.mit.edu/files/pretrained/bevfusion-det.pth) |
|
||||
| [Camera-Only Baseline](configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml) | C | 33.25 | 40.15 | [Link](https://bevfusion.mit.edu/files/pretrained/camera-only-det.pth) |
|
||||
| [BEVFusion](configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml) | C+L | 68.52 | 71.38 | [Link](https://bevfusion.mit.edu/files/pretrained_updated/bevfusion-det.pth) |
|
||||
| [Camera-Only Baseline](configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml) | C | 35.56 | 41.21 | [Link](https://bevfusion.mit.edu/files/pretrained_updated/camera-only-det.pth) |
|
||||
| [LiDAR-Only Baseline](configs/nuscenes/det/transfusion/secfpn/lidar/voxelnet_0p075.yaml) | L | 64.68 | 69.28 | [Link](https://bevfusion.mit.edu/files/pretrained/lidar-only-det.pth) |
|
||||
|
||||
*Note*: The camera-only object detection baseline is a variant of BEVDet-Tiny with a much heavier view transformer and other differences in hyperparameters. Thanks to our [efficient BEV pooling](mmdet3d/ops/bev_pool) operator, this model runs fast and has higher mAP than BEVDet-Tiny under the same input resolution. Please refer to [BEVDet repo](https://github.com/HuangJunjie2017/BEVDet) for the original BEVDet-Tiny implementation. The LiDAR-only baseline is TransFusion-L.
|
||||
|
|
@ -44,8 +44,8 @@ Multi-sensor fusion is essential for an accurate and reliable autonomous driving
|
|||
|
||||
| Model | Modality | mIoU | Checkpoint |
|
||||
| :------------------: | :------: | :--: | :---------: |
|
||||
| [BEVFusion](configs/nuscenes/seg/fusion-bev256d2-lss.yaml) | C+L | 62.69 | [Link](https://bevfusion.mit.edu/files/pretrained/bevfusion-seg.pth) |
|
||||
| [Camera-Only Baseline](configs/nuscenes/seg/camera-bev256d2.yaml) | C | 56.56 | [Link](https://bevfusion.mit.edu/files/pretrained/camera-only-seg.pth) |
|
||||
| [BEVFusion](configs/nuscenes/seg/fusion-bev256d2-lss.yaml) | C+L | 62.95 | [Link](https://bevfusion.mit.edu/files/pretrained_updated/bevfusion-seg.pth) |
|
||||
| [Camera-Only Baseline](configs/nuscenes/seg/camera-bev256d2.yaml) | C | 57.09 | [Link](https://bevfusion.mit.edu/files/pretrained_updated/camera-only-seg.pth) |
|
||||
| [LiDAR-Only Baseline](configs/nuscenes/seg/lidar-centerpoint-bev128.yaml) | L | 48.56 | [Link](https://bevfusion.mit.edu/files/pretrained/lidar-only-seg.pth) |
|
||||
|
||||
## Usage
|
||||
|
|
@ -122,6 +122,34 @@ While for the segmentation variant of BEVFusion, this command will be helpful:
|
|||
torchpack dist-run -np 8 python tools/test.py configs/nuscenes/seg/fusion-bev256d2-lss.yaml pretrained/bevfusion-seg.pth --eval map
|
||||
```
|
||||
|
||||
### Training
|
||||
|
||||
We provide instructions to reproduce our results on nuScenes.
|
||||
|
||||
For example, if you want to train the camera-only variant for object detection, please run:
|
||||
|
||||
```bash
|
||||
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth
|
||||
```
|
||||
|
||||
For camera-only BEV segmentation model, please run:
|
||||
|
||||
```bash
|
||||
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/camera-bev256d2.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth
|
||||
```
|
||||
|
||||
For LiDAR-only detector, please run:
|
||||
|
||||
```bash
|
||||
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/lidar/voxelnet_0p075.yaml
|
||||
```
|
||||
|
||||
For LiDAR-only BEV segmentation model, please run:
|
||||
|
||||
```bash
|
||||
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/lidar-centerpoint-bev128.yaml
|
||||
```
|
||||
|
||||
## FAQs
|
||||
|
||||
Q: Can we directly use the info files prepared by mmdetection3d?
|
||||
|
|
|
|||
|
|
@ -11,16 +11,16 @@ voxel_size: [0.1, 0.1, 0.2]
|
|||
image_size: [256, 704]
|
||||
|
||||
augment2d:
|
||||
resize: [[0.48, 0.48], [0.48, 0.48]]
|
||||
rotate: [0.0, 0.0]
|
||||
resize: [[0.38, 0.55], [0.48, 0.48]]
|
||||
rotate: [-5.4, 5.4]
|
||||
gridmask:
|
||||
prob: 0.0
|
||||
fixed_prob: true
|
||||
|
||||
augment3d:
|
||||
scale: [1.0, 1.0]
|
||||
rotate: [0.0, 0.0]
|
||||
translate: 0.0
|
||||
scale: [0.9, 1.1]
|
||||
rotate: [-0.78539816, 0.78539816]
|
||||
translate: 0.5
|
||||
|
||||
object_classes:
|
||||
- car
|
||||
|
|
@ -121,7 +121,7 @@ train_pipeline:
|
|||
resize_lim: ${augment2d.resize[0]}
|
||||
bot_pct_lim: [0.0, 0.0]
|
||||
rot_lim: ${augment2d.rotate}
|
||||
rand_flip: false
|
||||
rand_flip: true
|
||||
is_train: true
|
||||
-
|
||||
type: GlobalRotScaleTrans
|
||||
|
|
@ -148,9 +148,8 @@ train_pipeline:
|
|||
classes: ${object_classes}
|
||||
-
|
||||
type: ImageNormalize
|
||||
mean: [103.530, 116.280, 123.675]
|
||||
std: [58.395, 57.12, 57.375]
|
||||
to_rgb: true
|
||||
mean: [0.485, 0.456, 0.406]
|
||||
std: [0.229, 0.224, 0.225]
|
||||
-
|
||||
type: GridMask
|
||||
use_h: true
|
||||
|
|
@ -180,6 +179,7 @@ train_pipeline:
|
|||
- camera2ego
|
||||
- lidar2ego
|
||||
- lidar2camera
|
||||
- camera2lidar
|
||||
- lidar2image
|
||||
- img_aug_matrix
|
||||
- lidar_aug_matrix
|
||||
|
|
@ -234,9 +234,8 @@ test_pipeline:
|
|||
point_cloud_range: ${point_cloud_range}
|
||||
-
|
||||
type: ImageNormalize
|
||||
mean: [103.530, 116.280, 123.675]
|
||||
std: [58.395, 57.12, 57.375]
|
||||
to_rgb: true
|
||||
mean: [0.485, 0.456, 0.406]
|
||||
std: [0.229, 0.224, 0.225]
|
||||
-
|
||||
type: DefaultFormatBundle3D
|
||||
classes: ${object_classes}
|
||||
|
|
@ -253,6 +252,7 @@ test_pipeline:
|
|||
- camera2ego
|
||||
- lidar2ego
|
||||
- lidar2camera
|
||||
- camera2lidar
|
||||
- lidar2image
|
||||
- img_aug_matrix
|
||||
- lidar_aug_matrix
|
||||
|
|
|
|||
|
|
@ -59,3 +59,27 @@ model:
|
|||
out_channels: 256
|
||||
scale_factor: 2
|
||||
|
||||
optimizer:
|
||||
paramwise_cfg:
|
||||
custom_keys:
|
||||
absolute_pos_embed:
|
||||
decay_mult: 0
|
||||
relative_position_bias_table:
|
||||
decay_mult: 0
|
||||
encoders.camera.backbone:
|
||||
lr_mult: 0.1
|
||||
|
||||
|
||||
lr_config:
|
||||
policy: cyclic
|
||||
target_ratio: 5.0
|
||||
cyclic_times: 1
|
||||
step_ratio_up: 0.4
|
||||
|
||||
momentum_config:
|
||||
policy: cyclic
|
||||
cyclic_times: 1
|
||||
step_ratio_up: 0.4
|
||||
|
||||
data:
|
||||
samples_per_gpu: 6
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
augment3d:
|
||||
scale: [1.0, 1.0]
|
||||
rotate: [0.0, 0.0]
|
||||
scale: [0.95, 1.05]
|
||||
rotate: [-0.3925, 0.3925]
|
||||
translate: 0
|
||||
|
||||
model:
|
||||
|
|
|
|||
|
|
@ -21,3 +21,15 @@ optimizer:
|
|||
type: AdamW
|
||||
lr: 2.0e-4
|
||||
weight_decay: 0.01
|
||||
|
||||
optimizer_config:
|
||||
grad_clip:
|
||||
max_norm: 35
|
||||
norm_type: 2
|
||||
|
||||
lr_config:
|
||||
policy: CosineAnnealing
|
||||
warmup: linear
|
||||
warmup_iters: 500
|
||||
warmup_ratio: 0.33333333
|
||||
min_lr_ratio: 1.0e-3
|
||||
|
|
@ -39,3 +39,9 @@ optimizer_config:
|
|||
grad_clip:
|
||||
max_norm: 35
|
||||
norm_type: 2
|
||||
|
||||
lr_config:
|
||||
policy: cyclic
|
||||
|
||||
momentum_config:
|
||||
policy: cyclic
|
||||
|
|
|
|||
|
|
@ -18,6 +18,9 @@ model:
|
|||
out_indices: [1, 2, 3]
|
||||
with_cp: false
|
||||
convert_weights: true
|
||||
init_cfg:
|
||||
type: Pretrained
|
||||
checkpoint: https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
|
||||
neck:
|
||||
type: GeneralizedLSSFPN
|
||||
in_channels: [192, 384, 768]
|
||||
|
|
@ -63,3 +66,21 @@ model:
|
|||
optimizer:
|
||||
type: AdamW
|
||||
lr: 1.0e-4
|
||||
weight_decay: 0.01
|
||||
paramwise_cfg:
|
||||
custom_keys:
|
||||
absolute_pos_embed:
|
||||
decay_mult: 0
|
||||
relative_position_bias_table:
|
||||
decay_mult: 0
|
||||
|
||||
optimizer_config:
|
||||
grad_clip:
|
||||
max_norm: 35
|
||||
norm_type: 2
|
||||
|
||||
lr_config:
|
||||
policy: cyclic
|
||||
|
||||
momentum_config:
|
||||
policy: cyclic
|
||||
|
|
|
|||
|
|
@ -62,3 +62,14 @@ model:
|
|||
optimizer:
|
||||
type: AdamW
|
||||
lr: 1.0e-4
|
||||
|
||||
optimizer_config:
|
||||
grad_clip:
|
||||
max_norm: 35
|
||||
norm_type: 2
|
||||
|
||||
lr_config:
|
||||
policy: cyclic
|
||||
|
||||
momentum_config:
|
||||
policy: cyclic
|
||||
|
|
|
|||
|
|
@ -211,7 +211,6 @@ class NuScenesDataset(Custom3DDataset):
|
|||
|
||||
data = dict(
|
||||
token=info["token"],
|
||||
sample_idx=info['token'],
|
||||
lidar_path=info["lidar_path"],
|
||||
sweeps=info["sweeps"],
|
||||
timestamp=info["timestamp"],
|
||||
|
|
@ -236,6 +235,7 @@ class NuScenesDataset(Custom3DDataset):
|
|||
data["lidar2image"] = []
|
||||
data["camera2ego"] = []
|
||||
data["camera_intrinsics"] = []
|
||||
data["camera2lidar"] = []
|
||||
|
||||
for _, camera_info in info["cams"].items():
|
||||
data["image_paths"].append(camera_info["data_path"])
|
||||
|
|
@ -267,10 +267,14 @@ class NuScenesDataset(Custom3DDataset):
|
|||
camera2ego[:3, 3] = camera_info["sensor2ego_translation"]
|
||||
data["camera2ego"].append(camera2ego)
|
||||
|
||||
# TODO (Haotian): test set submission.
|
||||
# camera to lidar transform
|
||||
camera2lidar = np.eye(4).astype(np.float32)
|
||||
camera2lidar[:3, :3] = camera_info["sensor2lidar_rotation"]
|
||||
camera2lidar[:3, 3] = camera_info["sensor2lidar_translation"]
|
||||
data["camera2lidar"].append(camera2lidar)
|
||||
|
||||
annos = self.get_ann_info(index)
|
||||
data["ann_info"] = annos
|
||||
|
||||
return data
|
||||
|
||||
def get_ann_info(self, index):
|
||||
|
|
|
|||
|
|
@ -7,6 +7,8 @@ from mmdet3d.core.points import BasePoints
|
|||
from mmdet.datasets.builder import PIPELINES
|
||||
from mmdet.datasets.pipelines import to_tensor
|
||||
|
||||
import torch
|
||||
|
||||
|
||||
@PIPELINES.register_module()
|
||||
class DefaultFormatBundle3D:
|
||||
|
|
@ -96,14 +98,7 @@ class DefaultFormatBundle3D:
|
|||
dtype=np.int64,
|
||||
)
|
||||
if "img" in results:
|
||||
if isinstance(results["img"], list):
|
||||
# process multiple imgs in single frame
|
||||
imgs = [img.transpose(2, 0, 1) for img in results["img"]]
|
||||
imgs = np.ascontiguousarray(np.stack(imgs, axis=0))
|
||||
results["img"] = DC(to_tensor(imgs), stack=True)
|
||||
else:
|
||||
img = np.ascontiguousarray(results["img"].transpose(2, 0, 1))
|
||||
results["img"] = DC(to_tensor(img), stack=True)
|
||||
results["img"] = DC(torch.stack(results["img"]), stack=True)
|
||||
|
||||
for key in [
|
||||
"proposals",
|
||||
|
|
|
|||
|
|
@ -5,6 +5,8 @@ import mmcv
|
|||
import numpy as np
|
||||
from nuscenes.map_expansion.map_api import NuScenesMap
|
||||
from nuscenes.map_expansion.map_api import locations as LOCATIONS
|
||||
from PIL import Image
|
||||
|
||||
|
||||
from mmdet3d.core.points import BasePoints, get_points_type
|
||||
from mmdet.datasets.builder import PIPELINES
|
||||
|
|
@ -49,26 +51,25 @@ class LoadMultiViewImageFromFiles:
|
|||
"""
|
||||
filename = results["image_paths"]
|
||||
# img is of shape (h, w, c, num_views)
|
||||
img = np.stack(
|
||||
[mmcv.imread(name, self.color_type) for name in filename], axis=-1
|
||||
)
|
||||
if self.to_float32:
|
||||
img = img.astype(np.float32)
|
||||
# modified for waymo
|
||||
images = []
|
||||
h, w = 0, 0
|
||||
for name in filename:
|
||||
images.append(Image.open(name))
|
||||
|
||||
#TODO: consider image padding in waymo
|
||||
|
||||
results["filename"] = filename
|
||||
# unravel to list, see `DefaultFormatBundle` in formating.py
|
||||
# which will transpose each image separately and then stack into array
|
||||
results["img"] = [img[..., i] for i in range(img.shape[-1])]
|
||||
results["img_shape"] = img.shape
|
||||
results["ori_shape"] = img.shape
|
||||
results["img"] = images
|
||||
# [1600, 900]
|
||||
results["img_shape"] = images[0].size
|
||||
results["ori_shape"] = images[0].size
|
||||
# Set initial values for default meta_keys
|
||||
results["pad_shape"] = img.shape
|
||||
results["pad_shape"] = images[0].size
|
||||
results["scale_factor"] = 1.0
|
||||
num_channels = 1 if len(img.shape) < 3 else img.shape[2]
|
||||
results["img_norm_cfg"] = dict(
|
||||
mean=np.zeros(num_channels, dtype=np.float32),
|
||||
std=np.ones(num_channels, dtype=np.float32),
|
||||
to_rgb=False,
|
||||
)
|
||||
|
||||
return results
|
||||
|
||||
def __repr__(self):
|
||||
|
|
|
|||
|
|
@ -3,6 +3,7 @@ from typing import Any, Dict
|
|||
import mmcv
|
||||
import numpy as np
|
||||
import torch
|
||||
import torchvision
|
||||
from mmcv import is_tuple_of
|
||||
from mmcv.utils import build_from_cfg
|
||||
from numpy import random
|
||||
|
|
@ -34,7 +35,7 @@ class ImageAug3D:
|
|||
self.is_train = is_train
|
||||
|
||||
def sample_augmentation(self, results):
|
||||
H, W, _, _ = results["ori_shape"]
|
||||
W, H = results["ori_shape"]
|
||||
fH, fW = self.final_dim
|
||||
if self.is_train:
|
||||
resize = np.random.uniform(*self.resize_lim)
|
||||
|
|
@ -62,13 +63,11 @@ class ImageAug3D:
|
|||
self, img, rotation, translation, resize, resize_dims, crop, flip, rotate
|
||||
):
|
||||
# adjust image
|
||||
img = Image.fromarray(img.astype(np.uint8))
|
||||
img = img.resize(resize_dims)
|
||||
img = img.crop(crop)
|
||||
if flip:
|
||||
img = img.transpose(method=Image.FLIP_LEFT_RIGHT)
|
||||
img = img.rotate(rotate)
|
||||
img = np.array(img).astype(np.float32)
|
||||
|
||||
# post-homography transformation
|
||||
rotation *= resize
|
||||
|
|
@ -902,17 +901,19 @@ class ImagePad:
|
|||
|
||||
@PIPELINES.register_module()
|
||||
class ImageNormalize:
|
||||
def __init__(self, mean, std, to_rgb=True):
|
||||
self.mean = np.array(mean, dtype=np.float32)
|
||||
self.std = np.array(std, dtype=np.float32)
|
||||
self.to_rgb = to_rgb
|
||||
def __init__(self, mean, std):
|
||||
self.mean = mean
|
||||
self.std = std
|
||||
self.compose = torchvision.transforms.Compose(
|
||||
[
|
||||
torchvision.transforms.ToTensor(),
|
||||
torchvision.transforms.Normalize(mean=mean, std=std),
|
||||
]
|
||||
)
|
||||
|
||||
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
|
||||
data["img"] = [
|
||||
mmcv.imnormalize(img, self.mean, self.std, self.to_rgb)
|
||||
for img in data["img"]
|
||||
]
|
||||
data["img_norm_cfg"] = dict(mean=self.mean, std=self.std, to_rgb=self.to_rgb)
|
||||
data["img"] = [self.compose(img) for img in data["img"]]
|
||||
data["img_norm_cfg"] = dict(mean=self.mean, std=self.std)
|
||||
return data
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -12,9 +12,10 @@ from mmdet3d.models.builder import (
|
|||
build_neck,
|
||||
build_vtransform,
|
||||
)
|
||||
from mmdet3d.ops import Voxelization
|
||||
from mmdet3d.ops import Voxelization, DynamicScatter
|
||||
from mmdet3d.models import FUSIONMODELS
|
||||
|
||||
|
||||
from .base import Base3DFusionModel
|
||||
|
||||
__all__ = ["BEVFusion"]
|
||||
|
|
@ -42,9 +43,13 @@ class BEVFusion(Base3DFusionModel):
|
|||
}
|
||||
)
|
||||
if encoders.get("lidar") is not None:
|
||||
if encoders["lidar"]["voxelize"].get("max_num_points", -1) > 0:
|
||||
voxelize_module = Voxelization(**encoders["lidar"]["voxelize"])
|
||||
else:
|
||||
voxelize_module = DynamicScatter(**encoders["lidar"]["voxelize"])
|
||||
self.encoders["lidar"] = nn.ModuleDict(
|
||||
{
|
||||
"voxelize": Voxelization(**encoders["lidar"]["voxelize"]),
|
||||
"voxelize": voxelize_module,
|
||||
"backbone": build_backbone(encoders["lidar"]["backbone"]),
|
||||
}
|
||||
)
|
||||
|
|
@ -89,6 +94,7 @@ class BEVFusion(Base3DFusionModel):
|
|||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
img_metas,
|
||||
|
|
@ -113,6 +119,7 @@ class BEVFusion(Base3DFusionModel):
|
|||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
img_metas,
|
||||
|
|
@ -130,18 +137,28 @@ class BEVFusion(Base3DFusionModel):
|
|||
def voxelize(self, points):
|
||||
feats, coords, sizes = [], [], []
|
||||
for k, res in enumerate(points):
|
||||
f, c, n = self.encoders["lidar"]["voxelize"](res)
|
||||
ret = self.encoders["lidar"]["voxelize"](res)
|
||||
if len(ret) == 3:
|
||||
# hard voxelize
|
||||
f, c, n = ret
|
||||
else:
|
||||
assert len(ret) == 2
|
||||
f, c = ret
|
||||
n = None
|
||||
feats.append(f)
|
||||
coords.append(F.pad(c, (1, 0), mode="constant", value=k))
|
||||
sizes.append(n)
|
||||
if n is not None:
|
||||
sizes.append(n)
|
||||
|
||||
feats = torch.cat(feats, dim=0)
|
||||
coords = torch.cat(coords, dim=0)
|
||||
sizes = torch.cat(sizes, dim=0)
|
||||
|
||||
if self.voxelize_reduce:
|
||||
feats = feats.sum(dim=1, keepdim=False) / sizes.type_as(feats).view(-1, 1)
|
||||
feats = feats.contiguous()
|
||||
if len(sizes) > 0:
|
||||
sizes = torch.cat(sizes, dim=0)
|
||||
if self.voxelize_reduce:
|
||||
feats = feats.sum(dim=1, keepdim=False) / sizes.type_as(feats).view(
|
||||
-1, 1
|
||||
)
|
||||
feats = feats.contiguous()
|
||||
|
||||
return feats, coords, sizes
|
||||
|
||||
|
|
@ -155,6 +172,48 @@ class BEVFusion(Base3DFusionModel):
|
|||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
metas,
|
||||
gt_masks_bev=None,
|
||||
gt_bboxes_3d=None,
|
||||
gt_labels_3d=None,
|
||||
**kwargs,
|
||||
):
|
||||
if isinstance(img, list):
|
||||
raise NotImplementedError
|
||||
else:
|
||||
outputs = self.forward_single(
|
||||
img,
|
||||
points,
|
||||
camera2ego,
|
||||
lidar2ego,
|
||||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
metas,
|
||||
gt_masks_bev,
|
||||
gt_bboxes_3d,
|
||||
gt_labels_3d,
|
||||
**kwargs,
|
||||
)
|
||||
return outputs
|
||||
|
||||
@auto_fp16(apply_to=("img", "points"))
|
||||
def forward_single(
|
||||
self,
|
||||
img,
|
||||
points,
|
||||
camera2ego,
|
||||
lidar2ego,
|
||||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
metas,
|
||||
|
|
@ -164,7 +223,9 @@ class BEVFusion(Base3DFusionModel):
|
|||
**kwargs,
|
||||
):
|
||||
features = []
|
||||
for sensor in self.encoders:
|
||||
for sensor in (
|
||||
self.encoders if self.training else list(self.encoders.keys())[::-1]
|
||||
):
|
||||
if sensor == "camera":
|
||||
feature = self.extract_camera_features(
|
||||
img,
|
||||
|
|
@ -174,6 +235,7 @@ class BEVFusion(Base3DFusionModel):
|
|||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
metas,
|
||||
|
|
@ -184,6 +246,10 @@ class BEVFusion(Base3DFusionModel):
|
|||
raise ValueError(f"unsupported sensor: {sensor}")
|
||||
features.append(feature)
|
||||
|
||||
if not self.training:
|
||||
# avoid OOM
|
||||
features = features[::-1]
|
||||
|
||||
if self.fuser is not None:
|
||||
x = self.fuser(features)
|
||||
else:
|
||||
|
|
|
|||
|
|
@ -78,16 +78,15 @@ class BaseTransform(nn.Module):
|
|||
@force_fp32()
|
||||
def get_geometry(
|
||||
self,
|
||||
rots,
|
||||
trans,
|
||||
camera2lidar_rots,
|
||||
camera2lidar_trans,
|
||||
intrins,
|
||||
post_rots,
|
||||
post_trans,
|
||||
lidar2ego_rots,
|
||||
lidar2ego_trans,
|
||||
**kwargs,
|
||||
):
|
||||
B, N, _ = trans.shape
|
||||
B, N, _ = camera2lidar_trans.shape
|
||||
|
||||
# undo post-transformation
|
||||
# B x N x D x H x W x 3
|
||||
points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
|
||||
|
|
@ -96,7 +95,7 @@ class BaseTransform(nn.Module):
|
|||
.view(B, N, 1, 1, 1, 3, 3)
|
||||
.matmul(points.unsqueeze(-1))
|
||||
)
|
||||
# cam_to_ego
|
||||
# cam_to_lidar
|
||||
points = torch.cat(
|
||||
(
|
||||
points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
|
||||
|
|
@ -104,17 +103,9 @@ class BaseTransform(nn.Module):
|
|||
),
|
||||
5,
|
||||
)
|
||||
combine = rots.matmul(torch.inverse(intrins))
|
||||
combine = camera2lidar_rots.matmul(torch.inverse(intrins))
|
||||
points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
|
||||
points += trans.view(B, N, 1, 1, 1, 3)
|
||||
# ego_to_lidar
|
||||
points -= lidar2ego_trans.view(B, 1, 1, 1, 1, 3)
|
||||
points = (
|
||||
torch.inverse(lidar2ego_rots)
|
||||
.view(B, 1, 1, 1, 1, 3, 3)
|
||||
.matmul(points.unsqueeze(-1))
|
||||
.squeeze(-1)
|
||||
)
|
||||
points += camera2lidar_trans.view(B, N, 1, 1, 1, 3)
|
||||
|
||||
if "extra_rots" in kwargs:
|
||||
extra_rots = kwargs["extra_rots"]
|
||||
|
|
@ -181,9 +172,9 @@ class BaseTransform(nn.Module):
|
|||
lidar2camera,
|
||||
lidar2image,
|
||||
camera_intrinsics,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
metas=None,
|
||||
**kwargs,
|
||||
):
|
||||
rots = camera2ego[..., :3, :3]
|
||||
|
|
@ -193,19 +184,20 @@ class BaseTransform(nn.Module):
|
|||
post_trans = img_aug_matrix[..., :3, 3]
|
||||
lidar2ego_rots = lidar2ego[..., :3, :3]
|
||||
lidar2ego_trans = lidar2ego[..., :3, 3]
|
||||
camera2lidar_rots = camera2lidar[..., :3, :3]
|
||||
camera2lidar_trans = camera2lidar[..., :3, 3]
|
||||
|
||||
extra_rots = lidar_aug_matrix[..., :3, :3]
|
||||
extra_trans = lidar_aug_matrix[..., :3, 3]
|
||||
|
||||
geom = self.get_geometry(
|
||||
rots,
|
||||
trans,
|
||||
camera2lidar_rots,
|
||||
camera2lidar_trans,
|
||||
intrins,
|
||||
post_rots,
|
||||
post_trans,
|
||||
lidar2ego_rots,
|
||||
lidar2ego_trans,
|
||||
extra_rots=extra_rots,
|
||||
extra_trans=extra_trans
|
||||
extra_trans=extra_trans,
|
||||
)
|
||||
|
||||
x = self.get_cam_feats(img)
|
||||
|
|
@ -224,6 +216,7 @@ class BaseDepthTransform(BaseTransform):
|
|||
lidar2camera,
|
||||
lidar2image,
|
||||
cam_intrinsic,
|
||||
camera2lidar,
|
||||
img_aug_matrix,
|
||||
lidar_aug_matrix,
|
||||
metas,
|
||||
|
|
@ -236,18 +229,27 @@ class BaseDepthTransform(BaseTransform):
|
|||
post_trans = img_aug_matrix[..., :3, 3]
|
||||
lidar2ego_rots = lidar2ego[..., :3, :3]
|
||||
lidar2ego_trans = lidar2ego[..., :3, 3]
|
||||
extra_rots = lidar_aug_matrix[..., :3, :3]
|
||||
extra_trans = lidar_aug_matrix[..., :3, 3]
|
||||
camera2lidar_rots = camera2lidar[..., :3, :3]
|
||||
camera2lidar_trans = camera2lidar[..., :3, 3]
|
||||
|
||||
# print(img.shape, self.image_size, self.feature_size)
|
||||
|
||||
batch_size = len(points)
|
||||
depth = torch.zeros(batch_size, 6, 1, *self.image_size).to(points[0].device)
|
||||
depth = torch.zeros(batch_size, img.shape[1], 1, *self.image_size).to(
|
||||
points[0].device
|
||||
)
|
||||
|
||||
for b in range(batch_size):
|
||||
cur_coords = points[b][:, :3].transpose(1, 0)
|
||||
cur_coords = points[b][:, :3]
|
||||
cur_img_aug_matrix = img_aug_matrix[b]
|
||||
cur_lidar_aug_matrix = lidar_aug_matrix[b]
|
||||
cur_lidar2image = lidar2image[b]
|
||||
|
||||
# inverse aug
|
||||
cur_coords -= cur_lidar_aug_matrix[:3, 3]
|
||||
cur_coords = torch.inverse(cur_lidar_aug_matrix[:3, :3]).matmul(
|
||||
cur_coords.transpose(1, 0)
|
||||
)
|
||||
# lidar2image
|
||||
cur_coords = cur_lidar2image[:, :3, :3].matmul(cur_coords)
|
||||
cur_coords += cur_lidar2image[:, :3, 3].reshape(-1, 3, 1)
|
||||
|
|
@ -270,19 +272,21 @@ class BaseDepthTransform(BaseTransform):
|
|||
& (cur_coords[..., 1] < self.image_size[1])
|
||||
& (cur_coords[..., 1] >= 0)
|
||||
)
|
||||
for c in range(6):
|
||||
for c in range(on_img.shape[0]):
|
||||
masked_coords = cur_coords[c, on_img[c]].long()
|
||||
masked_dist = dist[c, on_img[c]]
|
||||
depth[b, c, 0, masked_coords[:, 0], masked_coords[:, 1]] = masked_dist
|
||||
|
||||
extra_rots = lidar_aug_matrix[..., :3, :3]
|
||||
extra_trans = lidar_aug_matrix[..., :3, 3]
|
||||
geom = self.get_geometry(
|
||||
rots,
|
||||
trans,
|
||||
camera2lidar_rots,
|
||||
camera2lidar_trans,
|
||||
intrins,
|
||||
post_rots,
|
||||
post_trans,
|
||||
lidar2ego_rots,
|
||||
lidar2ego_trans,
|
||||
extra_rots=extra_rots,
|
||||
extra_trans=extra_trans,
|
||||
)
|
||||
|
||||
x = self.get_cam_feats(img, depth)
|
||||
|
|
|
|||
|
|
@ -1,9 +1,10 @@
|
|||
mkdir pretrained &&
|
||||
cd pretrained &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained/bevfusion-det.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained/bevfusion-seg.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained_updated/bevfusion-det.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained_updated/bevfusion-seg.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained/lidar-only-det.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained/lidar-only-seg.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained/camera-only-det.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained/camera-only-seg.pth
|
||||
wget https://bevfusion.mit.edu/files/pretrained_updated/camera-only-det.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained_updated/camera-only-seg.pth &&
|
||||
wget https://bevfusion.mit.edu/files/pretrained_updated/swint-nuimages-pretrained.pth
|
||||
|
||||
|
|
|
|||
Loading…
Reference in New Issue