LION: Linear Group RNN for 3D Object Detection in Point Clouds

1
Huazhong University of Science and Technology,
2
The University of Hong Kong,
3
Baidu Inc.
(*Equal Contribution,Corresponding Author)
Image 2

🔥 Highlights


1. Strong performance. LION achieves state-of-the-art performance on Waymo, nuScenes, Argoverse V2, and ONCE datasets. 💪


2. Strong generalization. LION can support almost all linear RNN operators including Mamba, RWKV, RetNet, and TTT. Anyone is welcome to verify more linear RNN operators 😀.


3. More friendly. LION can train all models on less 24G GPU memory~(i.e., RTX 3090, RTX4090, V100 and A100 are enough to train our LION) 😀.

Abstract

The benefit of transformers in large-scale 3D point cloud perception tasks, such as 3D object detection, is limited by their quadratic computation cost when modeling long-range relationships. In contrast, linear RNNs have low computational complexity and are suitable for long-range modeling. Toward this goal, we propose a simple and effective window-based framework built on LInear group RNN (i.e., perform linear RNN for grouped features) for accurate 3D object detection, called LION. The key property is to allow sufficient feature interaction in a much larger group than transformer-based methods. However, effectively applying linear group RNN to 3D object detection in highly sparse point clouds is not trivial due to its limitation in handling spatial modeling. To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features. To further address the challenge in highly sparse point clouds, we propose a 3D voxel generation strategy to densify foreground features thanks to linear group RNN as a natural property of auto-regressive models. Extensive experiments verify the effectiveness of the proposed components and the generalization of our LION on different linear group RNN operators including Mamba, RWKV, and RetNet. Finally, it is worth mentioning that our LION-Mamba achieves state-of-the-art on Waymo, nuScenes, Argoverse V2, and ONCE datasets.

Interpolate start reference image.
The illustration of LION, which mainly consists of several LION blocks, each paired with a voxel generation for feature enhancement and a voxel merging for down-sampling features along the height dimension.
Interpolate start reference image.
The LION block is the core component of our approach, which involves LION layer for long-range feature interaction, 3D spatial feature descriptor for capturing local 3D spatial information, voxel merging for feature down-sampling and voxel expanding for feature up-sampling

Evaluation

LION achieves SOTA 3D object detection performance on 4 most popular large-scale datasets with different linear RNN operators. ⚡

Waymo

Image 2
Performances on the Waymo Open Dataset validation set (train with 100% training data).

nuScenes

Image 2
Performances on the nuScenes validation and test set.

Argoverse V2

Image 2
Performances on the Argoverse V2 validation set.

ONCE

Image 2
Performances on the ONCE validation set.

KITTI

Image 2
Performances on the KITTI validation set.

Visualization

Waymo
nuScenes
Argoverse V2
ONCE

BibTeX

@misc{liu2024lion,
  author= {Zhe Liu, Jinghua Hou, Xingyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai},
  title = {LION: Linear Group RNN for 3D Object Detection in Point Clouds},
  year={2024},
  eprint={2407.18232},
  archivePrefix={arXiv}
}