← Back ICRA 2026

Real-Time BEVFormer: Fast Transformer-Based BEV Perception Network on Edge Device

Juyoung Yang, Seoha Baek, Eunbin Seo, Wonseok Jeon, Doyeon Kim, Jongsun Kim, Heeyeon Nah

PDF

AI summary

Key figure (auto-extracted from paper)

RT-BEVFormer enables real-time, high-accuracy 3D perception on edge devices by distilling foundation model knowledge into a lightweight backbone and replacing dynamic attention sampling with an efficient static method.

BEV perception edge computing knowledge distillation static sampling real-time inference autonomous driving

Problem

Transformer-based BEV perception networks are typically too computationally heavy for real-time deployment on resource-constrained edge devices, primarily due to backbone complexity and latency bottlenecks from dynamic attention sampling.

Approach

The authors propose RT-BEVFormer, which boosts a lightweight backbone via knowledge distillation from a powerful foundation model and replaces dynamic spatial cross-attention with a fixed, efficient static sampling strategy to minimize latency and simplify deployment.

Key results

Distills RADIO foundation model features into a compact student backbone without adding latency
Introduces efficient static sampling to replace dynamic deformable attention, drastically reducing encoder latency
Outperforms FastBEV and BEVFormer-tiny in accuracy while achieving a 412% FPS increase on NVIDIA Jetson Orin
Enables straightforward deployment via standard ONNX and TensorRT export without custom hardware plugins

Why it matters

It proves that transformer-based BEV perception can achieve real-time performance on edge hardware, making advanced 3D perception practical for autonomous vehicles and robots.

Abstract

The development of camera-based real-time 3D perception network for edge devices is essential for embodied systems such as autonomous vehicles and robots. However, exist- ing methods often demand substantial computational resources and tend to overlook performance on resource-constrained devices. In this paper, we propose RT-BEVFormer, a simple yet effective multi-task 3D perception framework designed for efficiency. Based on BEVFormer, RT-BEVFormer enhances the feature extraction capability of the backbone and redesigns the spatial cross-attention module in the encoder, guided by two key observations: 1) the computational load and total number of parameters are dominated by the backbone, and 2) the sampling process within the deformable attention module is a primary bottleneck. Specifically, we leverage powerful foundation models to distill their rich and comprehensive knowledge, thereby crafting a highly efficient student backbone. This allows RT- BEVFormer to achieve significant performance gains without incurring additional latency. Furthermore, we introduce an efficient static sampling method. This approach replaces the dynamic and deployment unfriendly nature of standard spatial cross-attention, allowing the model to focus on salient image features with minimal overhead. On the widely-used edge device, NVIDIA Jetson Orin, RT-BEVFormer outperforms the previous state-of-the-art model in both accuracy and inference speed. Extensive experiments on the nuScenes dataset show that each component of our framework is effective in both inference speed and overall accuracy. Finally, as RT-BEVFormer is implemented without any model-specific custom plugin, it ensures superior flexibility and ease of deployment.

Index terms

Deep Learning for Visual Perception Object Detection Segmentation and Categorization Vision-Based Navigation