Research Analyzer
← Back ICRA 2026

Efficient Multi-Camera Tokenization with Triplanes for End-To-End Driving

Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, Marco Pavone

PDF

AI summary

Key figure (auto-extracted from paper)
Triplane-based tokenization cuts sensor tokens by up to 72% and accelerates driving policy inference by 50% while preserving planning accuracy.
Multi-camera tokenization Triplanes End-to-end driving Autonomous vehicles Neural rendering Transformer efficiency

Problem

Patch-based image tokenizers cause token counts to scale linearly with camera count and resolution, preventing real-time inference of large autoregressive Transformers on embedded autonomous vehicle hardware.

Approach

The authors encode multi-camera inputs into a fixed-size, geometry-aware 3D latent representation called triplanes, decoupling the token count from input resolution and camera configuration.

Key results

  • Up to 72% fewer sensor tokens per timestep
  • Up to 50% faster policy inference runtime
  • Matches open-loop motion planning accuracy of baseline tokenizers
  • Improves offroad rates in closed-loop driving simulations

Why it matters

Enables real-time deployment of scalable, internet-pretrained Transformer policies on resource-constrained autonomous vehicles.

Abstract

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tok- enizing sensor data efficiently is paramount to ensuring the real- time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on large-scale AV datasets and a state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.

Index terms

Representation Learning Deep Learning for Visual Perception Computer Vision for Transportation

Related papers