← Back ICRA 2026

BridgeTA: Bridging the Representation Gap in Knowledge Distillation Via Teacher Assistant for Bird�s Eye View Map Segmentation

Beomjun Kim, Suhan Woo, Sejong Heo, Euntai Kim

PDF

AI summary

Key figure (auto-extracted from paper)

BridgeTA bridges the modality gap between fusion teachers and camera-only students via a lightweight Teacher Assistant, boosting segmentation accuracy by 4.2% mIoU without adding inference cost.

Knowledge Distillation BEV Segmentation Teacher Assistant Camera-Only Perception Representation Gap Autonomous Driving

Problem

Camera-only BEV map segmentation lags behind LiDAR-camera fusion methods, while existing knowledge distillation techniques either inflate student model size or fail to adequately bridge the inherent representation gap between different sensor modalities.

Approach

The framework introduces a lightweight Teacher Assistant network that fuses teacher and student BEV features to create a shared latent space, decomposing direct distillation into stable dual paths and optimizing a theoretically grounded loss with multi-level alignment.

Key results

4.2% mIoU improvement over camera-only baseline on nuScenes
Up to 45% higher performance gain than state-of-the-art KD methods
Zero additional inference cost or latency for the student model
Theoretically grounded dual-path distillation loss ensuring stable optimization

Why it matters

Enables autonomous driving systems to achieve fusion-level perception accuracy using only cameras, making high-performance BEV segmentation highly cost-effective and scalable.

Abstract

Bird’s Eye View (BEV) map segmentation is one of the most important and challenging tasks in autonomous driving. Camera-only approaches have drawn attention as cost- effective alternatives to LiDAR, but they still fall behind LiDAR- Camera (LC) fusion-based methods. Knowledge Distillation (KD) has been explored to narrow this gap, but existing methods mainly enlarge the student model by mimicking the teacher’s architecture, leading to higher inference cost. To address this issue, we introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models through a Teacher Assistant (TA) network while keeping the student’s architecture and inference cost unchanged. A lightweight TA network combines the BEV representations of the teacher and student, creating a shared latent space that serves as an intermediate repre- sentation. To ground the framework theoretically, we derive a distillation loss using Young’s inequality, which decomposes the direct teacher-student distillation path into teacher-TA and TA- student dual paths, stabilizing optimization and strengthening knowledge transfer. Extensive experiments on the challenging nuScenes dataset demonstrate the effectiveness of our method, achieving an improvement of 4.2% mIoU over the Camera- only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods. The code will be available at https://github.com/kxxbeomjun/BridgeTA.

Index terms

Computer Vision for Automation Computer Vision for Transportation Recognition