← Back ICRA 2026

MASAR: Motion�Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting

Mohammed Amine Bencheikh Lehocine, Julian Schmidt, Frank Moosmann, Dikshant Gupta, Fabian Flohr

PDF

AI summary

Key figure (auto-extracted from paper)

MASAR improves joint 3D detection and trajectory forecasting by predicting and refining past trajectories using appearance cues, achieving over 20% reduction in minADE/minFDE without tracking or map data.

Joint detection trajectory forecasting motion-appearance synergy tracking-free map-free autonomous driving

Problem

Existing end-to-end autonomous driving models fail to fully exploit long-term motion cues and rely on noisy tracking or map information, limiting detection and forecasting accuracy.

Approach

MASAR introduces a tracking-free, map-free framework that jointly predicts multiple past trajectory hypotheses per object and refines them using appearance-guided scoring, then conditions future trajectory forecasting on these refined past trajectories.

Key results

New state-of-the-art on nuScenes without map data
Over 20% reduction in minADE and minFDE
Consistent gains across BEVFormer and SparseBEV backbones
Up to 6% minFDE improvement and 7% miss rate reduction via past conditioning

Why it matters

Enables more robust and accurate perception-prediction pipelines for camera-based autonomous driving by eliminating reliance on error-prone tracking and high-definition maps.

Abstract

Classical autonomous driving systems connect per- ception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of “looking backward to look forward”, and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object- centric spatio-temporal mechanism that jointly encodes appear- ance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR’s effectiveness, showing improve- ments of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at https://github.com/aminmed/MASAR.

Index terms

Object Detection Segmentation and Categorization Deep Learning for Visual Perception Computer Vision for Transportation