← Back ICRA 2026

A Two-Stage Framework for Ego-Centric Key Object Identification Via Object State Prediction

Shihong Ling, Yue Wan, Xiaowei Jia, Na Du

PDF

AI summary

Key figure (auto-extracted from paper)

A modular two-stage framework that predicts dynamic object states and integrates them with ego-centric spatial reasoning significantly outperforms existing methods in identifying critical driving objects.

Autonomous driving Key object identification Object state prediction Ego-centric reasoning Spatial-temporal features Transformer

Problem

Existing key object identification methods either treat objects independently or rely on visual relationships, failing to explicitly account for the ego-vehicle's perspective and dynamic object states in complex traffic.

Approach

The framework first uses a category-specific modular predictor to classify object behaviors from spatial and visual features, then ranks object importance using a transformer that combines these predicted states with relative spatial changes to a virtual ego-vehicle.

Key results

Outperforms CNN and transformer baselines in key object identification accuracy across cars, pedestrians, traffic lights, and stop signs
Demonstrates that modular, category-specific state prediction yields higher accuracy and F1 scores than unified models
Validates relative spatial positioning combined with object size or depth estimation as the most reliable feature input for state prediction
Achieves real-time inference at ≤0.10 ms per frame for both predictor and identifier modules

Why it matters

Improves autonomous vehicle transparency and safety by enabling accurate, real-time identification of critical objects that directly influence immediate driving decisions.

Abstract

This paper presents a novel framework designed to enhance key object identification in autonomous driving. Existing methods primarily focus on either detecting objects independently or leveraging visual relationships, but they do not explicitly consider the ego vehicle’s perspective in determining object importance. To address this gap, we propose a structured approach that integrates a virtual ego-vehicle representation and a modular object state predictor, enabling a more accurate estimation of object behaviors relative to the ego-vehicle. Sub- sequently, our framework employs spatial-temporal reasoning to refine key object identification, prioritizing objects based on their states and relative spatial information rather than relying solely on visual relationships. Experimental results on real-world driving datasets demonstrate the effectiveness of our approach in accurately detecting critical objects in complex traffic environments.

Index terms

Object Detection Segmentation and Categorization Deep Learning for Visual Perception Computer Vision for Transportation