A Two-Stage Framework for Ego-Centric Key Object Identification Via Object State Prediction
Shihong Ling, Yue Wan, Xiaowei Jia, Na Du
AI summary
Problem
Existing key object identification methods either treat objects independently or rely on visual relationships, failing to explicitly account for the ego-vehicle's perspective and dynamic object states in complex traffic.
Approach
The framework first uses a category-specific modular predictor to classify object behaviors from spatial and visual features, then ranks object importance using a transformer that combines these predicted states with relative spatial changes to a virtual ego-vehicle.
Key results
- Outperforms CNN and transformer baselines in key object identification accuracy across cars, pedestrians, traffic lights, and stop signs
- Demonstrates that modular, category-specific state prediction yields higher accuracy and F1 scores than unified models
- Validates relative spatial positioning combined with object size or depth estimation as the most reliable feature input for state prediction
- Achieves real-time inference at ≤0.10 ms per frame for both predictor and identifier modules
Why it matters
Improves autonomous vehicle transparency and safety by enabling accurate, real-time identification of critical objects that directly influence immediate driving decisions.
Abstract
This paper presents a novel framework designed to enhance key object identification in autonomous driving. Existing methods primarily focus on either detecting objects independently or leveraging visual relationships, but they do not explicitly consider the ego vehicle’s perspective in determining object importance. To address this gap, we propose a structured approach that integrates a virtual ego-vehicle representation and a modular object state predictor, enabling a more accurate estimation of object behaviors relative to the ego-vehicle. Sub- sequently, our framework employs spatial-temporal reasoning to refine key object identification, prioritizing objects based on their states and relative spatial information rather than relying solely on visual relationships. Experimental results on real-world driving datasets demonstrate the effectiveness of our approach in accurately detecting critical objects in complex traffic environments.