ConsistencyPlanner: Real-Time Planning with Fast-Sampling Consistency Models
Qichao Zhang, Xing Fang, Jiaqi Fang, Zhenwen Cai, Jie Ling, Qiankun Yu, Dongbin Zhao
AI summary
Problem
Learning-based autonomous driving planners struggle to balance modeling diverse, multimodal driving behaviors with the low-latency requirements of real-time deployment, often resulting in unsafe or computationally prohibitive actions.
Approach
ConsistencyPlanner integrates fast-sampling consistency models with an attention-enhanced decoder to fuse scene and route features, enabling efficient single-step generation of diverse driving trajectories without iterative denoising.
Key results
- Achieves lowest collision rate (2.77%) and off-road rate (2.09%) on the Waymax benchmark
- Delivers real-time inference at ~15ms latency, vastly outperforming diffusion models
- Surpasses state-of-the-art baselines in closed-loop safety metrics
- Validates that attention-based feature fusion significantly improves planning robustness
Why it matters
Offers a practical, low-latency planning solution for safety-critical autonomous vehicles that must navigate complex, dynamic traffic environments in real time.
Abstract
Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose ConsistencyPlanner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sam- pling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative meth- ods. Heterogeneous Feature Fusion: We introduce an attention- enhanced decoder that dynamically integrates heterogeneous input features—including scene feature and action token—into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.