Predictive Local Planning with Multi-Step Reward and Q-Value Forecasting
Yuhan Du, Yuxiang Cui, Yulin Peng, Yiyuan Pan, Tianhao Cai, Yue Wang, Rong Xiong
AI summary
Problem
Explicit future observation prediction is brittle in uncertain, dynamic settings, while existing trajectory evaluation methods either overlook long-term consequences or lack short-term sensitivity, leading to suboptimal planning.
Approach
The framework rolls out and optimizes candidate trajectories entirely within a compact latent space by jointly forecasting multi-step rewards and terminal Q-values, then refines paths using smoothness-regularized MPPI and a lightweight social reward.
Key results
- Eliminates brittle explicit environment prediction by operating fully in latent space
- Jointly optimizes short-term rewards and long-term Q-values via MPPI
- Introduces a lightweight social reward mechanism for yielding behavior
- Outperforms SAC, NEUPAN, and TD-MPC baselines in success rate, path efficiency, and inference speed
Why it matters
Provides a robust, sample-efficient navigation strategy for robots operating in crowded, unpredictable environments where explicit forecasting fails.
Abstract
Planning in dynamic environments often relies on explicit future observation prediction or value-based estimation, both of which can be brittle or hard to generalize in uncertain settings. We propose a novel model-based reinforcement learning framework that performs trajectory rollout and optimization entirely in a learned latent space. Instead of predicting future observations explicitly, our method evaluates candidate trajecto- ries through multi-step reward prediction and terminal Q-value estimation in the latent domain, enabling robust and generaliz- able planning in dynamic environments. A policy model generates an initial trajectory in latent space, which is then refined via a smoothness-regularized optimization using Model Predictive Path Integral (MPPI), guided by the predicted cumulative reward and Q-values. This avoids the complexity of future state reconstruc- tion while ensuring dynamically feasible execution. To enhance the model’s deployment performance in crowded or interactive scenarios, we further introduce a lightweight social reward that penalizes unsafe overtaking and encourages yielding behavior. Experiments in both simulation and real-world environments show improved success rate, efficiency, and social acceptability compared to strong baselines.