← Back ICRA 2026

Predictive Local Planning with Multi-Step Reward and Q-Value Forecasting

Yuhan Du, Yuxiang Cui, Yulin Peng, Yiyuan Pan, Tianhao Cai, Yue Wang, Rong Xiong

PDF

AI summary

Key figure (auto-extracted from paper)

Planning entirely in a learned latent space using joint multi-step reward and Q-value forecasting significantly improves navigation success, efficiency, and social compliance in dynamic environments.

latent-space planning model-based reinforcement learning multi-step reward forecasting Q-value estimation social navigation MPPI

Problem

Explicit future observation prediction is brittle in uncertain, dynamic settings, while existing trajectory evaluation methods either overlook long-term consequences or lack short-term sensitivity, leading to suboptimal planning.

Approach

The framework rolls out and optimizes candidate trajectories entirely within a compact latent space by jointly forecasting multi-step rewards and terminal Q-values, then refines paths using smoothness-regularized MPPI and a lightweight social reward.

Key results

Eliminates brittle explicit environment prediction by operating fully in latent space
Jointly optimizes short-term rewards and long-term Q-values via MPPI
Introduces a lightweight social reward mechanism for yielding behavior
Outperforms SAC, NEUPAN, and TD-MPC baselines in success rate, path efficiency, and inference speed

Why it matters

Provides a robust, sample-efficient navigation strategy for robots operating in crowded, unpredictable environments where explicit forecasting fails.

Abstract

Planning in dynamic environments often relies on explicit future observation prediction or value-based estimation, both of which can be brittle or hard to generalize in uncertain settings. We propose a novel model-based reinforcement learning framework that performs trajectory rollout and optimization entirely in a learned latent space. Instead of predicting future observations explicitly, our method evaluates candidate trajecto- ries through multi-step reward prediction and terminal Q-value estimation in the latent domain, enabling robust and generaliz- able planning in dynamic environments. A policy model generates an initial trajectory in latent space, which is then refined via a smoothness-regularized optimization using Model Predictive Path Integral (MPPI), guided by the predicted cumulative reward and Q-values. This avoids the complexity of future state reconstruc- tion while ensuring dynamically feasible execution. To enhance the model’s deployment performance in crowded or interactive scenarios, we further introduce a lightweight social reward that penalizes unsafe overtaking and encourages yielding behavior. Experiments in both simulation and real-world environments show improved success rate, efficiency, and social acceptability compared to strong baselines.

Index terms

Motion and Path Planning Planning under Uncertainty Reinforcement Learning