TD-CD-MPPI: Temporal-Difference Constraint-Discounted Model Predictive Path Integral Control
Pietro Noah Crestaz, Ludovic De Matteïs, Elliot Chane-Sane, Nicolas Mansard, Andrea Del Prete
AI summary
Problem
Sampling-based control methods like MPPI suffer from computational costs that scale linearly with the planning horizon, limiting long-term reasoning, while constraint enforcement relies on brittle, handcrafted penalty functions that lack interpretability and scalability.
Approach
The method integrates a terminal value function learned offline via temporal-difference learning to approximate infinite-horizon costs, allowing shorter rollouts, and modulates trajectory discount factors based on constraint violations to replace traditional cost shaping.
Key results
- Enables stable locomotion with significantly shorter MPC horizons (e.g., H=8 vs H≥10)
- Provides a modular, interpretable mechanism for constraint-aware planning without penalty shaping
- Reduces computational cost while maintaining or improving sample efficiency
- Successfully transfers from simulation to real-world Solo12 quadruped hardware
Why it matters
Provides a practical, computationally efficient framework for real-time, constraint-aware locomotion control that bridges the gap between sampling-based optimization and learning-based long-horizon reasoning.
Abstract
Path Integral methods have demonstrated remark- able capabilities for solving non-linear stochastic optimal control problems through sampling-based optimization. However, their computational complexity grows linearly with the prediction horizon, limiting long-term reasoning, while constraints are merely enforced through handcrafted penalties. In this work, we propose a unified and efficient framework for enabling long- horizon reasoning and constraint enforcement within Model Predictive Path Integral (MPPI) control. First, we introduce a practical method to incorporate a terminal value function, learned offline via temporal-difference learning, to approximate the long-term cost-to-go. This allows for significantly shorter roll- outs while enabling infinite-horizon reasoning, thereby improv- ing computational efficiency and motion performance. Second, we propose a discount modulation strategy that adjusts the return of sampled trajectories based on constraint violations. This provides a more interpretable and effective mechanism for enforcing constraints compared to traditional cost shaping. Our formulation retains the flexibility and sampling efficiency of MPPI while supporting structured integration of long-term objectives and constraint handling. We validate our approach on both simulated and real-world robotic locomotion tasks, demonstrating improved performance, constraint-awareness, and generalization under reduced computational budgets.