Prompt-To-State Stable Vision-Language MPC for Approximated Neural Network Dynamics a Case Study on Soft Robot Control
Nicotra Emanuele, James J. Davies, Kefan Zhu, Sharma Bibhu, Adrienne Ji, Phuoc Thien Phan, Hung Manh La, Nigel Hamilton Lovell, Thanh Nho Do
AI summary
Problem
Deploying vision-language models in closed-loop robotic control lacks formal safety and stability guarantees, particularly when system dynamics are approximated by neural networks that introduce compounding prediction errors in model predictive control.
Approach
The authors introduce a two-loop architecture where a vision-language model translates natural language and visual feedback into MPC parameters, while a lower-level MPC uses a Taylor-expanded neural network dynamics model with a rigorously computed terminal cost weight to guarantee stability despite approximation errors.
Key results
- Formal definition of Prompt-to-State Stability guaranteeing closed-loop stability under arbitrary prompts
- Derivation of a computable terminal cost weight ensuring Input-to-State Stability despite neural network approximation errors
- Development of a two-loop PSS-VLMPC framework that safely translates natural language commands into MPC parameters
- Validation via simulation and real-world experiments on a soft continuum robot executing language-specified tasks
Why it matters
Enables safe, stable, and interpretable integration of large vision-language models into real-time robotic control loops for complex, hard-to-model systems.
Abstract
The integration of large-scale foundation models in control loops proven to be effective for executing complex tasks from natural language inputs. However, ensuring stability and real-time performance remains a significant challenge when such models are used, especially for systems with hard-to-model dynamics. In this paper we introduce the concept of Prompt-to- State Stability (PSS) and we present the Prompt-to-State Stable Vision-Language Model Predictive Control (PSS-VLMPC), a novel framework that integrates a VLM with a robust MPC. We use the VLM to interpret user commands and visual feedback, translating them into parameters for the MPC that controls the system. The system’s dynamics are entirely learned by a neural network, and approximated for real-time performance of the MPC. Starting from the prediction error bound we provide rigorous stability guarantees for the closed-loop system, provided the environment dynamics do not exceed the VLM update rate. The effectiveness of the PSS-VLMPC is validated through simulations and real-world experiments on a soft continuum robot, demonstrating its capability to execute tasks from natural language inputs.