The Robot's Inner Critic: Self-Refinement of Social Behaviors through VLM-Based Replanning
Jiyu Lim, Youngwoo Yoon, Kwanghyun Park
AI summary
Problem
Conventional robot social behavior generation relies on rigid predefined motions or costly human feedback, which limits flexibility, autonomy, and cross-platform adaptability.
Approach
The CRISP framework uses a Vision-Language Model as an autonomous 'social critic' to evaluate robot motions and iteratively replan low-level joint control code based on the robot's structural file and situational context.
Key results
- Autonomous VLM-driven critique and replanning cycle for social behaviors
- Platform-agnostic low-level joint control generation from structural files
- Significantly higher user preference and situational appropriateness across 5 robot types and 20 scenarios
- Ablation study validating the contribution of each framework component
Why it matters
Enables robots to generate flexible, human-like social interactions autonomously, reducing reliance on human feedback and enabling cross-platform deployment.
Abstract
Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an au- tonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a ‘human-like social critic.’ CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot’s description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual infor- mation (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot’s structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot’s autonomous interaction capabilities and cross-platform appli- cability. Detailed result videos and supplementary information regarding this work are available at https://limjiyu99. github.io/inner-critic/.