← Back ICRA 2026

SOE: Sample-Efficient Robot Policy Self-Improvement Via On-Manifold Exploration

Yang Jin, Jun Lv, Han Xue, Wendi Chen, Chuan Wen, Cewu Lu

PDF

AI summary

Key figure (auto-extracted from paper)

Constraining exploration to a learned task manifold enables safe, diverse, and highly sample-efficient robot policy self-improvement without degrading base performance.

on-manifold exploration policy self-improvement variational information bottleneck imitation learning sample efficiency controllable exploration

Problem

Robot policies often suffer from action mode collapse and lack sufficient exploration capability, while existing random perturbation methods are unsafe and cause unstable behaviors. This limits scalable policy refinement without costly human teleoperation.

Approach

SOE learns a compact latent representation of task-relevant factors using a variational information bottleneck and restricts exploration to the manifold of valid actions in this space, acting as a plug-in module that integrates seamlessly with existing policies.

Key results

50.8% average relative improvement in real-world task success rates after one self-improvement round
Safe, temporally coherent exploration by constraining actions to a learned task manifold
Plug-in dual-path architecture that preserves base policy performance during joint training
Human-guided steering via disentangled latent dimensions for controllable exploration

Why it matters

Provides a scalable, safe alternative to costly teleoperation and unstable random exploration, enabling roboticists and AI developers to efficiently improve real-world manipulation policies.

Abstract

Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently out- performs prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement.

Index terms

Imitation Learning Learning from Experience Deep Learning in Grasping and Manipulation