← Back ICRA 2026

VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation

Yixiang Chen, Yan Huang, Keji He, Peiyan Li, Liang Wang

PDF

AI summary

Key figure (auto-extracted from paper)

Leveraging foundation models to generate task-adaptive virtual camera views significantly accelerates 3D robotic manipulation training and inference while improving success rates over multi-camera baselines.

3D robotic manipulation foundation models virtual camera coarse-to-fine refinement spatial reasoning imitation learning

Problem

Multi-camera setups in 3D robotic manipulation introduce substantial redundancy and irrelevant information, increasing computational costs and training time while making it difficult to extract task-relevant features and mitigate occlusion.

Approach

VERM uses a foundation model to predict a task-adaptive virtual camera pose from multi-view observations, projects the 3D point cloud onto this single view, and combines it with a depth-aware module and dynamic coarse-to-fine refinement for precise 3D action planning.

Key results

Achieves 1.89× faster training and 1.54× faster inference than state-of-the-art methods
Surpasses prior SOTA on RLBench simulation and real-world 3D manipulation tasks
Validates plug-and-play compatibility across multiple foundation models including GPT-4o, Qwen2.5, and Claude 3.5
Introduces a dynamic coarse-to-fine refinement mechanism that selectively triggers high-precision views only during critical task phases

Why it matters

Provides a computationally efficient, task-adaptive visual perception pipeline that enables robots to perform precise 3D manipulation with reduced hardware dependency and faster learning, benefiting both academic researchers and industrial automation systems.

Abstract

When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial re- dundancy and irrelevant information, which increases computa- tional costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we pro- pose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to- fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89× speedup in training time and 1.54× speedup in inference speed.

Index terms

Deep Learning for Visual Perception Deep Learning in Grasping and Manipulation Learning from Demonstration